Replacing missing value in datasets in tool R - r

Hi I have a dataset with 4 columns (all numeric) and I am replacing missing value with mean value of column. Below code is neither giving error nor replacing value.
mi <- function(x){
for( col in 1:ncol(x)){
for( row in 1:nrow(x)){
ifelse(is.na(x[row, col]), x[row,col] <- mean(x[, col], na.rm = TRUE), x[row, col])
}
}
}
please suggest..

Here's a pretty straightforward approach (with some reproducible sample data):
Some sample data:
set.seed(1)
df <- data.frame(matrix(sample(c(NA, 1:10), 100, TRUE), ncol = 4))
head(df)
# X1 X2 X3 X4
# 1 2 4 5 9
# 2 4 NA 9 9
# 3 6 4 4 4
# 4 9 9 2 8
# 5 2 3 NA 10
# 6 9 5 1 4
Let's make a copy and replace NA with the column means.
df2 <- df
df2[] <- lapply(df2, function(x) { x[is.na(x)] <- mean(x, na.rm=TRUE); x })
head(df2)
# X1 X2 X3 X4
# 1 2 4.000000 5 9
# 2 4 5.956522 9 9
# 3 6 4.000000 4 4
# 4 9 9.000000 2 8
# 5 2 3.000000 5 10
# 6 9 5.000000 1 4
Verify the correct values were inserted. Compare df2[2, 2] with the following:
mean(df$X2, na.rm = TRUE)
# [1] 5.956522

The argument x is a copy of the original. You need to return the modified value:
mi <- function(x){
for( col in 1:ncol(x)){
for( row in 1:nrow(x)){
ifelse(is.na(x[row, col]), x[row,col] <- mean(x[, col], na.rm = TRUE), x[row, col])
}
}
return(x)
}

Or like this:
x <- matrix(sample(c(NA,1:10),100,TRUE),nrow=10)
x
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 7 7 1 6 7 3 10 4 NA 2
[2,] 3 2 7 9 1 4 2 5 10 1
[3,] 10 4 2 8 7 4 1 8 8 3
[4,] 7 7 6 9 2 6 NA 6 6 10
[5,] 1 NA 5 9 9 4 NA 5 8 2
[6,] 4 4 9 3 9 4 5 NA 5 1
[7,] NA 2 2 2 9 2 10 NA 8 7
[8,] 10 8 7 1 5 2 9 7 10 5
[9,] 6 3 10 9 8 6 7 10 3 10
[10,] 7 9 5 2 2 9 5 6 NA 9
means <- colMeans(x,na.rm=TRUE)
for(i in 1:ncol(x)){
x[is.na(x[,i]),i] <- means[i]
}
x
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 7.000000 7.000000 1 6 7 3 10.000 4.000 7.25 2
[2,] 3.000000 2.000000 7 9 1 4 2.000 5.000 10.00 1
[3,] 10.000000 4.000000 2 8 7 4 1.000 8.000 8.00 3
[4,] 7.000000 7.000000 6 9 2 6 6.125 6.000 6.00 10
[5,] 1.000000 5.111111 5 9 9 4 6.125 5.000 8.00 2
[6,] 4.000000 4.000000 9 3 9 4 5.000 6.375 5.00 1
[7,] 6.111111 2.000000 2 2 9 2 10.000 6.375 8.00 7
[8,] 10.000000 8.000000 7 1 5 2 9.000 7.000 10.00 5
[9,] 6.000000 3.000000 10 9 8 6 7.000 10.000 3.00 10
[10,] 7.000000 9.000000 5 2 2 9 5.000 6.000 7.25 9
This is not quite exactly what you are looking for but might be useful. This function substitute all NA with median (in every column):
require(randomForest)
x <- matrix(sample(c(NA,1:10),100,TRUE),nrow=10)
na.roughfix(x)

Related

Number of observations used by cor function in R

I have a big matrix in R with more than 2000 columns and 10,000 rows, and many missing values. This line of code calculates the correlation matrix in R.
cor(data, use = "complete.obs")
My question is: how can I find the number of observations that have been used to calculate each correlation in the output matrix?
The output should be something like this:
v1
v2
v3
v4
v1
20
12
15
18
v2
12
15
10
11
v3
15
10
25
20
v4
18
11
20
20
Thanks for any suggestion
Let's use a sample matrix data filled with random NAs:
library(dplyr)
set.seed(1234)
data <- rnorm(100) %>%
matrix(nrow = 10) %>%
{
m <- .
m[rnorm(100) > .5] <- NA
m
}
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.48522682 NA 0.8951720 -0.32439330 0.05913517 0.4369306
[2,] 0.69676878 -0.4002352 0.6602126 NA 0.41339889 NA
[3,] 0.18551392 1.4934931 2.2734835 -0.93350334 NA 0.4521904
[4,] NA -1.6070809 1.1734976 NA NA 0.6631986
[5,] 0.31168103 -0.4157518 0.2877097 0.31916024 0.71888873 -1.1363736
[6,] 0.76046236 NA -0.6597701 -1.07754212 NA NA
[7,] 1.84246363 -0.1517365 NA -3.23315213 1.35727444 NA
[8,] NA NA 0.6774155 NA 0.40446847 -1.2239038
[9,] 0.03266396 -0.3047211 NA 0.02951783 0.26436427 0.2580684
[10,] NA 0.6295361 0.1864921 0.59427377 0.26804390 NA
[,7] [,8] [,9] [,10]
[1,] NA -0.3046139 -1.0118219 NA
[2,] NA 1.8250111 0.4701675 0.1832475
[3,] 0.1586254 0.6705594 -0.7009703 -1.7662292
[4,] -1.7632551 0.9486326 NA NA
[5,] 0.3385960 2.0494030 NA NA
[6,] NA -0.6511136 NA NA
[7,] -0.2386466 0.8086193 NA -1.1750368
[8,] -1.1877653 0.9865806 -0.2457632 NA
[9,] 0.3849353 NA -1.5528590 0.3536254
[10,] NA 0.3190524 0.1284340 0.3191562
You can transform it into a logical matrix dna where dna[i,j] == TRUE means that data[i,j] is not NA:
dna <- !is.na(data)
Then you can perform matrix product of dna with t(dna) to obtain the number of non-missing observations.
dna <- !is.na(data)
dna %*% t(dna)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 8 7 4 6 4 3 4 8 5 7
[2,] 7 9 6 6 5 4 6 8 6 8
[3,] 4 6 6 4 4 3 4 5 4 5
[4,] 6 6 4 7 3 3 3 6 5 6
[5,] 4 5 4 3 5 2 4 5 3 4
[6,] 3 4 3 3 2 5 4 4 3 5
[7,] 4 6 4 3 4 4 6 5 4 6
[8,] 8 8 5 6 5 4 5 9 5 8
[9,] 5 6 4 5 3 3 4 5 6 5
[10,] 7 8 5 6 4 5 6 8 5 9

Efficient creation of a matrix of offsets

Goal
I want to use a long vector of numbers, to create a matrix where each column is a successive offset (lag or lead) of the original vector. If n is the maximum offset, the matrix will have dimensions [length(vector), n * 2 + 1] (because we want offsets in both directions, and include the 0 offset, i.e. the original vector).
Example
To illustrate, consider the following vector:
test <- c(2, 8, 1, 10, 7, 5, 9, 3, 4, 6)
[1] 2 8 1 10 7 5 9 3 4 6
Expected output
Now we create offsets of values, let's say for n == 3:
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] NA NA NA 2 8 1 10
[2,] NA NA 2 8 1 10 7
[3,] NA 2 8 1 10 7 5
[4,] 2 8 1 10 7 5 9
[5,] 8 1 10 7 5 9 3
[6,] 1 10 7 5 9 3 4
[7,] 10 7 5 9 3 4 6
[8,] 7 5 9 3 4 6 NA
[9,] 5 9 3 4 6 NA NA
[10,] 9 3 4 6 NA NA NA
I am looking for an efficient solution. data.table or tidyverse solutions more than welcome.
Returning only the rows that have no NA's (i.e. rows 4 to 7) is also ok.
Current solution
lags <- lapply(3:1, function(x) dplyr::lag(test, x))
leads <- lapply(1:3, function(x) dplyr::lead(test, x))
l <- c(lags, test, leads)
matrix(unlist(l), nrow = length(test))
In base R, you can use embed to get rows 4 through 7. You have to reverse the column order, however.
embed(test, 7)[, 7:1]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 2 8 1 10 7 5 9
[2,] 8 1 10 7 5 9 3
[3,] 1 10 7 5 9 3 4
[4,] 10 7 5 9 3 4 6
data
test <- c(2, 8, 1, 10, 7, 5, 9, 3, 4, 6)
This will produce what you need...
n <- 3
t(embed(c(rep(NA,n), test, rep(NA,n)), length(test)))[length(test):1,]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] NA NA NA 2 8 1 10
[2,] NA NA 2 8 1 10 7
[3,] NA 2 8 1 10 7 5
[4,] 2 8 1 10 7 5 9
[5,] 8 1 10 7 5 9 3
[6,] 1 10 7 5 9 3 4
[7,] 10 7 5 9 3 4 6
[8,] 7 5 9 3 4 6 NA
[9,] 5 9 3 4 6 NA NA
[10,] 9 3 4 6 NA NA NA
This can be solved by constructing the matrix from a long vector and returning only the wanted columns and rows:
test <- c(2, 8, 1, 10, 7, 5, 9, 3, 4, 6)
n_offs <- 3L
n_row <- length(test) + n_offs + 1L
matrix(rep(c(rep(NA, n_offs), test), n_row), nrow = n_row)[1:length(test), 1:(n_offs * 2L + 1L)]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] NA NA NA 2 8 1 10
[2,] NA NA 2 8 1 10 7
[3,] NA 2 8 1 10 7 5
[4,] 2 8 1 10 7 5 9
[5,] 8 1 10 7 5 9 3
[6,] 1 10 7 5 9 3 4
[7,] 10 7 5 9 3 4 6
[8,] 7 5 9 3 4 6 NA
[9,] 5 9 3 4 6 NA NA
[10,] 9 3 4 6 NA NA NA
A variant which just returns the same result as embed(test, 7)[, 7:1] is:
matrix(rep(test, length(test) + 1L), nrow = length(test) + 1L)[
seq_len(length(test) - 2L * n_offs), seq_len(n_offs * 2L + 1L)]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 2 8 1 10 7 5 9
[2,] 8 1 10 7 5 9 3
[3,] 1 10 7 5 9 3 4
[4,] 10 7 5 9 3 4 6

How fill matrix with loop for and same value for colum

I have this matrix:
mat_A <- matrix(ncol=7,nrow=12)
I would fill the columns of mat_A with same values for each column, in a range of values from 5 to 11. The expected result is:
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 5 6 7 8 9 10 11
[2,] 5 6 7 8 9 10 11
[3,] 5 6 7 8 9 10 11
[4,] 5 6 7 8 9 10 11
[5,] 5 6 7 8 9 10 11
[6,] 5 6 7 8 9 10 11
[7,] 5 6 7 8 9 10 11
[8,] 5 6 7 8 9 10 11
[9,] 5 6 7 8 9 10 11
[10,] 5 6 7 8 9 10 11
[11,] 5 6 7 8 9 10 11
[12,] 5 6 7 8 9 10 11
I know that i can tray colum by column, like :
mat_A[,1] <- 5
....
mat_A[,7] <- 11
But why i can do this with loop for?
I tried with:
pippo <- rep(5:11,each=12)
for(j in 1:ncol(mat_A)){
mat_A[j,] <- pippo
}
but the error is:
Error in mat_A[j, ] <- pippo :
number of items to replace is not a multiple of replacement length
Any idea?
You don't need a loop. Try
mat_A <- matrix(ncol=7,nrow=12)
mat_A <- col(mat_A)+4
mat_A
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 5 6 7 8 9 10 11
# [2,] 5 6 7 8 9 10 11
# [3,] 5 6 7 8 9 10 11
# [4,] 5 6 7 8 9 10 11
# [5,] 5 6 7 8 9 10 11
# [6,] 5 6 7 8 9 10 11
# [7,] 5 6 7 8 9 10 11
# [8,] 5 6 7 8 9 10 11
# [9,] 5 6 7 8 9 10 11
#[10,] 5 6 7 8 9 10 11
#[11,] 5 6 7 8 9 10 11
#[12,] 5 6 7 8 9 10 11
Alternatively, if you want to use the loop as described in the OP, the code can be used after two modifications:
remove each=12, and
loop over the rows, not the columns.
Therefore, this works, too:
pippo <- rep(5:11)
for(j in 1:nrow(mat_A)){
mat_A[j,] <- pippo
}
The matrix function has a byrow argument which can be used with R's recycling behavior for this purpose
matrix(5:11,ncol=7,nrow=12,byrow=TRUE)
You can simply construct the matrix:
mat_A <- matrix(rep(5:11, each=12), 12)
Here the results of microbenchmark for the three answers:
> library(microbenchmark)
> microbenchmark(
+ by.row= matrix(5:11,ncol=7,nrow=12,byrow=TRUE),
+ rep=matrix(rep(5:11, each=12), 12),
+ col.plus=col(matrix(ncol=7,nrow=12))+4,
+ loop={mat_A <- matrix(ncol=7,nrow=12); pippo <- rep(5:11); for(j in 1:nrow(mat_A)) mat_A[j,] <- pippo }
+ )
Unit: microseconds
expr min lq mean median uq max neval cld
by.row 2.681 2.9505 3.27668 3.0955 3.3025 14.087 100 a
rep 3.780 4.0580 4.26584 4.2170 4.3485 5.707 100 ab
col.plus 4.230 4.5000 4.81078 4.6905 4.8680 10.853 100 b
loop 17.946 18.4055 19.87737 18.6970 19.1745 65.719 100 c

Difference between row in df with na

My sample data looks like this
DF
n a b c d
1 NA NA NA NA
2 1 2 3 4
3 5 6 7 8
4 9 NA 11 12
5 NA NA NA NA
6 4 5 6 NA
7 8 9 10 11
8 12 13 15 16
9 NA NA NA NA
I need to substract row 2 from row 3 and row 4.
Similarly i need to subtract row 6 from row 7 and row 8
My real data is huge, is there a way of doing it automatically. It seems it could be some for loop but as I am dummy R user my trials were not successful.
Thank you for any help and tips.
UPDATE
I want to achieve something like this
DF2
rowN1<-DF$row3-DF$row2
rowN2<-DF$row4-DF$row2
rowN3<-DF$row7-DF$row6 # there is NA in row 6 so after subtracting there should be NA also
rowN4<-DF$row8-DF$row6
Here's one idea
set.seed(1)
(m <- matrix(sample(c(1:9, NA), 60, T), ncol=5))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 3 7 3 8 8
# [2,] 4 4 4 2 7
# [3,] 6 8 1 8 5
# [4,] NA 5 4 5 9
# [5,] 3 8 9 9 5
# [6,] 9 NA 4 7 3
# [7,] NA 4 5 8 1
# [8,] 7 8 6 6 1
# [9,] 7 NA 5 6 4
# [10,] 1 3 2 8 6
# [11,] 3 7 9 1 7
# [12,] 2 2 7 5 5
idx <- seq(2, nrow(m)-2, 4)
do.call(rbind, lapply(idx, function(x) {
rbind(m[x+1, ]-m[x, ], m[x+2, ]-m[x, ])
}))
# [1,] 2 4 -3 6 -2
# [2,] NA 1 0 3 2
# [3,] NA NA 1 1 -2
# [4,] -2 NA 2 -1 -2
# [5,] 2 4 7 -7 1
# [6,] 1 -1 5 -3 -1

Omit inf from row sum in R

So I am trying to sum the rows of a matrix, and there are inf's within it. How do I sum the row, omitting the inf's?
Multiply your matrix by the result of is.finite(m) and call rowSums on the product with na.rm=TRUE. This works because Inf*0 is NaN.
m <- matrix(c(1:3,Inf,4,Inf,5:6),4,2)
rowSums(m*is.finite(m),na.rm=TRUE)
A[is.infinite(A)]<-NA
rowSums(A,na.rm=TRUE)
Some benchmarking for comparison:
library(microbenchmark)
rowSumsMethod<-function(A){
A[is.infinite(A)]<-NA
rowSums(A,na.rm=TRUE)
}
applyMethod<-function(A){
apply( A , 1 , function(x){ sum(x[!is.infinite(x)])})
}
rowSumsMethod2<-function(m){
rowSums(m*is.finite(m),na.rm=TRUE)
}
rowSumsMethod0<-function(A){
A[is.infinite(A)]<-0
rowSums(A)
}
A1 <- matrix(sample(c(1:5, Inf), 50, TRUE), ncol=5)
A2 <- matrix(sample(c(1:5, Inf), 5000, TRUE), ncol=5)
microbenchmark(rowSumsMethod(A1),rowSumsMethod(A2),
rowSumsMethod0(A1),rowSumsMethod0(A2),
rowSumsMethod2(A1),rowSumsMethod2(A2),
applyMethod(A1),applyMethod(A2))
Unit: microseconds
expr min lq median uq max neval
rowSumsMethod(A1) 13.063 14.9285 16.7950 19.3605 1198.450 100
rowSumsMethod(A2) 212.726 220.8905 226.7220 240.7165 307.427 100
rowSumsMethod0(A1) 11.663 13.9960 15.3950 18.1940 112.894 100
rowSumsMethod0(A2) 103.098 109.6290 114.0610 122.9240 159.545 100
rowSumsMethod2(A1) 8.864 11.6630 12.5960 14.6955 49.450 100
rowSumsMethod2(A2) 57.380 60.1790 63.4450 67.4100 81.172 100
applyMethod(A1) 78.839 84.4380 92.1355 99.8330 181.005 100
applyMethod(A2) 3996.543 4221.8645 4338.0235 4552.3825 6124.735 100
So Joshua's method wins! And apply method is clearly slower than two other methods (relatively speaking of course).
I'd use apply and is.infinite in order to avoid replacing Inf values by NA as in #Hemmo's answer.
> set.seed(1)
> Mat <- matrix(sample(c(1:5, Inf), 50, TRUE), ncol=5)
> Mat # this is an example
[,1] [,2] [,3] [,4] [,5]
[1,] 2 2 Inf 3 5
[2,] 3 2 2 4 4
[3,] 4 5 4 3 5
[4,] Inf 3 1 2 4
[5,] 2 5 2 5 4
[6,] Inf 3 3 5 5
[7,] Inf 5 1 5 1
[8,] 4 Inf 3 1 3
[9,] 4 3 Inf 5 5
[10,] 1 5 3 3 5
> apply(Mat, 1, function(x) sum(x[!is.infinite(x)]))
[1] 12 15 21 10 18 16 12 11 17 17
Try this...
m <- c( 1 ,2 , 3 , Inf , 4 , Inf ,5 )
sum(m[!is.infinite(m)])
Or
m <- matrix( sample( c(1:10 , Inf) , 100 , rep = TRUE ) , nrow = 10 )
sums <- apply( m , 1 , FUN = function(x){ sum(x[!is.infinite(x)])})
> m
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 8 9 7 Inf 9 2 2 6 1 Inf
[2,] 8 7 4 5 9 5 8 4 7 10
[3,] 7 9 3 4 7 3 3 6 9 4
[4,] 7 Inf 2 6 4 8 3 1 9 9
[5,] 4 Inf 7 5 9 5 3 5 9 9
[6,] 7 3 7 Inf 7 3 7 3 7 1
[7,] 5 7 2 1 Inf 1 9 8 1 5
[8,] 4 Inf 10 Inf 8 10 4 9 7 2
[9,] 10 7 9 7 2 Inf 4 Inf 4 6
[10,] 9 4 6 3 9 6 6 5 1 8
> sums
[1] 44 67 55 49 56 45 39 54 49 57
This is a "non-apply" and non-destructive approach:
rowSums( matrix(match(A, A[is.finite(A)]), nrow(A)), na.rm=TRUE)
[1] 2 4
Although it is reasonably efficient, it is not as fast as Johsua's multiplication method.

Resources