Adding data by row into an empty matrix and handling missing data - r

I have an empty matrix with a certain number of columns that I'm trying to fill row-by-row with output vectors of a for-loop. However, some of the output are not the same length as the number of columns as my matrix, and just want to fill up those "empty spaces" with NAs.
For example:
matrix.names <- c("x1", "x2", "x3", "x4", "y1", "y2", "y3", "y4", "z1", "z2", "z3", "z4")
my.matrix <- matrix(ncol = length(matrix.names))
colnames(my.matrix) <- matrix.names
This would be the output from one iteration:
x <- c(1,2)
y <- c(4,2,1,5)
z <- c(1)
Where I would want it in the matrix like this:
x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
[1,] 1 2 NA NA 4 2 1 5 1 NA NA NA
The output from the next iteration would be, for example:
x <- c(1,1,1,1)
y <- c(0,4)
z <- c(4,1,3)
And added as a new row in the matrix:
x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
[1,] 1 2 NA NA 4 2 1 5 1 NA NA NA
[2,] 1 1 1 1 0 4 NA NA 4 1 3 NA
It's not really a concern if I have a 0, it's just where there is no data. Also, the data is saved in such a way that whatever is there is listed in the row first, followed by NAs in empty slots. In other words, I'm not worried if an NA may pop up first.
Also, is such a thing better handled in data frames rather than matrices?

not the efficient answer : just a try
logic : extending the length to 4.(exception could be if already x/y/z is laready of length4) Therefore while rbinding I only extract the first 4 elements .
x[length(x)+1:4] <- NA
y[length(y)+1:4] <- NA
z[length(z)+1:4] <- NA
my.matrix <- rbind(my.matrix,c(x[1:4],y[1:4],z[1:4]))
Note : the exception I mentioned above is like below :
> x <- c(1,1,1,1)
> x
[1] 1 1 1 1
> x[length(x)+1:4] <- NA
> x
[1] 1 1 1 1 NA NA NA NA # therefore I extract only the first four

Here is an option to do this programmatically
d1 <- stack(mget(c("x", "y", "z")))[2:1]
nm <- with(d1, paste0(ind, ave(seq_along(ind),ind, FUN = seq_along)))
my.matrix[,match(nm,colnames(my.matrix), nomatch = 0)] <- d1$values
my.matrix
# x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
#[1,] 1 2 NA NA 4 2 1 5 1 NA NA NA
Or another option is stri_list2matrix from stringi
library(stringi)
m1 <- as.numeric(stri_list2matrix(list(x,y, z)))
Change the 'x', 'y', 'z' values
m2 <- as.numeric(stri_list2matrix(list(x,y, z)))
rbind(m1, m2)

Related

how to add a new row with extra column in R?

I was trying to add results of a for loop into a dataframe as new rows, but it gets an error when there is a new result with more columns than the original dataframe, how could I add the new result with extra columns to the dataframe with adding the extra column names to the original dataframe?
e.g.
original dataframe:
-______A B C
x1 1 1 1
x2 2 2 2
x3 3 3 3
I want to get
-______A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
X4 4 4 4 4
I tried rbind (Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match)
and rbind_fill (Error: All inputs to rbind.fill must be data.frames)
and bind_rows (Argument 2 must have names)
In base R, this can be done by creating a new column 'D' with NA and then assign new row with 4.
df1$D <- NA
df1['x4', ] <- 4
-output
> df1
A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
x4 4 4 4 4
Or in a single line
rbind(cbind(df1, D = NA), x4 = 4)
A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
x4 4 4 4 4
Regarding the error in bind_rows, it happens when the for loop output is not a named vector
library(dplyr)
> vec1 <- c(4, 4, 4, 4)
> bind_rows(df1, vec1)
Error: Argument 2 must have names.
Run `rlang::last_error()` to see where the error occurred.
If it is a named vector, then it should work
> vec1 <- c(A = 4, B = 4, C = 4, D = 4)
> bind_rows(df1, vec1)
A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
...4 4 4 4 4
data
df1 <- structure(list(A = 1:3, B = 1:3, C = 1:3),
class = "data.frame", row.names = c("x1",
"x2", "x3"))
You probably have something like this, if you list the elements of your for loop.
(l <- list(x1, x2, x3, x4, x5))
# [[1]]
# [1] 1 1 1
#
# [[2]]
# [1] 2 2 2 2
#
# [[3]]
# [1] 3 3
#
# [[4]]
# [1] 4
#
# [[5]]
# NULL
Multiple elements can be rbinded using a do.call(rbind, .) approach, your problem is, how to rbind multiple elements that differ in length.
There's a `length<-` function with which you may adjust the length of a vector. To know to which length, there's another function, lengths, that gives you the lengths of each list element, where you are interested in the maximum.
I include the special case when an element has length NULL (our 5th element of l); since length of NULL cannot be changed, replace those elements with NA.
So altogether you may do:
do.call(rbind, lapply(replace(l, lengths(l) == 0L, NA), `length<-`, max(lengths(l))))
# [,1] [,2] [,3] [,4]
# [1,] 1 1 1 NA
# [2,] 2 2 2 2
# [3,] 3 3 NA NA
# [4,] 4 NA NA NA
# [5,] NA NA NA NA
Or, since you probably want a data frame with pretty row and column names:
ml <- max(lengths(l))
do.call(rbind, lapply(replace(l, lengths(l) == 0L, NA), `length<-`, ml)) |>
as.data.frame() |> `dimnames<-`(list(paste0('x', 1:length(l)), LETTERS[1:ml]))
# A B C D
# x1 1 1 1 NA
# x2 2 2 2 2
# x3 3 3 NA NA
# x4 4 NA NA NA
# x5 NA NA NA NA
Note: R >= 4.1 used.
Data:
x1 <- rep(1, 3); x2 <- rep(2, 4); x3 <- rep(3, 2); x4 <- rep(4, 1); x5 <- NULL

Combine matrices in list by identifier column [duplicate]

This question already has an answer here:
Merging a lot of data.frames [duplicate]
(1 answer)
Closed 3 years ago.
The problem attempts to join matrices in a set, by a column identifier.
We can express the problem in the following form:
Setup
mat1 <- data.frame(matrix(nrow=4, ncol =3, rnorm(12,0,1)))
mat2 <- data.frame(matrix(nrow =5, ncol=3, rnorm(15,0,1)))
mat3 <- data.frame(matrix(nrow=3, ncol =3, rnorm(9,0,1)))
mat4 <- data.frame(matrix(nrow =6, ncol =3, rnorm(18,0,1)))
colnames(mat1) = colnames(mat2) = colnames(mat3) = colnames(mat4) <- c("Code", "x1", "x2")
mat1$Code <- c(1,2,3,4)
mat2$Code <- c(2,3,4,5,6)
mat3$Code <- c(6,7,8)
mat4$Code <- c(1,2,3,4,5,6)
mat_set <- c(mat1, mat2, mat3, mat4)
> mat1
Code x1 x2
1 1 0.6425172 -1.9404704
2 2 -0.1278021 0.8485476
3 3 -0.5525808 -0.9060624
4 4 -1.3013592 0.7350129
> mat2
Code x1 x2
1 2 -0.06543585 -1.1244444
2 3 0.03773743 -0.8124004
3 4 3.53421807 -0.4935844
4 5 0.56686927 0.3433276
5 6 0.41849489 0.8782866
> mat3
Code x1 x2
1 6 1.0821070 0.08006585
2 7 0.1038577 0.61057716
3 8 2.7002036 0.19693561
> mat1
Code x1 x2
1 1 0.6425172 -1.9404704
2 2 -0.1278021 0.8485476
3 3 -0.5525808 -0.9060624
4 4 -1.3013592 0.7350129
> mat2
Code x1 x2
1 2 -0.06543585 -1.1244444
2 3 0.03773743 -0.8124004
3 4 3.53421807 -0.4935844
4 5 0.56686927 0.3433276
5 6 0.41849489 0.8782866
> mat3
Code x1 x2
1 6 1.0821070 0.08006585
2 7 0.1038577 0.61057716
3 8 2.7002036 0.19693561
> mat4
Code x1 x2
1 1 -0.1188262 0.6338566
2 2 0.6128098 1.3759910
3 3 -1.3504901 -0.2830859
4 4 -1.2153638 -1.1611660
5 5 -1.7420065 0.2470048
6 6 -0.9786468 -1.2214594
I then want to bind by column all matrices in the set by "Code". Preserve the ordering. This will yield output of the form:
output <- data.frame(matrix(nrow = 8, ncol =9))
output[,1] <- c(1,2,3,4,5,6,7,8)
output[,2] <- c(mat1$x1, NA, NA,NA,NA)
output[,3] <- c(mat1$x2, NA,NA,NA,NA)
output[,4] <- c(NA, mat2$x1, NA, NA)
output[,5] <- c(NA, mat2$x2, NA, NA)
output[,6] <- c(NA,NA,NA,NA,NA,mat3$x1)
output[,7] <- c(NA,NA,NA,NA,NA,mat3$x2)
output[,8] <- c(mat4$x1, NA,NA)
output[,9] <- c(mat4$x2, NA,NA)
output
X1 X2 X3 X4 X5 X6 X7 X8 X9
1 1 0.6425172 -1.9404704 NA NA NA NA -0.1188262 0.6338566
2 2 -0.1278021 0.8485476 -0.06543585 -1.1244444 NA NA 0.6128098 1.3759910
3 3 -0.5525808 -0.9060624 0.03773743 -0.8124004 NA NA -1.3504901 -0.2830859
4 4 -1.3013592 0.7350129 3.53421807 -0.4935844 NA NA -1.2153638 -1.1611660
5 5 NA NA 0.56686927 0.3433276 NA NA -1.7420065 0.2470048
6 6 NA NA 0.41849489 0.8782866 1.0821070 0.08006585 -0.9786468 -1.2214594
7 7 NA NA NA NA 0.1038577 0.61057716 NA NA
8 8 NA NA NA NA 2.7002036 0.19693561 NA NA
>
A final point is that the code must be replicable over a large set of matrices. Thanks!
You can use merge in Reduce
out <- Reduce(function(x, y) merge(x, y, by = "Code", all = TRUE), mat_set)
colnames(out) <- paste0("x", seq_along(out))
out
# x1 x2 x3 x4 x5 x6 x7 x8 x9
#1 1 0.4291247 -0.5644520 NA NA NA NA -0.8553646 -0.5238281
#2 2 0.5060559 -0.8900378 -0.9111954 -0.4405479 NA NA -0.2806230 -0.4968500
#3 3 -0.5747400 -0.4771927 -0.8371717 0.4595894 NA NA -0.9943401 -1.8060313
#4 4 -0.5466319 -0.9983864 2.4158352 -0.6937202 NA NA -0.9685143 -0.5820759
#5 5 NA NA 0.1340882 -1.4482049 NA NA -1.1073182 -1.1088896
#6 6 NA NA -0.4906859 0.5747557 1.1022975 -0.5012581 -1.2519859 -1.0149620
#7 7 NA NA NA NA -0.4755931 -1.6290935 NA NA
#8 8 NA NA NA NA -0.7094400 -1.1676193 NA NA
data
set.seed(1234)
mat1 <- data.frame(matrix(nrow=4, ncol =3, rnorm(12,0,1)))
mat2 <- data.frame(matrix(nrow =5, ncol=3, rnorm(15,0,1)))
mat3 <- data.frame(matrix(nrow=3, ncol =3, rnorm(9,0,1)))
mat4 <- data.frame(matrix(nrow =6, ncol =3, rnorm(18,0,1)))
colnames(mat1) = colnames(mat2) = colnames(mat3) = colnames(mat4) <- c("Code", "x1", "x2")
mat1$Code <- c(1,2,3,4)
mat2$Code <- c(2,3,4,5,6)
mat3$Code <- c(6,7,8)
mat4$Code <- c(1,2,3,4,5,6)
mat_set <- list(mat1, mat2, mat3, mat4)
As noted by #G. Grothendieck, the above code gives a warning because the column names are similar in the list of dataframes. Since we only want to join by "Code", while creating the dataframe we can give unique names to columns so to avoid the warning and renaming later.
Updated data
set.seed(1234)
mat1 <- setNames(data.frame(matrix(nrow=4, ncol =3, rnorm(12,0,1))), c("Code", "x1", "x2"))
mat2 <- setNames(data.frame(matrix(nrow =5, ncol=3, rnorm(15,0,1))), c("Code", "x3", "x4"))
mat3 <- setNames(data.frame(matrix(nrow=3, ncol =3, rnorm(9,0,1))), c("Code", "x5", "x6"))
mat4 <- setNames(data.frame(matrix(nrow =6, ncol =3, rnorm(18,0,1))), c("Code", "x7", "x8"))
mat1$Code <- c(1,2,3,4)
mat2$Code <- c(2,3,4,5,6)
mat3$Code <- c(6,7,8)
mat4$Code <- c(1,2,3,4,5,6)
mat_set <- list(mat1, mat2, mat3, mat4)
and then we can use Reduce
Reduce(function(x, y) merge(x, y, by = "Code", all = TRUE), mat_set)

How to fill the matrix with values from other matrix matching rows and columns?

I'm trying to make a new matrix using values from other matrix with R. I'm trying to match the names of rows and columns while importing the values. This is what what trying to do:
I have two matrices;
X1 X2 X3 X4
X1 0 9 8 0
X2 1 2 3 5
X4 6 1 2 4
X1 X2 X3 X4
X1 NA NA NA NA
X2 NA NA NA NA
X3 NA NA NA NA
X4 NA NA NA NA
I want to do
X1 X2 X3 X4
X1 0 9 8 0
X2 1 2 3 5
X3 NA NA NA NA
X4 6 1 2 4
These matrices are just simple examples of my dataset, my real data is more complicated.
Many thanks,
checking for rownames and colnames matches in both matrices will prevent subscript out of bounds error. See below.
mat2[rownames(mat2) %in% rownames(mat1),
colnames(mat2) %in% colnames(mat1)] <- mat1[rownames(mat1) %in% rownames(mat2),
colnames(mat1) %in% colnames(mat2)]
mat2
# X1 X2 X3 X4
# X1 0 9 8 0
# X2 1 2 3 5
# X3 NA NA NA NA
# X4 6 1 2 4
Data:
mat1 <- read.table(text = ' X1 X2 X3 X4
X1 0 9 8 0
X2 1 2 3 5
X4 6 1 2 4', header = TRUE)
mat1 <- as.matrix(mat1)
mat2 <- matrix(NA, nrow = 4, ncol = 4, dimnames = list(paste0("X", 1:4),
paste0("X", 1:4)))
If I understood your question you can do this:
# Building your matrices
mat1 <- matrix(runif(12), nrow = 3, ncol = 4)
mat2 <- matrix(NA, nrow = 4, ncol = 4)
labs <- paste0("x", 1:4)
colnames(mat1) <- colnames(mat2) <- labs
rownames(mat2) <- labs
rownames(mat1) <- labs[c(1:2, 4)]
#
rows <- sort(unique(c(rownames(mat1), rownames(mat2))))
result <- matrix(NA, nrow = length(rows), ncol = ncol(mat1))
result[match(rownames(mat1), rows), ] <- mat1

Call col name with min value (NA included)

I have df including NA.
df <- data.frame( X1= c(NA, 1, 4, NA),
X2 = c(34, 75, 1, 4),
X3= c(2,9,3,5))
My ideal out come looks like,
X1 X2 X3 Min
1 NA 34 2 X3
2 1 75 9 X1
3 4 1 3 X2
4 NA 4 5 X2
I have tried
df$Min <- colnames(df)[apply(df,1,which.min, na.rm=TRUE)]
but this one didn't work
You don't need the na.rm=TRUE when using which.min() – try this instead:
df$Min <- colnames(df)[apply(df,1,which.min)]
Output:
X1 X2 X3 Min
1 NA 34 2 X3
2 1 75 9 X1
3 4 1 3 X2
4 NA 4 5 X2
Code:
foo <- names(df)
df$Min <- apply(df, 1, function(x) foo[which.min(x)])
df
Output:
X1 X2 X3 Min
1 NA 34 2 X3
2 1 75 9 X1
3 4 1 3 X2
4 NA 4 5 X2
Here's an idea that will likely be faster and does not require any looping. You could replace NA with Inf, take the negative of the data, then find the maximum per column via max.col().
names(df)[max.col(-replace(df, is.na(df), Inf))]
# [1] "X3" "X1" "X2" "X2"
Also, not to forget, a data.table solution, given that dt <- as.data.table(df)
dt[ , Min:=names(dt)[match(min(.SD, na.rm=T), .SD)], by=1:nrow(dt)][]
# X1 X2 X3 Min
#1: NA 34 2 X3
#2: 1 75 9 X1
#3: 4 1 3 X2
#4: NA 4 5 X2
Not much simpler than the solutions above, just extending the choices here.

Deleting all variables with over 30% missing values

I found this function to detect proportions of missing values for each column in any given dataframe:
propmiss <- function(dataframe) lapply(dataframe,function(x) data.frame(nmiss=sum(is.na(x)), n=length(x), propmiss=sum(is.na(x))/length(x)))
I assign it to a variable like this:
propmissdf <- propmiss(df)
Then I loop through the dataframe to NULL variables in my data like this:
for(i in (1:length(df))){
var = names(df)[i]
if((propmissdf[[var]][[3]]) > 0.3) { #the 3 index represents the proportion inside propmissdf
df[var] <- NULL
}
}
This gives me an error:
Error in if ((propmissdf[[var]][[3]]) > 0.3) { :argument is of length zero
But it works, somehow. It gets rid of several variables with missing value proportions greater than 0.3, but if I run the for loop again, it gets rid of more until 3 or 4 more times until it gets rid of all of them. Why is this happening? Please feel free to correct my problem, or to come up with a better way to remove variables with over 30% NAs.
You can use something like this:
df <- df[colSums(is.na(df))/nrow(df) < .3]
colSums(is.na(df)) would calculate how many NA values there are in each column.
Divide that output by the number of rows in the data.frame to get the proportion.
Use < .3 to create a logical comparison that can be used to subset the relevant columns.
Sample data and example:
set.seed(2)
df <- data.frame(matrix(sample(c(NA, 1:4), 20, TRUE), nrow = 4))
df
# X1 X2 X3 X4 X5
# 1 NA 4 2 3 4
# 2 3 4 2 NA 1
# 3 2 NA 2 2 2
# 4 NA 4 1 4 NA
colSums(is.na(df))/nrow(df)
# X1 X2 X3 X4 X5
# 0.50 0.25 0.00 0.25 0.25
df[colSums(is.na(df))/nrow(df) < .3]
# X2 X3 X4 X5
# 1 4 2 3 4
# 2 4 2 NA 1
# 3 NA 2 2 2
# 4 4 1 4 NA
For reference, here's a quick timing comparison:
set.seed(1)
df <- data.frame(matrix(sample(c(NA, 1:4), 4000, TRUE), ncol = 1000))
akfun <- function() {
i1 <-sapply(df, function(x) {
pr <- prop.table(table(factor(is.na(x), levels=c(TRUE, FALSE))))
pr[as.logical(names(pr))]< 0.3
})
df[i1]
}
amfun <- function() df[colSums(is.na(df))/nrow(df) < .3]
identical(amfun(), akfun())
# [1] TRUE
system.time(akfun())
# user system elapsed
# 0.172 0.000 0.173
system.time(amfun())
# user system elapsed
# 0.000 0.000 0.001
We can loop over the columns with sapply, get the count of 'NA' values with table, use `prop.table to find the proportion and create a logical vector.
i1 <-sapply(df, function(x) {
pr <- prop.table(table(factor(is.na(x), levels=c(TRUE, FALSE))))
pr[as.logical(names(pr))]< 0.3
})
This vector can be used for subsetting the columns.
df[i1]
If we need to remove the columns
df[!i1] <- list(NULL) #contributed by #Ananda Mahto
df
# X2 X3 X4 X5
#1 4 2 3 4
#2 4 2 NA 1
#3 NA 2 2 2
#4 4 1 4 NA
NOTE: df taken from #Ananda Mahto's post

Resources