Merging list elements of two lists if unequal length - r

I have two lists. List elements are data.tables.
One list contains all Keys:
listA <- list(Key1 = data.table(A = rnorm(5), B = rnorm(5), C = rnorm(5)),
Key2 = data.table(A = rnorm(5), B = rnorm(5), C = rnorm(5)),
Key3 = data.table(A = rnorm(5), B = rnorm(5), C = rnorm(5)))
The other list is a subset with additional information:
listB <- list(Key1 = data.table(D = "B"),
Key2 = data.table(D = "N"))
I want to add column D from the tables in listB to the tables in listA, where the Key is matching. I have tried with:
mapply(FUN = function(x, y) x[, D := y[, D]], x = listA, y = listB, SIMPLIFY = F)
but this throws the warning.
Warning message:
In mapply(FUN = function(x, y) x[, :=(D, y[, D])], x = listA, :
longer argument not a multiple of length of Shorter
In the end it is doing the job, but recycles the D column for the Key not in listB.
How can I achieve that column D is only added for the tables if the Key matches. Or even better to add column D for all tables in listA and if there is no match, then just fill NA in column D in listA. Thanks.

What is happening here is that if you use mapply with vectors or lists of different lenghts, it will repeat the shorter element. Thus, if you have the two lists:
mapply(FUN, list(a1, a2, a3), list(b1, b2))
Is equivalent to doing (throwing a warning):
mapply(FUN, list(a1, a2, a3), list(b1, b2, b1))
What you can do to avoid this is create in listB the data tables that don't have a matching in listA, with a column D filled with NA's. Something like:
listB[[setdiff(names(listA), names(listB))]] <- data.table(D = NA)
mapply(FUN = function(x, y) x[, D := y[, D]], x = listA, y = listB, SIMPLIFY = FALSE)
Now it does not throw warnings and you have column D from non-matching elements filled with NA's.

Related

subset a list within a list by column's name start with all requested pattern

I have a list of data frames (L) with different column's name. I want to have a subset of a list which contains all requested columns name start with A and B (the sequence of A and B is not important.
L1 = data.frame(A1 = c(1:4) , Ab = c("u","v","w","x"))
L2 = data.frame(A2 = c(1:4) , Bc = c("u","v","w","x"))
L3 = data.frame(A3 = c(1:4) , Bd = c("u","v","w","x"))
L4 = data.frame(A = c(1:4) , B = c("u","v","w","x"))
L<-list(L1,L2, L3, L4)
The result should be a list with L2, L3, and L4 which contains columns start with A and B.
#######
Following command gives all lists which contain columns start with A or B but not subset a list which just start with both A and B.
lapply(L, function(x) x[ , grepl( '^A|^B' , names(x))])
#######
and this function gives lists with exact requested columns name and not the columns start with A and B.
trial <- function(x)
{
reqnames <- c('A', 'B')
L <- lapply(L, function(x) all(reqnames %in% names(x)))
L <- which(L==1)
x[L]
}
trial(L)
Try this:
new_list <- lapply(L, \(x) x[
all(
any(grepl("^A", names(x))),
any(grepl("^B", names(x)))
)
]
)
This will return an empty list in place of L1, and the contents of L2 to L4.
If you don't want an empty list for L1 you can subset it again:
new_list[sapply(new_list, length)>0]

applying a created function in a list with data frames

I would like to calculate the variation coefficient in a list composed of several data frames. However, when I apply my function that calculates the variation coeficient in my list of data frames I am getting this error:
coef_var = lapply(dists_log, cvs)
Error in is.data.frame(x) :
'list' object cannot be coerced to type 'double'
Here what I did:
List = list (A = data.frame(A = rnorm(30), B = rnorm(30), C =rnorm (30), D = rnorm(30)),
B = data.frame(A = rnorm(30), B = rnorm(30), C =rnorm (30), D = rnorm(30)),
C = data.frame(A = rnorm(30), B = rnorm(30), C =rnorm (30), D = rnorm(30)),
D = data.frame(A = rnorm(30), B = rnorm(30), C =rnorm (30), D = rnorm(30)))
#function to calculate the variation coeficient
cvs <- function (dist){
cv <- sd(dist, na.rm=T) / mean(dist, na.rm=T) * 100
return(cv)
}
The I run:
coef_var = lapply(dists_log, cvs)
and got the error message above
Can someone help me with this error?
We need a nested list as sd and mean requires the input to be vector and not a data.frame. So, we loop over the columns of the data.frame with lapply, apply the 'cvs' function, assign back to the object and return the data.frame object
lapply(dists_log, function(x) {x[] <- lapply(x, cvs); x})
If we are expecting only a single element as output
lapply(dists_log, function(x) unlist(lapply(x, cvs)))

rev() in r and how to apply it to a list using loops

I have a list of say {a,b,c,d,...} and each element, a,b,c,d, ... are data.table that I need to reverse the order of, however, for the data.table I only want to rev() all of it except the first column, as it is an ID. I tried using loops to do it but it returned
Error in `[<-.data.table`(`*tmp*`, , -1, value = list(code_a = c("a", :
Item 1 of column numbers in j is -1 which is outside range [1,ncol=4]. Use column names instead in j to add new columns.
Example:
a <- c("a","b","c","d","e","f")
b <- 1:6
c <- c("F","E","D","C","B","A")
d <- 10:15
dt1 <- data.table("ID" = b, "code_a" = a)
dt2 <- data.table("ID" = b, "code_c" = c)
dt3 <- data.table("ID" = b, "code_d" = d)
dt <- list(dt1,dt2,dt3)
rev_dt <- rev(dt)
merged_list <- list()
rev_merged_list <- list()
rev_merged_list <- Reduce(merge, rev_dt, accumulate = TRUE)
merged_list <- rev_merged_list
merged_list <- rev(merged_list)
for(z in 1:length(dt)){
merged_list[[z]][,-1] = rev(merged_list[[z]][,-1])
}
More Information:
The for loop here is supposed to be:
- for z from 1 to the length of dt
- the merged_list element z (which with double square brackets) should be a data.table
- where the data does not include the first column
- should be assigned to the rev of the same element z, where the first column is also excluded
Does this logic hold for the above loop? I am unsure what is wrong!
Expected Output:
output_ <- list()
a_ <- data.table("ID" = b, "code_a" = a, "code_c" = c, "code_d" = d)
b_ <- data.table("ID" = b, "code_c" = c, "code_d" = d)
c_ <- data.table("ID" = b, "code_d" = d)
output_[[1]] <- a_
output_[[2]] <- b_
output_[[3]] <- c_
output_
I was told yesterday that the merge above i can specify a right hand merge, however in doing so, I need to specify a by = "ID" in the merge, but I am unsure what is the x and y values in the case of merging multiple sets of data.
I am also under the impression that lapply() can do the same thing instead of loop, but I am unsure in this case how might I achieved that. Thanks~
We can use setcolorder
for(i in seq_along(merged_list)){
setcolorder(merged_list[[i]],
c(names(merged_list[[i]])[1], rev(names(merged_list[[i]])[-1])))
}
all.equal(merged_list, output_, check.attributes = FALSE)
#[1] TRUE

Automatically add any variables that exist in one data.frame but missing in other data.frames in R

Suppose I have a reference data.frame called a. I was wondering how I could automatically add any variables that exist in a but missing in other data.frames b and d?
NOTE: My goal is to make a function out of this such that any number of data.frames, and any number of variables can be completed based on a single reference data.frame.
a <- data.frame(x = 2:3, y = 4:5, z = c(T, F)) ## reference data.frame
b <- data.frame(x = 6:7) ## Add y and z here
d <- data.frame(x = 7:8) ## Add y and z here
Supposing all the data.frames involved share the same number of rows, you can simply:
toadd<-setdiff(colnames(a),colnames(b))
b[toadd]<-a[toadd]
Wrapping the above in a function:
f<-function(refdf, ...) {
res<-listdf<-list(...)
res<-lapply(listdf, function(x) {
toadd<-setdiff(names(refdf),names(x))
x[toadd]<-refdf[toadd]
x
})
c(list(refdf),res)
}
Then try for instance:
f(a,b)
f(a,b,d)
# Using a reference data.frame perform a right join in order
# to append required vectors to provided data.frames:
add_empty_vecs <- function(refdf, ...){
# Store the names of the other data.frames: df_names => character vector
df_names <- as.list(substitute(list(...)))[-1L]
# Return the right joined the reference data.frame to the
# provided data.frames: list => .GlobalEnv()
setNames(lapply(list(...), function(y){
merge(refdf, y, by = intersect(names(refdf), names(y)), all.y = TRUE)
}
), c(df_names))
}
# Apply function only df b:
add_empty_vecs(a, b)
# Apply function to both df b & df d:
add_empty_vecs(a, b, d)
# Apply function to all b, d, e:
add_empty_vecs(a, b, d, e)
Data:
a <- data.frame(x = 2:3, y = 4:5, z = c(T, F)) ## reference data.frame
b <- data.frame(x = 6:7) ## Add y and z here
d <- data.frame(x = 7:8) ## Add y and z here
e <- data.frame(x = 9:10)

Looping correlation tests within nested lists on same variables across more than two dataframes

Consider these three dataframes in a nested list:
df1 <- data.frame(a = runif(10,1,10), b = runif(10,1,10), c = runif(10,1,10))
df2 <- data.frame(a = runif(10,1,10), b = runif(10,1,10), c = runif(10,1,10))
df3 <- data.frame(a = runif(10,1,10), b = runif(10,1,10), c = runif(10,1,10))
dflist1 <- list(df1,df2,df3)
dflist2 <- list(df1,df2,df3)
nest_list <- list(dflist1, dflist2)
I want to do a 'cor.test' between column 'a' against column 'a', 'b' against 'b' and 'c' against 'c' in all 'dfs' for each dflist. I can do it individually if assign each one to the global environment with the code below thanks to this post:
for (i in 1:length(nest_list)) { # extract dataframes from list in to individual dfs
for(j in 1:length(dflist1)) {
temp_df <- Norm_red_list[[i]][[j]]}
ds <- paste (names(nest_list[i]),names(nestlist[[i]][[j]]), sep = "_")
assign(ds,temp_df)
}
}
combn(paste0("df", 1:3), 2, FUN = function(x) { #a ctual cor.test
x1 <- mget(x, envir = .GlobalEnv)
Map(function(x,y) cor.test(x,y, method = "spearman")$p.value, x1[[1]], x1[[2]])})
I am not sure that I understand exactly what you want to do but could something like this help you ?
#vector of your columns name
columns <- c("a","b","c")
n <- length(columns)
# correlation calculation function
correl <- function(i,j,data) {cor.test(unlist(data[i]),unlist(data[j]), method = "spearman")$p.value}
correlfun <- Vectorize(correl, vectorize.args=list("i","j"))
# Make a "loop" on columns vector (u will then be each value in columns vector, "a" then "b" then "c")
res <- sapply(columns,function(u){
# Create another loop on frames that respect the condition names(x)==u (only the data stored in columns "a", "b" or "c")
lapply(lapply(nest_list,function(x){sapply(x,function(x){x[which(names(x)==u)]})}),function(z)
# on those data, use the function outer to apply correlfun function on each pair of vectors
{outer(1:n,1:n,correlfun,data=z)})},simplify = FALSE,USE.NAMES = TRUE)
Is this helping ? Not sure I'm really clear in my explanation :)

Resources