data.table rbindlist column of lists - r

I want to transform a nested list to a data.table but I get this error:
library(data.table)
res = list(list(a=1, b=2, c=list(1,2,3,4,5)), list(a=2, b=3, c=list(1,2,3,4,5)))
rbindlist(res)
Error in rbindlist(res) :
Column 3 of item 1 is length 5, inconsistent with first column of that item which is length 1. rbind/rbindlist doesn't recycle as it already expects each item to be a uniform list, data.frame or data.table
My result should look like this:
data.table(a=c(1,2), b=c(2,3), c=list(c(1,2,3,4,5), c(1,2,3,4,5)))
a b c
1: 1 2 1,2,3,4,5
2: 2 3 1,2,3,4,5
I there a way to transform this list? It should work without knowing the column names beforehand.

What you want to do is easy. But your input must be formatted as the following:
library(data.table)
res = list(list(a=1, b=2, c=list(c(1,2,3,4,5))), list(a=2, b=3, c=list(c(1,2,3,4,5))))
rbindlist(res)
You can transform your input with the following code
res = lapply(res, function(x)
{
x[[3]] <- list(unlist(x[[3]]))
return(x)
})

Generally, R wants a column of a data.frame or data.table to be like a vector, meaning all single values. And rbindlist expects a list of data.frames, or things that can be treated as/converted to data.frames. So the reason your code fails is because it first tries to transform your input into 2 data.frames, for which the third column seems longer then the first and second.
But it is possible, you need to force R to construct 2 data.frames with each just one row. So we need the value of c to be just length-one, which we can do by making it a list with one element: a vector of length 5. And then we need to tell R it needs to really treat it "AsIs", and not convert it to something of length 5. For that, we have the function I which does nothing more then mark its input as "AsIs"
res <- list(data.frame(a=1, b=2, c=I(list(c(1,2,3,4,5)))),
data.frame(a=2, b=3, c=I(list(c(1,2,3,4,5)))))
res2 <- rbindlist(res)
The data.frame calls are not even necessary, it also works with list. But generally, I think not relying on hoe other functions must first convert your input works best.

I think you need to first process every sublist and convert it to achieve your example output, like so
library(data.table)
res = list(list(a=1, b=2, c=list(1,2,3,4,5)), list(a=2, b=3, c=list(1,2,3,4,5)))
DTlist <- lapply(res, function(row_){
lapply(row_, function(col_){
if(class(col_) == 'list'){
list(unlist(col_))
}else{
col_
}
})
})
rbindlist(DTlist)
The result would be
a b c
1: 1 2 1,2,3,4,5
2: 2 3 1,2,3,4,5
Sorry post is edited, bc I didnt recognize what the OP is trying to do initially. This also works, if the OP doesnt know which the sublist column is.

Related

R: choose elements from list based on values in vector with same names

[Probably this question already has an answer here, but I didn't manage to find one, also because I have some difficulty in formulating it concisely. Suggestions for reformulating the title of the question are appreciated.]
I have
a list of matrices with different numbers of rows,
a vector of integer values with the same names as the list's,
a list of names that appear in the list and vector above,
an integer variable telling which column to choose from those matrices.
Let's construct, as a working example:
mynames <- c('a', 'c')
mylist <- list(a=matrix(1:4,2,2), b=matrix(1:6,3,2), c=matrix(1:8,4,2))
myvec <- 2:4
names(myvec) <- names(mylist)
chooseCol <- 2
I'd like to construct a vector having as elements the rows taken from myvec and column chooseCol, for the names appearing in mynames. My attempt is
sapply(mynames, function(elem){mylist[[elem]][myvec[elem], chooseCol]})
which correctly yields
a c
4 8
but I was wondering if there's a faster, base (non-tidyverse) method of doing this.
Also important or relevant: the order of the names in mylist and myvec can be different, so I can't rely on position indices.
I would use mapply -
mapply(function(x, y) x[y, chooseCol], mylist[mynames], myvec[mynames])
#a c
#4 8

How to apply operation and sum over columns in R?

I want to apply some operations to the values in a number of columns, and then sum the results of each row across columns. I can do this using:
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$a2 <- x$a^2
x$b2 <- x$b^2
x$result <- x$a2 + x$b2
but this will become arduous with many columns, and I'm wondering if anyone can suggest a simpler way. Note that the dataframe contains other columns that I do not want to include in the calculation (in this example, column sample is not to be included).
Many thanks!
I would simply subset the columns of interest and apply everything directly on the matrix using the rowSums function.
x <- data.frame(sample=1:3, a=4:6, b=7:9)
# put column indices and apply your function
x$result <- rowSums(x[,c(2,3)]^2)
This of course assumes your function is vectorized. If not, you would need to use some apply variation (which you are seeing many of). That said, you can still use rowSums if you find it useful like so. Note, I use sapply which also returns a matrix.
# random custom function
myfun <- function(x){
return(x^2 + 3)
}
rowSums(sapply(x[,c(2,3)], myfun))
I would suggest to convert the data set into the 'long' format, group it by sample, and then calculate the result. Here is the solution using data.table:
library(data.table)
melt(setDT(x),id.vars = 'sample')[,sum(value^2),by=sample]
# sample V1
#1: 1 65
#2: 2 89
#3: 3 117
You can easily replace value^2 by any function you want.
You can use apply function. And get those columns that you need with c(i1,i2,..,etc).
apply(( x[ , c(2, 3) ])^2, 1 ,sum )
If you want to apply a function named somefunction to some of the columns, whose indices or colnames are in the vector col_indices, and then sum the results, you can do :
# if somefunction can be vectorized :
x$results<-apply(x[,col_indices],1,function(x) sum(somefunction(x)))
# if not :
x$results<-apply(x[,col_indices],1,function(x) sum(sapply(x,somefunction)))
I want to come at this one from a "no extensions" R POV.
It's important to remember what kind of data structure you are working with. Data frames are actually lists of vectors--each column is itself a vector. So you can you the handy-dandy lapply function to apply a function to the desired column in the list/data frame.
I'm going to define a function as the square as you have above, but of course this can be any function of any complexity (so long as it takes a vector as an input and returns a vector of the same length. If it doesn't, it won't fit into the original data.frame!
The steps below are extra pedantic to show each little bit, but obviously it can be compressed into one or two steps. Note that I only retain the sum of the squares of each column, given that you might want to save space in memory if you are working with lots and lots of data.
create data; define the function
grab the columns you want as a separate (temporary) data.frame
apply the function to the data.frame/list you just created.
lapply returns a list, so if you intend to retain it seperately make it a temporary data.frame. This is not necessary.
calculate the sums of the rows of the temporary data.frame and append it as a new column in x.
remove the temp data.table.
Code:
x <- data.frame(sample=1:3, a=4:6, b=7:9); square <- function(x) x^2 #step 1
x[2:3] #Step 2
temp <- data.frame(lapply(x[2:3], square)) #step 3 and step 4
x$squareRowSums <- rowSums(temp) #step 5
rm(temp) #step 6
Here is an other apply solution
cols <- c("a", "b")
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$result <- apply(x[, cols], 1, function(x) sum(x^2))

Basic Loop through a matrix with pasting column names into result objects pt II

This is an edited version of
my initial question, which i obviously explained poorly, so ill try again.
I want to perform a function with every column of the dataframe, and name the resulting objects (here values of the class dist) according to the original dataframe and the colname:
library(vegan)
d1 <- as.data.frame(matrix(rnorm(12),4,3), ncol=3, dimnames=list(NULL, LETTERS[1:3])))
Fun <-function(x){
vegdist(decostand(x,"standardize",MARGIN=2), method="euclidean")
}
d1.A <- Fun(d1$A) # A being the colname of the first column of d1
d1.B <- Fun(d1$B)
d1.C <- Fun(d1$C)
This i want to do for more than 100 columns in my dataframe.
So, in short i want to apply my function to all columns of my dataframe and create result values with names that are made from the name of the original dataframe and a paste of the column name the function was working on.
Thank you very much!
If you want to clutter your global environment with lots of objects, one option is list2env or you can use assign (Though, I would not recommend it). Instead you can do all the operations/analysis by storing it in a list and later save/write to different files using write.table and lapply
lst <- setNames(lapply(d1, Fun),
paste("d1", colnames(d1), sep="."))
The above list could be used for most of the analysis. If you need as individual objects.
list2env(lst, envir=.GlobalEnv)
#<environment: R_GlobalEnv>
Now, you can get the individual objects by calling d1.A, d1.B etc.
d1.A
# 1 2 3
#2 1.9838499
#3 1.2754209 0.7084290
#4 2.2286961 0.2448462 0.9532752
I am assuming you need to create a number (equal to the number of columns of d1) of objects of class "dist".
If that is the case, you can do this:
for (i in 1:ncol(d1))
{
eval(parse(text=paste('d1.',colnames(d1)[i], "<-" ,"Fun(d1[,",i,"])", sep="")))
}
This evaluates in each iteration to:
d1.V1 <- Fun(d1$V1)
d1.V2 <- Fun(d1$V2)
d1.V3 <- Fun(d1$V3)

How to avoid listing out function arguments but still subset too?

I have a function myFun(a,b,c,d,e,f,g,h) which contains a vectorised expression of its parameters within it.
I'd like to add a new column: data$result <- with(data, myFun(A,B,C,D,E,F,G,H)) where A,B,C,D,E,F,G,H are column names of data. I'm using data.table but data.frame answers are appreciated too.
So far the parameter list (column names) can be tedious to type out, and I'd like to improve readability. Is there a better way?
> myFun <- function(a,b,c) a+b+c
> dt <- data.table(a=1:5,b=1:5,c=1:5)
> with(dt,myFun(a,b,c))
[1] 3 6 9 12 15
The ultimate thing I would like to do is:
dt[isFlag, newCol:=myFun(A,B,C,D,E,F,G,H)]
However:
> dt[a==1,do.call(myFun,dt)]
[1] 3 6 9 12 15
Notice that the j expression seems to ignore the subset. The result should be just 3.
Ignoring the subset aspect for now: df$result <- do.call("myFun", df). But that copies the whole df whereas data.table allows you to add the column by reference: df[,result:=myFun(A,B,C,D,E,F,G,H)].
To include the comment from #Eddi (and I'm not sure how to combine these operations in data.frame so easily) :
dt[isFlag, newCol := do.call(myFun, .SD)]
Note that .SD can be used even when you aren't grouping, just subsetting.
Or if your function is literally just adding its arguments together :
dt[isFlag, newCol := do.call(sum, .SD)]
This automatically places NA into newCol where isFlag is FALSE.
You can use
df$result <- do.call(myFun, df)

R: Test condition on column of dataframe elements within list; return smaller list

My goal is take a list of dataframes, see if a specific column of the data frames has a max value of 0, and if so, remove that data frame from my list.
Right now I am looping over names of the list. Given that this is R, there must be a better way. I feel I need some function applied through lapply() to get this right. I've also considered ddply() but I think that maybe overkill. Here is what I have so far:
# Make df of First element
myColumn <- rep ("ElementA",times=10)
values <- seq(1,10)
a <- data.frame(myColumn,values)
# Make df of second element
myColumn <- rep ("ElementB",times=10)
values <- rep(0,10)
b <- data.frame(myColumn,values)
# Bind the dataframes together
df <- rbind(a,b)
#Now split the dataframes based on element name
myList <- split(df,df$myColumn)
# Now loop through element lists and check for max of 0 in values
for (name in names(myList)) { # Loop through List
if (max(myList[[name]]$values) == 0) { # Check Max for 0
myList <- myList[[-names]] # If 0, remove element from list
} # Close If
} # Close Loop
Error in -names : invalid argument to unary operator
I've tested my code outside the loop, and it all seems to work.
Any help is greatly appreciated. Thanks!
You can use this:
myList <- myList[sapply(myList, function(d) max(d$values) != 0)]
instead of the for() loop. This will let pass dataframes with zero rows, with a warning.
To ensure empty dataframes are removed, use this:
myList <- myList[sapply(myList, function(d) if(nrow(d)==0) FALSE else max(d$values)!=0)]

Resources