randomForest::combine() and objects in a list - r

In order to run Random Forest models over very large datasets, I have divided my data into chunks and have run randomForest::randomForest() on each chunk. The resulting randomForest objects are contained in a list. I now need to use randomForest::combine() to combine the trees from each chunk of data.
My question is, how do I use a function such as combine() over all objects in a list? In my understanding, sapply(), etc. apply a function to each object in a list--not what I want to do. I need to use combine() over all randomForest objects in the list; or if that is not directly possible, I need to pull out each object separately and send it to combine(). Another issue is that I have different datasets with a varying number of data chunks; I want the code to be flexible in regards to the number of chunks.
My list (rf.final) contains objects "1" through "5" which are each randomForest objects:
> class(rf.final)
[1] "list"
> names(rf.final)
[1] "1" "2" "3" "4" "5"
> class(rf.final[[1]])
[1] "randomForest.formula" "randomForest"
There are 5 objects just because I had 5 chunks of data for this particular dataset.
I haven't included str(rf.final) because the output is huge [even just for str(rf.final[[1]])] but I can if desired.

I finally found the solution! Use the do.call() function in the base package.
I.e.
rf.final2 <- do.call("combine", rf.final)

Related

How to use List of List of Dataframes

I´m not sure if this is possible or even how to get a good resolution for the following R problem.
Data / Background / Structure:
I´ve collected a big dataset of project based cooperation data, which maps specific projects to the participating companies (this can be understood as a bipartite edgelist for social network analysis). Because of analytical reasons it is advised to subset the whole dataset to different subsets of different locations and time periods. Therefore, I´ve created the following data structure
sna.location.list
[[1]] (location1)
[[1]] (is a dataframe containing the bip. edge-list for time-period1)
[[2]] (is a dataframe containing the bip. edge-list for time-period2)
...
[[20]] (is a dataframe containing the bip. edge-list for time-period20)
[[2]] (location2)
... (same as 1)
...
[[32]] (location32)
...
Every dataframe contains a project id and the corresponding company ids.
My goal is now to transform the bipartite edgelists to one-mode networks and then do some further sna-related-calculations (degree, centralization, status, community detection etc.) and save them.
I know how to these claculation-steps with one(!) specific network but it gives me a really hard time to automate this process for all of the networks at one time in the described list structure, and save the various outputs (node-level and network-level variables) in a similar structure.
I already tried to look up several ways of for-loops and apply approaches but it still gives me sleepless nights how to do this and right now I feel very helpless. Any help or suggestions would be highly appreciated. If you need more information or examples to give me a brief demo or code example how to tackle such a nested structure and do such sna-related calculations/modification for all of the aforementioned subsets in an efficient automatic way, please feel free to contact me.
Let's say you have a function foo that you want to apply to each data frame. Those data frames are in lists, so lapply(that_list, foo) is what we want. But you've got a bunch of lists, so we actually want to lapply that first lapply across the outer list, hence lapply(that_list, lapply, foo). (The foo will be passed along to the inner lapply with .... If you wish to be more explicit you can use an anonymous function instead: lapply(that_list, function(x) lapply(x, foo)).
You haven't given a reproducible example, so I'll demonstrate applying the nrow function to a list of built-in data frames
d = list(
list(mtcars, iris),
list(airquality, faithful)
)
result = lapply(d, lapply, nrow)
result
# [[1]]
# [[1]][[1]]
# [1] 32
#
# [[1]][[2]]
# [1] 150
#
#
# [[2]]
# [[2]][[1]]
# [1] 153
#
# [[2]][[2]]
# [1] 272
As you can see, the output is a list with the same structure. If you need the names, you can switch to sapply with simplify = FALSE.
This covers applying functions to a nested list and saving the returns in a similar data structure. If you need help with calculation efficiency, parallelization, etc., I'd suggest asking a separate question focused on that, with a reproducible example.

Creating multiple matrices with a "for" loop

I am currently in a statistics class working on multivariate clustering and classification. For our homework we are trying to use a 10 fold cross validation to test how accurate different classification methods are on a 6 variable data set with three classifications. I was hoping I could get some help on creating a for loop (or something else which would be better that I don't know about) to create and run 10 classifications and validations so I don't have to repeat myself 10 times on everything.... Here is what I have. It will run but the first two matrices only show the first variable. Because of this, I have not been able to troubleshoot the other parts.
index<-sample(1:10,90,rep=TRUE)
table(index)
training=NULL
leave=NULL
Trfootball=NULL
football.pred=NULL
for(i in 1:10){
training[i]<-football[index!=i,]
leave[i]<-football[index==i,]
Trfootball[i]<-rpart(V1~., data=training[i], method="class")
football.pred[i]<- predict(Trfootball[i], leave[i], type="class")
table(Actual=leave[i]$"V1", classfied=football.pred[i])}
Removing the "[i]" and replacing them with 1:10 individually works right now....
Your problem lies is the assignment of a data.frame or matrix to a vector that you initially set as NULL (training and leave). A way to think about it is, you are trying to squeeze in a whole matrix into an element that can only take a single number. That's why R has a problem with your code. You need to initialise training and leave to something that can handle your iterative agglomeration of values (the R object list as #akrun points out).
The following example should give you a feel for what is happening and what you can do to fix your problem:
a<-NULL # your set up at the moment
print(a) # NULL as expected
# your football data is either data.frame or matrix
# try assigning those objects to the first element of a:
a[1]<-data.frame(1:10,11:20) # no good
a[1]<-matrix(1:10,nrow=2) # no good either
print(a)
## create "a" upfront, instead of an empty object
# what you need:
a<-vector(mode="list",length=10)
print(a) # empty list with 10 locations
## to assign and extract elements out of a list, use the "[[" double brackets
a[[1]]<-data.frame(1:10,11:20)
#access data.frame in "a"
a[1] ## no good
a[[1]] ## what you need to extract the first element of the list
## how does it look when you add an extra element?
a[[2]]<-matrix(1:10,nrow=2)
print(a)

Clean environment to speed up function in R

I have this example data and some example functions
other_data<-c(1,2,3) # data I have to have
fun<-function(a,b,c){
data<-c(a,b,c)
return(data)
} # first function
var_1<-runif(20,10,20) # variables
var_2<-runif(20,10,20)
var_3<-runif(20,10,20)
vars<-data.frame(var_1,var_2,var_3) # data frame of variables
subfun<-function(x){
res<-fun(vars[x,1],vars[x,2],vars[x,3])
return(res)
} # sub function of the first one to use more options and get them into list
final<-lapply(c(1:nrow(vars)),subfun) # this should be the final result I want to get
The problem is, that my real data is much more bigger and I have about 500 "data" (like in first function) which has to be reloaded every time because of different values of a,b,c. And it seems to slow down the function because of memory, i.e. environment.
I don't want to do it like rm(data) and repeat it 500x times in first function before the row return(data).
So my questions
Is there any straightforward way how to remove all objects which was loaded during the function, but only these objects in fun(a,b,c)? Because I need to DONT remove other_data.
Or more simply, is there straightforward way how to delete all objects like rm(ls(),instead of=c("other_data")?
If you only want to keep certain object you can use function keep from library gdata. http://www.inside-r.org/packages/cran/gdata/docs/keep
library('gdata')
var1 <- 1
var2 <- 2
ls()
[1] "var1" "var2"
keep(var1, sure = T)
ls()
[1] "var1"
Setting sure to True performs the removal. Otherwise keep will return names of objects that would have been removed.

Proper way to subset big.matrix

I would like to know if there is a 'proper' way to subset big.matrix objects in R. It is simple to subset a matrix but the class always reverts to 'matrix'. This isn't a problem when working with small datasets like this but with massive datasets but with extremely large datasets the subset could still benefit from the 'big.matrix' class.
require(bigmemory)
data(iris)
# I realize the warning about factors but not important for this example
big <- as.big.matrix(iris)
class(big)
[1] "big.matrix"
attr(,"package")
[1] "bigmemory"
class(big[,c("Sepal.Length", "Sepal.Width")])
[1] "matrix"
class(big[,1:2])
[1] "matrix"
I have since learned that the 'proper' way to subset a big.matrix is to use sub.big.matrix although this is only for contiguous columns and/or rows. Non-contiguous subsetting is not currently implemented.
sm <- sub.big.matrix(big, firstCol=1, lastCol=2)
It doesn't seem to be possible without calling as.big.matrix on the subset.
From the big.matrix documentation,
If x is a big.matrix, then x[1:5,] is returned as an R matrix containing the first five rows of x.
I presume this also applies to columns as well. So it seems you would need to call
a <- as.big.matrix(big[,1:2])
in order for the subset to also be a big.matrix object.
class(a)
# [1] "big.matrix"
# attr(,"package")
# [1] "bigmemory"

Applying lapply on Multiple Data Frames in a List, R

I have a list of similar data frames in a list u (4 columns, all with same headers) and would like to run an lapply function to get the correlation of columns 2 and 3 of each data frame. I want the function to read any integer i (the list has 300+ csvs).
I've tried this code but it hasn't worked:
i<-1:2
for (i) lapply(u, cor(u[[i]][,2],u[[i]][,3]))
Can someone please help me fix this code? Still fairly new to the program.
Edit: I've tried Metrics code below and it works, unfortunately one of the csvs contain only headers and no data. I get this error: Error in cor(u[, 2], u[, 3]) : 'x' is empty
Is there anyway sapply can be modified so that the "cor" function returns 0 if there isn't any data available?
x contains the list of all dataframes. In the following example, I have used two dataframes from R. (mtcars and iris)
list(mtcars=mtcars,iris=iris)
lapply(x,function(x)cor(x[,2],x[,3]))
[[1]]
[1] 0.9020329
[[2]]
[1] -0.4284401
Or, if you want the vector output:
sapply(x,function(x)cor(x[,2],x[,3]))
[1] 0.9020329 -0.4284401

Resources