lapply(rep(list(sample(1:100)), 10), sort, partial = 1:10)
... is what I'm trying to do. But the partial = 1:10 term is only evaluated once. What is the simplest method to evaluate the list of ten sample(1:100)'s with ten i++-style values passed to partial?
I apologize for being inarticulate.
A follow-up question is whether there is a more efficient method of generating these samples. What might this look like in a single custom function?
Thank you.
To answer your original question: you can use the multivariate version of sapply/lapply, eg
mapply(sort, x = list(sample(1:100, 10)), partial = 1:10)
If you want to return a list, set SIMPLIFY to FALSE.
Related
Little introduction to the question :
I am developing an ecophysiological model, and I use a reference class list called S that store every object the model need for input/output (e.g. meteo, physiological parameters etc...).
This list contains 5 objects (see example below):
- two dataframes, S$Table_Day (the outputs from the model) and S$Met_c(the meteo in input), which both have variables in columns, and observations (input or output) in row.
- a list of parameters S$Parameters.
- a matrix
- a vector
The model runs many functions with a daily time step. Each day is computed in a for loop that runs from the first day i=1 to the last day i=n. This list is passed to the functions that often take data from S$Met_c and/or S$Parameters in input and compute something that is stored in S$Table_Day, using indexes (the ith day). S is a Reference Class list because they avoid copy on modification, which is very important considering the number of computations.
The question itself :
As the model is very slow, I am trying to decrease computation time by micro-benchmarking different solutions.
Today I found something surprising when comparing two solutions to store my data. Storing data by indexing in one of the preallocated dataframes is longer than storing it into an undeclared vector. After reading this, I thought preallocating memory was always faster, but it seems that R performs more operations while modifying by index (probably comparing the length, type etc...).
My question is : is there a better way to perform such operations ? In other words, is there a way for me to use/store more efficiently the inputs/outputs (in a data.frame, a list of vector or else) to keep track of all computations of each day ? For example would it be better to use many vectors (one for each variable) and regroup them in more complex objects (e.g. list of dataframe) at then end ?
By the way, am I right to use Reference Classes to avoid copy of the big objects in S while passing it to functions and modify it from within them ?
Reproducible example for the comparison:
SimulationClass <- setRefClass("Simulation",
fields = list(Table_Day = "data.frame",
Met_c= "data.frame",
PerCohortFruitDemand_c="matrix",
Parameters= "list",
Zero_then_One="vector"))
S= SimulationClass$new()
# Initializing the table with dummy numbers :
S$Table_Day= data.frame(one= 1:10000, two= rnorm(n = 10000), three= runif(n = 10000),Bud_dd= rep(0,10000))
S$Met_c= data.frame(DegreeDays= rnorm(n=10000, mean = 10, sd = 1))
f1= function(i){
a= cumsum(S$Met_c$DegreeDays[i:(i-1000)])
}
f2= function(i){
S$Table_Day$Bud_dd[(i-1000):i]= cumsum(S$Met_c$DegreeDays[i:(i-1000)])
}
res= microbenchmark(f1(1000),f2(1000),times = 10000)
autoplot(res)
And the result :
Also if someone has any experience in programming such models, I am deeply interested in any advice for model development.
I read more about the question, and I'll just write here for prosperity some of the solutions that were proposed on other posts.
Apparently, reading and writing are both worth to consider when trying to reduce the computation time of assignation to a data.frame by index.
The sources are all found in other discussions:
How to optimize Read and Write to subsections of a matrix in R (possibly using data.table)
Faster i, j matrix cell fill
Time in getting single elements from data.table and data.frame objects
Several solutions appeared relevant :
Use a matrix instead of a data.frame if possible to leverage in place modification (Advanced R).
Use a list instead of a data.frame, because [<-.data.frame is not a primitive function (Advanced R).
Write functions in C++ and use Rcpp (from this source)
Use .subset2 to read instead of [ (third source)
Use data.table as recommanded by #JulienNavarre and #Emmanuel-Lin and the different sources, and use either set for data.frame or := if using a data.table is not a problem.
Use [[ instead of [ when possible (index by one value only). This one is not very effective, and very restrictive, so I removed it from the following comparison.
Here is the analysis of performance using the different solutions :
The code :
# Loading packages :
library(data.table)
library(microbenchmark)
library(ggplot2)
# Creating dummy data :
SimulationClass <- setRefClass("Simulation",
fields = list(Table_Day = "data.frame",
Met_c= "data.frame",
PerCohortFruitDemand_c="matrix",
Parameters= "list",
Zero_then_One="vector"))
S= SimulationClass$new()
S$Table_Day= data.frame(one= 1:10000, two= rnorm(n = 10000), three= runif(n = 10000),Bud_dd= rep(0,10000))
S$Met_c= data.frame(DegreeDays= rnorm(n=10000, mean = 10, sd = 1))
# Transforming data objects into simpler forms :
mat= as.matrix(S$Table_Day)
Slist= as.list(S$Table_Day)
Metlist= as.list(S$Met_c)
MetDT= as.data.table(S$Met_c)
SDT= as.data.table(S$Table_Day)
# Setting up the functions for the tests :
f1= function(i){
S$Table_Day$Bud_dd[i]= cumsum(S$Met_c$DegreeDays[i])
}
f2= function(i){
mat[i,4]= cumsum(S$Met_c$DegreeDays[i])
}
f3= function(i){
mat[i,4]= cumsum(.subset2(S$Met_c, "DegreeDays")[i])
}
f4= function(i){
Slist$Bud_dd[i]= cumsum(.subset2(S$Met_c, "DegreeDays")[i])
}
f5= function(i){
Slist$Bud_dd[i]= cumsum(Metlist$DegreeDays[i])
}
f6= function(i){
set(S$Table_Day, i=as.integer(i), j="Bud_dd", cumsum(S$Met_c$DegreeDays[i]))
}
f7= function(i){
set(S$Table_Day, i=as.integer(i), j="Bud_dd", MetDT[i,cumsum(DegreeDays)])
}
f8= function(i){
SDT[i,Bud_dd := MetDT[i,cumsum(DegreeDays)]]
}
i= 6000:6500
res= microbenchmark(f1(i),f3(i),f4(i),f5(i),f7(i),f8(i), times = 10000)
autoplot(res)
And the resulting autoplot :
With f1 the reference base assignment, f2 using a matrix instead of a data.frame, f3 using the combination of .subset2 and matrix, f4 using a list and .subset2, f5 using two lists (both reading and writing), f6 using data.table::set, f7 using data.table::set and data.table for cumulative sum, and f8using data.table :=.
As we can see the best solution is to use lists for reading and writing. This is pretty surprising to see that data.table is the worst solution. I believe I did something wrong with it, because it is supposed to be the best. If you can improve it, please tell me.
This question already has answers here:
How to assign values to dynamic names variables
(2 answers)
Closed 7 years ago.
I keep running into situations where I want to dynamically create variables using a for loop (or similar / more efficient construct using dplyr perhaps). However, it's unclear to me how to do it right now.
For example, the below shows a construct that I would intuitively expect to generate 10 variables assigned numbers 1:10, but it doesn't work.
for (i in 1:10) {paste("variable",i,sep = "") = i}
The error
Error in paste("variable", i, sep = "") = i :
target of assignment expands to non-language object
Any thoughts on what method I should use to do this? I assume there are multiple approaches (including a more efficient dplyr method). Full disclosure: I'm relatively new to R and really appreciate the help. Thanks!
I've run into this problem myself many times. The solution is the assign command.
for(i in 1:10){
assign(paste("variable", i, sep = ""), i)
}
If you wanted to get everything into one vector, you could use sapply. The following code would give you a vector from 1 to 10, and the names of each item would be "variable i," where i is the value of each item. This may not be the prettiest or most elegant way to use the apply family for this, but I think it ought to work well enough.
var.names <- function(x){
a <- x
names(a) <- paste0("variable", x)
return(a)
}
variables <- sapply(X = 1:10, FUN = var.names)
This sort of approach seems to be favored because it keeps all of those variables tucked away in one object, rather than scattered all over the global environment. This could make calling them easier in the future, preventing the need to use get to scrounge up variables you'd saved.
No need to use a loop, you can create character expression with paste0 and then transform it as uneveluated expression with parse, and finally evaluate it with eval.
eval(parse(text = paste0("variable", 1:10, "=",1:10, collapse = ";") ))
The code you have is really no more useful than a vector of elements:
x<-1
for(i in 2:10){
x<-c(x,i)
}
(Obviously, this example is trivial, could just use x<-1:10 and be done. I assume there's a reason you need to do non-vectored calculations on each variable).
Let's say I have a function that accepts a vector of parameters and returns a vector of results (of the same length). And let's say I want to call this function 100 times always with the same parameter - a 100 elements long vector of 1 - ideally getting a list of vectors as a result.
The first thing that came to my mind was to use lapply, specifically to call lapply on a list of vectors. My testing on smaller data proved that it should work and that it returns data in required format. The problem is that I'm unable to generate the list of vectors I need as the argument.
All I found online was how to generate a vector which doesn't help me much as I already know how to do that. The problem is how to generate a list out of these vectors (using list(rep(1, 100), rep(1, 100), ...) is out of question as I'd have to repeat the rep(1, 100) part a hundred times.
The quickest way to do this is to use R's built in replicate function, like so:
replicate(100, rep(1, 100), simplify = FALSE)
where rep(1, 100) gets replaced by the vector you actually want a list of 100 copies of. An equivalent statement would be to use lapply and an anonymous function, like so:
lapply(1:100, function(x){ rep(1, 100) })
Essentially, what this is doing is writing a function that takes its input, throws it away, and outputs your vector of choice. In fact, that's not much different than what replicate does under the hood, according to the documentation:
replicate is a wrapper for the common use of sapply for repeated evaluation of an expression
The only difference from the standard use of replicate is that, by default, replicate returns your list of vectors simplified to an array. But as you can see it's easy enough to force it not to do that by passing simplify = FALSE.
I store my data in a multilevel nested list. This list may look like this
my.list = list()
my.list[[1]] = list("naive" = list("a"=c(1,1)) )
my.list[[2]] = list("naive" = list("b"=c(2,1)) )
my.list[[3]] = list("naive" = list("c"=c(3,1)) )
my.list[[4]] = list("naive" = list("d"=c(4,1)) )
my.list[[5]] = list("naive" = list("e"=c(5,1)) )
Now I want to do some operations on the stored values, like selecting the first elements of the 2-d vectors and combine them in another vector. Of course this can be done like the following
foo = c(my.list[[1]][["naive"]][[1]][1],
my.list[[2]][["naive"]][[1]][1],
my.list[[3]][["naive"]][[1]][1],
my.list[[4]][["naive"]][[1]][1],
my.list[[5]][["naive"]][[1]][1])
But this is really awkward if I have many more than 5 first level sublists. And I have the same feeling about for-looping through K in my.list[[K]][["naive"]][[1]][1], K=1:5.
The suggestions provided for similar questions using neat mapply, Reduce, sapply cannot be applied directly due to presence of the "naive" second level sublist.
Can someone help we with a clever and instructive solution, where values stored according to such multilevel list structure can be concatenated, added to each other or both? In example above, the final desired result could be adding respective components for all vectors, and storing the result in a new 2-d vector, that in this case would be foo = c(15,5). Your answer would help me a lot.
Yeah, sapply will definitely do the trick for you in this situation:
sapply(my.list, function(x) x[["naive"]][[1]][1])
And you can generalize from here or sapply/lapply over the indices of the naive vector if you need to.
I have a list of samples, each of varying lengths. I need to compare sample means (using a Mann-Whitney-Wilcoxon test) for all samples in the list. Current code is as follows:
wilcox.v = list() ##This creates the list of samples
for (i in df){
treat = list(i$treatment)
wilcox.v = c(wilcox.v,treat)
}
###This *should* iterate over all items in the list
wilcox = sapply(wilcox.v, function(i){ wilcox.test(as.numeric(wilcox.v[i,]), as.numeric(wilcox.v[-i,]), exact = FALSE)$p.value
})
I'd like to have the function return a vector of p-values, so that the broader function can re-sample if necessary.
The problem seems to lie in the need to compare a sample mean to all other sample means in the list.
I'm sure there's an easy way to do this (and I think it has something to do with calling indicies correctly), but I'm not sure!
AS joran said, you wrote your apply function a little wonky. There are two ways you can fis this.
Modify it so i is in fact an index reference:
wilcox = sapply(1:length(wilcox.v)
,function(i){ wilcox.test(as.numeric(wilcox.v[[i]])
,as.numeric(wilcox.v[[-i]]), exact = FALSE)$p.value
})
modify your function so it appropriately treats i as a list element. I'll leave this as an exercise to you (primarily since I don't want to deal with the wilcox.v[-i,] term.
Thanks for your help! This is the solution I ended up using. It's hardly elegant but it gets the job done.
mannwhit = vector()
for (i in mannwhit.v){
for (j in mannwhit.v){
if (identical(i,j) == FALSE){
p.val = wilcox.test(i, j, paired=FALSE)$p.value
mannwhit = c(mannwhit, p.val)
}
}
}