library(xml2)
library(rvest)
datpackage <- paste0("dat",1:10)
for(i in 1:10){
assign(datpackage[i], runif(2))
}
datlist <- list(dat1, dat2, dat3, dat4, dat5, dat6, dat7, dat8, dat9, dat10)
"datlist" is what I want, but is there easier way to make a list ?
datlist2 <- for (i in 1:10) {
list(paste0("dat",i))
}
datlist3 <- list(datpackage)
I've tried datlist2, and datlist3, but that's not the same as "datlist".
What should I have to do when I make a list with thousands of data?
We can use paste with mget if the objects are already created
datlist <- mget(paste0("dat", 1:10))
But, if we need to create a list of random uniform numbers,
datlist <- replicate(10, runif(2), simplify = FALSE)
For creating lists with random numbers I would also suggest:
datlist2 <- lapply(vector("list", 10), function(x) {runif(2)})
Benchmarking
May be worth adding that the lapply / vector approach appears to be faster:
funA <- function(x) {replicate(10, runif(2), simplify = FALSE)}
funB <- function(x) {lapply(vector("list", 10), function(x) {runif(2)})}
microbenchmark::microbenchmark(funA(), funB(), times = 1e4)
Results
Unit: microseconds
expr min lq mean median uq max neval cld
funA() 24.053 27.3305 37.98530 28.6665 34.4045 2478.510 10000 b
funB() 19.507 21.6400 30.37437 22.9235 27.0500 2547.145 10000 a
Related
I have a a list composed of nested lists. Each of these nested lists contains data frames that share the same columns. I want to merge the data frames within each nested list , maintaining the higher order list.
I've tried doing this with lapply and do.call, but it's taking far too long. Indeed I'm getting the following error:
Error: vector memory exhausted (limit reached?)
my.list <- replicate(100, replicate(10, data.frame(a = 1:5, b = 6:10), simplify = F), simplify = F)
my.list <- lapply(my.list, function(l) do.call("rbind", l))
This gives me exactly the data structure I want, but runs way too slow with large data.
Another option would be to use purrr::map with dplyr::bind_rows
library(purrr)
library(dplyr)
map(my.list, bind_rows)
Here is a microbenchmark comparison of the different methods
library(purrr)
library(dplyr)
library(data.table)
library(microbenchmark)
res <- microbenchmark(
lapply_do_call_rbind = {
lapply(my.list, function(l) do.call("rbind", l))
},
map_bind_rows = {
map(my.list, bind_rows)
},
lapply_rbindlist = {
lapply(my.list, rbindlist)
}
)
#Unit: milliseconds
# expr min lq mean median uq
# lapply_do_call_rbind 46.104965 49.801469 54.567249 51.815901 54.085547
# map_bind_rows 3.257474 3.490079 4.055779 3.620804 4.002505
# lapply_rbindlist 9.446331 10.009678 11.429870 10.796956 12.252741
library(ggplot2)
autoplot(res)
I am trying to change the first data of all the xts I have contained within a list but I can't seem to figure out how the syntax would be for lapply to do this. I have tried with:
b = lapply(a, function(a) a[1,]=1)
But this erases all the other rows' data. Does anyone knows the right syntax to address to the first data and modify it.
Thanks
Your internal function returning the a[1,]=1 as a result, therefore you didn't have the whole xts stored.
Use like this:
b <- lapply(a, function(a) { a[1,] = 1; a })
Another way is to use [<- (anonymous assignment):
b <- lapply(a, `[<-`, 1, TRUE, 1)
library(microbenchmark)
library(xts)
data(sample_matrix)
sample.xts <- as.xts(sample_matrix, descr='my new xts object')
a <- rep(list(sample.xts), 2000)
microbenchmark(assign = lapply(a, function(a) { a[1,] = 1; a }),
anon_assign = lapply(a, `[<-`, 1, TRUE, 3))
Unit: milliseconds
expr min lq mean median uq max neval
assign 33.50660 39.90533 58.75338 43.74316 88.39256 128.15991 100
anon_assign 29.95665 32.37879 44.80245 34.11000 38.87301 97.35795 100
Therefore, the anonymous assign version is much faster.
I am currently using following code to merge >130 data frames and the code takes too many hours to run (I actually never got to the completion on such a big dataset, only on subsets). Each table has two columns: unit (string) and counts (integer). I am merging by units.
tables <- lapply(files, function(x) read.table(x), col.names=c("unit", x))))
MyMerge <- function(x, y){
df <- merge(x, y, by="unit", all.x= TRUE, all.y= TRUE)
return(df)
}
data <- Reduce(MyMerge, tables)
Is there any way to speed this up easily? Each table/dataframe separately has around 500,000 rows and many of those are unique to that table. Therefore, by merging multiple tables one quickly gets number of the rows of the merged dataframe to many millions..
At the end, I will drop rows with too low summary counts from my big merged table, but I don't want to to that during merging as the order of my files would matter then..
Here a small comparison, first with a rather small dataset, then with a larger one:
library(data.table)
library(plyr)
library(dplyr)
library(microbenchmark)
# sample size:
n = 4e3
# create some data.frames:
df_list <- lapply(1:100, function(x) {
out <- data.frame(id = c(1:n),
type = sample(c("coffee", "americano", "espresso"),n, replace=T))
names(out)[2] <- paste0(names(out)[2], x)
out})
# transform dfs into data.tables:
dt_list <- lapply(df_list, function(x) {
out <- as.data.table(x)
setkey(out, "id")
out
})
# set options to outer join for all methods:
mymerge <- function(...) base::merge(..., by="id", all=T)
mydplyr <- function(...) dplyr::full_join(..., by="id")
myplyr <- function(...) plyr::join(..., by="id", type="full")
mydt <- function(...) merge(..., by="id", all=T)
# Compare:
microbenchmark(base = Reduce(mymerge, df_list),
dplyr= Reduce(mydplyr, df_list),
plyr = Reduce(myplyr, df_list),
dt = Reduce(mydt, dt_list), times=50)
This gives the following results:
Unit: milliseconds
expr min lq mean median uq max neval cld
base 944.0048 956.9049 974.8875 962.9884 977.6824 1221.5301 50 c
dplyr 316.5211 322.2476 329.6281 326.9907 332.6721 381.6222 50 a
plyr 2682.9981 2754.3139 2788.7470 2773.8958 2812.5717 3003.2481 50 d
dt 537.2613 554.3957 570.8851 560.5323 572.5592 757.6631 50 b
We can see that the two contestants are dplyr and data.table. Changing the sample size to 5e5 yields the following comparisons, showing that indeed data.table dominates. Note that I added this part after #BenBolker's suggestion.
microbenchmark(dplyr= Reduce(mydplyr, df_list),
dt = Reduce(mydt, dt_list), times=50)
Unit: seconds
expr min lq mean median uq max neval cld
dplyr 34.48993 34.85559 35.29132 35.11741 35.66051 36.66748 50 b
dt 10.89544 11.32318 11.61326 11.54414 11.87338 12.77235 50 a
I have a list where different rows are of different length (sometimes length of 1)
I would like to apply sample to each row by using
sapply(1:99,function(x) sample(mat[[]],1))
The problem of course is that whenever the row is of length one sample will choose from 1:x instead of always choose the same number.
Is there a way to force sample to return the same value whenever length is of 1?
What is an alternative way to avoid this problem?
Since the 1:x thing is hard coded into sample, the best option is just to use ifelse:
sapply(mat[1:99], function(x) if(length(x)==1) x else sample(x, 1))
You could use the example on the help page ?sample:
resample <- function(x, ...) x[sample.int(length(x), ...)]
Just use the above resample function in place of sample. Or rename it, modify it, etc. if you want it to work a little differently.
To satisfy my own curiosity I did a quick benchmark of the suggestions so far:
library(microbenchmark)
mylist <- lapply( sample( rep( 1:10, 10 ) ), rpois, lambda=3 )
resample <- function(x, ...) x[sample.int(length(x), ...)]
sample1 <- function(x) x[sample.int(length(x), 1)]
ie1 <- function(x) if(length(x)==1) x else sample(x,1)
ie2 <- function(x) ifelse( length(x)==1, x, sample(x,1) )
rep1 <- function(x) { if( length(x) < 2 ) x <- rep(x,2); sample(x,1) }
(out <- microbenchmark(
sapply(mylist, resample, size=1),
sapply(mylist, sample1),
sapply(mylist, ie1),
sapply(mylist, ie2),
sapply(mylist, rep1)
))
With results:
Unit: microseconds
expr min lq median uq max neval
sapply(mylist, resample, size = 1) 360.846 388.1455 398.4085 409.4925 2036.169 100
sapply(mylist, sample1) 339.499 365.7720 375.8300 391.6345 1846.100 100
sapply(mylist, ie1) 493.853 534.2900 543.3205 561.3840 2091.589 100
sapply(mylist, ie2) 1225.397 1291.6955 1328.4365 1395.1455 3787.850 100
sapply(mylist, rep1) 566.926 614.3405 627.2720 649.4405 2178.209 100
Once you have matrix vs. dataframe or whatever it is straightened out, here's a workaround I've used:
vec.len<-length(my_vector)
if (vec.len <2 ) my_vector<-rep(my_vector,2)
sample(my_vector,1)
Suppose I have some object (any object), for example:
X <- array(NA,dim=c(2,2))
Also I have some list:
L <- list()
I want L[[1]], L[[2]], L[[3]],...,L[[100]],...,L[[1000]] all to have the object X inside it. That is, if I type into the console L[[i]], it will return X, where i is in {1,2,...,1000}.
How do I do this efficiently without relying on a for loop or lapply?
Make a list of 1 and replicate it.
L <- rep(list(x), 1000)
Using replicate even if it still a kind of a loop solution:
L <- replicate(1000,X,simplify=FALSE)
EDIT benchmarkking the 2 solutions :
X <- array(NA,dim=c(2,2))
library(microbenchmark)
microbenchmark( rep(list(X), 10000),
replicate(10000,X,simplify=FALSE))
expr min lq median uq max neval
rep(list(X), 10000) 1.743070 2.114173 3.088678 5.178768 25.62722 100
replicate(10000, X, simplify = FALSE) 5.977105 7.573593 10.557783 13.647407 80.69774 100
rep is 5 times faster. I guess since replicate will evaluate the expression at each iteration.