Populate list with same object efficiently - r

Suppose I have some object (any object), for example:
X <- array(NA,dim=c(2,2))
Also I have some list:
L <- list()
I want L[[1]], L[[2]], L[[3]],...,L[[100]],...,L[[1000]] all to have the object X inside it. That is, if I type into the console L[[i]], it will return X, where i is in {1,2,...,1000}.
How do I do this efficiently without relying on a for loop or lapply?

Make a list of 1 and replicate it.
L <- rep(list(x), 1000)

Using replicate even if it still a kind of a loop solution:
L <- replicate(1000,X,simplify=FALSE)
EDIT benchmarkking the 2 solutions :
X <- array(NA,dim=c(2,2))
library(microbenchmark)
microbenchmark( rep(list(X), 10000),
replicate(10000,X,simplify=FALSE))
expr min lq median uq max neval
rep(list(X), 10000) 1.743070 2.114173 3.088678 5.178768 25.62722 100
replicate(10000, X, simplify = FALSE) 5.977105 7.573593 10.557783 13.647407 80.69774 100
rep is 5 times faster. I guess since replicate will evaluate the expression at each iteration.

Related

Is using comment() from base to assign information to R object slowing the code down?

Is using comment() from base to assign information to R object slowing the code down?
That is, should its implementation be used carefully?
Context: I'm having a function that creates several tibbles/dataframes that are saved in a list; and I'm thinking of saving a comment to each dataframe (or just saving one comment to the entire list).
From the comment documentation it seems that the method is just an interface to get/set a comment attribute to any R object. I can't see it becoming a burden in the vast majority of real-world use cases.
To have an idea of how the function behaves under load I've written a simple function to generate n dataframes (2000 rows, 3 columns) and annotate them at will. Results will be appended to a list:
df_and_comment <- function(n, add_comment = FALSE) {
res_list <- list()
for (i in seq(1:n)) {
x <- data.frame(
x = rnorm(2000),
y = rnorm(2000),
z = rnorm(2000)
)
if (add_comment) {
comment(x) <- sprintf("this is df no: %d", i)
}
res_list[[i]] <- x
}
res_list
}
Normal load - creating 50 dataframes
library(microbenchmark)
microbenchmark(
df_and_comment(n=50),
df_and_comment(n=50, add_comment = TRUE),
times = 10
)
Unit: milliseconds
expr min lq mean median uq max neval
df_and_comment(n = 50) 25.34398 25.51473 26.70731 25.74472 25.97483 33.81251 10
df_and_comment(n = 50, add_comment = TRUE) 26.32009 26.39826 27.49835 26.60218 27.80038 32.47273 10
Heavy load - creating 15.000 dataframes
microbenchmark(
df_and_comment(n=15000),
df_and_comment(n=15000, add_comment = TRUE),
times = 10
)
Unit: seconds expr min lq mean median uq max neval
df_and_comment(n = 15000) 8.218535 8.254919 8.324075 8.317126 8.354637 8.469191 10
df_and_comment(n = 15000, add_comment = TRUE) 8.414405 8.561279 8.687380 8.571137 8.685309 9.591972 10
In both cases, the performance difference are completely negligible. I wouldn't be worried about performance implications of annotating dataframes/regression results iteratively.

Use apply() with a simple features (SF) function

I've written a function to calculate the maximum distance between a centroid and the edge of its polygon, but I can't figure out how to run it on each individual polygon of a simple features ("sf) data.frame.
library(sf)
distance.func <- function(polygon){
max(st_distance(st_cast(polygon, "POINT"), st_centroid(polygon)))
}
If I test the function on a single polygon it works. (The warning messages are irrelevant to the current issue).
nc <- st_read(system.file("shape/nc.shp", package="sf")) # built in w/package
nc.1row <- nc[c(1),] # Just keep the first polygon
>distance.func(nc.1row)
24309.07 m
Warning messages:
1: In st_cast.sf(polygon, "POINT") :
repeating attributes for all sub-geometries for which they may not be constant
2: In st_centroid.sfc(st_geometry(x), of_largest_polygon = of_largest_polygon) :
st_centroid does not give correct centroids for longitude/latitude data
The problem is applying this function to the entire data.frame.
nc$distance <- apply(nc, 1, distance.func)
Error in UseMethod("st_cast") :
no applicable method for 'st_cast' applied to an object of class "list"
What can I do to run this function (or one like it) for each individual polygon in an object of class "sf"?
The problem here is that using apply-like functions directly on sf object is "problematic" because the geometry column is a list-column, which does not interact well with "apply" constructs.
The simplest workaround could be to just use a for loop:
library(sf)
nc <- st_read(system.file("shape/nc.shp", package="sf")) %>%
st_transform(3857)
distance.func <- function(polygon){
max(st_distance(st_cast(polygon, "POINT"), st_centroid(polygon)))
}
dist <- list()
for (i in seq_along(nc[[1]])) dist[[i]] <- distance.func(nc[i,])
head(unlist(dist))
# [1] 30185.34 27001.39 34708.57 52751.61 57273.54 34598.17
, but it is quite slow.
To be able to use apply-like functions, you need to pass to the function only the geometry column of the object. Something like this would work:
library(purrr)
distance.func_lapply <- function(polygon){
polygon <- st_sfc(polygon)
max(st_distance(st_cast(polygon, "POINT"), st_centroid(polygon)))
}
dist_lapply <- lapply(st_geometry(nc), distance.func_lapply)
dist_map <- purrr::map(st_geometry(nc), distance.func_lapply)
all.equal(dist, dist_lapply)
# [1] TRUE
all.equal(dist, dist_map)
# [1] TRUE
Note however that I had to slighlty modify the distance function, adding an st_sfc call, because otherwise you get a lot of "In st_cast.MULTIPOLYGON(polygon, "POINT") : point from first coordinate only" warnings, and the results are not correct (I did not investigate the reason for this - apparently st_cast behaves differently on sfg objects than on sfc ones).
In terms of speed, both the lapply and the map solutions outperform the for loop by almost an order of magnitude:
microbenchmark::microbenchmark(
forloop = {for (i in seq_along(nc[[1]])) dist[[i]] <- distance.func(nc[i,])},
map = {dist_map <- purrr::map(st_geometry(nc), distance.func_lapply)},
lapply = {dist_lapply <- lapply(st_geometry(nc), distance.func_lapply)}, times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
forloop 904.8827 919.5636 936.2214 920.7451 929.7186 1076.9646 10
map 122.7597 124.9074 126.1796 126.3326 127.6940 128.7551 10
lapply 122.9131 125.3699 126.9642 126.8100 129.3791 131.2675 10
There is an other way to apply over simple features albeit not really better than using a for loop. You can first create a list of simple features with lapply before applying your distance function.
distance.func <- function(polygon){
max(st_distance(st_cast(polygon, "POINT"), st_centroid(polygon)))
}
distance.func.ls_sf <- function(sf){
ls_sf <- lapply(1:nrow(sf), function(x, sf) {sf[x,]}, sf)
dist <- lapply(ls_sf, distance.func)
}
dist_lapply_ls_sf <- distance.func.ls_sf(nc)
all.equal(dist, dist_lapply_ls_sf)
# [1] TRUE
The performance is almost as poor as a for loop... and it even seems that 4 years later (now R 4.1.1 with sf 1.0-3), it's almost two orders of magnitude worst (instead of one) than lapply or map using st_geometry(nc)...
microbenchmark::microbenchmark(
forloop = {for (i in seq_along(nc[[1]])) dist[[i]] <- distance.func(nc[i,])},
map = {dist_map <- purrr::map(st_geometry(nc), distance.func_lapply)},
lapply = {dist_lapply <- lapply(st_geometry(nc), distance.func_lapply)},
ls_sf = {dist_lapply_ls_sf <- distance.func.ls_sf(nc)},
times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
forloop 7726.9337 7744.7534 7837.6937 7781.2301 7850.7447 8221.2092 10
map 124.1067 126.2212 135.1502 128.4745 130.2372 182.1479 10
lapply 122.0224 125.6585 130.6488 127.9388 134.1495 147.9301 10
ls_sf 7722.1066 7733.8204 7785.8104 7775.5011 7814.3849 7911.3466 10
So it's a bad solution unless the function you are applying to the simple feature object is taking much more time to compute than st_distance().
What if you need the attributes ?
If your function needs both geometries and attributes part of the sf object, using mapply is a good way to go. Here is an example computing the Sudden Infant Death density (SID/kmĀ²) using three methods:
for
extracting each features before using lapply
mapply
microbenchmark::microbenchmark(
forLoop =
{
sid.density.for <- vector("list", nrow(nc))
for (i in seq(nrow(nc))) sid.density.for[[i]] <- nc[i,][["SID74"]] / st_area(nc[i,]) / 1000^2
},
list_nc =
{
list_nc <- lapply(seq(nrow(nc)), function(x, nc) { nc[x,] }, nc)
sid.density.lapply <- lapply(list_nc, function(x) { x[["SID74"]] / as.numeric(st_area(x)) / 1000^2 })
},
mapply =
{
sid.density.func <- function(geometry, attribute) { attribute / st_area(geometry) / 1000^2 }
sid.density.mapply <- mapply(sid.density.func, st_geometry(nc), nc[["SID74"]], SIMPLIFY = FALSE)
},
times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
forLoop 4511.7203 4515.5997 4557.73503 4542.75200 4560.5508 4707.2877 10
list_nc 4356.3801 4400.5640 4455.35743 4440.38775 4475.2213 4717.5218 10
mapply 17.4783 17.6885 18.20704 17.99295 18.3078 20.1121 10

Use lapply to modify the data of an xts contained on a list

I am trying to change the first data of all the xts I have contained within a list but I can't seem to figure out how the syntax would be for lapply to do this. I have tried with:
b = lapply(a, function(a) a[1,]=1)
But this erases all the other rows' data. Does anyone knows the right syntax to address to the first data and modify it.
Thanks
Your internal function returning the a[1,]=1 as a result, therefore you didn't have the whole xts stored.
Use like this:
b <- lapply(a, function(a) { a[1,] = 1; a })
Another way is to use [<- (anonymous assignment):
b <- lapply(a, `[<-`, 1, TRUE, 1)
library(microbenchmark)
library(xts)
data(sample_matrix)
sample.xts <- as.xts(sample_matrix, descr='my new xts object')
a <- rep(list(sample.xts), 2000)
microbenchmark(assign = lapply(a, function(a) { a[1,] = 1; a }),
anon_assign = lapply(a, `[<-`, 1, TRUE, 3))
Unit: milliseconds
expr min lq mean median uq max neval
assign 33.50660 39.90533 58.75338 43.74316 88.39256 128.15991 100
anon_assign 29.95665 32.37879 44.80245 34.11000 38.87301 97.35795 100
Therefore, the anonymous assign version is much faster.

Faster way to iterate over rows

I'm trying to divide each row of a dataframe by a number stored in a second mapping dataframe.
for(g in rownames(data_table)){
print(g)
data_table[g,] <- data_table[g,]/mapping[g,2]
}
However, this is incredibly slow, each row takes almost 1-2 seconds to run. I know iteration is usually not the best way to do things in R, but I don't know how else to do it. Is there any way I can speed up the runtime?
Try this :
sweep(data_table, 1, mapping[[2]], "/")
In terms of speed here is a benchmark for the possibilities using iris dataset and including your version :
microbenchmark::microbenchmark(
A = {
for(g in rownames(test)){
# print(g)
test[g,] <- test[g,]/test[g,2]
}
},
B = sweep(test, 1, test[[2]], "/"),
C = test / test[[2]],
times = 100
)
#Unit: microseconds
#expr min lq mean median uq max neval
#A 82374.693 83722.023 101688.1254 84582.052 147280.057 157507.892 100
#B 453.652 484.393 514.4094 513.850 539.480 623.688 100
#C 404.506 423.794 456.0063 446.101 470.675 729.205 100
you can vectorize this operation if the two variables have the same number of rows:
dt <- data.frame(a = rnorm(100), b = rnorm(100))
mapping <- data.frame(x = rnorm(100), y = rnorm(100))
dt / mapping[,2]

Make List easier with For-Loop

library(xml2)
library(rvest)
datpackage <- paste0("dat",1:10)
for(i in 1:10){
assign(datpackage[i], runif(2))
}
datlist <- list(dat1, dat2, dat3, dat4, dat5, dat6, dat7, dat8, dat9, dat10)
"datlist" is what I want, but is there easier way to make a list ?
datlist2 <- for (i in 1:10) {
list(paste0("dat",i))
}
datlist3 <- list(datpackage)
I've tried datlist2, and datlist3, but that's not the same as "datlist".
What should I have to do when I make a list with thousands of data?
We can use paste with mget if the objects are already created
datlist <- mget(paste0("dat", 1:10))
But, if we need to create a list of random uniform numbers,
datlist <- replicate(10, runif(2), simplify = FALSE)
For creating lists with random numbers I would also suggest:
datlist2 <- lapply(vector("list", 10), function(x) {runif(2)})
Benchmarking
May be worth adding that the lapply / vector approach appears to be faster:
funA <- function(x) {replicate(10, runif(2), simplify = FALSE)}
funB <- function(x) {lapply(vector("list", 10), function(x) {runif(2)})}
microbenchmark::microbenchmark(funA(), funB(), times = 1e4)
Results
Unit: microseconds
expr min lq mean median uq max neval cld
funA() 24.053 27.3305 37.98530 28.6665 34.4045 2478.510 10000 b
funB() 19.507 21.6400 30.37437 22.9235 27.0500 2547.145 10000 a

Resources