Why is `unlist(lapply)` faster than `sapply`? - r

If so, why do we need sapply?
x <- list(a=1, b=1)
y <- list(a=1)
JSON <- rep(list(x,y),10000)
microbenchmark(sapply(JSON, function(x) x$a),
unlist(lapply(JSON, function(x) x$a)),
sapply(JSON, "[[", "a"),
unlist(lapply(JSON, "[[", "a"))
)
Unit: milliseconds
expr min lq median uq max neval
sapply(JSON, function(x) x$a) 25.22623 28.55634 29.71373 31.76492 88.26514 100
unlist(lapply(JSON, function(x) x$a)) 17.85278 20.25889 21.61575 22.67390 78.54801 100
sapply(JSON, "[[", "a") 18.85529 20.06115 21.53790 23.42480 38.56610 100
unlist(lapply(JSON, "[[", "a")) 11.33859 11.69198 12.25329 13.37008 27.81361 100

In addition to running lapply, sapply runs simplify2array to try and fit the output into an array. To figure out if that is possible, the function needs to check if all the individual outputs have the same length: this is done via a costly unique(lapply(..., length)) which accounts for most of the time difference you were seeing:
b <- lapply(JSON, "[[", "a")
microbenchmark(lapply(JSON, "[[", "a"),
unlist(b),
unique(lapply(b, length)),
sapply(JSON, "[[", "a"),
sapply(JSON, "[[", "a", simplify = FALSE),
unlist(lapply(JSON, "[[", "a"))
)
# Unit: microseconds
# expr min lq median uq max neval
# lapply(JSON, "[[", "a") 14809.151 15384.358 15774.26 16905.226 24944.863 100
# unlist(b) 920.047 1043.719 1158.62 1223.091 8056.231 100
# unique(lapply(b, length)) 10778.065 11060.452 11456.11 12581.414 19717.740 100
# sapply(JSON, "[[", "a") 24827.206 25685.535 26656.88 30519.556 93195.751 100
# sapply(JSON, "[[", "a", simplify = FALSE) 14283.541 14922.780 15526.42 16654.058 26865.022 100
# unlist(lapply(JSON, "[[", "a")) 15334.026 16133.146 16607.12 18476.182 30080.544 100

As droopy and Roland pointed out, sapply is a wrapper function for lapply designed for convenient use. sapply uses simplify2array which is slower than unlist:
> microbenchmark(unlist(as.list(1:1000)), simplify2array(as.list(1:1000)), times=1000)
Unit: microseconds
expr min lq median uq max neval
unlist(as.list(1:1000)) 99.734 109.0230 113.912 118.3120 21343.92 1000
simplify2array(as.list(1:1000)) 892.712 931.0895 947.957 976.3125 22241.52 1000
Also, when returning a matrix, sapply is slower than with other base functions, for example:
a <- list(c(1,2,3,4), c(1,2,3,4), c(1,2,3,4))
microbenchmark(t(do.call(rbind, lapply(a, function(x)x))), sapply(a, function(x)x))
Unit: microseconds
expr min lq median uq max neval
t(do.call(rbind, lapply(a, function(x) x))) 29.823 30.801 32.512 33.734 94.845 100
sapply(a, function(x) x) 57.201 58.179 59.156 60.134 111.956 100
But especially in the second case, sapply is much easier to use.

Related

How to substitute multiple words with spaces in R?

Here is an example:
drugs<-c("Lapatinib-Ditosylate", "Caffeic-Acid-Phenethyl-Ester", "Pazopanib-HCl", "D-Pantethine")
ads<-"These are recently new released drugs Lapatinib Ditosylate, Pazopanib HCl, and Caffeic Acid Phenethyl Ester"
What I wanted is to correct the drug names in ads with the names in drugs such that a desired output would be:
"These are recently new released drugs Lapatinib-Ditosylate, Pazopanib-HCl, and Caffeic-Acid-Phenethyl-Ester"
If you create a vector of words to be replaced, then you can loop over that vector and the vector of words to replace them (drugs), replacing all instances of one element in each interation of the loop.
to_repl <- gsub('-', ' ', drugs)
for(i in seq_along(drugs))
ads <- gsub(to_repl[i], drugs[i], ads)
ads
# "These are recently new released drugs Lapatinib-Ditosylate, Pazopanib-HCl, and Caffeic-Acid-Phenethyl-Ester"
Contrary to popular belief, for-loops in R are no slower than lapply
f_lapply <- function(ads){
to_repl <- gsub('-', ' ', drugs)
invisible(lapply(seq_along(to_repl), function(i) {
ads <<- gsub(to_repl[i], drugs[i], ads)
}))
ads
}
f_loop <- function(ads){
to_repl <- gsub('-', ' ', drugs)
for(i in seq_along(to_repl))
ads <- gsub(to_repl[i], drugs[i], ads)
ads
}
f_loop(ads) == f_lapply(ads)
# [1] TRUE
microbenchmark::microbenchmark(f_loop(ads), f_lapply(ads), times = 1e4)
# Unit: microseconds
# expr min lq mean median uq max neval
# f_loop(ads) 59.488 95.180 118.0793 107.487 120.205 7426.866 10000
# f_lapply(ads) 69.333 114.462 147.9732 130.872 152.205 27283.670 10000
Or, using more general examples:
loop_over <- 1:1e5
microbenchmark::microbenchmark(
for_loop = {for(i in loop_over) 1},
lapply = {lapply(loop_over, function(x) 1)}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# for_loop 4.66174 5.865842 7.725975 6.354867 7.449429 35.26807 100
# lapply 94.09223 114.378778 125.149863 124.665128 134.217326 170.16889 100
loop_over <- 1:1e5
microbenchmark::microbenchmark(
for_loop = {y <- numeric(1e5); for(i in seq_along(loop_over)) y[i] <- loop_over[i]},
lapply = {lapply(loop_over, function(x) x)}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# for_loop 11.00184 11.49455 15.24015 12.10461 15.26050 134.139 100
# lapply 71.41820 81.14660 93.64569 87.05162 98.59295 357.219 100
This can also be done using lapply() which will be faster than for loop. Modifying #IceCreamToucan's answer, this can be done in lapply as follows
to_repl <- gsub('-', ' ', drugs)
invisible(lapply(seq_along(to_repl), function(i) {
ads <<- gsub(to_repl[i], drugs[i], ads)
}))
# [1] "These are recently new released drugs Lapatinib-Ditosylate, Pazopanib-HCl, and Caffeic-Acid-Phenethyl-Ester"
Microbenchmark
Unit: microseconds
expr min lq mean median uq max neval
lapply 80.514 87.4935 110.1103 93.304 96.1995 1902.861 100
for.loop 2285.164 2318.5665 2463.1554 2338.216 2377.4120 7510.763 100

dictionary and list comprehension in R

Any generic way of doing the following R code faster? For example in python dict comprehension (see equivalent below) would be a nice faster alternative.
R:
l1 <- 1:3
l2 <- c("a", "b", "c")
foo <- function(x) {return(5*x)}
bar <- list()
for (i in 1:length(l1)) bar[l2[i]] <- foo(l1[i])
Python
l1 = range(1, 4)
l2 = ["a", "b", "c"]
def foo(x):
return 5*x
{b: foo(a) for a, b in zip(l1, l2)}
We're talking about speed, so let's do some benchmarking:
library(microbenchmark)
microbenchmark(op = {for (i in 1:length(l1)) bar[l2[i]] <- foo(l1[i])},
lapply = setNames(lapply(l1,foo),l2),
vectorised = setNames(as.list(foo(l1)), l2))
Unit: microseconds
expr min lq mean median uq max neval
op 7.982 9.122 10.81052 9.693 10.548 36.206 100
lapply 5.987 6.557 7.73159 6.842 7.270 55.877 100
vectorised 4.561 5.132 6.72526 5.417 5.987 80.964 100
But these small values don't mean much, so I pumped up the vector length to 10,000 where you'll really see a difference:
l <- 10000
l1 <- seq_len(l)
l2 <- sample(letters, l, replace = TRUE)
microbenchmark(op = {bar <- list(); for (i in 1:length(l1)) bar[l2[i]] <- foo(l1[i])},
lapply = setNames(lapply(l1,foo),l2),
vectorised = setNames(as.list(foo(l1)), l2),
times = 100)
Unit: microseconds
expr min lq mean median uq max neval
op 30122.865 33325.788 34914.8339 34769.8825 36721.428 41515.405 100
lapply 13526.397 14446.078 15217.5309 14829.2320 15351.933 19241.767 100
vectorised 199.559 259.997 349.0544 296.9155 368.614 3189.523 100
But tacking onto what everyone else said, it doesn't have to be a list. If you remove the list requirement:
microbenchmark(setNames(foo(l1), l2))
Unit: microseconds
expr min lq mean median uq max neval
setNames(foo(l1), l2) 22.522 23.8045 58.06888 25.0875 48.322 1427.417 100

R *apply vector as input; matrix as output

I'd like to apply over each element of a vector, a function that outputs a vector.
After applying the function to each element of that vector, I should have many vectors, which I'd like to rbind in order to have a matrix.
The code should be equivalent to the following:
my_function <- function(x) x:(x+10)
my_vec <- 1:10
x <- vector()
for(i in seq_along(vec)){
x <- rbind(x,my_function(my_vec[i]))
}
Of course, my_function and my_vec are just examples.
try:
tmp <- lapply(my_vec, my_function)
do.call(rbind, tmp)
or, like Heroka suggested, use sapply. i prefer lapply, then bind my output the way i like (rbind/cbind) instead of potentially transposing.
Here is an alternative:
matrix( unlist(lapply(my_vec,my_function)), length(my_vec), byrow=TRUE )
Speed is almost the same:
library(microbenchmark)
my_function <- function(x) sin(x:(x+10))
for ( n in 1:4 )
{
my_vec <- 1:10^n
print(
microbenchmark( mra68 = matrix( unlist(lapply(my_vec,my_function)), length(my_vec), byrow=TRUE ),
stas.g = do.call(rbind, lapply(my_vec, my_function)),
times = 1000 )
)
print("identical?")
print( identical( matrix( unlist(lapply(my_vec,my_function)), length(my_vec), byrow=TRUE ),
do.call(rbind, lapply(my_vec, my_function)) ) )
}
.
Unit: microseconds
expr min lq mean median uq max neval
mra68 38.496 40.307 68.00539 41.213 110.052 282.148 1000
stas.g 41.213 42.572 72.86443 43.930 115.939 445.186 1000
[1] "identical?"
[1] TRUE
Unit: microseconds
expr min lq mean median uq max neval
mra68 793.002 810.212 850.4857 818.3640 865.2375 7231.669 1000
stas.g 876.786 894.901 946.8165 906.2235 966.9100 7051.873 1000
[1] "identical?"
[1] TRUE
Unit: milliseconds
expr min lq mean median uq max neval
mra68 2.605448 3.028442 5.269003 4.020940 7.807512 14.51225 1000
stas.g 2.959604 3.390071 5.823661 4.500546 8.800462 92.54977 1000
[1] "identical?"
[1] TRUE
Unit: milliseconds
expr min lq mean median uq max neval
mra68 27.29810 30.99387 51.44223 41.20167 79.46185 559.0059 1000
stas.g 33.63622 37.22420 60.10224 49.07643 92.94333 395.3315 1000
[1] "identical?"
[1] TRUE
>

Convert a large scale characters to date-format-like characters in r

I have a data frame df with 10 million rows. I want to convert the character format of "birthday" column from "xxxxxxxx" to "xxxx-xx-xx". eg. from "20051023" to "2005-10-23". I can use df$birthday <- lapply(df$birthday, as.Date, "%Y%m%d") to do that, but it wastes a lot of memory and computing time for data transforming. However, I just want to convert it to date-like character, but not date type. Therefore I use stringi package because it is written by C language. Unfortunately, df$birthday <- stri_join(stri_sub(df$birthday, from=c(1,5,7), to=c(4,6,8)), collapse = "-") doesn't work, cuz the function doesn't support vector input. Is there any way to solve this problem? Thanks a lot.
Go with sub.
date <- c("20051023", "20151023")
sub("^(\\d{4})(\\d{2})(\\d{2})$", "\\1-\\2-\\3", date)
# [1] "2005-10-23" "2015-10-23"
as.Date works on vectors
df$birthday <- format(as.Date(df$birthday, "%Y%m%d"), "%Y-%m-%d)
A vectorised function is much faster than apply
library(microbenchmark)
n <- 1e3
df <- data.frame(birthday = rep("20051023", n))
microbenchmark(
lapply(df$birthday, as.Date, "%Y%m%d"),
sapply(df$birthday, as.Date, "%Y%m%d"),
as.Date(df$birthday, "%Y%m%d")
)
Unit: microseconds
expr min lq mean median uq max neval cld
lapply(df$birthday, as.Date, "%Y%m%d") 22833.624 25340.118 29064.7188 28406.154 32346.245 58522.360 100 b
sapply(df$birthday, as.Date, "%Y%m%d") 24048.493 26252.660 29797.9074 28437.156 33119.381 47966.133 100 b
as.Date(df$birthday, "%Y%m%d") 431.469 447.719 481.5221 461.189 475.086 1984.158 100 a
A regular expression is off-course even faster.
microbenchmark(
as.character(as.Date(df$birthday, "%Y%m%d")),
format(as.Date(df$birthday, "%Y%m%d"), "%Y-%m%-d"),
sub("^(\\d{4})(\\d{2})(\\d{2})$", "\\1-\\2-\\3", df$birthday)
)
Unit: microseconds
expr min lq mean
as.character(as.Date(df$birthday, "%Y%m%d")) 4923.189 5057.462 5390.313
format(as.Date(df$birthday, "%Y%m%d"), "%Y-%m%-d") 3428.657 3553.736 3697.660
sub("^(\\\\d{4})(\\\\d{2})(\\\\d{2})$", "\\\\1-\\\\2-\\\\3", df$birthday) 713.699 739.997 815.737
median uq max neval cld
5150.0420 5394.4265 8225.270 100 c
3594.7875 3665.9865 5753.200 100 b
763.0885 783.1865 2433.585 100 a
sub() works on matrices, but not on data.frames. Hence the as.matrix
df <- as.data.frame(matrix("20051023", ncol = 3, nrow = 3))
df$ID <- seq_len(nrow(df))
df[, 1:3] <- sub("^(\\d{4})(\\d{2})(\\d{2})$", "\\1-\\2-\\3", as.matrix(df[, 1:3]))
The matrix solution is faster than the for loop. The difference increases with the number of columns you need to loop over.
df <- as.data.frame(matrix("20051023", ncol = 20, nrow = 3))
df$ID <- seq_len(nrow(df))
library(microbenchmark)
microbenchmark(
matrix = df[, seq_len(ncol(df) - 1)] <- sub("^(\\d{4})(\\d{2})(\\d{2})$", "\\1-\\2-\\3", as.matrix(df[, seq_len(ncol(df) - 1)])),
forloop = {
for(i in seq_len(ncol(df) - 1)){
df[, i] <- sub("^(\\d{4})(\\d{2})(\\d{2})$", "\\1-\\2-\\3", df[, i])
}
}
)
Unit: microseconds
expr min lq mean median uq max neval cld
matrix 460.555 476.805 504.3012 494.1235 507.594 1122.522 100 a
forloop 1554.425 1590.774 1677.3038 1625.8390 1670.312 3563.845 100 b

replace loop with an *pply alternative

I am trying to speedup my code by replacing some lookup loops with tapply (How to do vlookup and fill down (like in Excel) in R?) and I stumbled upon this code piece:
DF<-data.frame(id=c(rep("A", 5),rep("B", 7),rep("C", 9)), series=NA, chi=c(letters[1:5], LETTERS[6:12], letters[13:21]))
for (i in unique(DF$id)){
DF$series[ DF$id==i ]<-1:length(DF$id[ DF$id==i ])
}
DF
Is it possible to replace this with an *apply family function? Or any other way to speed it up?
You may try ave:
DF$series <- ave(DF$id, DF$id, FUN = seq_along)
For larger data sets, dplyr is faster though.
library(dplyr)
fun_ave <- function(df) transform(df, series = ave(id, id, FUN = seq_along))
fun_dp <- function(df) df %.%
group_by(id) %.%
mutate(
series = seq_along(id))
df <- data.frame(id= sample(letters[1:3], 100000, replace = TRUE))
microbenchmark(fun_ave(df))
# Unit: milliseconds
# expr min lq median uq max neval
# fun_ave(df) 38.59112 39.40802 50.77921 51.2844 128.6791 100
microbenchmark(fun_dp(df))
# Unit: milliseconds
# expr min lq median uq max neval
# fun_dp(df) 4.977035 5.034244 5.060663 5.265173 17.16018 100
Could also use data.table
library(data.table)
DT <- data.table(DF)
DT[, series_new := 1:.N, by = id]
and using tapply
DF$series_new <- unlist(tapply(DF$id, DF$id, function(x) 1:length(x)))
Extending #Henrik's comparison above both data.table and dplyr are quite a bit faster for large data sets.
library(data.table)
library(dplyr)
df <- data.frame(id= sample(letters[1:3], 100000, replace = TRUE), stringsAsFactors = F)
dt <- data.table(df)
fun_orig <- function(df){
for (i in unique(df$id)){
df$series[df$id==i]<-1:length(df$id[df$id==i])
}}
fun_tapply <- function(df){
df$series <- unlist(tapply(df$id, df$id, function(x) 1:length(x)))
}
fun_ave <- function(df){
transform(df, series = ave(df$id, df$id, FUN = seq_along))
}
fun_dp <- function(df){
df %.%
group_by(id) %.%
mutate(
series = seq_along(id))
}
fun_dt <- function(dt) dt[, 1:.N, by = id]
microbenchmark(fun_dt(dt), times = 1000)
#Unit: milliseconds
# expr min lq median uq max neval
# fun_dt(dt) 2.473253 2.597031 2.771771 3.76307 40.59909 1000
microbenchmark(fun_dp(df), times = 1000)
#Unit: milliseconds
# expr min lq median uq max neval
# fun_dp(df) 2.71375 2.786829 2.914569 3.081609 40.48445 1000
microbenchmark(fun_orig(df), times = 1000)
#Unit: milliseconds
# expr min lq median uq max neval
# fun_orig(df) 30.65534 31.93449 32.72991 33.88885 75.13967 1000
microbenchmark(fun_tapply(df), times = 1000)
#Unit: milliseconds
# expr min lq median uq max neval
# fun_tapply(df) 56.67636 61.72207 66.37193 102.4189 124.6661 1000
microbenchmark(fun_ave(df), times = 1000)
#Unit: milliseconds
# expr min lq median uq max neval
# fun_ave(df) 97.36992 103.161 107.5007 139.1362 157.9464 1000

Resources