fastest way to add elements in list in R - r

zii=list()
zii[[1]]=c(1,2,3)
zii[[2]]=c(1,2,3)
zii[[3]]=c(1,2,3)
What is the best way to perform element-wise addition in the list , IE:
sum=c(1+1+1,2+2+2,3+3+3)=c(3,6,9)
I tried Reduce("+",zii) and it is slow. Any other suggestions ?

I'm not sure whether this will be any faster. The data.frame does a lot of validity checking:
> rowSums(data.frame(zii))
[1] 3 6 9
Could also try these if you get around to using microbenchmark. I'm guessing one of these will win and my money would be on the second one.:
> rowSums(do.call(cbind, zii))
[1] 3 6 9
> colSums(do.call(rbind, zii))
[1] 3 6 9
Looks like I lost my bet:
require(microbenchmark)
microbenchmark( Reduce("+",zii) ,
rowSums(data.frame(zii)),
rowSums(do.call(cbind, zii)),
colSums(do.call(rbind, zii)) )
#------------------------------------------------------
Unit: microseconds
expr min lq mean median uq
Reduce("+", zii) 26.975 28.1870 31.02119 30.0560 30.9695
rowSums(data.frame(zii)) 730.933 744.9015 776.36775 753.5785 787.2765
rowSums(do.call(cbind, zii)) 65.770 67.3800 71.94039 68.7050 70.1335
colSums(do.call(rbind, zii)) 61.202 62.8830 66.21362 64.1060 65.9130
max neval cld
57.958 100 a
1129.953 100 c
176.627 100 b
127.259 100 b

Related

What is the fastest way to detect whether a vector has at least one non-NA element? (i.e., opposite to `base::anyNA()`)

As we learn from this answer, there's a substantial performance increase when using anyNA() over any(is.na()) to detect whether a vector has at least one NA element. This makes sense, as the algorithm of anyNA() stops after the first NA value it finds, whereas any(is.na()) has to first run over the entire vector with is.na().
By contrast, I want to know whether a vector has at least 1 non-NA value. This means that I'm looking for an implementation that would stop after the first encounter with a non-NA value. Yes, I can use any(!is.na()), but then I face the issue with having is.na() run over the entire vector first.
Is there a performant opposite equivalent to anyNA(), i.e., "anyNonNA()"?
I'm not aware of a native function that stops if it comes across a non-NA value, but we can write a simple one using Rcpp:
Rcpp::cppFunction("bool any_NonNA(NumericVector v) {
for(size_t i = 0; i < v.length(); i++) {
if(!(Rcpp::traits::is_na<REALSXP>(v[i]))) return true;
}
return false;
}")
This creates an R function called any_NonNA which does what we need. Let's test it on a large vector of 100,000 NA values:
test <- rep(NA, 1e5)
any_NonNA(test)
#> [1] FALSE
any(!is.na(test))
#> [1] FALSE
Now let's make the first element a non-NA:
test[1] <- 1
any_NonNA(test)
#> [1] TRUE
any(!is.na(test))
#> [1] TRUE
So it gives the correct result, but is it faster?
Certainly, in this example, since it should stop after the first element, it should be much quicker. This is indeed the case if we do a head-to-head comparison:
microbenchmark::microbenchmark(
baseR = any(!is.na(test)),
Rcpp = any_NonNA(test)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> baseR 275.1 525.0 670.948 533.05 568.7 13029.9 100 b
#> Rcpp 1.6 2.1 4.319 3.30 5.1 33.7 100 a
As expected, this is a couple of orders of magnitude faster. What about if our first non-NA value is mid-way through the vector?
test[1] <- NA
test[50000] <- 1
microbenchmark::microbenchmark(
baseR = any(!is.na(test)),
Rcpp = any_NonNA(test)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> baseR 332.1 579.35 810.948 597.95 624.40 12010.4 100 b
#> Rcpp 299.4 300.70 311.516 305.10 309.25 370.1 100 a
Still faster, but not by much.
If we put our non-NA value at the end we shouldn't see much difference:
test[50000] <- NA
test[100000] <- 1
microbenchmark::microbenchmark(
baseR = any(!is.na(test)),
Rcpp = any_NonNA(test)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> baseR 395.6 631.65 827.173 642.6 663.8 11357.0 100 a
#> Rcpp 596.3 602.25 608.011 605.8 612.6 632.6 100 a
So this does indeed look to be faster than the base R solution (at least for large vectors).
anyNA() seems to be a collaboration by google. I think to check wether there are any NA is far common than the opposite, thus justifying the existene of that "special" function.
Here my attemp for numeric only:
anyNonNA <- Rcpp::cppFunction(
'bool anyNonNA(NumericVector x){
for (double i:x) if (!Rcpp::NumericVector::is_na(i)) return TRUE;
return FALSE;}
')
var <- rep(NA_real_, 1e7)
any(!is.na(var)) #FALSE
anyNonNA(var) #FALSE
var[5e6] <- 0
any(!is.na(var)) #TRUE
anyNonNA(var) #TRUE
microbenchmark::microbenchmark(any(!is.na(var)))
#Unit: milliseconds
# expr min lq mean median uq max neval
# any(!is.na(var)) 41.1922 46.6087 55.57655 59.1408 61.87265 74.4424 100
microbenchmark::microbenchmark(anyNonNA(var))
#Unit: milliseconds
# expr min lq mean median uq max neval
# anyNonNA(var) 10.6333 10.71325 11.05704 10.8553 11.2082 14.871 100

Subset list of vectors by position in a vectorized way

I have a list of vectors and I'm trying to select (for example) the 2nd and 4th element in each vector. I can do this using lapply:
list_of_vec <- list(c(1:10), c(10:1), c(1:10), c(10:1), c(1:10))
lapply(1:length(list_of_vec), function(i) list_of_vec[[i]][c(2,4)])
[[1]]
[1] 2 4
[[2]]
[1] 9 7
[[3]]
[1] 2 4
[[4]]
[1] 9 7
[[5]]
[1] 2 4
But is there a way to do this in a vectorized way -- avoiding one of the apply functions? My problem is that my actual list_of_vec is fairly long, so lapply takes awhile.
Solutions:
Option 1 #Athe's clever solution using do.call?:
do.call(rbind, list_of_vec)[ ,c(2,4)]
Option 2 Using lapply more efficiently:
lapply(list_of_vec, `[`, c(2, 4))
Option 3 A vectorized solution:
starts <- c(0, cumsum(lengths(list_of_vec)[-1]))
matrix(unlist(list_of_vec)[c(starts + 2, starts + 4)], ncol = 2)
Option 4 the lapply solution you wanted to improve:
lapply(1:length(list_of_vec), function(i) list_of_vec[[i]][c(2,4)])
Data:
And a few datasets I will test them on:
# The original data
list_of_vec <- list(c(1:10), c(10:1), c(1:10), c(10:1), c(1:10))
# A long list with short elements
list_of_vec2 <- rep(list_of_vec, 1e5)
# A long list with long elements
list_of_vec3 <- lapply(list_of_vec, rep, 1e3)
list_of_vec3 <- rep(list_of_vec3, 1e4)
Benchmarking:
Original list:
Unit: microseconds
expr min lq mean median uq max neval cld
o1 2.276 2.8450 3.00417 2.845 3.129 10.809 100 a
o2 2.845 3.1300 3.59018 3.414 3.414 23.325 100 a
o3 3.698 4.1250 4.60558 4.267 4.552 20.480 100 a
o4 5.689 5.9735 17.52222 5.974 6.258 1144.606 100 a
Longer list, short elements:
Unit: milliseconds
expr min lq mean median uq max neval cld
o1 146.30778 146.88037 155.04077 149.89164 159.52194 184.92028 10 b
o2 185.40526 187.85717 192.83834 188.42749 190.32103 213.79226 10 c
o3 26.55091 27.27596 28.46781 27.48915 28.84041 32.19998 10 a
o4 407.66430 411.58054 426.87020 415.82161 437.19193 473.64265 10 d
Longer list, long elements:
Unit: milliseconds
expr min lq mean median uq max neval cld
o1 4855.59146 4978.31167 5012.0429 5025.97619 5072.9350 5095.7566 10 c
o2 17.88133 18.60524 103.2154 21.28613 195.0087 311.4122 10 a
o3 855.63128 872.15011 953.8423 892.96193 1069.7526 1106.1980 10 b
o4 37.92927 38.87704 135.6707 124.05127 214.6217 276.5814 10 a
Summary:
Looks like the vectorized solution wins out if the list is long and the elements are short, but lapply is the clear winner for a long list with longer elements. Some of the options output a list, others a matrix. So keep in mind what you want your output to be. Good luck!!!
If your list is composed of vectors of the same length, you could first transform it into a matrix and then get the columns you want.
matrix_of_vec <- do.call(rbind,list_of_vec)
matrix_of_vec[ ,c(2,4)]
Otherwise I'm afraid you'll have to stick to the apply family. The most efficient way to do it is using the parallel package to compute parallely (surprisingly).
corenum <- parallel::detectCores()-1
cl<-parallel::makeCluster(corenum)
parallel::clusterExport(cl,"list_of_vec"))
parallel::parSapply(cl,list_of_vec, '[', c(2,4) )
In this piece of code '[' is the name of the subsetting function and c(2,4) the argument you pass to it.

Is there a good reason to use `sort` with `index.return = TRUE` instead of `order`?

sort has the argument index.return which is by default FALSE. If you set it to TRUE you get the ordering index... basically the same as when you use order.
My question
Are there cases where it makes sense to use sort with index.return = TRUE instead of order?
order simply gives the indexes, instead sort gives also the values (and with index.return=T a list):
x <- runif(10, 0, 100)
order(x)
# [1] 2 7 1 9 6 5 8 10 4 3
sort(x, index.return=T)
# $`x`
# [1] 0.08140348 0.18272011 0.23575252 0.51493537 0.64281259 0.92121388 0.93759670 0.96221375 0.97646916 0.97863369
#
# $ix
# [1] 2 7 1 9 6 5 8 10 4 3
It seems that order is a little faster with big numbers (longer vector size):
x <- runif(10000000, 0, 100)
microbenchmark::microbenchmark(
sort = {sort(x, index.return=T)},
order = {x[order(x)]},
times = 100
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# sort 63.48221 67.79530 78.33724 70.74215 74.10109 173.1129 100
# order 56.46055 57.18649 60.88239 58.29462 62.13086 155.5481 100
So probably you should pick sort with index.return = TRUE only if you need a list object to be returned. I can't find an example where sort is better than the other.
My suggestions are based on RLave's answer.
You could use the argument method, sort(x,method="quick",index.return=TRUE), and the function might be a little faster than the default. Also if you want a faster (for large vectors) alternative method of this, you can use this function:
sort_order <- function(x){
indices <- order(x) #you can choose a method also but leave default.
list("x"=x[indices],"ix"=indices)
}
Here are some benchmarks.
microbenchmark::microbenchmark(
sort=s<-sort(x,index.return=T),
"quick sort"=sq<-sort(x,method="quick",index.return=T),
"order sort"=so<-sort_order(x),times = 10
times=10
)
Unit: seconds
expr min lq mean median uq max neval
sort 1.493714 1.662791 1.737854 1.708502 1.887993 1.960912 10
quick sort 1.366938 1.374874 1.451778 1.444342 1.480122 1.668693 10
order sort 1.181974 1.344398 1.359209 1.369108 1.424569 1.461862 10
all.equal(so,sq)
[1] TRUE
all.equal(s,so)
[1] TRUE

How to execute a loop for changing number of iterations in R

I have a question function which takes a range and I need to execute a while loop for the give range. Below is the pseudo-code I wrote. Here I intend to read files from a sorted list and, start = 4 and end = 8 would mean read files 4 to 8.
readFiles<-function(start,end){
i = start
while(i<end){
#do something
i += 1
}
}
I need to know how to do this in R. Any help is appreciated.
You can try this :
readFiles<-function(start,end){
for (i in start:end){
print(i) # this is an example, here you put the code to read the file
# it just allows you to see that the index starts at 4 and ends at 8
}
}
readFiles(4,8)
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
As pointed out by mra68, if you don't want that the functions does something if end>start you could do this :
readFiles<-function(start,end){
if (start<=end){
for (i in start:end){
print(i)
}
}
}
It will not do anything for readFiles(8,4). Using print(i) as the function in the loop, it is slightly faster than a while if start<=endand also faster if end>start:
Unit: microseconds
expr min lq mean median uq max neval cld
readFiles(1, 10) 591.437 603.1610 668.4673 610.6850 642.007 1460.044 100 a
readFiles2(1, 10) 548.041 559.2405 640.9673 574.6385 631.333 2278.605 100 a
Unit: microseconds
expr min lq mean median uq max neval cld
readFiles(10, 1) 1.75 1.751 2.47508 2.10 2.101 23.098 100 b
readFiles2(10, 1) 1.40 1.401 1.72613 1.75 1.751 6.300 100 a
Here, readFiles2 is the if ... for solution and readFiles is the while solution.

Adding maximum values from different levels to a new column in a data.frame

I have a following R problem. I made an experiment and observed some cars speed. I have a table with cars (where number 1 means for example Porche, 2 Volvo and so on) and their speeds. One car could been taken into an observation more than once. So, for example, Porche was observed tree times, Volvo two times.
exp<-data.frame(car=c(1,1,1,2,2,3),speed=c(10,20,30,40,50,60))
I would like to add a third column, where for every row/every car the maximum speed is calculated. So it looks like that:
exp<-data.frame(car=c(1,1,1,2,2,3),speed=c(10,20,30,40,50,60), maxSpeed=c(30,30,30,50,50,60))
Maximal observed speed for Porsche was 30, so every row with Porsche will get maxSpeed = 30.
I know that it should be apply/sapply function, but have no idea how to implement it. Anyone? :)
#Arun this is my result in a bigger sample (1000 records). The ratio of the medians is now (actually) 0.82:
exp <- data.frame(car=sample(1:10, 1000, T),speed=rnorm(1000, 20, 5))
f1 <- function() mutate(exp, maxSpeed = ave(speed, car, FUN=max))
f2 <- function() transform(exp, maxSpeed = ave(speed, car, FUN=max))
library(microbenchmark)
library(plyr)
> microbenchmark(f1(), f2(), times=1000)
Unit: microseconds
expr min lq median uq max neval
f1() 551.321 565.112 570.565 589.9680 27866.23 1000
f2() 662.933 683.138 689.552 713.7665 28510.24 1000
the plyr documentation itself says Mutate seems to be considerably faster than transform for large data frames.
However, for this case, you're probably right. If I enlarge the sample:
> exp <- data.frame(car=sample(1:1000, 100000, T),speed=rnorm(100000, 20, 5))
> microbenchmark(f1(), f2(), times=100)
Unit: milliseconds
expr min lq median uq max neval
f1() 37.92438 39.00056 40.66607 41.18115 77.41645 100
f2() 39.47731 40.28650 43.11927 43.70779 78.34878 100
The ratio gets close to one. To be honest I was quite sure about plyr perfomance (always rely on it in my codes), that's why my 'claim' in the comment. Probably in different situation it performs better..
EDIT:
using f3() from #Arun comment
> microbenchmark(f1(), f2(), f3(), times=100)
Unit: milliseconds
expr min lq median uq max neval
f1() 38.76050 39.57129 41.48728 42.14812 76.94338 100
f2() 40.38913 41.19767 44.12329 44.78782 79.94021 100
f3() 38.63606 39.58700 40.24272 42.04902 76.07551 100
Yep! slightly faster... moves less data?
very straight forward with data.table
library(data.table)
exp <- data.table(exp)
exp[, maxSpeed := max(speed), by=car]
which gives:
exp
car speed maxSpeed
1: 1 10 30
2: 1 20 30
3: 1 30 30
4: 2 40 50
5: 2 50 50
6: 3 60 60
transform(exp, maxSpeed = ave(speed, car, FUN=max))
Another way using split:
exp$maxSpeed <- exp$speed
split(exp$maxSpeed, exp$car) <- lapply(split(exp$maxSpeed, exp$car), max)
exp

Resources