I have a list of matrices of varying sizes. How can I quickly sum up all elements in the list of matrices?
This is my current code, but it is pretty slow. Is there a faster way?
for (i in 1: length(w)) {
w_sum <- w_sum + sum(apply(w[[i]], 1:2, function (x) x^2))
}
Matrices are just vectors under the hood, hence you can unlist, square and sum:
sum(unlist(w)^2)
#[1] 2393
Using #akrun's example data, gives the same result.
It's much quicker if you've got tonnes of matrices:
w <- rep(list(matrix(1:24,nrow=6)), 1e6)
system.time(sum(unlist(w)^2))
# user system elapsed
# 0.11 0.00 0.10
system.time(sum(vapply(w, function(x) sum(x^2), numeric(1))))
# user system elapsed
# 2.17 0.00 2.17
Loops also don't have to be ridiculously slow in this circumstance in comparison:
w_sum <- 0
system.time(for(i in seq_along(w)) { w_sum <- w_sum + sum(w[[i]]^2) } )
# user system elapsed
# 2.62 0.00 2.62
Related
I have a very large unsorted vector A and a sorted vector B (relatively short).
A = runif(n = 1e6)
B = seq(0,1,by = 1e-3)
Now given a direction 'forward' or 'backward', for each element in A, find the nearest element in B with that direction. Eg for 'forward'
A2 = sapply(A, function(x) B[B>=x][1])
gives the result. However, this is too slow as sapply loops over A.
> system.time(sapply(A, function(x) B[B>=x][1]))
user system elapsed
17.93 0.00 17.93
Is there a way to do this much faster?
(It is guaranteed that min(B)<min(A) and max(B)>max(A), if this is useful)
The findInterval function solves this exact problem using binary search. Try this:
B[findInterval(A,B)+1]
A comparison:
set.seed(44)
A <- runif(n = 1e6)
B <- seq(0,1,by = 1e-3)
system.time(A2<-sapply(A, function(x) B[B>=x][1]))
# user system elapsed
# 18.058 0.000 15.606
system.time(A3<-B[findInterval(A,B)+1])
# user system elapsed
# 0.00 0.00 0.07
identical(A2,A3)
#[1] TRUE
I would like to efficiently find all combinations of a list excluding the combination of each element to itself. For example, with a list of A,B,C,D find all combinations excluding A-A, B-B, C-C, D-D.
I can do this in what seems to be an inefficient way using this code:
x <- c("A","B","C","D")
dt <- CJ(x,x)
dt <- dt[!V1==V2]
The problem is that the third line takes as about 4 times as long to run as the second line. So for a large list like my real data, line 2 and line 3 together can take a very long time.
I am using data.table 1.9.6, R 3.2.2, and R Studio on Windows 7.
Thanks so much.
Well, this is something of an improvement:
n = 1e4; x = seq(n)
# combn (variant of #Psidom's answer)
system.time({
cn = transpose(combn(x, 2, simplify=FALSE))
r = rbind( setDT(cn), rev(cn) )
})
# takes forever, so i cut it off
# op's code
system.time({
r0 = CJ(x,x)[V1 != V2]
})
# user system elapsed
# 1.69 0.63 1.50
# use indices in the final step
system.time({
r1 = CJ(x,x)[-seq(1L, .N, by=length(x)+1L)]
})
# user system elapsed
# 1.17 0.42 0.96
And some more:
# build it manually
system.time({
xlen = length(x)
r2 = data.table(rep(x, each = xlen), V2 = x)[-seq(1L, .N, by=xlen+1L)]
})
# user system elapsed
# 3.03 0.60 2.79
# ... or ...
system.time({
xlen = length(x)
r2 = data.table(rep(x, each = xlen-1L), rep.int(x, xlen)[-seq(1L, xlen^2, by=xlen+1L)])
})
# user system elapsed
# 2.79 0.25 3.07
# build it manually special for the case of two cols
system.time({
r3 = setDT(list(x))[, .(V2 = x), by=V1][ -seq(1L, .N, by=length(x)+1L) ]
})
# user system elapsed
# 0.92 0.25 0.86
# ... or ...
system.time({
r4 = setDT(list(x))[, .(V2 = x[-.GRP]), by=V1]
})
# user system elapsed
# 0.85 0.32 1.19
# verify
identical(r0, r1) # TRUE
identical(setkey(r0, NULL), r2) # TRUE
identical(setkey(r0, NULL), r3) # TRUE
identical(setkey(r0, NULL), r4) # TRUE
Maybe you can do a little better by writing your own CJ with Rcpp. It might also be worth noting that everything is faster with integers (instead of characters):
x = rep(LETTERS, 5e2)
system.time(CJ(x,x))
# user system elapsed
# 7.06 1.81 6.61
x = rep(1:26, 5e2)
system.time(CJ(x,x))
# user system elapsed
# 3.39 0.88 2.95
So if x is a character vector, it might be best to use seq_along(x) for the combinatorial tasks and then map back to the character values like x[V1] afterwards.
The R function
xts:::na.locf.xts
is extremely slow when used with a multicolumn xts of more than a few columns.
There is indeed a loop over the columns in the code of na.locf.xts
I am trying to find a way to avoid this loop.
Any idea?
The loop in na.locf.xts is slow because it creates a copy of the entire object for each column in the object. The loop itself isn't slow; the copies created by [.xts are slow.
There's an experimental (and therefore unexported) version of na.locf.xts on R-Forge that moves the loop over columns to C, which avoids copying the object. It's quite a bit faster for very large objects.
set.seed(21)
m <- replicate(20, rnorm(1e6))
is.na(m) <- sample(length(x), 1e5)
x <- xts(m, Sys.time()-1e6:1)
y <- x[1:1e5,1:3]
> # smaller objects
> system.time(a <- na.locf(y))
user system elapsed
0.008 0.000 0.008
> system.time(b <- xts:::.na.locf.xts(y))
user system elapsed
0.000 0.000 0.003
> identical(a,b)
[1] TRUE
> # larger objects
> system.time(a <- na.locf(x))
user system elapsed
1.620 1.420 3.064
> system.time(b <- xts:::.na.locf.xts(x))
user system elapsed
0.124 0.092 0.220
> identical(a,b)
[1] TRUE
timeIndex <- index(x)
x <- apply(x, 2, na.locf)
x <- as.xts(x, order.by = timeIndex)
This avoids the column-by-column data copying. Without this, when filling the nth column, you make a copy of 1 : (n - 1) columns and append the nth column to it, which becomes prohibitively slow when n is large.
What is the fastest way to detect if a vector has at least 1 NA in R? I've been using:
sum( is.na( data ) ) > 0
But that requires examining each element, coercion, and the sum function.
As of R 3.1.0 anyNA() is the way to do this. On atomic vectors this will stop after the first NA instead of going through the entire vector as would be the case with any(is.na()). Additionally, this avoids creating an intermediate logical vector with is.na that is immediately discarded. Borrowing Joran's example:
x <- y <- runif(1e7)
x[1e4] <- NA
y[1e7] <- NA
microbenchmark::microbenchmark(any(is.na(x)), anyNA(x), any(is.na(y)), anyNA(y), times=10)
# Unit: microseconds
# expr min lq mean median uq
# any(is.na(x)) 13444.674 13509.454 21191.9025 13639.3065 13917.592
# anyNA(x) 6.840 13.187 13.5283 14.1705 14.774
# any(is.na(y)) 165030.942 168258.159 178954.6499 169966.1440 197591.168
# anyNA(y) 7193.784 7285.107 7694.1785 7497.9265 7865.064
Notice how it is substantially faster even when we modify the last value of the vector; this is in part because of the avoidance of the intermediate logical vector.
I'm thinking:
any(is.na(data))
should be slightly faster.
We mention this in some of our Rcpp presentations and actually have some benchmarks which show a pretty large gain from embedded C++ with Rcpp over the R solution because
a vectorised R solution still computes every single element of the vector expression
if your goal is to just satisfy any(), then you can abort after the first match -- which is what our Rcpp sugar (in essence: some C++ template magic to make C++ expressions look more like R expressions, see this vignette for more) solution does.
So by getting a compiled specialised solution to work, we do indeed get a fast solution. I should add that while I have not compared this to the solutions offered in this SO question here, I am reasonably confident about the performance.
Edit And the Rcpp package contains examples in the directory sugarPerformance. It has an increase of the several thousand of the 'sugar-can-abort-soon' over 'R-computes-full-vector-expression' for any(), but I should add that that case does not involve is.na() but a simple boolean expression.
One could write a for loop stopping at NA, but the system.time then depends on where the NA is... (if there is none, it takes looooong)
set.seed(1234)
x <- sample(c(1:5, NA), 100000000, replace = TRUE)
nacount <- function(x){
for(i in 1:length(x)){
if(is.na(x[i])) {
print(TRUE)
break}
}}
system.time(
nacount(x)
)
[1] TRUE
User System verstrichen
0.14 0.04 0.18
system.time(
any(is.na(x))
)
User System verstrichen
0.28 0.08 0.37
system.time(
sum(is.na(x)) > 0
)
User System verstrichen
0.45 0.07 0.53
Here are some actual times from my (slow) machine for some of the various methods discussed so far:
x <- runif(1e7)
x[1e4] <- NA
system.time(sum(is.na(x)) > 0)
> system.time(sum(is.na(x)) > 0)
user system elapsed
0.065 0.001 0.065
system.time(any(is.na(x)))
> system.time(any(is.na(x)))
user system elapsed
0.035 0.000 0.034
system.time(match(NA,x))
> system.time(match(NA,x))
user system elapsed
1.824 0.112 1.918
system.time(NA %in% x)
> system.time(NA %in% x)
user system elapsed
1.828 0.115 1.925
system.time(which(is.na(x) == TRUE))
> system.time(which(is.na(x) == TRUE))
user system elapsed
0.099 0.029 0.127
It's not surprising that match and %in% are similar, since %in% is implemented using match.
You can try:
d <- c(1,2,3,NA,5,3)
which(is.na(d) == TRUE, arr.ind=TRUE)
As a matter of best practices, I'm trying to determine if it's better to create a function and apply() it across a matrix, or if it's better to simply loop a matrix through the function. I tried it both ways and was surprised to find apply() is slower. The task is to take a vector and evaluate it as either being positive or negative and then return a vector with 1 if it's positive and -1 if it's negative. The mash() function loops and the squish() function is passed to the apply() function.
million <- as.matrix(rnorm(100000))
mash <- function(x){
for(i in 1:NROW(x))
if(x[i] > 0) {
x[i] <- 1
} else {
x[i] <- -1
}
return(x)
}
squish <- function(x){
if(x >0) {
return(1)
} else {
return(-1)
}
}
ptm <- proc.time()
loop_million <- mash(million)
proc.time() - ptm
ptm <- proc.time()
apply_million <- apply(million,1, squish)
proc.time() - ptm
loop_million results:
user system elapsed
0.468 0.008 0.483
apply_million results:
user system elapsed
1.401 0.021 1.423
What is the advantage to using apply() over a for loop if performance is degraded? Is there a flaw in my test? I compared the two resulting objects for a clue and found:
> class(apply_million)
[1] "numeric"
> class(loop_million)
[1] "matrix"
Which only deepens the mystery. The apply() function cannot accept a simple numeric vector and that's why I cast it with as.matrix() in the beginning. But then it returns a numeric. The for loop is fine with a simple numeric vector. And it returns an object of same class as that one passed to it.
The point of the apply (and plyr) family of functions is not speed, but expressiveness. They also tend to prevent bugs because they eliminate the book keeping code needed with loops.
Lately, answers on stackoverflow have over-emphasised speed. Your code will get faster on its own as computers get faster and R-core optimises the internals of R. Your code will never get more elegant or easier to understand on its own.
In this case you can have the best of both worlds: an elegant answer using vectorisation that is also very fast, (million > 0) * 2 - 1.
As Chase said: Use the power of vectorization. You're comparing two bad solutions here.
To clarify why your apply solution is slower:
Within the for loop, you actually use the vectorized indices of the matrix, meaning there is no conversion of type going on. I'm going a bit rough over it here, but basically the internal calculation kind of ignores the dimensions. They're just kept as an attribute and returned with the vector representing the matrix. To illustrate :
> x <- 1:10
> attr(x,"dim") <- c(5,2)
> y <- matrix(1:10,ncol=2)
> all.equal(x,y)
[1] TRUE
Now, when you use the apply, the matrix is split up internally in 100,000 row vectors, every row vector (i.e. a single number) is put through the function, and in the end the result is combined into an appropriate form. The apply function reckons a vector is best in this case, and thus has to concatenate the results of all rows. This takes time.
Also the sapply function first uses as.vector(unlist(...)) to convert anything to a vector, and in the end tries to simplify the answer into a suitable form. Also this takes time, hence also the sapply might be slower here. Yet, it's not on my machine.
IF apply would be a solution here (and it isn't), you could compare :
> system.time(loop_million <- mash(million))
user system elapsed
0.75 0.00 0.75
> system.time(sapply_million <- matrix(unlist(sapply(million,squish,simplify=F))))
user system elapsed
0.25 0.00 0.25
> system.time(sapply2_million <- matrix(sapply(million,squish)))
user system elapsed
0.34 0.00 0.34
> all.equal(loop_million,sapply_million)
[1] TRUE
> all.equal(loop_million,sapply2_million)
[1] TRUE
You can use lapply or sapply on vectors if you want. However, why not use the appropriate tool for the job, in this case ifelse()?
> ptm <- proc.time()
> ifelse_million <- ifelse(million > 0,1,-1)
> proc.time() - ptm
user system elapsed
0.077 0.007 0.093
> all.equal(ifelse_million, loop_million)
[1] TRUE
And for comparison's sake, here are the two comparable runs using the for loop and sapply:
> ptm <- proc.time()
> apply_million <- sapply(million, squish)
> proc.time() - ptm
user system elapsed
0.469 0.004 0.474
> ptm <- proc.time()
> loop_million <- mash(million)
> proc.time() - ptm
user system elapsed
0.408 0.001 0.417
It is far faster in this case to do index-based replacement than either the ifelse(), the *apply() family, or the loop:
> million <- million2 <- as.matrix(rnorm(100000))
> system.time(million3 <- ifelse(million > 0, 1, -1))
user system elapsed
0.046 0.000 0.044
> system.time({million2[(want <- million2 > 0)] <- 1; million2[!want] <- -1})
user system elapsed
0.006 0.000 0.007
> all.equal(million2, million3)
[1] TRUE
It is well worth having all these tools at your finger tips. You can use the one that makes the most sense to you (as you need to understand the code months or years later) and then start to move to more optimised solutions if compute time becomes prohibitive.
Better example for speed advantage of for loop.
for_loop <- function(x){
out <- vector(mode="numeric",length=NROW(x))
for(i in seq(length(out)))
out[i] <- max(x[i,])
return(out)
}
apply_loop <- function(x){
apply(x,1,max)
}
million <- matrix(rnorm(1000000),ncol=10)
> system.time(apply_loop(million))
user system elapsed
0.57 0.00 0.56
> system.time(for_loop(million))
user system elapsed
0.32 0.00 0.33
EDIT
Version suggested by Eduardo.
max_col <- function(x){
x[cbind(seq(NROW(x)),max.col(x))]
}
By row
> system.time(for_loop(million))
user system elapsed
0.99 0.00 1.11
> system.time(apply_loop(million))
user system elapsed
1.40 0.00 1.44
> system.time(max_col(million))
user system elapsed
0.06 0.00 0.06
By column
> system.time(for_loop(t(million)))
user system elapsed
0.05 0.00 0.05
> system.time(apply_loop(t(million)))
user system elapsed
0.07 0.00 0.07
> system.time(max_col(t(million)))
user system elapsed
0.04 0.00 0.06