Fast extraction of rows in R - r

I have many binary matrices from which I want to extract every possible combination of three rows into a list. I then want to sum the columns of each of the extracted row combinations.
My current method is below, but it is extremely slow.
set.seed(123)
x <- matrix(sample(0:1, 110 * 609, replace = TRUE), 110, 609)
row.combinations <- t(combn(nrow(x),3))
extracted.row.combns <- lapply(1:nrow(row.combinations), FUN = function(y) x[c(row.combinations[y,1],row.combinations[y,2],row.combinations[y,3]),])
summed.rows <- lapply(extracted.row.combns, colSums)
How could this be sped up?

Using ?combn and an inline function as an argument, I can run this analysis in under 5 seconds on my current machine:
combn(nrow(x), 3, FUN=function(r) colSums(x[r,]), simplify=FALSE)

We can make this faster with combnPrim from gRbase.
library(gRbase)
lapply(combnPrim(nrow(x), 3, simplify = FALSE), function(r) colSums(x[r,]))
Benchmarks
system.time(x1 <- combn(nrow(x), 3, FUN=function(r) colSums(x[r,]), simplify=FALSE))
# user system elapsed
# 6.46 0.21 6.67
system.time(x2 <- lapply(combnPrim(nrow(x), 3, simplify = FALSE),
function(r) colSums(x[r,])))
# user system elapsed
# 4.61 0.22 4.83

Related

Is there a fast way to extract elements in a list of data frames?

I'm trying to find a fast way to extract elements in a list of data frames.
To do this, I've tested the function lapply. Here is a reproducible example:
i <- 2
dat <- replicate(100000, data.frame(x=1:5000, y = 1:5000, z = 1:5000), simplify=FALSE)
system.time(test <- lapply(dat, function(y) y[i, c("x", "y")]))
user system elapsed
7.69 0.00 7.73
Ideally, the elapsed time should be <= 1 second.

Sum every n rows of a matrix within a list

I am trying to create a matrix where each row consists of the sum of every three rows in another matrix. There are actually a bunch of these matrices in a list and I am performing the same operation on each of the elements in that list. Based on this post I was able to generate the code below. It works but it takes forever for my more complicated data set.
test<-lapply(1:1000, function(x) matrix(1:300, nrow=60))
testCons<-lapply(test, function(x) apply(x, 2, function(y) tapply(y, ceiling(seq_along(y)/3), sum)))
Does anybody have an idea of how to speed that up or simplify it?
rowsum gives an easy speed up - it calculates the sum of rows according to a grouping variable, which is an index for every three rows.
test <- lapply(1:1000, function(x) matrix(1:300, nrow=60))
system.time(
testCons <- lapply(test, function(x) apply(x, 2, function(y) tapply(y, ceiling(seq_along(y)/3), sum)))
)
# user system elapsed
# 1.672 0.004 1.678
system.time(
testCons2 <- lapply(test, function(x) rowsum(x, rep(seq_len(nrow(x) / 3), each=3)))
)
# user system elapsed
# 0.08 0.00 0.08
all.equal(testCons, testCons2)
#[1] TRUE

Writing to a large matrix much slower than normal

Suppose I do this:
m <- matrix(0, nrow = 20, ncol = 3)
system.time(m[1, 1:3] <- c(1,1,1))
That takes 0 seconds.
Now I do this:
m <- matrix(0, nrow = 10000000, ncol = 3)
system.time(m[1, 1:3] <- c(1,1,1))
This takes about 0.47 seconds on my system.
I need to fill in a matrix of around 8.5 million rows so at 0.47 seconds each it's not an option. Is there any way around this? Other than creating many smaller sub matrices and rbinding later?
Thanks!
After starting a new R session:
m <- matrix(0, nrow = 10000000, ncol = 3)
system.time(m[1, 1:3] <- c(1,1,1))
# User System elapsed
# 0 0 0
n <- m
system.time(m[1, 1:3] <- c(1,1,1))
# User System elapsed
# 0.074 0.061 0.135
The first time m is modified in place. The second time a copy is made since m is referred to by n.
This question might be of interest. However, if you do a rolling regression, you should first look, if it is implemented in some package. If you want to do this in Rcpp, you should do the whole loop in Rcpp and not assign to m 8.5M times.

Arithmetic mean on a multidimensional array on R and MATLAB: drastic difference of performances

I am working with multidimensional array both on R and MATLAB, these arrays have five dimensions (total of 14.5M of elements). I have to remove a dimension applying an arithmetic mean on it and I discovered an amazing difference of performances using the two softwares.
MATLAB:
>> a = rand([144 73 10 6 23]);
>> tic; b = mean(a,3); toc
Elapsed time is 0.014454 seconds.
R:
> a = array(data = runif(144*73*6*23*10), dim = c(144,73,10,6,23))
> start <- Sys.time (); b = apply(a, c(1,2,4,5), mean); Sys.time () - start
Time difference of 1.229083 mins
I know that apply function is slow because is something like a general purpose function, but I don't know how to deal with this problem because this difference of performances is really a big limit for me. I tried to search for a generalization of colMeans/rowMeans functions but I didn't succeed.
EDIT
I'll show a little sample matrix:
> dim(a)
[1] 2 4 3
> dput(aa)
structure(c(7, 8, 5, 8, 10, 11, 9, 9, 6, 12, 9, 10, 12, 10, 14,
12, 7, 9, 8, 10, 10, 9, 8, 6), .Dim = c(2L, 4L, 3L))
a_mean = apply(a, c(2,3), mean)
> a_mean
[,1] [,2] [,3]
[1,] 7.5 9.0 8.0
[2,] 6.5 9.5 9.0
[3,] 10.5 11.0 9.5
[4,] 9.0 13.0 7.0
EDIT (2):
I discovered that applying sum function and then dividing by the size of the removed dimension is definitely faster:
> start <- Sys.time (); aaout = apply(aa, c(1,2,4,5), sum); Sys.time () - start
Time difference of 5.528063 secs
In R, apply is not the right tool for the task. If you had a matrix and needed the row or column means, you would use the much much faster, vectorized rowMeans and colMeans. You can still use these for a multi-dimensional array but you need to be a little creative:
Assuming your array has n dimensions, and you want to compute means along dimension i:
use aperm to move the dimension i to the last position n
use rowMeans with dims = n - 1
Similarly, you could:
use aperm to move the dimension i to the first position
use colMeans with dims = 1
a <- array(data = runif(144*73*6*23*10), dim = c(144,73,10,6,23))
means.along <- function(a, i) {
n <- length(dim(a))
b <- aperm(a, c(seq_len(n)[-i], i))
rowMeans(b, dims = n - 1)
}
system.time(z1 <- apply(a, c(1,2,4,5), mean))
# user system elapsed
# 25.132 0.109 25.239
system.time(z2 <- means.along(a, 3))
# user system elapsed
# 0.283 0.007 0.289
identical(z1, z2)
# [1] TRUE
mean is particularly slow because of S3 method dispatch. This is faster:
set.seed(42)
a = array(data = runif(144*73*6*23*10), dim = c(144,73,10,6,23))
system.time({b = apply(a, c(1,2,4,5), mean.default)})
# user system elapsed
#16.80 0.03 16.94
If you don't need to handle NAs you can use the internal function:
system.time({b1 = apply(a, c(1,2,4,5), function(x) .Internal(mean(x)))})
# user system elapsed
# 6.80 0.04 6.86
For comparison:
system.time({b2 = apply(a, c(1,2,4,5), function(x) sum(x)/length(x))})
# user system elapsed
# 9.05 0.01 9.08
system.time({b3 = apply(a, c(1,2,4,5), sum)
b3 = b3/dim(a)[[3]]})
# user system elapsed
# 7.44 0.03 7.47
(Note that all timings are only approximate. Proper benchmarking would require running this repreatedly, e.g., using one of the bechmarking packages. But I'm not patient enough for that right now.)
It might be possible to speed this up with an Rcpp implementation.

How to convert a huge list-of-vector to a matrix more efficiently?

I have a list of length 130,000 where each element is a character vector of length 110. I would like to convert this list to a matrix with dimension 1,430,000*10. How can I do it more efficiently?\
My code is :
output=NULL
for(i in 1:length(z)) {
output=rbind(output,
matrix(z[[i]],ncol=10,byrow=TRUE))
}
This should be equivalent to your current code, only a lot faster:
output <- matrix(unlist(z), ncol = 10, byrow = TRUE)
I think you want
output <- do.call(rbind,lapply(z,matrix,ncol=10,byrow=TRUE))
i.e. combining #BlueMagister's use of do.call(rbind,...) with an lapply statement to convert the individual list elements into 11*10 matrices ...
Benchmarks (showing #flodel's unlist solution is 5x faster than mine, and 230x faster than the original approach ...)
n <- 1000
z <- replicate(n,matrix(1:110,ncol=10,byrow=TRUE),simplify=FALSE)
library(rbenchmark)
origfn <- function(z) {
output <- NULL
for(i in 1:length(z))
output<- rbind(output,matrix(z[[i]],ncol=10,byrow=TRUE))
}
rbindfn <- function(z) do.call(rbind,lapply(z,matrix,ncol=10,byrow=TRUE))
unlistfn <- function(z) matrix(unlist(z), ncol = 10, byrow = TRUE)
## test replications elapsed relative user.self sys.self
## 1 origfn(z) 100 36.467 230.804 34.834 1.540
## 2 rbindfn(z) 100 0.713 4.513 0.708 0.012
## 3 unlistfn(z) 100 0.158 1.000 0.144 0.008
If this scales appropriately (i.e. you don't run into memory problems), the full problem would take about 130*0.2 seconds = 26 seconds on a comparable machine (I did this on a 2-year-old MacBook Pro).
It would help to have sample information about your output. Recursively using rbind on bigger and bigger things is not recommended. My first guess at something that would help you:
z <- list(1:3,4:6,7:9)
do.call(rbind,z)
See a related question for more efficiency, if needed.
You can also use,
output <- as.matrix(as.data.frame(z))
The memory usage is very similar to
output <- matrix(unlist(z), ncol = 10, byrow = TRUE)
Which can be verified, with mem_changed() from library(pryr).
you can use as.matrix as below:
output <- as.matrix(z)

Resources