apply() when function depends on index and mutate() efficiency - r

I was wondering, suppose you want to transform some matrix, for example, and you want something like
Y[i,j] = i*j*X[i,j]
Using a for loop is a lot slower than other options, and apply() doesn't know which i and j it is using.
A solution I can think of is defining a data.frame-like object with columns i,j,X and then use mutate() to get the desired Y values.
I have two questions:
(a) Is it possible to construct the above matrix using apply() instead of a for loop? If there is, is it more efficient to construct it this way, or using the mutate() trick?
(b) Assume I have both matrix and data.frame object representations. Which is more efficient if I want to do operations that include row and col index? For example, if want a statistic like the sum of (Y[i,j] - mean(Y))²/(i*j). I know I can construct matrices with the row and col indexes and then just use vectorized functions, but is it better than using mutate?

Depending on how large and sparse your matrix is, the first thing you may give a try is use row and col function to construct row and column indices respectively, and then use it to transform your matrix:
X <- matrix(1:9, 3, 3)
row(X) * col(X) * X
# [,1] [,2] [,3]
#[1,] 1 8 21
#[2,] 4 20 48
#[3,] 9 36 81
Then for (Y[i,j] - mean(Y))²/(i*j), it's similarly:
Y <- row(X) * col(X) * X
(Y - mean(Y)) ^ 2 / (row(Y) * col(Y))
# [,1] [,2] [,3]
#[1,] 592.11111 150.222222 6.259259
#[2,] 227.55556 7.111111 85.629630
#[3,] 88.92593 18.962963 344.308642
This is fully vectorized approach, requiring no apply or for loop but need additional memory.

I think outer may be faster that row(X) * col(X).
# Define dimensions
n.rows <- n.cols <- 1000
# Define matrix
X <- matrix(runif(n.rows * n.cols), ncol = n.cols)
# Psidom's approach
rowcol.method <- function(X){row(X) * col(X) * X}
# Approach using outer
outer.method <- function(X){outer(1:nrow(X), 1:ncol(X)) * X}
# Benchmark library
library(microbenchmark)
# Test
microbenchmark(
rowcol.method(X),
outer.method(X)
)
Results:
Unit: milliseconds
expr min lq mean median uq max neval cld
rowcol.method(X) 20.895870 21.154815 23.795695 21.612485 22.584323 62.50660 100 b
outer.method(X) 5.608577 5.729724 6.883643 5.836526 5.977156 50.12508 100 a
Compare output:
identical(rowcol.method(X), outer.method(X))
[1] TRUE
Similarly for the other calculation, although there was a crazy outlier for the outer approach (221.66718 ms).
# Define matrix
Y <- row(X) * col(X) * X
# Psidom's approach
rowcol.method.Y <- function(Y) {(Y - mean(Y)) ^ 2 / (row(Y) * col(Y))}
# Approach using outer
outer.method.Y <- function(Y) {(Y - mean(Y)) ^ 2 / outer(1:nrow(X), 1:ncol(X))}
# Test
microbenchmark(
rowcol.method.Y(Y),
outer.method.Y(Y)
)
Results:
Unit: milliseconds
expr min lq mean median uq max neval cld
rowcol.method.Y(Y) 27.94405 30.18635 34.63551 33.32627 37.06467 46.58983 100 b
outer.method.Y(Y) 11.27064 12.66349 18.77192 15.66756 18.18864 221.66718 100 a
Compare output:
identical(rowcol.method.Y(Y), outer.method.Y(Y))
[1] TRUE

Related

r - Parallel allocation of a matrix [duplicate]

I want to do a simple column (Nx1) times row (1xM) multiplication, resulting in (NxM) matrix.
Code where I create a row by sequence, and column by transposing a similar sequence
row1 <- seq(1:6)
col1 <- t(seq(1:6))
col1 * row1
Output which indicates that R thinks matrices more like columns
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 4 9 16 25 36
Expected output: NxM matrix.
OS: Debian 8.5
Linux kernel: 4.6 backports
Hardware: Asus Zenbook UX303UA
In this case using outer would be a more natural choice
outer(1:6, 1:6)
In general for two numerical vectors x and y, the matrix rank-1 operation can be computed as
outer(x, y)
If you want to resort to real matrix multiplication routines, use tcrossprod:
tcrossprod(x, y)
If either of your x and y is a matrix with dimension, use as.numeric to cast it as a vector first.
It is not recommended to use general matrix multiplication operation "%*%" for this. But if you want, make sure you get comformable dimension: x is a one-column matrix and y is a one-row matrix, so x %*% y.
Can you say anything about efficiency?
Matrix rank-1 operation is known to be memory-bound. So make sure we use gc() for garbage collection to tell R to release memory from heap after every replicate (otherwise your system will stall):
x <- runif(500)
y <- runif(500)
xx <- matrix(x, ncol = 1)
yy <- matrix(y, nrow = 1)
system.time(replicate(200, {outer(x,y); gc();}))
# user system elapsed
# 4.484 0.324 4.837
system.time(replicate(200, {tcrossprod(x,y); gc();}))
# user system elapsed
# 4.320 0.324 4.653
system.time(replicate(200, {xx %*% yy; gc();}))
# user system elapsed
# 4.372 0.324 4.708
In terms of performance, they are all very alike.
Follow-up
When I came back I saw another answer with a different benchmark. Well, the thing is, it depends on the problem size. If you just try a small example you can not eliminate function interpretation / calling overhead for all three functions. If you do
x <- y <- runif(500)
microbenchmark(tcrossprod(x,y), x %*% t(y), outer(x,y), times = 200)
you will see roughly identical performance again.
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# tcrossprod(x, y) 2.09644 2.42466 3.402483 2.60424 3.94238 35.52176 200 a
# x %*% t(y) 2.22520 2.55678 3.707261 2.66722 4.05046 37.11660 200 a
# outer(x, y) 2.08496 2.55424 3.695660 2.69512 4.08938 35.41044 200 a
Here's a comparison of the execution speed for the three methods when the vectors being used are of length 100. The fastest is tcrossprod, with x%*%t(y) taking 17% longer and outer(x,y) taking 45% longer (in median time).
In the table, neval is the number of times the function was evaluated to get the benchmark scores.
> x <- runif(100,0,100)
> y <- runif(100,0,100)
> microbenchmark(tcrossprod(x,y), x%*%t(y), outer(x,y), times=5000)
Unit: microseconds
expr min lq mean median uq max neval
tcrossprod(x, y) 11.404 16.6140 50.42392 17.7300 18.7555 5590.103 5000
x %*% t(y) 13.878 19.4315 48.80170 20.5405 21.7310 4459.517 5000
outer(x, y) 19.238 24.0810 72.05250 25.3595 26.8920 89861.855 5000
To get the the following graph, have
library("ggplot2")
bench <- microbenchmark(tcrossprod(x,y), x%*%t(y), outer(x,y), times=5000)
autplot(bench)
Edit: The performance depends on the size of x and y, and of course the machine running the code. I originally did the benchmark with vectors of length 100 because that's what Masi asked about. However, it appears the three methods have very similar performance for larger vectors. For vectors of length 1000, the median times of the three methods are within 5% of each other on my machine.
> x <- runif(1000)
> y <- runif(1000)
> microbenchmark(tcrossprod(x,y),x%*%t(y),outer(x,y),times=2000)
Unit: milliseconds
expr min lq mean median uq max neval
tcrossprod(x, y) 1.870282 2.030541 4.721175 2.916133 4.482346 75.77459 2000
x %*% t(y) 1.861947 2.067908 4.921061 3.067670 4.527197 105.60500 2000
outer(x, y) 1.886348 2.078958 5.114886 3.033927 4.556067 93.93450 2000
An easy way to look at this is to transform your vectors to a matrix
row1.mat = matrix(row1)
col1.mat = matrix(col1)
and then use dim to see the dimension of the matrices:
dim(row1.mat)
dim(col1.mat)
If you want the product to work for this you need a 6*1 matrix, multiplied by a 1*6 matrix. so you need to transpose the col1.mat using t(col1.mat).
And as you might know the matrix product is %*%
row1.mat %*% t(col1.mat)
Comparison of this method to others
library("microbenchmark")
x <- runif(1000)
y <- runif(1000)
xx = matrix(x)
yy = matrix(y)
microbenchmark(tcrossprod(x,y),x%*%t(y),outer(x,y), xx %*% t(yy), times=2000)
Unit: milliseconds
expr min lq mean median uq max neval
tcrossprod(x, y) 2.829099 3.243785 6.015880 4.801640 5.040636 77.87932 2000
x %*% t(y) 2.847175 3.251414 5.942841 4.810261 5.049474 86.53374 2000
outer(x, y) 2.886059 3.277811 5.983455 4.788054 5.074997 96.12442 2000
xx %*% t(yy) 2.868185 3.255833 6.126183 4.699884 5.056234 87.80024 2000

How to do R multiplication with Nx1 1xM for Matrix NxM?

I want to do a simple column (Nx1) times row (1xM) multiplication, resulting in (NxM) matrix.
Code where I create a row by sequence, and column by transposing a similar sequence
row1 <- seq(1:6)
col1 <- t(seq(1:6))
col1 * row1
Output which indicates that R thinks matrices more like columns
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 4 9 16 25 36
Expected output: NxM matrix.
OS: Debian 8.5
Linux kernel: 4.6 backports
Hardware: Asus Zenbook UX303UA
In this case using outer would be a more natural choice
outer(1:6, 1:6)
In general for two numerical vectors x and y, the matrix rank-1 operation can be computed as
outer(x, y)
If you want to resort to real matrix multiplication routines, use tcrossprod:
tcrossprod(x, y)
If either of your x and y is a matrix with dimension, use as.numeric to cast it as a vector first.
It is not recommended to use general matrix multiplication operation "%*%" for this. But if you want, make sure you get comformable dimension: x is a one-column matrix and y is a one-row matrix, so x %*% y.
Can you say anything about efficiency?
Matrix rank-1 operation is known to be memory-bound. So make sure we use gc() for garbage collection to tell R to release memory from heap after every replicate (otherwise your system will stall):
x <- runif(500)
y <- runif(500)
xx <- matrix(x, ncol = 1)
yy <- matrix(y, nrow = 1)
system.time(replicate(200, {outer(x,y); gc();}))
# user system elapsed
# 4.484 0.324 4.837
system.time(replicate(200, {tcrossprod(x,y); gc();}))
# user system elapsed
# 4.320 0.324 4.653
system.time(replicate(200, {xx %*% yy; gc();}))
# user system elapsed
# 4.372 0.324 4.708
In terms of performance, they are all very alike.
Follow-up
When I came back I saw another answer with a different benchmark. Well, the thing is, it depends on the problem size. If you just try a small example you can not eliminate function interpretation / calling overhead for all three functions. If you do
x <- y <- runif(500)
microbenchmark(tcrossprod(x,y), x %*% t(y), outer(x,y), times = 200)
you will see roughly identical performance again.
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# tcrossprod(x, y) 2.09644 2.42466 3.402483 2.60424 3.94238 35.52176 200 a
# x %*% t(y) 2.22520 2.55678 3.707261 2.66722 4.05046 37.11660 200 a
# outer(x, y) 2.08496 2.55424 3.695660 2.69512 4.08938 35.41044 200 a
Here's a comparison of the execution speed for the three methods when the vectors being used are of length 100. The fastest is tcrossprod, with x%*%t(y) taking 17% longer and outer(x,y) taking 45% longer (in median time).
In the table, neval is the number of times the function was evaluated to get the benchmark scores.
> x <- runif(100,0,100)
> y <- runif(100,0,100)
> microbenchmark(tcrossprod(x,y), x%*%t(y), outer(x,y), times=5000)
Unit: microseconds
expr min lq mean median uq max neval
tcrossprod(x, y) 11.404 16.6140 50.42392 17.7300 18.7555 5590.103 5000
x %*% t(y) 13.878 19.4315 48.80170 20.5405 21.7310 4459.517 5000
outer(x, y) 19.238 24.0810 72.05250 25.3595 26.8920 89861.855 5000
To get the the following graph, have
library("ggplot2")
bench <- microbenchmark(tcrossprod(x,y), x%*%t(y), outer(x,y), times=5000)
autplot(bench)
Edit: The performance depends on the size of x and y, and of course the machine running the code. I originally did the benchmark with vectors of length 100 because that's what Masi asked about. However, it appears the three methods have very similar performance for larger vectors. For vectors of length 1000, the median times of the three methods are within 5% of each other on my machine.
> x <- runif(1000)
> y <- runif(1000)
> microbenchmark(tcrossprod(x,y),x%*%t(y),outer(x,y),times=2000)
Unit: milliseconds
expr min lq mean median uq max neval
tcrossprod(x, y) 1.870282 2.030541 4.721175 2.916133 4.482346 75.77459 2000
x %*% t(y) 1.861947 2.067908 4.921061 3.067670 4.527197 105.60500 2000
outer(x, y) 1.886348 2.078958 5.114886 3.033927 4.556067 93.93450 2000
An easy way to look at this is to transform your vectors to a matrix
row1.mat = matrix(row1)
col1.mat = matrix(col1)
and then use dim to see the dimension of the matrices:
dim(row1.mat)
dim(col1.mat)
If you want the product to work for this you need a 6*1 matrix, multiplied by a 1*6 matrix. so you need to transpose the col1.mat using t(col1.mat).
And as you might know the matrix product is %*%
row1.mat %*% t(col1.mat)
Comparison of this method to others
library("microbenchmark")
x <- runif(1000)
y <- runif(1000)
xx = matrix(x)
yy = matrix(y)
microbenchmark(tcrossprod(x,y),x%*%t(y),outer(x,y), xx %*% t(yy), times=2000)
Unit: milliseconds
expr min lq mean median uq max neval
tcrossprod(x, y) 2.829099 3.243785 6.015880 4.801640 5.040636 77.87932 2000
x %*% t(y) 2.847175 3.251414 5.942841 4.810261 5.049474 86.53374 2000
outer(x, y) 2.886059 3.277811 5.983455 4.788054 5.074997 96.12442 2000
xx %*% t(yy) 2.868185 3.255833 6.126183 4.699884 5.056234 87.80024 2000

Create a sequence from vectors with start and end positions

Given two separate vectors of equal length: f.start and f.end, I would like to construct a sequence (by 1), going from f.start[1]:f.end[1] to f.start[2]:f.end[2], ..., to f.start[n]:f.end[n].
Here is an example with just 6 rows.
f.start f.end
[1,] 45739 122538
[2,] 125469 202268
[3,] 203563 280362
[4,] 281657 358456
[5,] 359751 436550
[6,] 437845 514644
Crudely, a loop can do it, but is extremely slow for larger datasets (rows>2000).
f.start<-c(45739,125469,203563,281657,359751,437845)
f.end<-c(122538,202268,280362,358456,436550,514644)
f.ind<-f.start[1]:f.end[1]
for (i in 2:length(f.start))
{
f.ind.temp<-f.start[i]:f.end[i]
f.ind<-c(f.ind,f.ind.temp)
}
I suspect this can be done with apply(), but I have not worked out how to include two separate arguments in apply, and would appreciate some guidance.
You can try using mapply or Map, which iterates simultaneously on your two vectors. You need to provide the function as first argument:
vec1 = c(1,33,50)
vec2 = c(10,34,56)
unlist(Map(':',vec1, vec2))
# [1] 1 2 3 4 5 6 7 8 9 10 33 34 50 51 52 53 54 55 56
Just replace vec1 and vec2 by f.start and f.end provided all(f.start<=f.end)
Your loop is going to be slow as you are growing the vector
f.ind. You will also get an increase in speed if you pre-allocate
the length of the output vector.
# Some data (of length 3000)
set.seed(1)
f.start <- sample(1:10000, 3000)
f.end <- f.start + sample(1:200, 3000, TRUE)
# Functions
op <- function(L=1) {
f.ind <- vector("list", L)
for (i in 1:length(f.start)) {
f.ind[[i]] <- f.start[i]:f.end[i]
}
unlist(f.ind)
}
op2 <- function() unlist(lapply(seq(f.start), function(x) f.start[x]:f.end[x]))
col <- function() unlist(mapply(':',f.start, f.end))
# check output
all.equal(op(), op2())
all.equal(op(), col())
A few benchmarks
library(microbenchmark)
# Look at the effect of pre-allocating
microbenchmark(op(L=1), op(L=1000), op(L=3000), times=500)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# op(L = 1) 46.760416 48.741080 52.29038 49.636864 50.661506 113.08303 500 c
# op(L = 1000) 41.644123 43.965891 46.20380 44.633016 45.739895 94.88560 500 b
# op(L = 3000) 7.629882 8.098691 10.10698 8.338387 9.963558 60.74152 500 a
# Compare methods - the loop actually performs okay
# I left the original loop out
microbenchmark(op(L=3000), op2(), col(), times=500)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# op(L = 3000) 7.778643 8.123136 10.119464 8.367720 11.402463 62.35632 500 b
# op2() 6.461926 6.762977 8.619154 6.995233 10.028825 57.55236 500 a
# col() 6.656154 6.910272 8.735241 7.137500 9.935935 58.37279 500 a
So a loop should perform okay speed wise, but of course the Colonel's code is a lot cleaner. The *apply functions here wont really give much speed up in the calculation but they do offer tidier code and remove the need for pre-allocation.

Vectorization while using which() function in R

I have 3 vectors and I want to apply separately on each of them the 'which()' function.
I'm trying to find the max index of values less than some given number.
How can I operate this task using vectorization?
my 3 vectors (may have various lengths)
vec1 <- c(1,2,3,4,5)
vec2 <- c(11,12,13)
vec3 <- c(1,2,3,4,5,6,7,8)
How can I vectorize it?
max(which(vec1<3))
max(which(vec2<12.3))
max(which(vec3<5.7))
The expected result is:
2
2
5
One way to get a speedup would be to use Rcpp to search for elements smaller than your cutoff, starting from the right side of the vector and moving left. You can return as soon as you find the element that meets your criteria, which means that if your target is near the right side of the vector you might avoid looking at most of the vector's elements (meanwhile, which looks at all vector elements and max looks at all values returned by which). The speedup would be largest for long vectors where the target element is close to the end.
library(Rcpp)
rightmost.small <- cppFunction(
'double rightmostSmall(NumericVector x, const double cutoff) {
for (int i=x.size()-1; i >= 0; --i) {
if (x[i] < cutoff) return i+1; // 1-index
}
return 0; // None found
}')
rightmost.small(vec1, 3)
# [1] 2
rightmost.small(vec2, 12.3)
# [1] 2
rightmost.small(vec3, 5.7)
# [1] 5
Let's look at the performance for a vector where we expect this to give us a big speedup:
set.seed(144)
vec.large <- rnorm(1000000)
all.equal(max(which(vec.large < -1)), rightmost.small(vec.large, -1))
# [1] TRUE
library(microbenchmark)
microbenchmark(max(which(vec.large < -1)), rightmost.small(vec.large, -1))
# Unit: microseconds
# expr min lq mean median uq max neval
# max(which(vec.large < -1)) 4912.016 8097.290 12816.36406 9189.0685 9883.9775 60405.585 100
# rightmost.small(vec.large, -1) 1.643 2.476 8.54274 8.8915 12.8375 58.152 100
For this vector of length 1 million, we see a speedup of about 1000x using the Rcpp code.
This speedup should carry directly over to the case where you have many vectors stored in a list; you can use #JoshO'Brien's mapply code and observe a speedup when you switch from max(which(...)) to the Rcpp code:
f <- function(v,m) max(which(v < m))
l <- list(vec.large)[rep(1, 100)]
m <- rep(-1, 100)
microbenchmark(mapply(f, l, m), mapply(rightmost.small, l, m))
Unit: microseconds
expr min lq mean median uq max neval
mapply(f, l, m) 865287.828 907893.8505 931448.1555 918637.343 935632.0505 1133909.950 100
mapply(rightmost.small, l, m) 253.573 281.6855 344.5437 303.094 335.1675 712.897 100
We see a 3000x speedup by using the Rcpp code here.
l <- list(vec1,vec2,vec3)
m <- c(3, 12.3, 5.7)
f <- function(v,m) max(which(v < m))
mapply(f,l,m)
# [1] 2 2 5

How to get row wise standard deviation over specific columns [duplicate]

I'd like to compute the variance for each row in a matrix. For the following matrix A
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 5 6 10
[3,] 50 7 11
[4,] 4 8 12
I would like to get
[1] 16.0000 7.0000 564.3333 16.0000
I know I can achieve this with apply(A,1,var), but is there a faster or better way? From octave, I can do this with var(A,0,2), but I don't get how the Y argument of the var() function in R is to be used.
Edit: The actual dataset of a typical chunk has around 100 rows and 500 columns. The total amount of data is around 50GB though.
You could potentially vectorize var over rows (or columns) using rowSums and rowMeans
RowVar <- function(x, ...) {
rowSums((x - rowMeans(x, ...))^2, ...)/(dim(x)[2] - 1)
}
RowVar(A)
#[1] 16.0000 7.0000 564.3333 16.0000
Using #Richards data, yields in
microbenchmark(apply(m, 1, var), RowVar(m))
## Unit: milliseconds
## expr min lq median uq max neval
## apply(m, 1, var) 343.369091 400.924652 424.991017 478.097573 746.483601 100
## RowVar(m) 1.766668 1.916543 2.010471 2.412872 4.834471 100
You can also create a more general function that will receive a syntax similar to apply but will remain vectorized (the column wise variance will be slower as the matrix needs to be transposed first)
MatVar <- function(x, dim = 1, ...) {
if(dim == 1){
rowSums((x - rowMeans(x, ...))^2, ...)/(dim(x)[2] - 1)
} else if (dim == 2) {
rowSums((t(x) - colMeans(x, ...))^2, ...)/(dim(x)[1] - 1)
} else stop("Please enter valid dimension")
}
MatVar(A, 1)
## [1] 16.0000 7.0000 564.3333 16.0000
MatVar(A, 2)
V1 V2 V3
## 547.333333 1.666667 1.666667
This is one of the main reasons why apply() is useful. It is meant to operate on the margins of an array or matrix.
set.seed(100)
m <- matrix(sample(1e5L), 1e4L)
library(microbenchmark)
microbenchmark(apply(m, 1, var))
# Unit: milliseconds
# expr min lq median uq max neval
# apply(m, 1, var) 270.3746 283.9009 292.2933 298.1297 343.9531 100
Is 300 milliseconds too long to make 10,000 calculations?

Resources