I have a simple problem. I want to multiply a 3D array by another 3D array in R without using a for-loop.
To illustrate:
Suppose I have a 1x3 matrix A:
[A1, A2, A3]
And I have a 3x3 matrix B:
[B1, B2, B3 \\
B4, B5, B6 \\
B7, B8, B9]
My main operation is A %*% B resulting in a 1x3 matrix.
But now I want to repeat the process 10,000 times, each with a different A and B of the same dimensions as above. I can use a for-loop
for (i in 1:10000) {
A[i] %*% B[i]
}
Then I can store the 10,000 values.
But is there any way to achieve the same thing without using a for-loop. I am thinking of possibly a 3D array multiplication. But I am not sure how to do this in R.
Matrix A: 1 x 3 x 10000
[A1, A2, A3]
Matrix B: 3 x 3 x 10000
[B1, B2, B3
B4, B5, B6
B7, B8, B9]
Also, would vectorization help?
Can you guys please help? Thank you!
There are several ways to accomplish this with array multiplication. The price you pay is to reformat the matrices into much larger tensors with many zeros. Those are sparse, by definition, and so the principal cost is the overhead for conversion. It's actually superior to a loop by the time you have 10,000 arrays to multiply.
Let n by the number of (A,B) pairs and k=3 the dimension.
The sleekest solution seems to be to reorganize the n rows of A (an n by k matrix) into an n*k by n*k block-diagonal matrix of k by k blocks. Block i, i=1..n, contains row i of A in its top row and otherwise is zero. Multiplying this (on the right) by B (arranged as a k*n by k matrix consisting of a "stack" of n blocks of dimension k by k) computes all the individual products, depositing them at rows 1, k+1, 2k+1, ..., of the result, where they can be picked out.
f3 <- function(a, b) {
require(RcppArmadillo) # sparseMatrix package
n <- dim(b)[3]
k <- dim(b)[2]
i0 <- (1:n-1)*k+1
i <- rep(i0, each=k)
j <- 1:(k*n)
aa <- sparseMatrix(i, j, x=c(t(a)), dims=c(n*k, n*k))
bb <- matrix(aperm(b, c(1,3,2)), nrow=n*k)
t((aa %*% bb)[i0, ])
}
As you can see, the array operations are basic: create sparse matrices, transpose arrays (with aperm and t), and multiply. It returns its results in a k by n array (which you may transpose if you prefer), one result vector per column.
As a test, here is a brute-force loop using the same array data structures.
f1 <- function(a, b) sapply(1:nrow(a), function(i) a[i,] %*% b[,,i])
We may apply these solutions to the same input and compare the results:
#
# Create random matrices for testing.
#
k <- 3
n <- 1e6 # Number of (a,B) pairs
a <- matrix(runif(k*n), ncol=k)
b <- array(runif(k^2*n), dim=c(k,k,n))
system.time(c1 <- f1(a,b)) # 4+ seconds
system.time(c3 <- f3(a,b)) # 2/3 second
mean((c1-c3)^2) # Want around 10^-32 or less
The results aren't completely equal, but their mean squared difference is less than 10^-32, showing they can be considered the same up to floating point roundoff error.
The array-oriented procedure f3 is initially slower than the looping procedure f1, but catches up by the time n is 10,000. After that it's about twice as fast or better (on this machine; YMMV). Both algorithms should scale linearly in n (and the timing suggests they do, at least out to n=10,000,000).
If your A and B are lists, you can use mapply():
> nn <- 1e1
> set.seed(1)
> A <- replicate(nn,matrix(rnorm(3),nrow=1),simplify=FALSE)
> B <- replicate(nn,matrix(rnorm(9),nrow=3),simplify=FALSE)
> head(mapply("%*%",A,B,SIMPLIFY=FALSE),3)
[[1]]
[,1] [,2] [,3]
[1,] -1.193976 0.1275999 -0.6831007
[[2]]
[,1] [,2] [,3]
[1,] 1.371143 1.860379 -1.639078
[[3]]
[,1] [,2] [,3]
[1,] 0.8250047 -0.6967286 1.949236
The for-loop is more efficient than you think
Your problem of multiplying n (A,B) pairs is not equivalent to tensor multiplication in the usual sense, although whuber has presented a very neat way of turning it into a matrix multiplication by stacking the Bs as blocks in a sparse matrix.
You have said you want to avoid a for-loop, but the for-loop approach is actually very competitive when programmed efficiently, and I would suggest that you reconsider it.
I will use the same notation as whuber, with A of dimension n x k and B of dimension k x k x n, for example:
n <- 1e4
k <- 3
A <- array(rnorm(k*n),c(n,k))
B <- array(rnorm(k*k*n),c(k,k,n))
A simple and efficient for-loop solution would go like this
justAForLoop <- function(A,B) {
n <- nrow(A)
for (i in 1:n) A[i,] <- A[i,] %*% B[,,i]
A
}
producing an n x k matrix of results.
I have modified whuber's f3 function to load the Matrix package, otherwise the sparseMatrix function is unavailable. My version of f3 is very slightly faster than the original because I've eliminated the last matrix transpose before returning the result.
With this modification, it returns identical numerical results to justAForLoop.
f3 <- function(a, b) {
require(Matrix)
n <- dim(b)[3]
k <- dim(b)[2]
i0 <- (1:n-1)*k+1
i <- rep(i0, each=k)
j <- 1:(k*n)
aa <- sparseMatrix(i, j, x=c(t(a)), dims=c(n*k, n*k))
bb <- matrix(aperm(b, c(1,3,2)), nrow=n*k)
(aa %*% bb)[i0, ]
}
Now I rerun whuber's simulation in a fresh R session:
> k <- 3
> n <- 1e6
> a <- matrix(runif(k*n), ncol=k)
> b <- array(runif(k^2*n), dim=c(k,k,n))
>
> system.time(c1 <- f1(a,b))
user system elapsed
3.40 0.09 3.50
> system.time(c3 <- f3(a,b))
Loading required package: Matrix
user system elapsed
1.06 0.24 1.30
> system.time(c4 <- justAForLoop(a,b))
user system elapsed
1.27 0.00 1.26
The for-loop approach is actually the fastest by a narrow margin. It is very much faster than f1, which relies on sapply. (My machine is a Windows 10 PC with 32Gb RAM running R 3.6.0).
If I run all three methods a second time, then f3 becomes the fastest because this time the Matrix package is already in the search path and doesn't have to be reloaded:
> system.time(c1 <- f1(a,b))
user system elapsed
3.23 0.04 3.26
> system.time(c3 <- f3(a,b))
user system elapsed
0.33 0.20 0.53
> system.time(c4 <- justAForLoop(a,b))
user system elapsed
1.28 0.01 1.30
However f3 uses more RAM than the for-loop. On my PC, I can run justAForLoop successfully with n=1e8 whereas f1 and f3 both run out of memory and fail.
Summary
A direct for-loop approach is much more efficient than sapply.
For your problem with n=10,000 matrix multiplications, running the for-loop is simple and efficient, taking <0.02sec. By contrast, merely loading the package with sparse matrix functions requires about 2/3sec.
For n between 1-10 million, whuber's sparse matrix solution starts to outperform, especially if the Matrix package is already loaded.
The for-loop uses the least RAM of the three methods. For n at 100 million on my PC with 32Gb RAM, only the for-loop approach works.
I am looking for an effficient way of computing the Kronecker product of two large matrices. I have tried using the method kronecker() as follows:
I = diag(700)
data = replicate(15, rnorm(120))
test = kronecker(I,data)
However, it takes a long time to execute and then gives the following error:
Error: cannot allocate vector of size 6.8 Gb
As long as you use Matrix::Diagonal to construct your diagonal matrix, you'll automatically get your test object constructed as a sparse matrix:
library(Matrix)
I=Diagonal(700)
data = replicate(15,rnorm(120))
system.time(test <- kronecker(I,data))
## user system elapsed
## 0.600 0.044 0.671
dim(test)
## [1] 84000 10500
format(object.size(test),"Mb")
## [1] "19.2 Mb"
If you are computing kron(I,A)*v where v is a vector you can do this using vec(A*V) where V reshapes v into a matrix. This uses the more general rule that vec(ABC)=kron(C',A)*vec(B). This avoids forming the Kronecker product and uses far less operations to perform the computation.
Note that V may need to be transposed depending on how matrix storage is handled (columns versus rows).
In the R programming language...
Bottleneck in my code:
a <- a[b]
where:
a,b are vectors of length 90 Million.
a is a logical vector.
b is a permutation of the indeces of a.
This operation is slow: it takes ~ 1.5 - 2.0 seconds.
I thought straightforward indexing would be much faster, even for large vectors.
Am I simply stuck? Or is there a way to speed this up?
Context:
P is a large matrix (10k row, 5k columns).
rows = names, columns = features. values = real numbers.
Problem: Given a subset of names, I need to obtain matrix Q, where:
Each column of Q is sorted (independently of the other columns of Q).
The values in a column of Q come from the corresponding column of P and are only those from the rows of P which are in the given subset of names.
Here is a naive implementation:
Psub <- P[names,]
Q <- sapply( Psub , sort )
But I am given 10,000 distinct subsets of names (each subset is several 20% to 90% of the total). Taking the subset and sorting each time is incredibly slow.
Instead, I can pre-compute the order vector:
b <- sapply( P , order )
b <- convert_to_linear_index( as.data.frame(b) , dim(P) )
# my own function.
# Now b is a vector of length nrow(P) * ncol(P)
a <- rownames(P) %in% myNames
a <- rep(a , ncol(P) )
a <- a[b]
a <- as.matrix(a , nrow = length(myNames) )
I don't see this getting much faster than that. You can try to write an optimized C function to do exactly this, which might cut the time in half or so (and that's optimistic -- vectorized R operations like this don't have much overhead), but not much more than that.
You've got approx 10^8 values to go through. Each time through the internal loop, it needs to increment the iterator, get the index b[i] out of memory, look up a[b[i]] and then save that value into newa[i]. I'm not a compiler/assembly expert by a long shot, but this sounds like on the order of 5-10 instructions, which means you're looking at "big O" of 1 billion instructions total, so there's a clock rate limit to how fast this can go.
Also, R stores logical values as 32 bit ints, so the array a will take up about 400 megs, which doesn't fit into cache, so if b is a more or less random permutation, then you're going to be missing the cache regularly (on most lookups to a, in fact). Again, I'm not an expert, but I would think it's likely that the cache misses here are the bottleneck, and if that's the case, optimized C won't help much.
Aside from writing it in C, the other thing to do is determine whether there are any assumptions you can make that would let you not go through the whole array. For example, if you know most of the indices will not change, and you can figure out which ones do change, you might be able to make it go faster.
On edit, here are some numbers. I have an AMD with clock speed of 2.8GHz. It takes me 3.4 seconds with a random permutation (i.e. lots of cache misses) and 0.7 seconds with either 1:n or n:1 (i.e. very few cache misses), which breaks into 0.6 seconds of execution time and 0.1 of system time, presumably to allocate the new array. So it does appear that cache misses are the thing. Maybe optimized C code could shave something like 0.2 or 0.3 seconds off of that base time, but if the permutation is random, that won't make much difference.
> x<-sample(c(T,F),90*10**6,T)
> prm<-sample(90*10**6)
> prm1<-1:length(prm)
> prm2<-rev(prm1)
> system.time(x<-x[prm])
user system elapsed
3.317 0.116 3.436
> system.time(x<-x[prm1])
user system elapsed
0.593 0.140 0.734
> system.time(x<-x[prm2])
user system elapsed
0.631 0.112 0.743
>
I have a very tall integer matrix (mat) and a sparse binary vector (v) of equal row length. I want to find the minimum value in all columns of mat where v==1.
Here are several possible solutions:
mat <- matrix(as.integer(runif(100000*100,0,2^31)),nrow=100000,ncol=100)
v<-(rbinom(100000,1,.01))
a<-apply(v*mat,2, function(x) min(x[x>0]))
b<-apply(mat,2,function(x) min(x[v==1]))
c<-sapply(subset(data.frame(mat),v==1), min)
These all work fine, and on my machine solution c seems fastest (an admittedly older,slower MacBook). But if I have a function that feeds unique sets of v, the computation time scales linearly with the number of sets. So a large number of unique sets (>10,000) will take hours to process.
Any ideas on how to do such an operation faster, or is this as fast as I can go?
I guess that subsetting and then calling apply gains a lot, given that v is almost always 0:
system.time(b<-apply(mat[as.logical(v),],2, min))
# user system elapsed
# 0.012 0.000 0.013
system.time(a<-apply(v*mat,2, function(x) min(x[x>0])))
# user system elapsed
# 0.628 0.019 0.649
identical(a,b)
#[1] TRUE
I dropped also the x[x>0], since it appears that mat is always greater than 0.
I am looking for an effficient way of computing the Kronecker product of two large matrices. I have tried using the method kronecker() as follows:
I = diag(700)
data = replicate(15, rnorm(120))
test = kronecker(I,data)
However, it takes a long time to execute and then gives the following error:
Error: cannot allocate vector of size 6.8 Gb
As long as you use Matrix::Diagonal to construct your diagonal matrix, you'll automatically get your test object constructed as a sparse matrix:
library(Matrix)
I=Diagonal(700)
data = replicate(15,rnorm(120))
system.time(test <- kronecker(I,data))
## user system elapsed
## 0.600 0.044 0.671
dim(test)
## [1] 84000 10500
format(object.size(test),"Mb")
## [1] "19.2 Mb"
If you are computing kron(I,A)*v where v is a vector you can do this using vec(A*V) where V reshapes v into a matrix. This uses the more general rule that vec(ABC)=kron(C',A)*vec(B). This avoids forming the Kronecker product and uses far less operations to perform the computation.
Note that V may need to be transposed depending on how matrix storage is handled (columns versus rows).