Here's what I want to do: I have two matrices A and B of dimensions N x k1 and N x k2. I now want to pointwise multiply each column of the matrix A with B.
Implementation one does this in a for loop.
For speed optimization purposes, I considered to vectorize the entire operation - but it turns out vectorization (as I have implemented it, via kronecker products) did not improve my runtime for larger problems.
Does anyone have a suggestion how to differently implement this operation, having runtime in mind?
The code below starts with a small example, then implements a loop-based and vectorized solution, then benchmarks on a larger problem.
# toy example:
N <- 5
k1 <- 2
k2 <- 3
A <- matrix(rnorm(N*k1), N, k1)
B <- matrix(rnorm(N*k2), N, k2)
colmat_prod <- function(x, y){
k2 <- ncol(y)
k1 <- ncol(x)
res <- array(NA, c(N, k2 , k1))
for(i in 1:k1){
res[, ,i] <- x[,i] * y
}
res
}
colmat_prod_vec <- function(x, y){
k1 <- ncol(x)
res_vec <- c(x) * (rep(1, k1) %x% y)
res_vec
}
colmat_prod(A, B)
colmat_prod_vec(A, B)
# > colmat_prod(A, B)
# , , 1
#
# [,1] [,2] [,3]
# [1,] 1.95468879 0.55206339 0.24713400
# [2,] -0.02678564 -0.03762645 -0.03144102
# [3,] 0.30964437 0.26912771 -0.49451656
# [4,] -1.40719543 0.77245522 -0.47236888
# [5,] -1.71485558 0.98348809 0.16569915
#
# , , 2
#
# [,1] [,2] [,3]
# [1,] 1.60358991 0.45290242 0.20274409
# [2,] -0.21009808 -0.29513001 -0.24661348
# [3,] 0.04069121 0.03536681 -0.06498577
# [4,] -2.89562745 1.58950383 -0.97200734
# [5,] -1.59504293 0.91477425 0.15412217
#
# > colmat_prod_vec(A, B)
# [,1] [,2] [,3]
# [1,] 1.95468879 0.55206339 0.24713400
# [2,] -0.02678564 -0.03762645 -0.03144102
# [3,] 0.30964437 0.26912771 -0.49451656
# [4,] -1.40719543 0.77245522 -0.47236888
# [5,] -1.71485558 0.98348809 0.16569915
# [6,] 1.60358991 0.45290242 0.20274409
# [7,] -0.21009808 -0.29513001 -0.24661348
# [8,] 0.04069121 0.03536681 -0.06498577
# [9,] -2.89562745 1.58950383 -0.97200734
# [10,] -1.59504293 0.91477425 0.15412217
# speed:
N <- 10000
k1 <- 1000
k2 <- 9
A1 <- matrix(rnorm(N*k1), N, k1)
B1 <- matrix(rnorm(N*k2), N, k2)
library(microbenchmark)
microbenchmark(colmat_prod(A1, B1),
colmat_prod_vec(A1, B1),
times = 10)
#Unit: seconds
#expr min lq mean median uq max neval
#colmat_prod(A1, B1) 1.981737 2.179122 2.769812 2.32343 2.680407 4.96276 10
#colmat_prod_vec(A1, B1) 9.779629 9.955576 10.291264 10.21356 10.380702 11.70494 10
You can try apply(A, 2, '*', B) and to come the the same like colmat_prod use array(apply(A, 2, '*', B), c(dim(B), ncol(A))):
identical(array(apply(A, 2, '*', B), c(dim(B), ncol(A))), colmat_prod(A, B))
#[1] TRUE
Another option is to use rep for the columns of A:
array(A[,rep(seq_len(ncol(A)), each=ncol(B))] * as.vector(B), c(dim(B), ncol(A)))
Timings:
library(microbenchmark)
microbenchmark(colmat_prod(A1, B1),
colmat_prod_vec(A1, B1),
array(apply(A1, 2, '*', B1), c(dim(B1), ncol(A1))),
array(A1[,rep(seq_len(ncol(A1)), each=ncol(B1))] * as.vector(B1), c(dim(B1), ncol(A1))),
times = 10)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# colmat_prod(A1, B1) 831.5437 857.0305 910.5694 878.6842 999.5354 1025.0915 10 c
# colmat_prod_vec(A1, B1) 981.9241 1010.9482 1174.1700 1162.7004 1319.3478 1444.6158 10 d
# array(apply(A1, 2, "*", B1), c(dim(B1), ncol(A1))) 716.1469 725.7862 765.4987 732.2520 789.3843 907.4417 10 b
# array(A1[, rep(seq_len(ncol(A1)), each = ncol(B1))] * as.vector(B1), c(dim(B1), ncol(A1))) 404.8460 406.2848 430.4043 428.2685 458.9400 462.0634 10 a
Related
I need to write a function in R that receives as input an integer number n>1, and generates an output matrix P, where P_{i,j} = min (i,j) for(i,j)=1,...,n. This function must not have for nor while loops.
So far I have tried with the following code.
mat <- function(n){
m <- matrix(0,nrow = n,ncol = n)
if(row(m) >= col(m)){
col(m)
}
else{
row(m)
}
}
I know that with the if conditions, row(m) and col(m) I should be capable to look over the matrix, however, I don't know how to set that for that conditions I can have the min of row(m) and col(m) in the (i,j) position. I know I won't achieve the latter with the conditions I have above, but so far is the closest I've been.
An example is the following.
If n=3, then the result should be:
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 2 2
[3,] 1 2 3
Try pmin, row and col
f1 <- function(n = 3) {
mat <- matrix(nrow = n, ncol = n)
pmin(row(mat), col(mat))
}
f1()
# [,1] [,2] [,3]
#[1,] 1 1 1
#[2,] 1 2 2
#[3,] 1 2 3
Or use outer and pmin which is more effiecient
f2 <- function(n = 3) {
idx <- sequence(n)
outer(idx, idx, pmin)
}
benchmark
library(microbenchmark)
n <- 10000
b <- microbenchmark(
f1 = f1(n),
f2 = f2(n),
times = 10
)
library(ggplot2)
autoplot(b)
b
#Unit: seconds
# expr min lq mean median uq max neval cld
# f1 5.554471 5.908210 5.924173 5.950610 5.996274 6.058502 10 b
# f2 1.272793 1.298099 1.354428 1.309208 1.464950 1.495362 10 a
How to extract the row and column of the element in use when using apply function? For example, say I want to apply a function for each element of the matrix where row and column number of the selected element are also variables in the function. A simple reproducible example is given below
mymatrix <- matrix(1:12, nrow=3, ncol=4)
I want a function which does the following
apply(mymatrix, c(1,2), function (x) sum(x, row_number, col_number))
where row_number and col_number are the row and column number of the selected element in mymatrix. Note that my function is more complicated than sum, so a robust solution is appreciated.
I'm not entirely sure what you're trying to do but I would use a for loop here.
Pre-allocate the return matrix and this will be very fast
ret <- mymatrix
for (i in 1:nrow(mymatrix))
for (j in 1:ncol(mymatrix))
ret[i, j] <- sum(mymatrix[i, j], i, j)
# [,1] [,2] [,3] [,4]
#[1,] 3 7 11 15
#[2,] 5 9 13 17
#[3,] 7 11 15 19
Benchmark analysis 1
I was curious so I ran a microbenchmark analysis to compare methods; I used a bigger 200x300 matrix.
mymatrix <- matrix(1:600, nrow = 200, ncol = 300)
library(microbenchmark)
res <- microbenchmark(
for_loop = {
ret <- mymatrix
for (i in 1:nrow(mymatrix))
for (j in 1:ncol(mymatrix))
ret[i, j] <- sum(mymatrix[i, j], i, j)
},
expand_grid_mapply = {
newResult<- mymatrix
grid1 <- expand.grid(1:nrow(mymatrix),1:ncol(mymatrix))
newResult[]<-
mapply(function(row_number, col_number){ sum(mymatrix[row_number, col_number], row_number, col_number) },row_number = grid1$Var1, col_number = grid1$Var2 )
},
expand_grid_apply = {
newResult<- mymatrix
grid1 <- expand.grid(1:nrow(mymatrix),1:ncol(mymatrix))
newResult[]<-
apply(grid1, 1, function(x){ sum(mymatrix[x[1], x[2]], x[1], x[2]) })
},
double_sapply = {
sapply(1:ncol(mymatrix), function (x) sapply(1:nrow(mymatrix), function (y) sum(mymatrix[y,x],x,y)))
}
)
res
#Unit: milliseconds
# expr min lq mean median uq max
# for_loop 41.42098 52.72281 56.86675 56.38992 59.1444 82.89455
# expand_grid_mapply 126.98982 161.79123 183.04251 182.80331 196.1476 332.94854
# expand_grid_apply 295.73234 354.11661 375.39308 375.39932 391.6888 562.59317
# double_sapply 91.80607 111.29787 120.66075 120.37219 126.0292 230.85411
library(ggplot2)
autoplot(res)
Benchmark analysis 2 (with expand.grid outside of microbenchmark)
grid1 <- expand.grid(1:nrow(mymatrix),1:ncol(mymatrix))
res <- microbenchmark(
for_loop = {
ret <- mymatrix
for (i in 1:nrow(mymatrix))
for (j in 1:ncol(mymatrix))
ret[i, j] <- sum(mymatrix[i, j], i, j)
},
expand_grid_mapply = {
newResult<- mymatrix
newResult[]<-
mapply(function(row_number, col_number){ sum(mymatrix[row_number, col_number], row_number, col_number) },row_number = grid1$Var1, col_number = grid1$Var2 )
},
expand_grid_apply = {
newResult<- mymatrix
newResult[]<-
apply(grid1, 1, function(x){ sum(mymatrix[x[1], x[2]], x[1], x[2]) })
}
)
res
#Unit: milliseconds
# expr min lq mean median uq max
# for_loop 39.65599 54.52077 60.87034 59.19354 66.64983 95.7890
# expand_grid_mapply 130.33573 167.68201 194.39764 186.82411 209.33490 400.9273
# expand_grid_apply 296.51983 373.41923 405.19549 403.36825 427.41728 597.6937
That's not how apply works: You cannot access the current index (row, col index) from inside [lsvm]?apply-family.
You will have to create the current row and col index before applying. ?expand.grid.
mymatrix <- matrix(1:12, nrow=3, ncol=4)
newResult<- mymatrix
grid1 <- expand.grid(1:nrow(mymatrix),1:ncol(mymatrix))
newResult[]<-
mapply(function(row_number, col_number){ sum(mymatrix[row_number, col_number], row_number, col_number) },row_number = grid1$Var1, col_number = grid1$Var2 )
newResult
# [,1] [,2] [,3] [,4]
#[1,] 3 7 11 15
#[2,] 5 9 13 17
#[3,] 7 11 15 19
If you want to use apply
newResult[]<-
apply(grid1, 1, function(x){ sum(mymatrix[x[1], x[2]], x[1], x[2]) })
This is my thought with outer() function.
The third argument FUN can be any two-argument function.
mymatrix <- matrix(1:12, nrow = 3, ncol = 4)
nr <- nrow(mymatrix)
nc <- ncol(mymatrix)
mymatrix + outer(1:nr, 1:nc, FUN = "+")
[,1] [,2] [,3] [,4]
[1,] 3 7 11 15
[2,] 5 9 13 17
[3,] 7 11 15 19
With #Maurits Evers' benchmark code :
Unit: microseconds
expr min lq mean median uq max
for_loop 19963.203 22427.1630 25308.168 23811.855 25017.031 158341.678
outer 848.247 949.3515 1054.944 1011.457 1059.217 1463.956
In addition, I try to complete your original idea with apply(X, c(1,2), function (x)) :
(It's a little slower than other answers)
mymatrix <- matrix(1:12, nrow = 3, ncol = 4)
n <- 1 # n = index of data
nr <- nrow(mymatrix)
apply(mymatrix, c(1,2), function (x) {
row_number <- (n-1) %% nr + 1 # convert n to row number
col_number <- (n-1) %/% nr + 1 # convert n to column number
res <- sum(x, row_number, col_number)
n <<- n + 1
return(res)
})
[,1] [,2] [,3] [,4]
[1,] 3 7 11 15
[2,] 5 9 13 17
[3,] 7 11 15 19
Anyone has a tip how to speed up the code below? Particularly with avoiding the for-loops?
J <- 10000
I <- 10000
Y <- matrix(0,J,I)
X <- runif(I,0,1)
P <- runif(I,0,1)
Z <- matrix(runif(n = J*I,0,1),J,I)
K <- matrix(runif(n = J*I,0,1),J,I)
for(j in 1:J){
for (i in 1:I){
Y[j,i] <- X[i]^(Z[j,i])*P[i]^(K[j,i])
}
}
Thanks!
I think t(X^t(Z)*P^t(K)) would lead to the same result and much faster. Here is a reproducible example with a 5 X 5 matrix and performance evaluation.
set.seed(543)
### Original Code
J <- 5
I <- 5
Y <- matrix(0,J,I)
X <- runif(I,0,1)
P <- runif(I,0,1)
Z <- matrix(runif(n = J*I,0,1),J,I)
K <- matrix(runif(n = J*I,0,1),J,I)
for(j in 1:J){
for (i in 1:I){
Y[j,i] <- X[i]^(Z[j,i])*P[i]^(K[j,i])
}
}
# View the result
Y
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.8244760 0.7717289 0.3884273 0.30937614 0.6807137
# [2,] 0.8878758 0.3761184 0.2819624 0.08388850 0.5299624
# [3,] 0.9559749 0.7813653 0.2048310 0.05117558 0.4069641
# [4,] 0.9317235 0.6614524 0.1619824 0.08777542 0.3037913
# [5,] 0.9507279 0.5434549 0.3950076 0.08050582 0.3244810
### A solution without for loop
Y2 <- t(X^t(Z)*P^t(K))
# View the result
Y2
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.8244760 0.7717289 0.3884273 0.30937614 0.6807137
# [2,] 0.8878758 0.3761184 0.2819624 0.08388850 0.5299624
# [3,] 0.9559749 0.7813653 0.2048310 0.05117558 0.4069641
# [4,] 0.9317235 0.6614524 0.1619824 0.08777542 0.3037913
# [5,] 0.9507279 0.5434549 0.3950076 0.08050582 0.3244810
identical(Y, Y2)
# [1] TRUE
### Performance evaluation
library(microbenchmark)
perf <- microbenchmark(
m1 = { Y <- matrix(0,J,I)
for(j in 1:J){
for (i in 1:I){
Y[j,i] <- X[i]^(Z[j,i])*P[i]^(K[j,i])
}
}},
m2 = {Y2 <- t(X^t(Z)*P^t(K))},
times = 100L
)
# View the result
perf
# Unit: microseconds
# expr min lq mean median uq max neval cld
# m1 3649.287 3858.250 4107.31032 3932.017 4112.965 6240.644 100 b
# m2 13.365 14.907 21.66753 15.422 26.731 60.658 100 a
I have the following vectors
> X <- c(1,1,3,4)
> a <- c(1,1,2,2)
> b <- c(2,1,4,3)
> c <- c(2,1,4,6)
I want to compare each element of X with corresponding elements of a,b and c and finally I need a class assigned to each row of X. for eg.
The first element of X is 1 and it has a match in corresponding element vector a, then I need to assign a class as '1-1' (no matter from which vector it got the match)
The second element of X is 1 and it also has match (in fact 3) so, again the class is '1-1'
The third element of X is 3 and it doesn't have a match then I should look for next integer value, which is 4 and there is 4 (in b and c). So the class should be '3-4'
The fourth element of X is 4 and it doesn't have a match. Also there is no 5 (next integer) then it should look for the previous integer which is 3 and there is 3. So the class should be '4-3'
Actually I have thousand of rows for each vector and I have to do this for each row. Any suggestion to do it in a less complicated way. I would prefer to use base functions of R.
Based on rbatt's comment and answer I realized my original answer was quite lacking. Here's a redo...
match_nearest <- function( x, table )
{
dist <- x - table
tgt <- which( dist < 0, arr.ind=TRUE, useNames=F )
dist[tgt] <- abs( dist[tgt] + .5 )
table[ cbind( seq_along(x), max.col( -dist, ties.method="first" ) ) ]
}
X <- c(1,1,3,4)
a <- c(1,1,2,2)
b <- c(2,1,4,3)
c <- c(2,1,4,6)
paste(X, match_nearest(X, cbind(a,b,c) ), sep="-")
## [1] "1-1" "1-1" "3-4" "4-3"
Compared to the original answer and rbatt's we find neither was correct!
set.seed(1)
X <- rbinom(n=1E4, size=10, prob=0.5)
a <- rbinom(n=1E4, size=10, prob=0.5)
b <- rbinom(n=1E4, size=10, prob=0.5)
c <- rbinom(n=1E4, size=10, prob=0.5)
T <- current_solution(X,a,b,c)
R <- rbatt_solution(X,a,b,c)
all.equal( T, R )
## [1] "195 string mismatches"
# Look at mismatched rows...
mismatch <- head( which( T != R ) )
cbind(X,a,b,c)[mismatch,]
## X a b c
## [1,] 4 6 3 3
## [2,] 5 7 4 7
## [3,] 5 8 3 9
## [4,] 5 7 7 4
## [5,] 4 6 3 7
## [6,] 5 7 4 2
T[mismatch]
## [1] "4-3" "5-4" "5-3" "5-4" "4-3" "5-4"
R[mismatch]
## [1] "4-6" "5-7" "5-8" "5-7" "4-6" "5-7"
and needlessly slow...
library(microbenchmark)
bm <- microbenchmark( current_solution(X,a,b,c),
previous_solution(X,a,b,c),
rbatt_solution(X,a,b,c) )
print(bm, order="median")
## Unit: milliseconds
## expr min lq median uq max neval
## current_solution(X, a, b, c) 7.088 7.298 7.996 8.268 38.25 100
## rbatt_solution(X, a, b, c) 33.920 38.236 46.524 53.441 85.50 100
## previous_solution(X, a, b, c) 83.082 93.869 101.997 115.961 135.98 100
Looks like the current_solution is getting it right; but without an expected output ...
Here's the functions...
current_solution <- function(X,a,b,c) {
paste(X, match_nearest(X, cbind(a,b,c) ), sep="-")
}
# DO NOT USE... it is wrong!
previous_solution <- function(X,a,b,c) {
dat <- rbind(X,a,b,c)
v <- apply(dat,2, function(v) {
v2 <- v[1] - v
v2[v2<0] <- abs( v2[v2<0]) - 1
v[ which.min( v2[-1] ) + 1 ]
})
paste("X", v, sep="-")
}
# DO NOT USE... it is wrong!
rbatt_solution <- function(X,a,b,c) {
mat <- cbind(X,a,b,c)
diff.signed <- mat[,"X"]-mat[,c("a","b","c")]
diff.break <- abs(diff.signed) + sign(diff.signed)*0.5
min.ind <- apply(diff.break, 1, which.min)
ind.array <- matrix(c(1:nrow(mat),min.ind), ncol=2)
match.value <- mat[,c("a","b","c")][ind.array]
ref.class <- paste(X, match.value, sep="-")
ref.class
}
This solution should provide the output you want. Also, it is ~ 3x faster than Thell's solution, because the differences are vectorized and are not calculated row-wise with apply.
I compare times for the two approaches below. Note that if you want the "class" as another column in a data.frame, just uncomment the last line of my function. I commented it out to make the calculation times between the two answers more comparable (creating a data.frame is quite slow).
# Example data from Thell, plus 1 more
X1 <- c(1,1,3,4,7,1, 5)
a1 <- c(1,1,2,2,2,2, 9)
b1 <- c(2,1,4,3,3,3, 3)
c1 <- c(2,1,4,6,6,6, 7)
# Random example data, much larger
# X1 <- rbinom(n=1E4, size=10, prob=0.5)
# a1 <- rbinom(n=1E4, size=10, prob=0.5)
# b1 <- rbinom(n=1E4, size=10, prob=0.5)
# c1 <- rbinom(n=1E4, size=10, prob=0.5)
My answer:
rbTest <- function(){
mat <- cbind(X1,a1,b1,c1)
diff.signed <- mat[,"X1"]-mat[,c("a1","b1","c1")] # differences (with sign)
diff.break <- abs(diff.signed) + sign(diff.signed)*0.5 # penalize for differences that are negative by adding 0.5 to them (break ties by preferring higher integer)
min.ind <- apply(diff.break, 1, which.min) # index of smallest difference (prefer larger integers when there is a tie)
ind.array <- matrix(c(1:nrow(mat),min.ind), ncol=2) # array index format
match.value <- mat[,c("a1","b1","c1")][ind.array] # value of the smallest difference (value of the match)
ref.class <- paste(X1, match.value, sep="-") # the 'class' in the format 'ref-match'
ref.class
# data.frame(class=ref.class, mat)
}
Thell answer:
thTest <- function(){
dat <- rbind(X1,a1,b1,c1)
apply(dat,2, function(v) {
# Get distance
v2 <- v[1] - v
# Prefer values >= v[1]
v2[v2<0] <- abs( v2[v2<0]) - 1
# Obtain and return nearest v excluding v[1]
v[ which.min( v2[-1] ) + 1 ]
})
}
Benchmark on large matrix (10,000 rows)
# > microbenchmark(rbTest(), thTest())
# Unit: milliseconds
# expr min lq median uq max neval
# rbTest() 47.95451 52.01729 59.36161 71.94076 103.1314 100
# thTest() 167.49798 180.69627 195.02828 204.19916 315.0610 100
Benchmark on small matrix (7 rows)
# > microbenchmark(rbTest(), thTest())
# Unit: microseconds
# expr min lq median uq max neval
# rbTest() 108.299 112.3550 115.4225 119.4630 146.722 100
# thTest() 147.727 152.2015 155.9005 159.3115 235.898 100
Example output (small matrix):
# > rbTest()
# [1] "1-1" "1-1" "3-4" "4-3" "7-6" "1-2" "5-7" "6-1"
# > thTest()
# [1] 1 1 4 3 6 2 7
Is there a way to use the diag() function in a Matrix without using the built-in function or iteration?
M<-matrix(1:9, ncol=3) # make a matrix
q5b<-function(M){ #function
}
I know that M[1,1], M[2,2], and M[3,3] will give me the same output as diag(M). However, I can't think of a way to do this without a for loop.
My thought process was I should have a condition where row index == column index in the Matrix then print that value. I appreciate any suggestions.
You can use the functions row and col to find the indices where the column number is identical to the row number:
row(M) == col(M)
# [,1] [,2] [,3]
# [1,] TRUE FALSE FALSE
# [2,] FALSE TRUE FALSE
# [3,] FALSE FALSE TRUE
M[row(M) == col(M)]
# [1] 1 5 9
Just subset based on another matrix:
> diag(M)
[1] 1 5 9
> M[matrix(rep(sequence(ncol(M)), 2), ncol = 2)]
[1] 1 5 9
The above would run into a problem in a non-square matrix, so we modify it as below.
As your function, one answer for question 5b could be:
q5b <- function(M) {
A <- sequence(ncol(M))[sequence(min(nrow(M), ncol(M)))]
M[cbind(A, A)]
}
Update: Benchmarks are always fun
library(microbenchmark)
fun1 <- function(M) diag(M)
fun2 <- function(M) M[row(M) == col(M)]
fun3 <- function(M) {
A <- sequence(ncol(M))[sequence(min(nrow(M), ncol(M)))]
M[cbind(A, A)]
}
set.seed(1)
M <- matrix(rnorm(1000*1000), ncol = 1000)
microbenchmark(fun1(M), fun2(M), fun3(M), times = 100)
# Unit: microseconds
# expr min lq median uq max neval
# fun1(M) 4654.825 4747.408 4822.8865 4912.690 5877.866 100
# fun2(M) 53270.266 54813.606 55059.0695 55749.062 200384.531 100
# fun3(M) 66.284 82.321 118.8835 129.361 191.155 100