row-wise differences between two large matrices in R - r

I would like to ask an opinion on how to speed up the following operation.
I have two matrices says A and B with n rows and 3 columns; for any row vector of A I want to compare its difference with any row vector of B. So it is a pairwise difference between all row vectors of the two matrices. The resulting matrix is then a n*n matrix. Then I want to apply a function to any element of this, the biharm() function that I wrote in the example. The problem is that, while for small matrices I have no problems, I have the necessity to apply this operation to very large matrices such as 1000*3. In the sigm() function, that I wrote to do that, I first initialize S and then I wrote two annidated for cycles. However, this is slow for large matrices. Does anyone has an idea on how to speed up this? I think using apply() but I cannot figure out the correct way. Here below a fully reproducible example. Thanks in advance for any advice. Best, Paolo.
biharm<-function(vec1,vec2){
reso<-norm(as.matrix(vec1)-as.matrix(vec2),type="F")^2*log(norm((as.matrix(vec1)-as.matrix(vec2)),type="F"))
reso
}
sigm<-function(mat1,mat2=NULL){
tt<-mat1
if(is.null(mat2)){yy<-mat1}else{yy<-mat2}
k<-nrow(yy)
m<-ncol(yy)
SGMr<-matrix(rep(0,k^2),ncol=k)
for(i in 1:k){
for(j in 1: k){
SGMr[i,j]<-biharm(yy[i,],tt[j,])
}}
SGMr<-replace(SGMr,which(SGMr=="NaN",arr.ind=T),0)
return(SGMr)}
### small matrices example:
A<-matrix(rnorm(30),ncol=3)
B<-matrix(rnorm(30),ncol=3)
sigm(A,B)
### large matrices example:
A<-matrix(rnorm(900),ncol=3)
B<-matrix(rnorm(900),ncol=3)
sigm(A,B)

This is about 8 times faster on my system.
biharm.new <- function(vec1,vec2){
n <- sqrt(sum((vec1-vec2)^2))
n^2*log(n)
}
sigm.new<-function(mat1,mat2=NULL){
tt<-mat1
if(is.null(mat2)){yy<-mat1}else{yy<-mat2}
SGMr <- apply(tt,1,function(t)apply(yy,1,biharm.new,t))
replace(SGMr,which(SGMr=="NaN",arr.ind=T),0)
}
### large matrices example:
set.seed(1)
A<-matrix(rnorm(900),ncol=3)
B<-matrix(rnorm(900),ncol=3)
system.time(result.1<-sigm(A,B))
# user system elapsed
# 6.13 0.00 6.13
system.time(result.2<-sigm.new(A,B))
# user system elapsed
# 0.81 0.00 0.81
all.equal(result.1,result.2)
# [1] TRUE
The use of apply(...) results in about a 3-fold improvement. The rest comes from optimizing biharm(...) - since you are calling this 810,000 times it pays to make it as efficient as possible.
Note that the Frobenius norm is just the Euclidean norm, so if that is what you really want use sqrt(sum(x^2)) rather than converting to matrices and using norm(...). The former is much faster.

How about this:
set.seed(1)
foo<-matrix(runif(30),nc=3)
bar<-matrix(runif(30),nc=3)
sapply(1:10,function(j) sapply(1:10,function(k) biharm(bar[k,],foo[j,])) )
EDIT -- basically same as jhoward's "sigm.new" without the error checking. Clearly biharm.new is a winner.
microbenchmark(carl(foo,bar),jhoward(foo,bar),times=3)
Unit: milliseconds
expr min lq median uq max neval
carl(foo, bar) 5846.8273 6071.364 6295.8999 6322.425 6348.951 3
jhoward(foo, bar) 891.5734 934.550 977.5267 1008.388 1039.248 3

Related

R lookup time for very long vector

In the R programming language...
Bottleneck in my code:
a <- a[b]
where:
a,b are vectors of length 90 Million.
a is a logical vector.
b is a permutation of the indeces of a.
This operation is slow: it takes ~ 1.5 - 2.0 seconds.
I thought straightforward indexing would be much faster, even for large vectors.
Am I simply stuck? Or is there a way to speed this up?
Context:
P is a large matrix (10k row, 5k columns).
rows = names, columns = features. values = real numbers.
Problem: Given a subset of names, I need to obtain matrix Q, where:
Each column of Q is sorted (independently of the other columns of Q).
The values in a column of Q come from the corresponding column of P and are only those from the rows of P which are in the given subset of names.
Here is a naive implementation:
Psub <- P[names,]
Q <- sapply( Psub , sort )
But I am given 10,000 distinct subsets of names (each subset is several 20% to 90% of the total). Taking the subset and sorting each time is incredibly slow.
Instead, I can pre-compute the order vector:
b <- sapply( P , order )
b <- convert_to_linear_index( as.data.frame(b) , dim(P) )
# my own function.
# Now b is a vector of length nrow(P) * ncol(P)
a <- rownames(P) %in% myNames
a <- rep(a , ncol(P) )
a <- a[b]
a <- as.matrix(a , nrow = length(myNames) )
I don't see this getting much faster than that. You can try to write an optimized C function to do exactly this, which might cut the time in half or so (and that's optimistic -- vectorized R operations like this don't have much overhead), but not much more than that.
You've got approx 10^8 values to go through. Each time through the internal loop, it needs to increment the iterator, get the index b[i] out of memory, look up a[b[i]] and then save that value into newa[i]. I'm not a compiler/assembly expert by a long shot, but this sounds like on the order of 5-10 instructions, which means you're looking at "big O" of 1 billion instructions total, so there's a clock rate limit to how fast this can go.
Also, R stores logical values as 32 bit ints, so the array a will take up about 400 megs, which doesn't fit into cache, so if b is a more or less random permutation, then you're going to be missing the cache regularly (on most lookups to a, in fact). Again, I'm not an expert, but I would think it's likely that the cache misses here are the bottleneck, and if that's the case, optimized C won't help much.
Aside from writing it in C, the other thing to do is determine whether there are any assumptions you can make that would let you not go through the whole array. For example, if you know most of the indices will not change, and you can figure out which ones do change, you might be able to make it go faster.
On edit, here are some numbers. I have an AMD with clock speed of 2.8GHz. It takes me 3.4 seconds with a random permutation (i.e. lots of cache misses) and 0.7 seconds with either 1:n or n:1 (i.e. very few cache misses), which breaks into 0.6 seconds of execution time and 0.1 of system time, presumably to allocate the new array. So it does appear that cache misses are the thing. Maybe optimized C code could shave something like 0.2 or 0.3 seconds off of that base time, but if the permutation is random, that won't make much difference.
> x<-sample(c(T,F),90*10**6,T)
> prm<-sample(90*10**6)
> prm1<-1:length(prm)
> prm2<-rev(prm1)
> system.time(x<-x[prm])
user system elapsed
3.317 0.116 3.436
> system.time(x<-x[prm1])
user system elapsed
0.593 0.140 0.734
> system.time(x<-x[prm2])
user system elapsed
0.631 0.112 0.743
>

R: Faster Minimum Column Value Over Multiple Matricies?

I have a very tall integer matrix (mat) and a sparse binary vector (v) of equal row length. I want to find the minimum value in all columns of mat where v==1.
Here are several possible solutions:
mat <- matrix(as.integer(runif(100000*100,0,2^31)),nrow=100000,ncol=100)
v<-(rbinom(100000,1,.01))
a<-apply(v*mat,2, function(x) min(x[x>0]))
b<-apply(mat,2,function(x) min(x[v==1]))
c<-sapply(subset(data.frame(mat),v==1), min)
These all work fine, and on my machine solution c seems fastest (an admittedly older,slower MacBook). But if I have a function that feeds unique sets of v, the computation time scales linearly with the number of sets. So a large number of unique sets (>10,000) will take hours to process.
Any ideas on how to do such an operation faster, or is this as fast as I can go?
I guess that subsetting and then calling apply gains a lot, given that v is almost always 0:
system.time(b<-apply(mat[as.logical(v),],2, min))
# user system elapsed
# 0.012 0.000 0.013
system.time(a<-apply(v*mat,2, function(x) min(x[x>0])))
# user system elapsed
# 0.628 0.019 0.649
identical(a,b)
#[1] TRUE
I dropped also the x[x>0], since it appears that mat is always greater than 0.

multiplying every nth row of a matrix by a number using sapply and for loop

I have a 100 * 100 matrix and I a willing to multiply every tenth row by 10.
I know that it would be easily doable by writing sth like:
tenthrows<-seq(from=10,to=100,by=10)
m[tenthrows,]<-m[tenthrows,]*10
However I've asked to do this by using "for loop" and "sapply/lapply" functions.
I could do this by using only for loop like:
for (i in seq(from = 10 , to = 100, by = 10)) {m[i,]<-m[i,]*10 }
But I am not sure how to incorporate sapply/lapply also for doing so, any idea?
Many Thanks in advance
Generate a matrix:
m0 <- m <- matrix(1:1e4,100,100)
One way to do it:
tenthrows <- seq(1,100,by=10)
m[tenthrows,] <- m[tenthrows,]*10
Doing it with a for loop would be less efficient. I can't think of a reason you would want to do it that way, unless (a) this is homework, (b) this is self-study, (c) there is some context you haven't told us about. If one of those is true, could you let us know?
Incorporating #baptiste's answer:
m2 <- m0 * c(10,rep(1,9))
all.equal(m,m2) ## TRUE
This works because R stores matrices columnwise, so that multiplying every 10th element is equivalent to multiplying every 10th row.
Another possibility:
m3 <- sweep(m0,MARGIN=1,FUN="*",STAT=c(10,rep(1,9)))
all.equal(m,m3) ## TRUE
Note that the difference between the original comment (c(rep(1,9),10)) depends on how you interpret "every 10th row"; the original comment multiplied rows 10,20,30,... whereas my answer is multiplying rows 1,11,21,31,...
library("rbenchmark")
benchmark(m[tenthrows,] <- m[tenthrows,]*10,
m0*c(10,rep(1,9)),
sweep(m0,MARGIN=1,FUN="*",STAT=c(10,rep(1,9))),
replications=1000)
## test replications elapsed relative
## 2 m0 * c(10, rep(1, 9)) 1000 0.109 1.000
## 1 m[tenthrows, ] <- m[tenthrows, ] * 10 1000 0.125 1.147
The rep approach is (slightly) faster. (sweep, not shown, is even slower, although it's perhaps the most principled approach -- it's the one that is the closest to a direct translation of "multiply every 10th row by 10". It's also worth keeping in mind that even the slowest approach took 0.25 seconds for 1000 replications of this task.)

Finding equal matrix lines

I have a matrix in which each row represents a data point (it's a nxp matrix with n p-dimensional points), and I need to find if there is a pair of equal points.
With only two points, I could just apply dist, but as the amount of points increase, so does the amount of comparisons I need to do with dist (as I'm comparing only two points at a time).
So, as I'm starting to use big matrices, I need a quick way to find if any two points are equal (or if there are two equal rows in this matrix).
Any suggestions?
Edit: as I don't need to return the numbers of the equal rows (I just need to verify if any two are equal), I guess I could create a matrix with no duplicated lines and just compare the number of lines between this matrix and the original matrix. What do you think?
Use the unique function, which is specifically set up to allow you to check for either unique rows or columns in a matrix. Or, depending on whether you want to save the reduced matrix or not, you could use duplicated as juba pointed out.
If the matrix is large, consider using data tables.
library(data.table)
n <- 1e6
set.seed(1)
df <- data.frame(x.1=round(runif(n,0,100)),
x.2=round(runif(n,0,100)),
x.3=round(runif(n,0,100)),
x.4=round(runif(n,0,100)))
dt <- data.table(df)
system.time(df.dupe <- duplicated(df))
# user system elapsed
# 16.55 0.01 16.60
system.time(dt.dupe <- duplicated(dt))
# user system elapsed
# 9.79 0.05 9.83
setkeyv(dt,colnames(dt))
system.time(dt.dupe <- duplicated(dt))
# user system elapsed
# 0.08 0.00 0.07
So without keys, data tables are about 40% faster. WIth keys they are about 160X faster. Of course you have to create the keys (sort), which takes about 10 sec so if you're only doing this once, better to use the unkeyed data table.

Efficiencies for nested for loop

I've created the following code that nests a for loop inside of a for loop in R. It is a simulation to calculate Power. I've read that R isn't great for doing for loops but I was wondering if there are any efficiencies I could apply to make this run a bit faster. I'm fairly new to R as well as programming of any sort. Right now the run times I'm seeing are:
m=10 I get .17 sec
m=100 I get 3.95 sec
m=1000 I get 246.26 sec
m=2000 I get 1003.55 sec
I was hoping to set the number of times to sample, m, upwards of 100K but I'm afraid to even set this at 10K
Here is the code:
m = 1000 # number of times we are going to take samples
popmean=120 # set population mean at 120
popvar=225 # set known/established population
variance at 225
newvar=144 # variance of new methodology
alpha=.01 # set alpha
teststatvect = matrix(nrow=m,ncol=1) # empty vector to populate with test statistics
power = matrix(nrow=200,ncol=1) # empty vector to populate with power
system.time( # not needed - using to gauge how long this takes
for (n in 1:length(power)) # begin for loop for different sample sizes
for(i in 1:m){ # begin for loop to take "m" samples
y=rnorm(n,popmean,sqrt(newvar)) # sample of size n with mean 120 and var=144
ts=sum((y-popmean)^2/popvar) # calculate test statistic for each sample
teststatvect[i]=ts # loop and populate the vector to hold test statistics
vecpvals=pchisq(teststatvect,n) # calculate the pval of each statistic
power[n]=length(which(vecpvals<=alpha))/length(vecpvals) # loop to populate power vector. Power is the proportion lessthan ot equal to alpha
}
}
)
I reorganized your code a bit and got rid of the inner loop.
Sampling one long vector of random numbers (and then collapsing it into a matrix) is much faster than repeatedly sampling short vectors (replicate, as suggested in another answer, is nice for readability, but in this case you can do better by sampling random numbers in a block)
colSums is faster than summing inside a for loop or using apply.
it's just sugar (i.e. it isn't actually any more efficient), but you can use mean(pvals<=alpha) in place of sum(pvals<=alpha)/length(alpha)
I defined a function to return the power for a specified set of parameters (including sample size), then used sapply to range over the vector of sizes (not faster than a for loop, but cleaner and maybe easier to generalize).
Code:
powfun <- function(ssize=100,
m=1000, ## samples per trial
popmean=120, ## pop mean
popvar=225, ## known/established pop variance
newvar=144, ## variance of new methodology
alpha=0.01,
sampchisq=FALSE) ## sample directly from chi-squared distrib?
{
if (!sampchisq) {
ymat <- matrix(rnorm(ssize*m,popmean,sd=sqrt(newvar)),ncol=m)
ts <- colSums((ymat-popmean)^2/popvar) ## test statistic
} else {
ts <- rchisq(m,df=ssize)*newvar/popvar
}
pvals <- pchisq(ts,df=ssize) ## pval
mean(pvals<=alpha) ## power
}
Do you really need the power for every integer value of sample size, or would a more widely spaced sample be OK (if you need exact values, interpolation would probably be pretty accurate)
ssizevec <- seq(10,250,by=5)
set.seed(101)
system.time(powvec <- sapply(ssizevec,powfun,m=5000)) ## 13 secs elapsed
This is reasonably fast and might get you up to m=1e5 if you needed, but I'm not quite sure why you need results that are that precise -- the power curve is reasonably smooth with m=5000 ...
If you're impatiently waiting for long simulations, you can also get a progress bar to print by replacing sapply(ssizevec,powfun,m=5000) with library(plyr); aaply(ssizevec,.margins=1,powfun,.progress="text",m=5000)
Finally, I think you can speed the whole up a lot by sampling chi-squared values directly, or by doing an analytical power calculation (!). I think that rchisq(m,df=ssize)*newvar/popvar is equivalent to the first two lines of the loop, and you might even be able to do a numerical computation on the chi-squared densities directly ...
system.time(powvec2 <- sapply(ssizevec,powfun,m=5000,sampchisq=TRUE))
## 0.24 seconds elapsed
(I just tried this out, sampling m=1e5 at every value of sample size from 1 to 200 ... it takes 24 seconds ... but I still think it might be unnecessary.)
A picture:
par(bty="l",las=1)
plot(ssizevec,powvec,type="l",xlab="sample size",ylab="power",
xlim=c(0,250),ylim=c(0,1))
lines(ssizevec,powvec2,col="red")
In general, you want as far as possible to take advantage of vectorization, not so much for speed as readability/comprehension.
Why is writing to power[n] inside the inner loop (and I guess the calculation of vecpals as well)? Shouldn't that be in the outer loop after the inner loop executes? You may want to move the calculation of the square root outside both loops.
Why are teststatvect and power initialized as matrices (which are explicitly two dimensional arrays) and not vectors (or rather, as one dimensional arrays, using array)? Is variance at 225just the end of the comment from the previous line? You may want to check formatting. (Is this homework?)
For what it looks like you're trying to do here, you may want to take advantage of the very handy function replicate, perhaps by writing a specific function to call it on.

Resources