I have a very tall integer matrix (mat) and a sparse binary vector (v) of equal row length. I want to find the minimum value in all columns of mat where v==1.
Here are several possible solutions:
mat <- matrix(as.integer(runif(100000*100,0,2^31)),nrow=100000,ncol=100)
v<-(rbinom(100000,1,.01))
a<-apply(v*mat,2, function(x) min(x[x>0]))
b<-apply(mat,2,function(x) min(x[v==1]))
c<-sapply(subset(data.frame(mat),v==1), min)
These all work fine, and on my machine solution c seems fastest (an admittedly older,slower MacBook). But if I have a function that feeds unique sets of v, the computation time scales linearly with the number of sets. So a large number of unique sets (>10,000) will take hours to process.
Any ideas on how to do such an operation faster, or is this as fast as I can go?
I guess that subsetting and then calling apply gains a lot, given that v is almost always 0:
system.time(b<-apply(mat[as.logical(v),],2, min))
# user system elapsed
# 0.012 0.000 0.013
system.time(a<-apply(v*mat,2, function(x) min(x[x>0])))
# user system elapsed
# 0.628 0.019 0.649
identical(a,b)
#[1] TRUE
I dropped also the x[x>0], since it appears that mat is always greater than 0.
Related
In the R programming language...
Bottleneck in my code:
a <- a[b]
where:
a,b are vectors of length 90 Million.
a is a logical vector.
b is a permutation of the indeces of a.
This operation is slow: it takes ~ 1.5 - 2.0 seconds.
I thought straightforward indexing would be much faster, even for large vectors.
Am I simply stuck? Or is there a way to speed this up?
Context:
P is a large matrix (10k row, 5k columns).
rows = names, columns = features. values = real numbers.
Problem: Given a subset of names, I need to obtain matrix Q, where:
Each column of Q is sorted (independently of the other columns of Q).
The values in a column of Q come from the corresponding column of P and are only those from the rows of P which are in the given subset of names.
Here is a naive implementation:
Psub <- P[names,]
Q <- sapply( Psub , sort )
But I am given 10,000 distinct subsets of names (each subset is several 20% to 90% of the total). Taking the subset and sorting each time is incredibly slow.
Instead, I can pre-compute the order vector:
b <- sapply( P , order )
b <- convert_to_linear_index( as.data.frame(b) , dim(P) )
# my own function.
# Now b is a vector of length nrow(P) * ncol(P)
a <- rownames(P) %in% myNames
a <- rep(a , ncol(P) )
a <- a[b]
a <- as.matrix(a , nrow = length(myNames) )
I don't see this getting much faster than that. You can try to write an optimized C function to do exactly this, which might cut the time in half or so (and that's optimistic -- vectorized R operations like this don't have much overhead), but not much more than that.
You've got approx 10^8 values to go through. Each time through the internal loop, it needs to increment the iterator, get the index b[i] out of memory, look up a[b[i]] and then save that value into newa[i]. I'm not a compiler/assembly expert by a long shot, but this sounds like on the order of 5-10 instructions, which means you're looking at "big O" of 1 billion instructions total, so there's a clock rate limit to how fast this can go.
Also, R stores logical values as 32 bit ints, so the array a will take up about 400 megs, which doesn't fit into cache, so if b is a more or less random permutation, then you're going to be missing the cache regularly (on most lookups to a, in fact). Again, I'm not an expert, but I would think it's likely that the cache misses here are the bottleneck, and if that's the case, optimized C won't help much.
Aside from writing it in C, the other thing to do is determine whether there are any assumptions you can make that would let you not go through the whole array. For example, if you know most of the indices will not change, and you can figure out which ones do change, you might be able to make it go faster.
On edit, here are some numbers. I have an AMD with clock speed of 2.8GHz. It takes me 3.4 seconds with a random permutation (i.e. lots of cache misses) and 0.7 seconds with either 1:n or n:1 (i.e. very few cache misses), which breaks into 0.6 seconds of execution time and 0.1 of system time, presumably to allocate the new array. So it does appear that cache misses are the thing. Maybe optimized C code could shave something like 0.2 or 0.3 seconds off of that base time, but if the permutation is random, that won't make much difference.
> x<-sample(c(T,F),90*10**6,T)
> prm<-sample(90*10**6)
> prm1<-1:length(prm)
> prm2<-rev(prm1)
> system.time(x<-x[prm])
user system elapsed
3.317 0.116 3.436
> system.time(x<-x[prm1])
user system elapsed
0.593 0.140 0.734
> system.time(x<-x[prm2])
user system elapsed
0.631 0.112 0.743
>
I have a 100 * 100 matrix and I a willing to multiply every tenth row by 10.
I know that it would be easily doable by writing sth like:
tenthrows<-seq(from=10,to=100,by=10)
m[tenthrows,]<-m[tenthrows,]*10
However I've asked to do this by using "for loop" and "sapply/lapply" functions.
I could do this by using only for loop like:
for (i in seq(from = 10 , to = 100, by = 10)) {m[i,]<-m[i,]*10 }
But I am not sure how to incorporate sapply/lapply also for doing so, any idea?
Many Thanks in advance
Generate a matrix:
m0 <- m <- matrix(1:1e4,100,100)
One way to do it:
tenthrows <- seq(1,100,by=10)
m[tenthrows,] <- m[tenthrows,]*10
Doing it with a for loop would be less efficient. I can't think of a reason you would want to do it that way, unless (a) this is homework, (b) this is self-study, (c) there is some context you haven't told us about. If one of those is true, could you let us know?
Incorporating #baptiste's answer:
m2 <- m0 * c(10,rep(1,9))
all.equal(m,m2) ## TRUE
This works because R stores matrices columnwise, so that multiplying every 10th element is equivalent to multiplying every 10th row.
Another possibility:
m3 <- sweep(m0,MARGIN=1,FUN="*",STAT=c(10,rep(1,9)))
all.equal(m,m3) ## TRUE
Note that the difference between the original comment (c(rep(1,9),10)) depends on how you interpret "every 10th row"; the original comment multiplied rows 10,20,30,... whereas my answer is multiplying rows 1,11,21,31,...
library("rbenchmark")
benchmark(m[tenthrows,] <- m[tenthrows,]*10,
m0*c(10,rep(1,9)),
sweep(m0,MARGIN=1,FUN="*",STAT=c(10,rep(1,9))),
replications=1000)
## test replications elapsed relative
## 2 m0 * c(10, rep(1, 9)) 1000 0.109 1.000
## 1 m[tenthrows, ] <- m[tenthrows, ] * 10 1000 0.125 1.147
The rep approach is (slightly) faster. (sweep, not shown, is even slower, although it's perhaps the most principled approach -- it's the one that is the closest to a direct translation of "multiply every 10th row by 10". It's also worth keeping in mind that even the slowest approach took 0.25 seconds for 1000 replications of this task.)
I would like to ask an opinion on how to speed up the following operation.
I have two matrices says A and B with n rows and 3 columns; for any row vector of A I want to compare its difference with any row vector of B. So it is a pairwise difference between all row vectors of the two matrices. The resulting matrix is then a n*n matrix. Then I want to apply a function to any element of this, the biharm() function that I wrote in the example. The problem is that, while for small matrices I have no problems, I have the necessity to apply this operation to very large matrices such as 1000*3. In the sigm() function, that I wrote to do that, I first initialize S and then I wrote two annidated for cycles. However, this is slow for large matrices. Does anyone has an idea on how to speed up this? I think using apply() but I cannot figure out the correct way. Here below a fully reproducible example. Thanks in advance for any advice. Best, Paolo.
biharm<-function(vec1,vec2){
reso<-norm(as.matrix(vec1)-as.matrix(vec2),type="F")^2*log(norm((as.matrix(vec1)-as.matrix(vec2)),type="F"))
reso
}
sigm<-function(mat1,mat2=NULL){
tt<-mat1
if(is.null(mat2)){yy<-mat1}else{yy<-mat2}
k<-nrow(yy)
m<-ncol(yy)
SGMr<-matrix(rep(0,k^2),ncol=k)
for(i in 1:k){
for(j in 1: k){
SGMr[i,j]<-biharm(yy[i,],tt[j,])
}}
SGMr<-replace(SGMr,which(SGMr=="NaN",arr.ind=T),0)
return(SGMr)}
### small matrices example:
A<-matrix(rnorm(30),ncol=3)
B<-matrix(rnorm(30),ncol=3)
sigm(A,B)
### large matrices example:
A<-matrix(rnorm(900),ncol=3)
B<-matrix(rnorm(900),ncol=3)
sigm(A,B)
This is about 8 times faster on my system.
biharm.new <- function(vec1,vec2){
n <- sqrt(sum((vec1-vec2)^2))
n^2*log(n)
}
sigm.new<-function(mat1,mat2=NULL){
tt<-mat1
if(is.null(mat2)){yy<-mat1}else{yy<-mat2}
SGMr <- apply(tt,1,function(t)apply(yy,1,biharm.new,t))
replace(SGMr,which(SGMr=="NaN",arr.ind=T),0)
}
### large matrices example:
set.seed(1)
A<-matrix(rnorm(900),ncol=3)
B<-matrix(rnorm(900),ncol=3)
system.time(result.1<-sigm(A,B))
# user system elapsed
# 6.13 0.00 6.13
system.time(result.2<-sigm.new(A,B))
# user system elapsed
# 0.81 0.00 0.81
all.equal(result.1,result.2)
# [1] TRUE
The use of apply(...) results in about a 3-fold improvement. The rest comes from optimizing biharm(...) - since you are calling this 810,000 times it pays to make it as efficient as possible.
Note that the Frobenius norm is just the Euclidean norm, so if that is what you really want use sqrt(sum(x^2)) rather than converting to matrices and using norm(...). The former is much faster.
How about this:
set.seed(1)
foo<-matrix(runif(30),nc=3)
bar<-matrix(runif(30),nc=3)
sapply(1:10,function(j) sapply(1:10,function(k) biharm(bar[k,],foo[j,])) )
EDIT -- basically same as jhoward's "sigm.new" without the error checking. Clearly biharm.new is a winner.
microbenchmark(carl(foo,bar),jhoward(foo,bar),times=3)
Unit: milliseconds
expr min lq median uq max neval
carl(foo, bar) 5846.8273 6071.364 6295.8999 6322.425 6348.951 3
jhoward(foo, bar) 891.5734 934.550 977.5267 1008.388 1039.248 3
I have a matrix in which each row represents a data point (it's a nxp matrix with n p-dimensional points), and I need to find if there is a pair of equal points.
With only two points, I could just apply dist, but as the amount of points increase, so does the amount of comparisons I need to do with dist (as I'm comparing only two points at a time).
So, as I'm starting to use big matrices, I need a quick way to find if any two points are equal (or if there are two equal rows in this matrix).
Any suggestions?
Edit: as I don't need to return the numbers of the equal rows (I just need to verify if any two are equal), I guess I could create a matrix with no duplicated lines and just compare the number of lines between this matrix and the original matrix. What do you think?
Use the unique function, which is specifically set up to allow you to check for either unique rows or columns in a matrix. Or, depending on whether you want to save the reduced matrix or not, you could use duplicated as juba pointed out.
If the matrix is large, consider using data tables.
library(data.table)
n <- 1e6
set.seed(1)
df <- data.frame(x.1=round(runif(n,0,100)),
x.2=round(runif(n,0,100)),
x.3=round(runif(n,0,100)),
x.4=round(runif(n,0,100)))
dt <- data.table(df)
system.time(df.dupe <- duplicated(df))
# user system elapsed
# 16.55 0.01 16.60
system.time(dt.dupe <- duplicated(dt))
# user system elapsed
# 9.79 0.05 9.83
setkeyv(dt,colnames(dt))
system.time(dt.dupe <- duplicated(dt))
# user system elapsed
# 0.08 0.00 0.07
So without keys, data tables are about 40% faster. WIth keys they are about 160X faster. Of course you have to create the keys (sort), which takes about 10 sec so if you're only doing this once, better to use the unkeyed data table.
I have a large correlation matrix, 62589x62589. I've binarised the matrix above a certain threshold which I've done with no problems but I'm slightly confused as to the significant difference in basic Calculation time.
The first time I did this.... number of 1's : 425,491 ... Number of 0's : 3,916,957,430
Sum of these two numbers == 62589^2, implying that the matrix is truly binarised. I saved this as an Rdata object (31Mb). Performing a basic calculation of the matrix takes ~3.5 minutes.
fooB <- foo %*% foo
The second time, with a lower threshold..... number of 1's : 30,384,683 ... Number of 0's : 3,886,998,238. Sum of these is again, 62589^2, and therefore truly binarised. The Rdata Object is 84Mb. Performing the same multiplication step as above is still currently calculating after an hour.
Should the increased number of 1's in the newest matrix increase the file size and processing time so drastically?
Thanks for reading
Edit: final time for same calculation to second matrix is 65 minutes
Edit2: performing is() results in : Matrix Array Structure Vector
Here is a reproducible example that may help with memory size and processing times for binary sparse matrices from package Matrix:
n <- 62589
N1 <- 425491
require(Matrix)
foo <- sparseMatrix(i=sample(n, N1, TRUE), j=sample(n, N1, TRUE), dims=c(n, n))
print(object.size(foo), units="Mb")
#1.9 Mb
sum(foo)
#[1] 425464
(Note that the sampling may give some duplicates in pairs (i,j), thus the number above is slightly less than N1.)
Squaring:
system.time(fooB <- foo %*% foo)
# user system elapsed
# 0.39 0.03 0.42
print(object.size(fooB), units="Mb")
#11.3 Mb
sum(fooB)
#[1] 2892234
Cubing:
system.time(fooC <- fooB %*% foo)
# user system elapsed
# 2.74 0.11 2.87
print(object.size(fooC), units="Mb")
#75 Mb
sum(fooC)
#[1] 19610641