Finding equal matrix lines - r

I have a matrix in which each row represents a data point (it's a nxp matrix with n p-dimensional points), and I need to find if there is a pair of equal points.
With only two points, I could just apply dist, but as the amount of points increase, so does the amount of comparisons I need to do with dist (as I'm comparing only two points at a time).
So, as I'm starting to use big matrices, I need a quick way to find if any two points are equal (or if there are two equal rows in this matrix).
Any suggestions?
Edit: as I don't need to return the numbers of the equal rows (I just need to verify if any two are equal), I guess I could create a matrix with no duplicated lines and just compare the number of lines between this matrix and the original matrix. What do you think?

Use the unique function, which is specifically set up to allow you to check for either unique rows or columns in a matrix. Or, depending on whether you want to save the reduced matrix or not, you could use duplicated as juba pointed out.

If the matrix is large, consider using data tables.
library(data.table)
n <- 1e6
set.seed(1)
df <- data.frame(x.1=round(runif(n,0,100)),
x.2=round(runif(n,0,100)),
x.3=round(runif(n,0,100)),
x.4=round(runif(n,0,100)))
dt <- data.table(df)
system.time(df.dupe <- duplicated(df))
# user system elapsed
# 16.55 0.01 16.60
system.time(dt.dupe <- duplicated(dt))
# user system elapsed
# 9.79 0.05 9.83
setkeyv(dt,colnames(dt))
system.time(dt.dupe <- duplicated(dt))
# user system elapsed
# 0.08 0.00 0.07
So without keys, data tables are about 40% faster. WIth keys they are about 160X faster. Of course you have to create the keys (sort), which takes about 10 sec so if you're only doing this once, better to use the unkeyed data table.

Related

R lookup time for very long vector

In the R programming language...
Bottleneck in my code:
a <- a[b]
where:
a,b are vectors of length 90 Million.
a is a logical vector.
b is a permutation of the indeces of a.
This operation is slow: it takes ~ 1.5 - 2.0 seconds.
I thought straightforward indexing would be much faster, even for large vectors.
Am I simply stuck? Or is there a way to speed this up?
Context:
P is a large matrix (10k row, 5k columns).
rows = names, columns = features. values = real numbers.
Problem: Given a subset of names, I need to obtain matrix Q, where:
Each column of Q is sorted (independently of the other columns of Q).
The values in a column of Q come from the corresponding column of P and are only those from the rows of P which are in the given subset of names.
Here is a naive implementation:
Psub <- P[names,]
Q <- sapply( Psub , sort )
But I am given 10,000 distinct subsets of names (each subset is several 20% to 90% of the total). Taking the subset and sorting each time is incredibly slow.
Instead, I can pre-compute the order vector:
b <- sapply( P , order )
b <- convert_to_linear_index( as.data.frame(b) , dim(P) )
# my own function.
# Now b is a vector of length nrow(P) * ncol(P)
a <- rownames(P) %in% myNames
a <- rep(a , ncol(P) )
a <- a[b]
a <- as.matrix(a , nrow = length(myNames) )
I don't see this getting much faster than that. You can try to write an optimized C function to do exactly this, which might cut the time in half or so (and that's optimistic -- vectorized R operations like this don't have much overhead), but not much more than that.
You've got approx 10^8 values to go through. Each time through the internal loop, it needs to increment the iterator, get the index b[i] out of memory, look up a[b[i]] and then save that value into newa[i]. I'm not a compiler/assembly expert by a long shot, but this sounds like on the order of 5-10 instructions, which means you're looking at "big O" of 1 billion instructions total, so there's a clock rate limit to how fast this can go.
Also, R stores logical values as 32 bit ints, so the array a will take up about 400 megs, which doesn't fit into cache, so if b is a more or less random permutation, then you're going to be missing the cache regularly (on most lookups to a, in fact). Again, I'm not an expert, but I would think it's likely that the cache misses here are the bottleneck, and if that's the case, optimized C won't help much.
Aside from writing it in C, the other thing to do is determine whether there are any assumptions you can make that would let you not go through the whole array. For example, if you know most of the indices will not change, and you can figure out which ones do change, you might be able to make it go faster.
On edit, here are some numbers. I have an AMD with clock speed of 2.8GHz. It takes me 3.4 seconds with a random permutation (i.e. lots of cache misses) and 0.7 seconds with either 1:n or n:1 (i.e. very few cache misses), which breaks into 0.6 seconds of execution time and 0.1 of system time, presumably to allocate the new array. So it does appear that cache misses are the thing. Maybe optimized C code could shave something like 0.2 or 0.3 seconds off of that base time, but if the permutation is random, that won't make much difference.
> x<-sample(c(T,F),90*10**6,T)
> prm<-sample(90*10**6)
> prm1<-1:length(prm)
> prm2<-rev(prm1)
> system.time(x<-x[prm])
user system elapsed
3.317 0.116 3.436
> system.time(x<-x[prm1])
user system elapsed
0.593 0.140 0.734
> system.time(x<-x[prm2])
user system elapsed
0.631 0.112 0.743
>

monthly means with apply for multidimensional arrays

I want to compute the mean over the 3-D of a multidimensional array. As this dimension is supposed to be the time, I wanted to computed monthly means. For that, I tried to use apply, but I am not sure where the problem is. Let's say my data is as the following:
#Creating a sample
m <-array(1:12, dim=c(20,4,36))
#number of months
months <- seq(1:12)
#Compute the mean over each month (dimension of the result should be [20,4,12]
monmean <- apply(m,1:2,function(x) for(i in 1:12) mean(x[,,months==i],na.rm=TRUE))
Any idea??
Thanks in advance
I think I understand what you're after. This is actually slightly more complex than it may seem, because months are not regular periods of time; they vary in number of days, and February varies between years due to leap years. Thus a simple regular logical or numeric index vector will not be sufficient to calculate this result precisely. You need to take into account the exact dates that are covered by the z-dimension of your array.
Solution 1
What you can do is separately compute a date vector that identifies the dates that correspond to each z-index of your array. Within the apply() call for each z-line, you can then call strftime() to extract the months for each such date, and group by that month value using tapply() to take monthly mean()s. Here's how it could be done:
set.seed(1);
R <- 48;
C <- 39;
Z <- 3653;
N <- R*C*Z;
a1 <- array(rnorm(N,10,2),c(R,C,Z));
dates <- seq(as.Date('2000-01-01'),as.Date('2009-12-31'),1);
a2 <- aperm(apply(a1,1:2,function(x) tapply(x,strftime(dates,'%m'),mean)),c(2,3,1));
Here's a demo showing a few specific proofs of correctness:
for (r in sample(1:nrow(a2),2)) for (c in sample(1:ncol(a2),2)) for (m in sample(1:dim(a2)[3],2)) cat(sprintf('[%02d,%02d,%3s] %f %f\n',r,c,month.abb[m],mean(a1[r,c,strftime(dates,'%m')==sprintf('%02d',m)]),a2[r,c,m]));
## [14,05,Aug] 10.030313 10.030313
## [14,05,Apr] 10.200982 10.200982
## [14,25,Jan] 9.957879 9.957879
## [14,25,Apr] 10.185447 10.185447
## [26,34,Oct] 10.056931 10.056931
## [26,34,Nov] 9.876327 9.876327
## [26,17,Apr] 10.005423 10.005423
## [26,17,Sep] 10.009785 10.009785
Notes
I randomly chose a date range of 2000-01-01 to 2009-12-31 because it covers a 10 year period during which (due to leap years) there were exactly 3653 days, but obviously you should be sure to use whatever dates are actually covered by your real data.
As you can see, you were on the right track by calling apply() with 1:2 as the margins, because that allows you to operate independently on each z-line, such that you can group that z-line by month and compute the mean for each month along that z-line.
Unfortunately, apply() has an annoying habit of returning the result in a different transposition than people generally expect. For two-dimensional usages, this is normally solved with a simple call to t(), but since we're working in three dimensions here, we need to call aperm() to fix the dimension order.
Since the dates I chose begin with January and advance through the months in calendar order, the means in the result will end up being ordered by calendar month. IOW, z-indexes 1:12 in a2 correspond to months Jan-Dec. If your dates do not begin with January, then this solution should still work, but you'll have to be careful about the correspondence between z-indexes and months in the result. For example, my "proof of correctness" code assumed that indexes 1:12 corresponded to months Jan-Dec, but that wouldn't be correct if the months occurred in a different order in the input array.
Solution 2
While writing this answer I actually thought of a slightly different, and one could argue slightly better, solution. You can call tapply() just once and group by rows, then columns, and finally months. Unfortunately, tapply() doesn't seem to be designed to naturally cycle its group vectors to cover the input vector, so we have to cycle them ourselves using carefully crafted calls to rep() (using the each and times arguments carefully--and I suppose tapply() actually wouldn't even know how to do this properly for our input data), but other than that, it's fairly straightforward:
a3 <- tapply(a1,list(rep(1:R,C*Z),rep(1:C,each=R,times=Z),rep(strftime(dates,'%m'),each=R*C)),mean);
Here's a proof that the result is identical to my first method (dimnames() have to be fixed first to get the identical() call to work, but that's trivial):
dimnames(a3) <- dimnames(a2);
identical(a3,a2);
## [1] TRUE
Performance
Here's some basic performance testing using system.time() to give an idea of the superiority of the second solution:
first <- function() a2 <- aperm(apply(a1,1:2,function(x) tapply(x,strftime(dates,'%m'),mean)),c(2,3,1));
second <- function() a3 <- tapply(a1,list(rep(1:R,C*Z),rep(1:C,each=R,times=Z),rep(strftime(dates,'%m'),each=R*C)),mean);
system.time({ first() });
## user system elapsed
## 3.672 0.015 3.719
system.time({ first() });
## user system elapsed
## 3.672 0.016 3.720
system.time({ second() });
## user system elapsed
## 1.797 0.344 2.135
system.time({ second() });
## user system elapsed
## 1.719 0.391 2.124

R: Faster Minimum Column Value Over Multiple Matricies?

I have a very tall integer matrix (mat) and a sparse binary vector (v) of equal row length. I want to find the minimum value in all columns of mat where v==1.
Here are several possible solutions:
mat <- matrix(as.integer(runif(100000*100,0,2^31)),nrow=100000,ncol=100)
v<-(rbinom(100000,1,.01))
a<-apply(v*mat,2, function(x) min(x[x>0]))
b<-apply(mat,2,function(x) min(x[v==1]))
c<-sapply(subset(data.frame(mat),v==1), min)
These all work fine, and on my machine solution c seems fastest (an admittedly older,slower MacBook). But if I have a function that feeds unique sets of v, the computation time scales linearly with the number of sets. So a large number of unique sets (>10,000) will take hours to process.
Any ideas on how to do such an operation faster, or is this as fast as I can go?
I guess that subsetting and then calling apply gains a lot, given that v is almost always 0:
system.time(b<-apply(mat[as.logical(v),],2, min))
# user system elapsed
# 0.012 0.000 0.013
system.time(a<-apply(v*mat,2, function(x) min(x[x>0])))
# user system elapsed
# 0.628 0.019 0.649
identical(a,b)
#[1] TRUE
I dropped also the x[x>0], since it appears that mat is always greater than 0.

row-wise differences between two large matrices in R

I would like to ask an opinion on how to speed up the following operation.
I have two matrices says A and B with n rows and 3 columns; for any row vector of A I want to compare its difference with any row vector of B. So it is a pairwise difference between all row vectors of the two matrices. The resulting matrix is then a n*n matrix. Then I want to apply a function to any element of this, the biharm() function that I wrote in the example. The problem is that, while for small matrices I have no problems, I have the necessity to apply this operation to very large matrices such as 1000*3. In the sigm() function, that I wrote to do that, I first initialize S and then I wrote two annidated for cycles. However, this is slow for large matrices. Does anyone has an idea on how to speed up this? I think using apply() but I cannot figure out the correct way. Here below a fully reproducible example. Thanks in advance for any advice. Best, Paolo.
biharm<-function(vec1,vec2){
reso<-norm(as.matrix(vec1)-as.matrix(vec2),type="F")^2*log(norm((as.matrix(vec1)-as.matrix(vec2)),type="F"))
reso
}
sigm<-function(mat1,mat2=NULL){
tt<-mat1
if(is.null(mat2)){yy<-mat1}else{yy<-mat2}
k<-nrow(yy)
m<-ncol(yy)
SGMr<-matrix(rep(0,k^2),ncol=k)
for(i in 1:k){
for(j in 1: k){
SGMr[i,j]<-biharm(yy[i,],tt[j,])
}}
SGMr<-replace(SGMr,which(SGMr=="NaN",arr.ind=T),0)
return(SGMr)}
### small matrices example:
A<-matrix(rnorm(30),ncol=3)
B<-matrix(rnorm(30),ncol=3)
sigm(A,B)
### large matrices example:
A<-matrix(rnorm(900),ncol=3)
B<-matrix(rnorm(900),ncol=3)
sigm(A,B)
This is about 8 times faster on my system.
biharm.new <- function(vec1,vec2){
n <- sqrt(sum((vec1-vec2)^2))
n^2*log(n)
}
sigm.new<-function(mat1,mat2=NULL){
tt<-mat1
if(is.null(mat2)){yy<-mat1}else{yy<-mat2}
SGMr <- apply(tt,1,function(t)apply(yy,1,biharm.new,t))
replace(SGMr,which(SGMr=="NaN",arr.ind=T),0)
}
### large matrices example:
set.seed(1)
A<-matrix(rnorm(900),ncol=3)
B<-matrix(rnorm(900),ncol=3)
system.time(result.1<-sigm(A,B))
# user system elapsed
# 6.13 0.00 6.13
system.time(result.2<-sigm.new(A,B))
# user system elapsed
# 0.81 0.00 0.81
all.equal(result.1,result.2)
# [1] TRUE
The use of apply(...) results in about a 3-fold improvement. The rest comes from optimizing biharm(...) - since you are calling this 810,000 times it pays to make it as efficient as possible.
Note that the Frobenius norm is just the Euclidean norm, so if that is what you really want use sqrt(sum(x^2)) rather than converting to matrices and using norm(...). The former is much faster.
How about this:
set.seed(1)
foo<-matrix(runif(30),nc=3)
bar<-matrix(runif(30),nc=3)
sapply(1:10,function(j) sapply(1:10,function(k) biharm(bar[k,],foo[j,])) )
EDIT -- basically same as jhoward's "sigm.new" without the error checking. Clearly biharm.new is a winner.
microbenchmark(carl(foo,bar),jhoward(foo,bar),times=3)
Unit: milliseconds
expr min lq median uq max neval
carl(foo, bar) 5846.8273 6071.364 6295.8999 6322.425 6348.951 3
jhoward(foo, bar) 891.5734 934.550 977.5267 1008.388 1039.248 3

Efficiencies for nested for loop

I've created the following code that nests a for loop inside of a for loop in R. It is a simulation to calculate Power. I've read that R isn't great for doing for loops but I was wondering if there are any efficiencies I could apply to make this run a bit faster. I'm fairly new to R as well as programming of any sort. Right now the run times I'm seeing are:
m=10 I get .17 sec
m=100 I get 3.95 sec
m=1000 I get 246.26 sec
m=2000 I get 1003.55 sec
I was hoping to set the number of times to sample, m, upwards of 100K but I'm afraid to even set this at 10K
Here is the code:
m = 1000 # number of times we are going to take samples
popmean=120 # set population mean at 120
popvar=225 # set known/established population
variance at 225
newvar=144 # variance of new methodology
alpha=.01 # set alpha
teststatvect = matrix(nrow=m,ncol=1) # empty vector to populate with test statistics
power = matrix(nrow=200,ncol=1) # empty vector to populate with power
system.time( # not needed - using to gauge how long this takes
for (n in 1:length(power)) # begin for loop for different sample sizes
for(i in 1:m){ # begin for loop to take "m" samples
y=rnorm(n,popmean,sqrt(newvar)) # sample of size n with mean 120 and var=144
ts=sum((y-popmean)^2/popvar) # calculate test statistic for each sample
teststatvect[i]=ts # loop and populate the vector to hold test statistics
vecpvals=pchisq(teststatvect,n) # calculate the pval of each statistic
power[n]=length(which(vecpvals<=alpha))/length(vecpvals) # loop to populate power vector. Power is the proportion lessthan ot equal to alpha
}
}
)
I reorganized your code a bit and got rid of the inner loop.
Sampling one long vector of random numbers (and then collapsing it into a matrix) is much faster than repeatedly sampling short vectors (replicate, as suggested in another answer, is nice for readability, but in this case you can do better by sampling random numbers in a block)
colSums is faster than summing inside a for loop or using apply.
it's just sugar (i.e. it isn't actually any more efficient), but you can use mean(pvals<=alpha) in place of sum(pvals<=alpha)/length(alpha)
I defined a function to return the power for a specified set of parameters (including sample size), then used sapply to range over the vector of sizes (not faster than a for loop, but cleaner and maybe easier to generalize).
Code:
powfun <- function(ssize=100,
m=1000, ## samples per trial
popmean=120, ## pop mean
popvar=225, ## known/established pop variance
newvar=144, ## variance of new methodology
alpha=0.01,
sampchisq=FALSE) ## sample directly from chi-squared distrib?
{
if (!sampchisq) {
ymat <- matrix(rnorm(ssize*m,popmean,sd=sqrt(newvar)),ncol=m)
ts <- colSums((ymat-popmean)^2/popvar) ## test statistic
} else {
ts <- rchisq(m,df=ssize)*newvar/popvar
}
pvals <- pchisq(ts,df=ssize) ## pval
mean(pvals<=alpha) ## power
}
Do you really need the power for every integer value of sample size, or would a more widely spaced sample be OK (if you need exact values, interpolation would probably be pretty accurate)
ssizevec <- seq(10,250,by=5)
set.seed(101)
system.time(powvec <- sapply(ssizevec,powfun,m=5000)) ## 13 secs elapsed
This is reasonably fast and might get you up to m=1e5 if you needed, but I'm not quite sure why you need results that are that precise -- the power curve is reasonably smooth with m=5000 ...
If you're impatiently waiting for long simulations, you can also get a progress bar to print by replacing sapply(ssizevec,powfun,m=5000) with library(plyr); aaply(ssizevec,.margins=1,powfun,.progress="text",m=5000)
Finally, I think you can speed the whole up a lot by sampling chi-squared values directly, or by doing an analytical power calculation (!). I think that rchisq(m,df=ssize)*newvar/popvar is equivalent to the first two lines of the loop, and you might even be able to do a numerical computation on the chi-squared densities directly ...
system.time(powvec2 <- sapply(ssizevec,powfun,m=5000,sampchisq=TRUE))
## 0.24 seconds elapsed
(I just tried this out, sampling m=1e5 at every value of sample size from 1 to 200 ... it takes 24 seconds ... but I still think it might be unnecessary.)
A picture:
par(bty="l",las=1)
plot(ssizevec,powvec,type="l",xlab="sample size",ylab="power",
xlim=c(0,250),ylim=c(0,1))
lines(ssizevec,powvec2,col="red")
In general, you want as far as possible to take advantage of vectorization, not so much for speed as readability/comprehension.
Why is writing to power[n] inside the inner loop (and I guess the calculation of vecpals as well)? Shouldn't that be in the outer loop after the inner loop executes? You may want to move the calculation of the square root outside both loops.
Why are teststatvect and power initialized as matrices (which are explicitly two dimensional arrays) and not vectors (or rather, as one dimensional arrays, using array)? Is variance at 225just the end of the comment from the previous line? You may want to check formatting. (Is this homework?)
For what it looks like you're trying to do here, you may want to take advantage of the very handy function replicate, perhaps by writing a specific function to call it on.

Resources