multiplying every nth row of a matrix by a number using sapply and for loop - r

I have a 100 * 100 matrix and I a willing to multiply every tenth row by 10.
I know that it would be easily doable by writing sth like:
tenthrows<-seq(from=10,to=100,by=10)
m[tenthrows,]<-m[tenthrows,]*10
However I've asked to do this by using "for loop" and "sapply/lapply" functions.
I could do this by using only for loop like:
for (i in seq(from = 10 , to = 100, by = 10)) {m[i,]<-m[i,]*10 }
But I am not sure how to incorporate sapply/lapply also for doing so, any idea?
Many Thanks in advance

Generate a matrix:
m0 <- m <- matrix(1:1e4,100,100)
One way to do it:
tenthrows <- seq(1,100,by=10)
m[tenthrows,] <- m[tenthrows,]*10
Doing it with a for loop would be less efficient. I can't think of a reason you would want to do it that way, unless (a) this is homework, (b) this is self-study, (c) there is some context you haven't told us about. If one of those is true, could you let us know?
Incorporating #baptiste's answer:
m2 <- m0 * c(10,rep(1,9))
all.equal(m,m2) ## TRUE
This works because R stores matrices columnwise, so that multiplying every 10th element is equivalent to multiplying every 10th row.
Another possibility:
m3 <- sweep(m0,MARGIN=1,FUN="*",STAT=c(10,rep(1,9)))
all.equal(m,m3) ## TRUE
Note that the difference between the original comment (c(rep(1,9),10)) depends on how you interpret "every 10th row"; the original comment multiplied rows 10,20,30,... whereas my answer is multiplying rows 1,11,21,31,...
library("rbenchmark")
benchmark(m[tenthrows,] <- m[tenthrows,]*10,
m0*c(10,rep(1,9)),
sweep(m0,MARGIN=1,FUN="*",STAT=c(10,rep(1,9))),
replications=1000)
## test replications elapsed relative
## 2 m0 * c(10, rep(1, 9)) 1000 0.109 1.000
## 1 m[tenthrows, ] <- m[tenthrows, ] * 10 1000 0.125 1.147
The rep approach is (slightly) faster. (sweep, not shown, is even slower, although it's perhaps the most principled approach -- it's the one that is the closest to a direct translation of "multiply every 10th row by 10". It's also worth keeping in mind that even the slowest approach took 0.25 seconds for 1000 replications of this task.)

Related

R lookup time for very long vector

In the R programming language...
Bottleneck in my code:
a <- a[b]
where:
a,b are vectors of length 90 Million.
a is a logical vector.
b is a permutation of the indeces of a.
This operation is slow: it takes ~ 1.5 - 2.0 seconds.
I thought straightforward indexing would be much faster, even for large vectors.
Am I simply stuck? Or is there a way to speed this up?
Context:
P is a large matrix (10k row, 5k columns).
rows = names, columns = features. values = real numbers.
Problem: Given a subset of names, I need to obtain matrix Q, where:
Each column of Q is sorted (independently of the other columns of Q).
The values in a column of Q come from the corresponding column of P and are only those from the rows of P which are in the given subset of names.
Here is a naive implementation:
Psub <- P[names,]
Q <- sapply( Psub , sort )
But I am given 10,000 distinct subsets of names (each subset is several 20% to 90% of the total). Taking the subset and sorting each time is incredibly slow.
Instead, I can pre-compute the order vector:
b <- sapply( P , order )
b <- convert_to_linear_index( as.data.frame(b) , dim(P) )
# my own function.
# Now b is a vector of length nrow(P) * ncol(P)
a <- rownames(P) %in% myNames
a <- rep(a , ncol(P) )
a <- a[b]
a <- as.matrix(a , nrow = length(myNames) )
I don't see this getting much faster than that. You can try to write an optimized C function to do exactly this, which might cut the time in half or so (and that's optimistic -- vectorized R operations like this don't have much overhead), but not much more than that.
You've got approx 10^8 values to go through. Each time through the internal loop, it needs to increment the iterator, get the index b[i] out of memory, look up a[b[i]] and then save that value into newa[i]. I'm not a compiler/assembly expert by a long shot, but this sounds like on the order of 5-10 instructions, which means you're looking at "big O" of 1 billion instructions total, so there's a clock rate limit to how fast this can go.
Also, R stores logical values as 32 bit ints, so the array a will take up about 400 megs, which doesn't fit into cache, so if b is a more or less random permutation, then you're going to be missing the cache regularly (on most lookups to a, in fact). Again, I'm not an expert, but I would think it's likely that the cache misses here are the bottleneck, and if that's the case, optimized C won't help much.
Aside from writing it in C, the other thing to do is determine whether there are any assumptions you can make that would let you not go through the whole array. For example, if you know most of the indices will not change, and you can figure out which ones do change, you might be able to make it go faster.
On edit, here are some numbers. I have an AMD with clock speed of 2.8GHz. It takes me 3.4 seconds with a random permutation (i.e. lots of cache misses) and 0.7 seconds with either 1:n or n:1 (i.e. very few cache misses), which breaks into 0.6 seconds of execution time and 0.1 of system time, presumably to allocate the new array. So it does appear that cache misses are the thing. Maybe optimized C code could shave something like 0.2 or 0.3 seconds off of that base time, but if the permutation is random, that won't make much difference.
> x<-sample(c(T,F),90*10**6,T)
> prm<-sample(90*10**6)
> prm1<-1:length(prm)
> prm2<-rev(prm1)
> system.time(x<-x[prm])
user system elapsed
3.317 0.116 3.436
> system.time(x<-x[prm1])
user system elapsed
0.593 0.140 0.734
> system.time(x<-x[prm2])
user system elapsed
0.631 0.112 0.743
>

monthly means with apply for multidimensional arrays

I want to compute the mean over the 3-D of a multidimensional array. As this dimension is supposed to be the time, I wanted to computed monthly means. For that, I tried to use apply, but I am not sure where the problem is. Let's say my data is as the following:
#Creating a sample
m <-array(1:12, dim=c(20,4,36))
#number of months
months <- seq(1:12)
#Compute the mean over each month (dimension of the result should be [20,4,12]
monmean <- apply(m,1:2,function(x) for(i in 1:12) mean(x[,,months==i],na.rm=TRUE))
Any idea??
Thanks in advance
I think I understand what you're after. This is actually slightly more complex than it may seem, because months are not regular periods of time; they vary in number of days, and February varies between years due to leap years. Thus a simple regular logical or numeric index vector will not be sufficient to calculate this result precisely. You need to take into account the exact dates that are covered by the z-dimension of your array.
Solution 1
What you can do is separately compute a date vector that identifies the dates that correspond to each z-index of your array. Within the apply() call for each z-line, you can then call strftime() to extract the months for each such date, and group by that month value using tapply() to take monthly mean()s. Here's how it could be done:
set.seed(1);
R <- 48;
C <- 39;
Z <- 3653;
N <- R*C*Z;
a1 <- array(rnorm(N,10,2),c(R,C,Z));
dates <- seq(as.Date('2000-01-01'),as.Date('2009-12-31'),1);
a2 <- aperm(apply(a1,1:2,function(x) tapply(x,strftime(dates,'%m'),mean)),c(2,3,1));
Here's a demo showing a few specific proofs of correctness:
for (r in sample(1:nrow(a2),2)) for (c in sample(1:ncol(a2),2)) for (m in sample(1:dim(a2)[3],2)) cat(sprintf('[%02d,%02d,%3s] %f %f\n',r,c,month.abb[m],mean(a1[r,c,strftime(dates,'%m')==sprintf('%02d',m)]),a2[r,c,m]));
## [14,05,Aug] 10.030313 10.030313
## [14,05,Apr] 10.200982 10.200982
## [14,25,Jan] 9.957879 9.957879
## [14,25,Apr] 10.185447 10.185447
## [26,34,Oct] 10.056931 10.056931
## [26,34,Nov] 9.876327 9.876327
## [26,17,Apr] 10.005423 10.005423
## [26,17,Sep] 10.009785 10.009785
Notes
I randomly chose a date range of 2000-01-01 to 2009-12-31 because it covers a 10 year period during which (due to leap years) there were exactly 3653 days, but obviously you should be sure to use whatever dates are actually covered by your real data.
As you can see, you were on the right track by calling apply() with 1:2 as the margins, because that allows you to operate independently on each z-line, such that you can group that z-line by month and compute the mean for each month along that z-line.
Unfortunately, apply() has an annoying habit of returning the result in a different transposition than people generally expect. For two-dimensional usages, this is normally solved with a simple call to t(), but since we're working in three dimensions here, we need to call aperm() to fix the dimension order.
Since the dates I chose begin with January and advance through the months in calendar order, the means in the result will end up being ordered by calendar month. IOW, z-indexes 1:12 in a2 correspond to months Jan-Dec. If your dates do not begin with January, then this solution should still work, but you'll have to be careful about the correspondence between z-indexes and months in the result. For example, my "proof of correctness" code assumed that indexes 1:12 corresponded to months Jan-Dec, but that wouldn't be correct if the months occurred in a different order in the input array.
Solution 2
While writing this answer I actually thought of a slightly different, and one could argue slightly better, solution. You can call tapply() just once and group by rows, then columns, and finally months. Unfortunately, tapply() doesn't seem to be designed to naturally cycle its group vectors to cover the input vector, so we have to cycle them ourselves using carefully crafted calls to rep() (using the each and times arguments carefully--and I suppose tapply() actually wouldn't even know how to do this properly for our input data), but other than that, it's fairly straightforward:
a3 <- tapply(a1,list(rep(1:R,C*Z),rep(1:C,each=R,times=Z),rep(strftime(dates,'%m'),each=R*C)),mean);
Here's a proof that the result is identical to my first method (dimnames() have to be fixed first to get the identical() call to work, but that's trivial):
dimnames(a3) <- dimnames(a2);
identical(a3,a2);
## [1] TRUE
Performance
Here's some basic performance testing using system.time() to give an idea of the superiority of the second solution:
first <- function() a2 <- aperm(apply(a1,1:2,function(x) tapply(x,strftime(dates,'%m'),mean)),c(2,3,1));
second <- function() a3 <- tapply(a1,list(rep(1:R,C*Z),rep(1:C,each=R,times=Z),rep(strftime(dates,'%m'),each=R*C)),mean);
system.time({ first() });
## user system elapsed
## 3.672 0.015 3.719
system.time({ first() });
## user system elapsed
## 3.672 0.016 3.720
system.time({ second() });
## user system elapsed
## 1.797 0.344 2.135
system.time({ second() });
## user system elapsed
## 1.719 0.391 2.124

R: Faster Minimum Column Value Over Multiple Matricies?

I have a very tall integer matrix (mat) and a sparse binary vector (v) of equal row length. I want to find the minimum value in all columns of mat where v==1.
Here are several possible solutions:
mat <- matrix(as.integer(runif(100000*100,0,2^31)),nrow=100000,ncol=100)
v<-(rbinom(100000,1,.01))
a<-apply(v*mat,2, function(x) min(x[x>0]))
b<-apply(mat,2,function(x) min(x[v==1]))
c<-sapply(subset(data.frame(mat),v==1), min)
These all work fine, and on my machine solution c seems fastest (an admittedly older,slower MacBook). But if I have a function that feeds unique sets of v, the computation time scales linearly with the number of sets. So a large number of unique sets (>10,000) will take hours to process.
Any ideas on how to do such an operation faster, or is this as fast as I can go?
I guess that subsetting and then calling apply gains a lot, given that v is almost always 0:
system.time(b<-apply(mat[as.logical(v),],2, min))
# user system elapsed
# 0.012 0.000 0.013
system.time(a<-apply(v*mat,2, function(x) min(x[x>0])))
# user system elapsed
# 0.628 0.019 0.649
identical(a,b)
#[1] TRUE
I dropped also the x[x>0], since it appears that mat is always greater than 0.

row-wise differences between two large matrices in R

I would like to ask an opinion on how to speed up the following operation.
I have two matrices says A and B with n rows and 3 columns; for any row vector of A I want to compare its difference with any row vector of B. So it is a pairwise difference between all row vectors of the two matrices. The resulting matrix is then a n*n matrix. Then I want to apply a function to any element of this, the biharm() function that I wrote in the example. The problem is that, while for small matrices I have no problems, I have the necessity to apply this operation to very large matrices such as 1000*3. In the sigm() function, that I wrote to do that, I first initialize S and then I wrote two annidated for cycles. However, this is slow for large matrices. Does anyone has an idea on how to speed up this? I think using apply() but I cannot figure out the correct way. Here below a fully reproducible example. Thanks in advance for any advice. Best, Paolo.
biharm<-function(vec1,vec2){
reso<-norm(as.matrix(vec1)-as.matrix(vec2),type="F")^2*log(norm((as.matrix(vec1)-as.matrix(vec2)),type="F"))
reso
}
sigm<-function(mat1,mat2=NULL){
tt<-mat1
if(is.null(mat2)){yy<-mat1}else{yy<-mat2}
k<-nrow(yy)
m<-ncol(yy)
SGMr<-matrix(rep(0,k^2),ncol=k)
for(i in 1:k){
for(j in 1: k){
SGMr[i,j]<-biharm(yy[i,],tt[j,])
}}
SGMr<-replace(SGMr,which(SGMr=="NaN",arr.ind=T),0)
return(SGMr)}
### small matrices example:
A<-matrix(rnorm(30),ncol=3)
B<-matrix(rnorm(30),ncol=3)
sigm(A,B)
### large matrices example:
A<-matrix(rnorm(900),ncol=3)
B<-matrix(rnorm(900),ncol=3)
sigm(A,B)
This is about 8 times faster on my system.
biharm.new <- function(vec1,vec2){
n <- sqrt(sum((vec1-vec2)^2))
n^2*log(n)
}
sigm.new<-function(mat1,mat2=NULL){
tt<-mat1
if(is.null(mat2)){yy<-mat1}else{yy<-mat2}
SGMr <- apply(tt,1,function(t)apply(yy,1,biharm.new,t))
replace(SGMr,which(SGMr=="NaN",arr.ind=T),0)
}
### large matrices example:
set.seed(1)
A<-matrix(rnorm(900),ncol=3)
B<-matrix(rnorm(900),ncol=3)
system.time(result.1<-sigm(A,B))
# user system elapsed
# 6.13 0.00 6.13
system.time(result.2<-sigm.new(A,B))
# user system elapsed
# 0.81 0.00 0.81
all.equal(result.1,result.2)
# [1] TRUE
The use of apply(...) results in about a 3-fold improvement. The rest comes from optimizing biharm(...) - since you are calling this 810,000 times it pays to make it as efficient as possible.
Note that the Frobenius norm is just the Euclidean norm, so if that is what you really want use sqrt(sum(x^2)) rather than converting to matrices and using norm(...). The former is much faster.
How about this:
set.seed(1)
foo<-matrix(runif(30),nc=3)
bar<-matrix(runif(30),nc=3)
sapply(1:10,function(j) sapply(1:10,function(k) biharm(bar[k,],foo[j,])) )
EDIT -- basically same as jhoward's "sigm.new" without the error checking. Clearly biharm.new is a winner.
microbenchmark(carl(foo,bar),jhoward(foo,bar),times=3)
Unit: milliseconds
expr min lq median uq max neval
carl(foo, bar) 5846.8273 6071.364 6295.8999 6322.425 6348.951 3
jhoward(foo, bar) 891.5734 934.550 977.5267 1008.388 1039.248 3

subset slow in large matrix

I have a numeric vector of length 5,000,000
>head(coordvec)
[1] 47286545 47286546 47286547 47286548 47286549 472865
and a 3 x 1,400,000 numeric matrix
>head(subscores)
V1 V2 V3
1 47286730 47286725 0.830
2 47286740 47286791 0.065
3 47286750 47286806 -0.165
4 47288371 47288427 0.760
5 47288841 47288890 0.285
6 47288896 47288945 0.225
What I am trying to accomplish is that for each number in coordvec, find the average of V3 for rows in subscores in which V1 and V2 encompass the number in coordvec. To do that, I am taking the following approach:
results<-numeric(length(coordvec))
for(i in 1:length(coordvec)){
select_rows <- subscores[, 1] < coordvec[i] & subscores[, 2] > coordvec[i]
scores_subset <- subscores[select_rows, 3]
results[m]<-mean(scores_subset)
}
This is very slow, and would take a few days to finish. Is there a faster way?
Thanks,
Dan
I think there are two challenging parts to this question. The first is finding the overlaps. I'd use the IRanges package from Bioconductor (?findInterval in the base package might also be useful)
library(IRanges)
creating width 1 ranges representing the coordinate vector, and set of ranges representing the scores; I sort the coordinate vectors for convenience, assuming that duplicate coordinates can be treated the same
coord <- sort(sample(.Machine$integer.max, 5000000))
starts <- sample(.Machine$integer.max, 1200000)
scores <- runif(length(starts))
q <- IRanges(coord, width=1)
s <- IRanges(starts, starts + 100L)
Here we find which query overlaps which subject
system.time({
olaps <- findOverlaps(q, s)
})
This takes about 7s on my laptop. There are different types of overlaps (see ?findOverlaps) so maybe this step requires a bit of refinement.
The result is a pair of vectors indexing the query and overlapping subject.
> olaps
Hits of length 281909
queryLength: 5000000
subjectLength: 1200000
queryHits subjectHits
<integer> <integer>
1 19 685913
2 35 929424
3 46 1130191
4 52 37417
I think this is the end of the first complicated part, finding the 281909 overlaps. (I don't think the data.table answer offered elsewhere addresses this, though I could be mistaken...)
The next challenging part is calculating a large number of means. The built-in way would be something like
olaps0 <- head(olaps, 10000)
system.time({
res0 <- tapply(scores[subjectHits(olaps0)], queryHits(olaps0), mean)
})
which takes about 3.25s on my computer and appears to scale linearly, so maybe 90s for the 280k overlaps. But I think we can accomplish this tabulation efficiently with data.table. The original coordinates are start(v)[queryHits(olaps)], so as
require(data.table)
dt <- data.table(coord=start(q)[queryHits(olaps)],
score=scores[subjectHits(olaps)])
res1 <- dt[,mean(score), by=coord]$V1
which takes about 2.5s for all 280k overlaps.
Some more speed can be had by recognizing that the query hits are ordered. We want to calculate a mean for each run of query hits. We start by creating a variable to indicate the ends of each query hit run
idx <- c(queryHits(olaps)[-1] != queryHits(olaps)[-length(olaps)], TRUE)
and then calculate the cumulative scores at the ends of each run, the length of each run, and the difference between the cumulative score at the end and at the start of the run
scoreHits <- cumsum(scores[subjectHits(olaps)])[idx]
n <- diff(c(0L, seq_along(idx)[idx]))
xt <- diff(c(0L, scoreHits))
And finally, the mean is
res2 <- xt / n
This takes about 0.6s for all the data, and is identical to (though more cryptic than?) the data.table result
> identical(res1, res2)
[1] TRUE
The original coordinates corresponding to the means are
start(q)[ queryHits(olaps)[idx] ]
Something like this might be faster :
require(data.table)
subscores <- as.data.table(subscores)
subscores[, cond := V1 < coordvec & V2 > coordvec]
subscores[list(cond)[[1]], mean(V3)]
list(cond)[[1]] because: "When i is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope." source: ?data.table
Since your answer isn't easily reproducible and even if it were, none of your subscores meet your boolean condition, I'm not sure if this does exactly what you're looking for but you can use one of the apply family and a function.
myfun <- function(x) {
y <- subscores[, 1] < x & subscores[, 2] > x
mean(subscores[y, 3])
}
sapply(coordvec, myfun)
You can also take a look at mclapply. If you have enough memory this will probably speed things up significantly. However, you could also look at the foreach package with similar results. You've got your for loop "correct" by assigning into results rather than growing it, but really, you're doing a lot of comparisons. It will be hard to speed this up much.

Resources