monthly means with apply for multidimensional arrays - r

I want to compute the mean over the 3-D of a multidimensional array. As this dimension is supposed to be the time, I wanted to computed monthly means. For that, I tried to use apply, but I am not sure where the problem is. Let's say my data is as the following:
#Creating a sample
m <-array(1:12, dim=c(20,4,36))
#number of months
months <- seq(1:12)
#Compute the mean over each month (dimension of the result should be [20,4,12]
monmean <- apply(m,1:2,function(x) for(i in 1:12) mean(x[,,months==i],na.rm=TRUE))
Any idea??
Thanks in advance

I think I understand what you're after. This is actually slightly more complex than it may seem, because months are not regular periods of time; they vary in number of days, and February varies between years due to leap years. Thus a simple regular logical or numeric index vector will not be sufficient to calculate this result precisely. You need to take into account the exact dates that are covered by the z-dimension of your array.
Solution 1
What you can do is separately compute a date vector that identifies the dates that correspond to each z-index of your array. Within the apply() call for each z-line, you can then call strftime() to extract the months for each such date, and group by that month value using tapply() to take monthly mean()s. Here's how it could be done:
set.seed(1);
R <- 48;
C <- 39;
Z <- 3653;
N <- R*C*Z;
a1 <- array(rnorm(N,10,2),c(R,C,Z));
dates <- seq(as.Date('2000-01-01'),as.Date('2009-12-31'),1);
a2 <- aperm(apply(a1,1:2,function(x) tapply(x,strftime(dates,'%m'),mean)),c(2,3,1));
Here's a demo showing a few specific proofs of correctness:
for (r in sample(1:nrow(a2),2)) for (c in sample(1:ncol(a2),2)) for (m in sample(1:dim(a2)[3],2)) cat(sprintf('[%02d,%02d,%3s] %f %f\n',r,c,month.abb[m],mean(a1[r,c,strftime(dates,'%m')==sprintf('%02d',m)]),a2[r,c,m]));
## [14,05,Aug] 10.030313 10.030313
## [14,05,Apr] 10.200982 10.200982
## [14,25,Jan] 9.957879 9.957879
## [14,25,Apr] 10.185447 10.185447
## [26,34,Oct] 10.056931 10.056931
## [26,34,Nov] 9.876327 9.876327
## [26,17,Apr] 10.005423 10.005423
## [26,17,Sep] 10.009785 10.009785
Notes
I randomly chose a date range of 2000-01-01 to 2009-12-31 because it covers a 10 year period during which (due to leap years) there were exactly 3653 days, but obviously you should be sure to use whatever dates are actually covered by your real data.
As you can see, you were on the right track by calling apply() with 1:2 as the margins, because that allows you to operate independently on each z-line, such that you can group that z-line by month and compute the mean for each month along that z-line.
Unfortunately, apply() has an annoying habit of returning the result in a different transposition than people generally expect. For two-dimensional usages, this is normally solved with a simple call to t(), but since we're working in three dimensions here, we need to call aperm() to fix the dimension order.
Since the dates I chose begin with January and advance through the months in calendar order, the means in the result will end up being ordered by calendar month. IOW, z-indexes 1:12 in a2 correspond to months Jan-Dec. If your dates do not begin with January, then this solution should still work, but you'll have to be careful about the correspondence between z-indexes and months in the result. For example, my "proof of correctness" code assumed that indexes 1:12 corresponded to months Jan-Dec, but that wouldn't be correct if the months occurred in a different order in the input array.
Solution 2
While writing this answer I actually thought of a slightly different, and one could argue slightly better, solution. You can call tapply() just once and group by rows, then columns, and finally months. Unfortunately, tapply() doesn't seem to be designed to naturally cycle its group vectors to cover the input vector, so we have to cycle them ourselves using carefully crafted calls to rep() (using the each and times arguments carefully--and I suppose tapply() actually wouldn't even know how to do this properly for our input data), but other than that, it's fairly straightforward:
a3 <- tapply(a1,list(rep(1:R,C*Z),rep(1:C,each=R,times=Z),rep(strftime(dates,'%m'),each=R*C)),mean);
Here's a proof that the result is identical to my first method (dimnames() have to be fixed first to get the identical() call to work, but that's trivial):
dimnames(a3) <- dimnames(a2);
identical(a3,a2);
## [1] TRUE
Performance
Here's some basic performance testing using system.time() to give an idea of the superiority of the second solution:
first <- function() a2 <- aperm(apply(a1,1:2,function(x) tapply(x,strftime(dates,'%m'),mean)),c(2,3,1));
second <- function() a3 <- tapply(a1,list(rep(1:R,C*Z),rep(1:C,each=R,times=Z),rep(strftime(dates,'%m'),each=R*C)),mean);
system.time({ first() });
## user system elapsed
## 3.672 0.015 3.719
system.time({ first() });
## user system elapsed
## 3.672 0.016 3.720
system.time({ second() });
## user system elapsed
## 1.797 0.344 2.135
system.time({ second() });
## user system elapsed
## 1.719 0.391 2.124

Related

Count duration of value in vector in R

I am trying to count the length of occurrances of a value in a vector such as
q <- c(1,1,1,1,1,1,4,4,4,4,4,4,4,4,4,4,4,4,6,6,6,6,6,6,6,6,6,6,1,1,4,4,4)
Actual vectors are longer than this, and are time based. What I would like would be an output for 4 that tells me it occurred for 12 time steps (before the vector changes to 6) and then 3 time steps. (Not that it occurred 15 times total).
Currently my ideas to do this are pretty inefficient (a loop that looks element by element that I can have stop when it doesn't equal the value I specified). Can anyone recommend a more efficient method?
x <- with(rle(q), data.frame(values, lengths)) will pull the information that you want (courtesy of d.b. in the comments).
From the R Documentation: rle is used to "Compute the lengths and values of runs of equal values in a vector – or the reverse operation."
y <- x[x$values == 4, ] will subset the data frame to include only the value of interest (4). You can then see clearly that 4 ran for 12 times and then later for 3.
Modifying the code will let you check whatever value you want.

R lookup time for very long vector

In the R programming language...
Bottleneck in my code:
a <- a[b]
where:
a,b are vectors of length 90 Million.
a is a logical vector.
b is a permutation of the indeces of a.
This operation is slow: it takes ~ 1.5 - 2.0 seconds.
I thought straightforward indexing would be much faster, even for large vectors.
Am I simply stuck? Or is there a way to speed this up?
Context:
P is a large matrix (10k row, 5k columns).
rows = names, columns = features. values = real numbers.
Problem: Given a subset of names, I need to obtain matrix Q, where:
Each column of Q is sorted (independently of the other columns of Q).
The values in a column of Q come from the corresponding column of P and are only those from the rows of P which are in the given subset of names.
Here is a naive implementation:
Psub <- P[names,]
Q <- sapply( Psub , sort )
But I am given 10,000 distinct subsets of names (each subset is several 20% to 90% of the total). Taking the subset and sorting each time is incredibly slow.
Instead, I can pre-compute the order vector:
b <- sapply( P , order )
b <- convert_to_linear_index( as.data.frame(b) , dim(P) )
# my own function.
# Now b is a vector of length nrow(P) * ncol(P)
a <- rownames(P) %in% myNames
a <- rep(a , ncol(P) )
a <- a[b]
a <- as.matrix(a , nrow = length(myNames) )
I don't see this getting much faster than that. You can try to write an optimized C function to do exactly this, which might cut the time in half or so (and that's optimistic -- vectorized R operations like this don't have much overhead), but not much more than that.
You've got approx 10^8 values to go through. Each time through the internal loop, it needs to increment the iterator, get the index b[i] out of memory, look up a[b[i]] and then save that value into newa[i]. I'm not a compiler/assembly expert by a long shot, but this sounds like on the order of 5-10 instructions, which means you're looking at "big O" of 1 billion instructions total, so there's a clock rate limit to how fast this can go.
Also, R stores logical values as 32 bit ints, so the array a will take up about 400 megs, which doesn't fit into cache, so if b is a more or less random permutation, then you're going to be missing the cache regularly (on most lookups to a, in fact). Again, I'm not an expert, but I would think it's likely that the cache misses here are the bottleneck, and if that's the case, optimized C won't help much.
Aside from writing it in C, the other thing to do is determine whether there are any assumptions you can make that would let you not go through the whole array. For example, if you know most of the indices will not change, and you can figure out which ones do change, you might be able to make it go faster.
On edit, here are some numbers. I have an AMD with clock speed of 2.8GHz. It takes me 3.4 seconds with a random permutation (i.e. lots of cache misses) and 0.7 seconds with either 1:n or n:1 (i.e. very few cache misses), which breaks into 0.6 seconds of execution time and 0.1 of system time, presumably to allocate the new array. So it does appear that cache misses are the thing. Maybe optimized C code could shave something like 0.2 or 0.3 seconds off of that base time, but if the permutation is random, that won't make much difference.
> x<-sample(c(T,F),90*10**6,T)
> prm<-sample(90*10**6)
> prm1<-1:length(prm)
> prm2<-rev(prm1)
> system.time(x<-x[prm])
user system elapsed
3.317 0.116 3.436
> system.time(x<-x[prm1])
user system elapsed
0.593 0.140 0.734
> system.time(x<-x[prm2])
user system elapsed
0.631 0.112 0.743
>

Finding equal matrix lines

I have a matrix in which each row represents a data point (it's a nxp matrix with n p-dimensional points), and I need to find if there is a pair of equal points.
With only two points, I could just apply dist, but as the amount of points increase, so does the amount of comparisons I need to do with dist (as I'm comparing only two points at a time).
So, as I'm starting to use big matrices, I need a quick way to find if any two points are equal (or if there are two equal rows in this matrix).
Any suggestions?
Edit: as I don't need to return the numbers of the equal rows (I just need to verify if any two are equal), I guess I could create a matrix with no duplicated lines and just compare the number of lines between this matrix and the original matrix. What do you think?
Use the unique function, which is specifically set up to allow you to check for either unique rows or columns in a matrix. Or, depending on whether you want to save the reduced matrix or not, you could use duplicated as juba pointed out.
If the matrix is large, consider using data tables.
library(data.table)
n <- 1e6
set.seed(1)
df <- data.frame(x.1=round(runif(n,0,100)),
x.2=round(runif(n,0,100)),
x.3=round(runif(n,0,100)),
x.4=round(runif(n,0,100)))
dt <- data.table(df)
system.time(df.dupe <- duplicated(df))
# user system elapsed
# 16.55 0.01 16.60
system.time(dt.dupe <- duplicated(dt))
# user system elapsed
# 9.79 0.05 9.83
setkeyv(dt,colnames(dt))
system.time(dt.dupe <- duplicated(dt))
# user system elapsed
# 0.08 0.00 0.07
So without keys, data tables are about 40% faster. WIth keys they are about 160X faster. Of course you have to create the keys (sort), which takes about 10 sec so if you're only doing this once, better to use the unkeyed data table.

subset slow in large matrix

I have a numeric vector of length 5,000,000
>head(coordvec)
[1] 47286545 47286546 47286547 47286548 47286549 472865
and a 3 x 1,400,000 numeric matrix
>head(subscores)
V1 V2 V3
1 47286730 47286725 0.830
2 47286740 47286791 0.065
3 47286750 47286806 -0.165
4 47288371 47288427 0.760
5 47288841 47288890 0.285
6 47288896 47288945 0.225
What I am trying to accomplish is that for each number in coordvec, find the average of V3 for rows in subscores in which V1 and V2 encompass the number in coordvec. To do that, I am taking the following approach:
results<-numeric(length(coordvec))
for(i in 1:length(coordvec)){
select_rows <- subscores[, 1] < coordvec[i] & subscores[, 2] > coordvec[i]
scores_subset <- subscores[select_rows, 3]
results[m]<-mean(scores_subset)
}
This is very slow, and would take a few days to finish. Is there a faster way?
Thanks,
Dan
I think there are two challenging parts to this question. The first is finding the overlaps. I'd use the IRanges package from Bioconductor (?findInterval in the base package might also be useful)
library(IRanges)
creating width 1 ranges representing the coordinate vector, and set of ranges representing the scores; I sort the coordinate vectors for convenience, assuming that duplicate coordinates can be treated the same
coord <- sort(sample(.Machine$integer.max, 5000000))
starts <- sample(.Machine$integer.max, 1200000)
scores <- runif(length(starts))
q <- IRanges(coord, width=1)
s <- IRanges(starts, starts + 100L)
Here we find which query overlaps which subject
system.time({
olaps <- findOverlaps(q, s)
})
This takes about 7s on my laptop. There are different types of overlaps (see ?findOverlaps) so maybe this step requires a bit of refinement.
The result is a pair of vectors indexing the query and overlapping subject.
> olaps
Hits of length 281909
queryLength: 5000000
subjectLength: 1200000
queryHits subjectHits
<integer> <integer>
1 19 685913
2 35 929424
3 46 1130191
4 52 37417
I think this is the end of the first complicated part, finding the 281909 overlaps. (I don't think the data.table answer offered elsewhere addresses this, though I could be mistaken...)
The next challenging part is calculating a large number of means. The built-in way would be something like
olaps0 <- head(olaps, 10000)
system.time({
res0 <- tapply(scores[subjectHits(olaps0)], queryHits(olaps0), mean)
})
which takes about 3.25s on my computer and appears to scale linearly, so maybe 90s for the 280k overlaps. But I think we can accomplish this tabulation efficiently with data.table. The original coordinates are start(v)[queryHits(olaps)], so as
require(data.table)
dt <- data.table(coord=start(q)[queryHits(olaps)],
score=scores[subjectHits(olaps)])
res1 <- dt[,mean(score), by=coord]$V1
which takes about 2.5s for all 280k overlaps.
Some more speed can be had by recognizing that the query hits are ordered. We want to calculate a mean for each run of query hits. We start by creating a variable to indicate the ends of each query hit run
idx <- c(queryHits(olaps)[-1] != queryHits(olaps)[-length(olaps)], TRUE)
and then calculate the cumulative scores at the ends of each run, the length of each run, and the difference between the cumulative score at the end and at the start of the run
scoreHits <- cumsum(scores[subjectHits(olaps)])[idx]
n <- diff(c(0L, seq_along(idx)[idx]))
xt <- diff(c(0L, scoreHits))
And finally, the mean is
res2 <- xt / n
This takes about 0.6s for all the data, and is identical to (though more cryptic than?) the data.table result
> identical(res1, res2)
[1] TRUE
The original coordinates corresponding to the means are
start(q)[ queryHits(olaps)[idx] ]
Something like this might be faster :
require(data.table)
subscores <- as.data.table(subscores)
subscores[, cond := V1 < coordvec & V2 > coordvec]
subscores[list(cond)[[1]], mean(V3)]
list(cond)[[1]] because: "When i is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope." source: ?data.table
Since your answer isn't easily reproducible and even if it were, none of your subscores meet your boolean condition, I'm not sure if this does exactly what you're looking for but you can use one of the apply family and a function.
myfun <- function(x) {
y <- subscores[, 1] < x & subscores[, 2] > x
mean(subscores[y, 3])
}
sapply(coordvec, myfun)
You can also take a look at mclapply. If you have enough memory this will probably speed things up significantly. However, you could also look at the foreach package with similar results. You've got your for loop "correct" by assigning into results rather than growing it, but really, you're doing a lot of comparisons. It will be hard to speed this up much.

Efficiencies for nested for loop

I've created the following code that nests a for loop inside of a for loop in R. It is a simulation to calculate Power. I've read that R isn't great for doing for loops but I was wondering if there are any efficiencies I could apply to make this run a bit faster. I'm fairly new to R as well as programming of any sort. Right now the run times I'm seeing are:
m=10 I get .17 sec
m=100 I get 3.95 sec
m=1000 I get 246.26 sec
m=2000 I get 1003.55 sec
I was hoping to set the number of times to sample, m, upwards of 100K but I'm afraid to even set this at 10K
Here is the code:
m = 1000 # number of times we are going to take samples
popmean=120 # set population mean at 120
popvar=225 # set known/established population
variance at 225
newvar=144 # variance of new methodology
alpha=.01 # set alpha
teststatvect = matrix(nrow=m,ncol=1) # empty vector to populate with test statistics
power = matrix(nrow=200,ncol=1) # empty vector to populate with power
system.time( # not needed - using to gauge how long this takes
for (n in 1:length(power)) # begin for loop for different sample sizes
for(i in 1:m){ # begin for loop to take "m" samples
y=rnorm(n,popmean,sqrt(newvar)) # sample of size n with mean 120 and var=144
ts=sum((y-popmean)^2/popvar) # calculate test statistic for each sample
teststatvect[i]=ts # loop and populate the vector to hold test statistics
vecpvals=pchisq(teststatvect,n) # calculate the pval of each statistic
power[n]=length(which(vecpvals<=alpha))/length(vecpvals) # loop to populate power vector. Power is the proportion lessthan ot equal to alpha
}
}
)
I reorganized your code a bit and got rid of the inner loop.
Sampling one long vector of random numbers (and then collapsing it into a matrix) is much faster than repeatedly sampling short vectors (replicate, as suggested in another answer, is nice for readability, but in this case you can do better by sampling random numbers in a block)
colSums is faster than summing inside a for loop or using apply.
it's just sugar (i.e. it isn't actually any more efficient), but you can use mean(pvals<=alpha) in place of sum(pvals<=alpha)/length(alpha)
I defined a function to return the power for a specified set of parameters (including sample size), then used sapply to range over the vector of sizes (not faster than a for loop, but cleaner and maybe easier to generalize).
Code:
powfun <- function(ssize=100,
m=1000, ## samples per trial
popmean=120, ## pop mean
popvar=225, ## known/established pop variance
newvar=144, ## variance of new methodology
alpha=0.01,
sampchisq=FALSE) ## sample directly from chi-squared distrib?
{
if (!sampchisq) {
ymat <- matrix(rnorm(ssize*m,popmean,sd=sqrt(newvar)),ncol=m)
ts <- colSums((ymat-popmean)^2/popvar) ## test statistic
} else {
ts <- rchisq(m,df=ssize)*newvar/popvar
}
pvals <- pchisq(ts,df=ssize) ## pval
mean(pvals<=alpha) ## power
}
Do you really need the power for every integer value of sample size, or would a more widely spaced sample be OK (if you need exact values, interpolation would probably be pretty accurate)
ssizevec <- seq(10,250,by=5)
set.seed(101)
system.time(powvec <- sapply(ssizevec,powfun,m=5000)) ## 13 secs elapsed
This is reasonably fast and might get you up to m=1e5 if you needed, but I'm not quite sure why you need results that are that precise -- the power curve is reasonably smooth with m=5000 ...
If you're impatiently waiting for long simulations, you can also get a progress bar to print by replacing sapply(ssizevec,powfun,m=5000) with library(plyr); aaply(ssizevec,.margins=1,powfun,.progress="text",m=5000)
Finally, I think you can speed the whole up a lot by sampling chi-squared values directly, or by doing an analytical power calculation (!). I think that rchisq(m,df=ssize)*newvar/popvar is equivalent to the first two lines of the loop, and you might even be able to do a numerical computation on the chi-squared densities directly ...
system.time(powvec2 <- sapply(ssizevec,powfun,m=5000,sampchisq=TRUE))
## 0.24 seconds elapsed
(I just tried this out, sampling m=1e5 at every value of sample size from 1 to 200 ... it takes 24 seconds ... but I still think it might be unnecessary.)
A picture:
par(bty="l",las=1)
plot(ssizevec,powvec,type="l",xlab="sample size",ylab="power",
xlim=c(0,250),ylim=c(0,1))
lines(ssizevec,powvec2,col="red")
In general, you want as far as possible to take advantage of vectorization, not so much for speed as readability/comprehension.
Why is writing to power[n] inside the inner loop (and I guess the calculation of vecpals as well)? Shouldn't that be in the outer loop after the inner loop executes? You may want to move the calculation of the square root outside both loops.
Why are teststatvect and power initialized as matrices (which are explicitly two dimensional arrays) and not vectors (or rather, as one dimensional arrays, using array)? Is variance at 225just the end of the comment from the previous line? You may want to check formatting. (Is this homework?)
For what it looks like you're trying to do here, you may want to take advantage of the very handy function replicate, perhaps by writing a specific function to call it on.

Resources