Aggregate rows in a large matrix by rowname - r

I would like to aggregate the rows of a matrix by adding the values in rows that have the same rowname. My current approach is as follows:
> M
a b c d
1 1 1 2 0
1 2 3 4 2
2 3 0 1 2
3 4 2 5 2
> index <- as.numeric(rownames(M))
> M <- cbind(M,index)
> Dfmat <- data.frame(M)
> Dfmat <- aggregate(. ~ index, data = Dfmat, sum)
> M <- as.matrix(Dfmat)
> rownames(M) <- M[,"index"]
> M <- subset(M, select= -index)
> M
a b c d
1 3 4 6 2
2 3 0 1 2
3 4 2 5 2
The problem of this appraoch is that i need to apply it to a number of very large matrices (up to 1.000 rows and 30.000 columns). In these cases the computation time is very high (Same problem when using ddply). Is there a more eficcient to come up with the solution? Does it help that the original input matrices are DocumentTermMatrix from the tm package? As far as I know they are stored in a sparse matrix format.

Here's a solution using by and colSums, but requires some fiddling due to the default output of by.
M <- matrix(1:9,3)
rownames(M) <- c(1,1,2)
t(sapply(by(M,rownames(M),colSums),identity))
V1 V2 V3
1 3 9 15
2 3 6 9

There is now an aggregate function in Matrix.utils. This can accomplish what you want with a single line of code and is about 10x faster than the combineByRow solution and 100x faster than the by solution:
N <- 10000
m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)
> microbenchmark(a<-t(sapply(by(m,rownames(m),colSums),identity)),b<-combineByRow(m),c<-aggregate.Matrix(m,row.names(m)),times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
a <- t(sapply(by(m, rownames(m), colSums), identity)) 6000.26552 6173.70391 6660.19820 6419.07778 7093.25002 7723.61642 10
b <- combineByRow(m) 634.96542 689.54724 759.87833 732.37424 866.22673 923.15491 10
c <- aggregate.Matrix(m, row.names(m)) 42.26674 44.60195 53.62292 48.59943 67.40071 70.40842 10
> identical(as.vector(a),as.vector(c))
[1] TRUE
EDIT: Frank is right, rowsum is somewhat faster than any of these solutions. You would want to consider using another one of these other functions only if you were using a Matrix, especially a sparse one, or if you were performing an aggregation besides sum.

The answer by James work as expected, but is quite slow for large matrices. Here is a version that avoids creating of new objects:
combineByRow <- function(m) {
m <- m[ order(rownames(m)), ]
## keep track of previous row name
prev <- rownames(m)[1]
i.start <- 1
i.end <- 1
## cache the rownames -- profiling shows that it takes
## forever to look at them
m.rownames <- rownames(m)
stopifnot(all(!is.na(m.rownames)))
## go through matrix in a loop, as we need to combine some unknown
## set of rows
for (i in 2:(1+nrow(m))) {
curr <- m.rownames[i]
## if we found a new row name (or are at the end of the matrix),
## combine all rows and mark invalid rows
if (prev != curr || is.na(curr)) {
if (i.start < i.end) {
m[i.start,] <- apply(m[i.start:i.end,], 2, max)
m.rownames[(1+i.start):i.end] <- NA
}
prev <- curr
i.start <- i
} else {
i.end <- i
}
}
m[ which(!is.na(m.rownames)),]
}
Testing it shows that is about 10x faster than the answer using by (2 vs. 20 seconds in this example):
N <- 10000
m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)
start <- proc.time()
m1 <- combineByRow(m)
print(proc.time()-start)
start <- proc.time()
m2 <- t(sapply(by(m,rownames(m),function(x) apply(x, 2, max)),identity))
print(proc.time()-start)
all(m1 == m2)

Related

Calculating standard deviation of variables in a large list in R

I have a large list that contains 1000 lists of the same variables and same length.
My goal is to calculate mean, standard deviation, and standard error of all lists within the large list.
I have calculated mean of the variables using Reduce(), but I couldn't figure out how to do the same for standard deviation.
My list looks something like this:
large.list <- vector('list', 1000)
for (i in 1:1000) {
large.list[[i]] <- as.data.frame(matrix(c(1:4), ncol=2))
}
large.list
[[1]]
V1 V2
1 1 3
2 2 4
[[2]]
V1 V2
1 1 3
2 2 4
[[3]]
V1 V2
1 1 3
2 2 4
......
[[1000]]
V1 V2
1 1 3
2 2 4
To calculate mean, I do:
list.mean <- Reduce("+", large.list) / length(large.list)
list.mean
V1 V2
1 1 3
2 2 4
This is overly simplified version of a large list, but how can I calculate list-wide standard deviation and standard error like I did for mean?
Thank you very much in advance!
If you stay with Reduce(), you have to do a little bit statistics:
var(x) = E(x^2) - (E(x))^2
Note that you already got E(x) as list.mean. To get E(x^2), it is also straightforward:
list.squared.mean <- Reduce("+", lapply(large.list, "^", 2)) / length(large.list)
Then variance is:
list.variance <- list.squared.mean - list.mean^2
Standard deviation is just
list.sd <- sqrt(list.variance)
However, a much more efficient solution is to use tapply()
vec <- unlist(large.list, use.names = FALSE)
DIM <- dim(large.list[[1]])
n <- length(large.list)
list.mean <- tapply(vec, rep(1:prod(DIM),times = n), mean)
attr(list.mean, "dim") <- DIM
list.mean <- as.data.frame(list.mean)
list.sd <- tapply(vec, rep(1:prod(DIM),times = n), sd)
attr(list.sd, "dim") <- DIM
list.sd <- as.data.frame(list.sd)
If I may suggest an alternative, you could transform the list into a 3-dimensional matrix, and then use apply() to produce the output.
Here's how to transform the list (assuming dimensional regularity):
m <- do.call(cbind,lapply(large.list,as.matrix));
m <- array(m,c(nrow(m),ncol(m)/length(large.list),length(large.list)));
And here's how to use apply() on the matrix:
apply(m,1:2,mean);
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
apply(m,1:2,sd);
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
here a solution based on reshaping the list into data.table. we are basically extracting the value of index i from each sub-list to create a single vector.
ll <- unlist(large.list)
DX <- data.table(V1= ll[c(T,F,F,F)],
V2= ll[c(F,T,F,F)],
V3= ll[c(F,F,T,F)],
V4= ll[c(F,F,F,T)])
then all calculation are straight forward:
mm <- DX[,lapply(.SD,mean)]
sdd <- DX[,lapply(.SD,sd)]

Efficiently save indices of nonzero matrix elements to a file

I need to save the indices of a matrix's nonzero elements to a file. This works very well for small-sized matrices, storing the row numbers of the non-zero indices in a and the column numbers of the non-zero indices in b:
X <- matrix(c(1,0,3,4,0,5), byrow=TRUE, nrow=2);
a <- row(X)[which(!X == 0)]
b <- col(X)[which(!X == 0)]
But size of the matrix is huge, and I need to find an efficient way to save the indices to a txt file, so that I have a[1] b[1] (new line) a[2] b[2] and so on. Any suggestions?
The package Matrix has a great solution for extremely large matrices. The sparseMatrix object can be summarized into a data.frame where i and j are your indices and x is the value:
X <- matrix(c(1,0,3,4,0,5), byrow=TRUE, nrow=2);
a <- row(X)[which(!X == 0)]
b <- col(X)[which(!X == 0)]
library(Matrix)
Y <- Matrix(X, sparse = TRUE)
(res <- summary(Y))
2 x 3 sparse Matrix of class "dgCMatrix", with 4 entries
i j x
1 1 1 1
2 2 1 4
3 1 3 3
4 2 3 5
class(res)
[1] "sparseSummary" "data.frame"
You can then subset to get just i and j:
res[, c("i", "j")]
i j
1 1 1
2 2 1
3 1 3
4 2 3
You can grab the rows and columns of all non-zero locations using which with parameter arr.ind=TRUE, writing the result to a file with write.table:
write.table(which(X != 0, arr.ind=TRUE), "file.txt", row.names=F, col.names=F)
This yields space-separated output of the pairs of elements in the specified file:
1 1
2 1
1 3
2 3
Using which with arr.ind=TRUE saves a few scans through your input matrix compared to the code posted in your question, so it should be a bit quicker at calculating the data to output. You can see this with a benchmark for a larger matrix (1000 x 1000, with 1% density):
set.seed(144)
bigX <- matrix(sample(c(rep(0, 99), 1), 1000000, replace=T), nrow=1000)
OP <- function(X) cbind(row(X)[which(!X == 0)], col(X)[which(!X == 0)])
josilber <- function(X) which(X != 0, arr.ind=TRUE)
library(microbenchmark)
microbenchmark(OP(bigX), josilber(bigX))
# Unit: milliseconds
# expr min lq mean median uq max neval
# OP(bigX) 20.513535 23.014517 36.463423 25.354250 59.130520 65.50304 100
# josilber(bigX) 3.873165 4.281624 6.741824 5.250777 6.998415 45.02542 100
In this case we see about a 5x speedup in computing the non-zero rows and columns. Depending on the density and size of your matrix the output operation (write.table) might instead be the bottleneck, in which case there may not be too much benefit to this approach.

faster way to compare rows in a data frame

Consider the data frame below. I want to compare each row with rows below and then take the rows that are equal in more than 3 values.
I wrote the code below, but it is very slow if you have a large data frame.
How could I do that faster?
data <- as.data.frame(matrix(c(10,11,10,13,9,10,11,10,14,9,10,10,8,12,9,10,11,10,13,9,13,13,10,13,9), nrow=5, byrow=T))
rownames(data)<-c("sample_1","sample_2","sample_3","sample_4","sample_5")
>data
V1 V2 V3 V4 V5
sample_1 10 11 10 13 9
sample_2 10 11 10 14 9
sample_3 10 10 8 12 9
sample_4 10 11 10 13 9
sample_5 13 13 10 13 9
output <- data.frame(sample = NA, duplicate = NA, matches = NA)
dfrow <- 1
for(i in 1:nrow(data)) {
sample <- data[i, ]
for(j in (i+1):nrow(data)) if(i+1 <= nrow(data)) {
matches <- 0
for(V in 1:ncol(data)) {
if(data[j,V] == sample[,V]) {
matches <- matches + 1
}
}
if(matches > 3) {
duplicate <- data[j, ]
pair <- cbind(rownames(sample), rownames(duplicate), matches)
output[dfrow, ] <- pair
dfrow <- dfrow + 1
}
}
}
>output
sample duplicate matches
1 sample_1 sample_2 4
2 sample_1 sample_4 5
3 sample_2 sample_4 4
Here is an Rcpp solution. However, if the result matrix gets too big (i.e., there are too many hits), this will throw an error. I run the loops twice, first to get the necessary size of the result matrix and then to fill it. There is probably a better possibility. Also, obviously, this will only work with integers. If your matrix is numeric, you'll have to deal with floating point precision.
library(Rcpp)
library(inline)
#C++ code:
body <- '
const IntegerMatrix M(as<IntegerMatrix>(MM));
const int m=M.ncol(), n=M.nrow();
long count1;
int count2;
count1 = 0;
for (int i=0; i<(n-1); i++)
{
for (int j=(i+1); j<n; j++)
{
count2 = 0;
for (int k=0; k<m; k++) {
if (M(i,k)==M(j,k)) count2++;
}
if (count2>3) count1++;
}
}
IntegerMatrix R(count1,3);
count1 = 0;
for (int i=0; i<(n-1); i++)
{
for (int j=(i+1); j<n; j++)
{
count2 = 0;
for (int k=0; k<m; k++) {
if (M(i,k)==M(j,k)) count2++;
}
if (count2>3) {
count1++;
R(count1-1,0) = i+1;
R(count1-1,1) = j+1;
R(count1-1,2) = count2;
}
}
}
return wrap(R);
'
fun <- cxxfunction(signature(MM = "matrix"),
body,plugin="Rcpp")
#with your data
fun(as.matrix(data))
# [,1] [,2] [,3]
# [1,] 1 2 4
# [2,] 1 4 5
# [3,] 2 4 4
#Benchmarks
set.seed(42)
mat1 <- matrix(sample(1:10,250*26,TRUE),ncol=26)
mat2 <- matrix(sample(1:10,2500*26,TRUE),ncol=26)
mat3 <- matrix(sample(1:10,10000*26,TRUE),ncol=26)
mat4 <- matrix(sample(1:10,25000*26,TRUE),ncol=26)
library(microbenchmark)
microbenchmark(
fun(mat1),
fun(mat2),
fun(mat3),
fun(mat4),
times=3
)
# Unit: milliseconds
# expr min lq median uq max neval
# fun(mat1) 2.675568 2.689586 2.703603 2.732487 2.761371 3
# fun(mat2) 272.600480 274.680815 276.761151 276.796217 276.831282 3
# fun(mat3) 4623.875203 4643.634249 4663.393296 4708.067638 4752.741979 3
# fun(mat4) 29041.878164 29047.151348 29052.424532 29235.839275 29419.254017 3
EDIT: Not sure what I was thinking last night when I subtracted rows considering I could've directly tested for equality. Removed that uncessary step from the code below.
Here is one approach that may either be slightly clever or poorly thought out... but hopefully the former. The idea is that instead of doing a series of comparisons row-by-row you can instead perform some vectorized operations by subtracting the row from the rest of the data frame and then looking at the number of elements that are equal to zero. Here is a simple implementation of the approach:
> library(data.table)
> data <- as.data.frame(matrix(c(10,11,10,13,9,10,11,10,14,9,10,10,8,12,9,10,11,10,13,9,13,13,10,13,9), nrow=5, byrow=T))
> rownames(data)<-c("sample_1","sample_2","sample_3","sample_4","sample_5")
>
> findMatch <- function(i,n){
+ tmp <- colSums(t(data[-(1:i),]) == unlist(data[i,]))
+ tmp <- tmp[tmp > n]
+ if(length(tmp) > 0) return(data.table(sample=rownames(data)[i],duplicate=names(tmp),match=tmp))
+ return(NULL)
+ }
>
> system.time(tab <- rbindlist(lapply(1:(nrow(data)-1),findMatch,n=3)))
user system elapsed
0.003 0.000 0.003
> tab
sample duplicate match
1: sample_1 sample_2 4
2: sample_1 sample_4 5
3: sample_2 sample_4 4
EDIT: Here is version2 that uses matrices and pre-tranposes the data so you only need to do that once. It should scale better to your example with a non-trivial amount of data.
library(data.table)
data <- matrix(round(runif(26*250000,0,25)),ncol=26)
tdata <- t(data)
findMatch <- function(i,n){
tmp <- colSums(tdata[,-(1:i)] == data[i,])
j <- which(tmp > n)
if(length(tmp) > 0) return(data.table(sample=i,duplicate=j+1,match=tmp[j]))
return(NULL)
}
tab <- rbindlist(lapply(1:(nrow(data)-1),findMatch,n=3))
I ran than on my machine for a bit and got through the first 1500 iterations a full 250,000 x 26 matrix in under 15 minutes and required 600 Mb memory. Since previous iterations do not impact future iterations you could certainly chunk this into parts and run it separately if needed.
This is not a complete answer, just a quick workout that comes in mind is to use matrices instead of data.frame (those are quite slow tbh). Matrices are quite fast in R and by completing at least some operations in it and then appending the vector with column names will result in significant speed increase.
Just a quick demo:
data <- matrix(c(10,11,10,13,9,10,11,10,14,9,10,10,8,12,9,10,11,10,13,9,13,13,10,13,9), nrow=5, byrow=T)rownames(data)<-c("sample_1","sample_2","sample_3","sample_4","sample_5")
mu<-c("sample_1","sample_2","sample_3","sample_4","sample_5")
t=proc.time()
tab <- data.frame(sample = NA, duplicate = NA, matches = NA)
dfrow <- 1
for(i in 1:nrow(data)) {
sample <- data[i, ]
for(j in (i+1):nrow(data)) if(i+1 <= nrow(data)) {
matches <- 0
for(V in 1:ncol(data)) {
if(data[j,V] == sample[V]) {
matches <- matches + 1
}
}
if(matches > 3) {
duplicate <- data[j, ]
pair <- cbind(mu[i], mu[j], matches)
tab[dfrow, ] <- pair
dfrow <- dfrow + 1
}
}
}
proc.time()-t
On the average, on my machine, yields
user system elapsed
0.00 0.06 0.06
While in your case I get
user system elapsed
0.02 0.06 0.08
I'm not sure whether there's something more quicker than matrices. You can also play around with parallelisation, but for loops C++ code inlining are quite often used (package Rcpp).
library(data.table)
#creating the data
dt <- data.table(read.table(textConnection(
"Sample V1 V2 V3 V4 V5
sample_1 10 11 10 13 9
sample_2 10 11 10 14 9
sample_3 10 10 8 12 9
sample_4 10 11 10 13 9
sample_5 13 13 10 13 9"), header= TRUE))
# some constants which will be used frequently
nr = nrow(dt)
nc = ncol(dt)-1
#list into which we will insert the no. of matches for each sample
#for example's sake, i still suggest you write output to a file possibly
totalmatches <- vector(mode = "list", length = (nr-1))
#looping over each sample
for ( i in 1:(nr-1))
{
# all combinations of i with i+1 to nr
samplematch <- cbind(dt[i],dt[(i+1):nr])
# renaming the comparison sample columns
setnames(samplematch,append(colnames(dt),paste0(colnames(dt),"2")))
#calculating number of matches
samplematch[,noofmatches := 0]
for (j in 1:nc)
{
samplematch[,noofmatches := noofmatches+1*(get(paste0("V",j)) == get(paste0("V",j,"2")))]
}
# removing individual value columns and matches < 3
samplematch <- samplematch[noofmatches >= 3,list(Sample,Sample2,noofmatches)]
# adding to the list
totalmatches[[i]] <- samplematch
}
The output -
rbindlist(totalmatches)
Sample Sample2 noofmatches
1: sample_1 sample_2 4
2: sample_1 sample_4 5
3: sample_1 sample_5 3
4: sample_2 sample_4 4
5: sample_4 sample_5 3
The performance on matrices seems to be better though, this method clocked -
user system elapsed
0.17 0.01 0.19
Everything that has been said in the comments is very valid; in particular, I also don't necessarily think R is the best place to do this. That said, this works a lot quicker for me than what you've posed on a much larger dataset (~9.7 sec vs. unfinished after two minutes):
data <- matrix(sample(1:30, 10000, replace=TRUE), ncol=5)
#Pre-prepare
x <- 1
#Loop
for(i in seq(nrow(data)-2)){
#Find the number of matches on that row
sums <- apply(data[seq(from=-1,to=-i),], 1, function(x) sum(x==data[i,]))
#Find how many are greater than/equal to 3
matches <- which(sums >= 3)
#Prepare output
output[seq(from=x, length.out=length(matches)),1] <- rep(i, length(matches))
output[seq(from=x, length.out=length(matches)),2] <- matches
output[seq(from=x, length.out=length(matches)),3] <- sums[matches]
#Alter the counter of how many we've made...
x <- x + length(matches)
}
#Cleanup output
output <- output[!is.na(output[,1]),]})
...I'm fairly certain my weird x variable and the assignment of output could be improved/turned into an apply-type problem, but it's late and I'm tired! Good luck!
Well, I took a stab at it, the following code runs about 3 times faster than the original.
f <- function(ind, mydf){
res <- NULL
matches <- colSums(t(mydf[-(1:ind),])==mydf[ind,])
Ndups <- sum(matches > 3)
if(Ndups > 0){
res <- data.frame(sample=rep(ind,Ndups),duplicate=which(matches > 3),
matches= matches[matches > 3],stringsAsFactors = F)
rownames(res) <- NULL
return(as.matrix(res))
}
return(res)
}
f(1,mydf=as.matrix(data))
f(2,mydf=as.matrix(data))
system.time(
for(i in 1:1000){
tab <- NULL
for(j in 1:(dim(data)[1]-1))
tab <- rbind(tab,f(j,mydf=as.matrix(data)))
}
)/1000
tab
Assuming that all the entries in your dataset are of the same mode (numeric), turn it into a matrix. By transposing, you can take advantage of how == can be vectorized.
data <- as.matrix(data)
data <- t(data)
output <- lapply(seq_len(ncol(data) - 1), function(x) {
tmp <- data[,x] == data[, (x+1):ncol(data)]
n_matches <- {
if (x == ncol(data) - 1) {
setNames(sum(tmp),colnames(data)[ncol(data)])
} else {
colSums(tmp)
}
}
good_matches <- n_matches[n_matches >= 3]
})
The big question is how to output the results. As it stands I have your data in a list. I would think that this is the least memory-intensive way of storing your data.
[[1]]
sample_2 sample_4 sample_5
4 5 3
[[2]]
sample_4
4
[[3]]
named numeric(0)
[[4]]
sample_5
3
If you want a data frame output, then you'll want to tweak the return value of the function within lapply. Perhaps add in the last line of the function:
return(data.frame(
sample = colnames(data)[x],
duplicate = names(good_matches),
noofmatches = good_matches,
stringsAsFactors = FALSE))
And then use:
newoutput <- do.call(rbind, output)
## or, using plyr
# require(plyr)
# newoutput <- rbind.fill(output)

does the by( ) function make growing list

Does the by function make a list that grows one element at a time?
I need to process a data frame with about 4M observations grouped by a factor column. The situation is similar to the example below:
> # Make 4M rows of data
> x = data.frame(col1=1:4000000, col2=10000001:14000000)
> # Make a factor
> x[,"f"] = x[,"col1"] - x[,"col1"] %% 5
>
> head(x)
col1 col2 f
1 1 10000001 0
2 2 10000002 0
3 3 10000003 0
4 4 10000004 0
5 5 10000005 5
6 6 10000006 5
Now, a tapply on one of the columns takes a reasonable amount of time:
> t1 = Sys.time()
> z = tapply(x[, 1], x[, "f"], mean)
> Sys.time() - t1
Time difference of 22.14491 secs
But if I do this:
z = by(x[, 1], x[, "f"], mean)
That doesn't finish anywhere near the same time (I gave up after a minute).
Of course, in the above example, tapply could be used, but I actually need to process multiple columns together. What is the better way to do this?
by is slower than tapply because it is wrapping by.
Let's take a look at some benchmarks: tapply in this situation is more than 3x faster than using by
UPDATED to include #Roland's great recomendation:
library(rbenchmark)
library(data.table)
dt <- data.table(x,key="f")
using.tapply <- quote(tapply(x[, 1], x[, "f"], mean))
using.by <- quote(by(x[, 1], x[, "f"], mean))
using.dtable <- quote(dt[,mean(col1),by=key(dt)])
times <- benchmark(using.tapply, using.dtable, using.by, replications=10, order="relative")
times[,c("test", "elapsed", "relative")]
#------------------------#
# RESULTS #
#------------------------#
# COMPARING tapply VS by #
#-----------------------------------
# test elapsed relative
# 1 using.tapply 2.453 1.000
# 2 using.by 8.889 3.624
# COMPARING data.table VS tapply VS by #
#------------------------------------------#
# test elapsed relative
# 2 using.dtable 0.168 1.000
# 1 using.tapply 2.396 14.262
# 3 using.by 8.566 50.988
If x$f is a factor, the loss in efficiency between tapply and by is even greater!
Although, notice that they both improve relative to non-factor inputs, while data.table remains approx the same or worse
x[, "f"] <- as.factor(x[, "f"])
dt <- data.table(x,key="f")
times <- benchmark(using.tapply, using.dtable, using.by, replications=10, order="relative")
times[,c("test", "elapsed", "relative")]
# test elapsed relative
# 2 using.dtable 0.175 1.000
# 1 using.tapply 1.803 10.303
# 3 using.by 7.854 44.880
As for the why, the short answer is in the documentation itself.
?by :
Description
Function by is an object-oriented wrapper for tapply applied to data frames.
let's take a look at the source for by (or more specificaly, by.data.frame):
by.data.frame
function (data, INDICES, FUN, ..., simplify = TRUE)
{
if (!is.list(INDICES)) {
IND <- vector("list", 1L)
IND[[1L]] <- INDICES
names(IND) <- deparse(substitute(INDICES))[1L]
}
else IND <- INDICES
FUNx <- function(x) FUN(data[x, , drop = FALSE], ...)
nd <- nrow(data)
ans <- eval(substitute(tapply(seq_len(nd), IND, FUNx, simplify = simplify)),
data)
attr(ans, "call") <- match.call()
class(ans) <- "by"
ans
}
We see immediately that there is still a call to tapply plus a lot of extras (including calls to deparse(substitute(.)) and an eval(substitute(.)) both of which are relatively slow). Therefore it makes sense that your tapply will be relatively faster than a similar call to by.
Regarding a better way to do this: With 4M rows you should use data.table.
library(data.table)
dt <- data.table(x,key="f")
dt[,mean(col1),by=key(dt)]
dt[,list(mean1=mean(col1),mean2=mean(col2)),by=key(dt)]
dt[,lapply(.SD,mean),by=key(dt)]

splitting up ranges

Say I have some ranges represented by start coordinates start<-c(1,2,3) and end coordiantes end<-c(4,5,4) ;ranges<-data.frame(start,end) How can I split this up into one length intervals?
i.e. I want
this
starts ends
1 1 4
2 2 5
3 3 4
to be transformed into this:
starts ends
1 1 2 |
2 3 4 <-end of original first interval
3 2 3 |
4 4 5 <-end of original second interval
5 3 4 <-end of original third interval
right now I have a for loop iterating through the list and creating a sequence sequence that goes from start to end but this loop takes a very long time to execute for long lists of ranges.
Here's one way. It's a "glorified for-loop" in the disguise of lapply on a sequence.
# Your sample data
ranges<-data.frame(start=c(1,2,3),end=c(4,5,4))
# Extract the start/end columns
start <- ranges$start
end <- ranges$end
# Calculate result data
res <- lapply(seq_along(start), function(i) start[i]+seq(0, end[i]-start[i]))
# Make it into a data.frame by way of a matrix (which has a byrow argument)
newRanges <- as.data.frame( matrix(unlist(res), ncol=2, byrow=TRUE, dimnames=list(NULL, names(ranges))) )
Which gives the correct result:
> newRanges
start end
1 1 2
2 3 4
3 2 3
4 4 5
5 3 4
And then time it on a bigger problem:
n <- 1e5
start <- sample(10, n, replace=TRUE)
end <- start + sample( 3, n, replace=TRUE)*2-1
system.time( newRanges <- as.data.frame( matrix(unlist(lapply(seq_along(start), function(i) start[i]+seq(0, end[i]-start[i]))), ncol=2, byrow=TRUE) ) )
This takes about 1.6 seconds on my machine. Good enough?
...The trick is to work on the vectors directly instead of on the data.frame. And then build the data.frame at the end.
Update #Ellipsis... commented that lapply is no better than a for-loop. Let's see:
system.time( a <- unlist(lapply(seq_along(start), function(i) start[i]+seq(0, end[i]-start[i]))) ) # 1.6 secs
system.time( b <- {
res <- vector('list', length(start))
for (i in seq_along(start)) {
res[[i]] <- start[i]+seq(0, end[i]-start[i])
}
unlist(res)
}) # 1.8 secs
So, not only is the for-loop about 12% slower in this case, it is also much more verbose...
UPDATE AGAIN!
#Martin Morgan suggested using Map, and it is indeed the fastest solution yet - faster than do.call in my other answer. Also, by using seq.int my first solution is also much faster:
# do.call solution: 0.46 secs
system.time( matrix(do.call('c', lapply(seq_along(start), function(i) call(':', start[i], end[i]))), ncol=2, byrow=TRUE) )
# lapply solution: 0.42 secs
system.time( matrix(unlist(lapply(seq_along(start), function(i) start[[i]]+seq.int(0L, end[[i]]-start[[i]]))), ncol=2, byrow=TRUE) )
# Map solution: 0.26 secs
system.time( matrix(unlist(Map(seq.int, start, end)), ncol=2, byrow=TRUE) )
You could try creating text for the vectors, parse-ing and eval-uating and then using a matrix to create the data.frame:
txt <- paste("c(",paste(ranges$start,ranges$end,sep=":",collapse=","),")",sep="")
> txt
[1] "c(1:4,2:5,3:4)"
vec <- eval(parse(text=txt))
> vec
[1] 1 2 3 4 2 3 4 5 3 4
mat <- matrix(vec,ncol=2,byrow=T)
> data.frame(mat)
X1 X2
1 1 2
2 3 4
3 2 3
4 4 5
5 3 4
Here's another answer based on #James great solution. It avoids paste and parse and is a little bit faster:
vec <- do.call('c', lapply(seq_along(start), function(i) call(':', start[i], end[i])))
mat <- matrix(vec,ncol=2,byrow=T)
Timing it:
set.seed(42)
n <- 1e5
start <- sample(10, n, replace=TRUE)
end <- start + sample( 3, n, replace=TRUE)*2-1
# #James code: 6,64 secs
system.time({
for(i in 1:10) {
txt <- paste("c(",paste(start,end,sep=":",collapse=","),")",sep="")
vec <- eval(parse(text=txt))
mat <- matrix(vec,ncol=2,byrow=T)
}
})
# My variant: 5.17 secs
system.time({
for(i in 1:10) {
vec <- do.call('c', lapply(seq_along(start), function(i) call(':', start[i], end[i])))
mat <- matrix(vec,ncol=2,byrow=T)
}
})

Resources