I am trying to create a matrix where each row consists of the sum of every three rows in another matrix. There are actually a bunch of these matrices in a list and I am performing the same operation on each of the elements in that list. Based on this post I was able to generate the code below. It works but it takes forever for my more complicated data set.
test<-lapply(1:1000, function(x) matrix(1:300, nrow=60))
testCons<-lapply(test, function(x) apply(x, 2, function(y) tapply(y, ceiling(seq_along(y)/3), sum)))
Does anybody have an idea of how to speed that up or simplify it?
rowsum gives an easy speed up - it calculates the sum of rows according to a grouping variable, which is an index for every three rows.
test <- lapply(1:1000, function(x) matrix(1:300, nrow=60))
system.time(
testCons <- lapply(test, function(x) apply(x, 2, function(y) tapply(y, ceiling(seq_along(y)/3), sum)))
)
# user system elapsed
# 1.672 0.004 1.678
system.time(
testCons2 <- lapply(test, function(x) rowsum(x, rep(seq_len(nrow(x) / 3), each=3)))
)
# user system elapsed
# 0.08 0.00 0.08
all.equal(testCons, testCons2)
#[1] TRUE
Related
I have existing code that calculates concordance value for a dataframe/matrix. It's basically the number of rows where all the values are the same over the total number of rows.
...
concordance<-new[complete.cases(new),] #removes rows with NAs
TF<-apply(concordance, 1, function(x) if(length(unique(x))>1) F else T)
#outputs vector of T/F if it is concordant
numF<-table(TF)["TRUE"]#gets number of trues
concValue<-numF/NROW(TF) #true/total
...
Above is what I have now. It runs ok but I was wondering if there was any way to make it faster.
Edit: Dimensions of object is variable, but # of cols are typically 2-6 and there are typically 1,000,000+ rows. This is part of a package i'm developing so input data is variable.
Because the number of rows is much larger than the number of columns it makes sense to loop on columns instead, dropping rows, where there is more than different one value in the process:
propIdentical <- function(Mat){
nrowInit <- nrow(Mat)
for(i in 1:(ncol(Mat) - 1)){
if(!nrow(Mat)) break #stop if the matrix has no rows
else{
#check which elements of column i and column i+1 are equal:
equals <- Mat[,i] == Mat[, i+1]
# remove all other rows from the matrix
Mat <- Mat[equals,,drop = F]
}
}
return(nrow(Mat)/nrowInit)
}
some tests:
set.seed(1)
# normal case
dat <- matrix(sample(1:10, rep = T, size = 3*10^6), nrow = 10^6)
system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0.053 0.017 0.070
[1] 0.009898
# normal case on my pc for comparison:
system.time(app <- mean(apply(dat, 1, function(x) length(unique(x))) == 1L)); app
user system elapsed
12.176 0.036 12.231
[1] 0.009898
# worst case
dat <- matrix(1L, nrow = 10^6, ncol = 6)
> system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0.302 0.044 0.348
[1] 1
# worst case on my pc for comparison
system.time(mean(apply(dat, 1, function(x) length(unique(x))) == 1L))
user system elapsed
12.562 0.001 12.578
# testing drop = F and if(!nrow(Mat)) break
dat <- matrix(1:2, ncol = 2)
> system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0 0 0
[1] 0
Note: if you run this on a data.frame make sure to turn it into a matrix first.
I'm trying to find a fast way to extract elements in a list of data frames.
To do this, I've tested the function lapply. Here is a reproducible example:
i <- 2
dat <- replicate(100000, data.frame(x=1:5000, y = 1:5000, z = 1:5000), simplify=FALSE)
system.time(test <- lapply(dat, function(y) y[i, c("x", "y")]))
user system elapsed
7.69 0.00 7.73
Ideally, the elapsed time should be <= 1 second.
I have many binary matrices from which I want to extract every possible combination of three rows into a list. I then want to sum the columns of each of the extracted row combinations.
My current method is below, but it is extremely slow.
set.seed(123)
x <- matrix(sample(0:1, 110 * 609, replace = TRUE), 110, 609)
row.combinations <- t(combn(nrow(x),3))
extracted.row.combns <- lapply(1:nrow(row.combinations), FUN = function(y) x[c(row.combinations[y,1],row.combinations[y,2],row.combinations[y,3]),])
summed.rows <- lapply(extracted.row.combns, colSums)
How could this be sped up?
Using ?combn and an inline function as an argument, I can run this analysis in under 5 seconds on my current machine:
combn(nrow(x), 3, FUN=function(r) colSums(x[r,]), simplify=FALSE)
We can make this faster with combnPrim from gRbase.
library(gRbase)
lapply(combnPrim(nrow(x), 3, simplify = FALSE), function(r) colSums(x[r,]))
Benchmarks
system.time(x1 <- combn(nrow(x), 3, FUN=function(r) colSums(x[r,]), simplify=FALSE))
# user system elapsed
# 6.46 0.21 6.67
system.time(x2 <- lapply(combnPrim(nrow(x), 3, simplify = FALSE),
function(r) colSums(x[r,])))
# user system elapsed
# 4.61 0.22 4.83
If I have a data frame as such:
df = data.frame(matrix(rnorm(100), 5000, 100))
I can use the following function to get every combination of three-term medians row-wise:
median_df = t(apply(df, 1, combn, 3, median))
The problem is, this function will take several hours to run. The culprit is median(), which takes about ten times longer to run than max() or min().
How can I speed this function up, possibly by writing a faster version of median() or working with the original data differently?
Update:
If I run the above code but only for df[,1:10] as such:
median_df = t(apply(df[,1:10], 1, combn, 3, median))
takes 29 seconds
fastMedian_df = t(apply(df[,1:10], 1, combn, 3, fastMedian))
from the package ccaPP takes 6.5 seconds
max_df = t(apply(df[,1:10], 1, combn, 3, max))
takes 2.5 seconds
So we see a significant improvement with fastMedian(). Can we still do better?
One approach to speed things up would be to note that the median of three numbers is their sum minus their max minus their min. This means we can vectorize our median calculations by handling each triple of columns once (performing the median for all rows in the same calculation) instead of handling it once for each row.
set.seed(144)
# Fully random matrix
df = matrix(rnorm(50000), 5000, 10)
original <- function(df) t(apply(df, 1, combn, 3, median))
josilber <- function(df) {
combos <- combn(seq_len(ncol(df)), 3)
apply(combos, 2, function(x) rowSums(df[,x]) - pmin(df[,x[1]], df[,x[2]], df[,x[3]]) - pmax(df[,x[1]], df[,x[2]], df[,x[3]]))
}
system.time(res.josilber <- josilber(df))
# user system elapsed
# 0.117 0.009 0.149
system.time(res.original <- original(df))
# user system elapsed
# 15.107 1.864 16.960
all.equal(res.josilber, res.original)
# [1] TRUE
The vectorization yields a 110x speedup when there are 10 columns and 5000 rows. Unfortunately I do not have a machine with enough memory to store the 808.5 million numbers in the output for your full example.
You could speed this up further by implementing a Rcpp function that takes as input the vector representation of a matrix (aka the vector obtained by reading the matrix down the columns) along with the number of rows and returns the median of each column. The function relies heavily on the std::nth_element function, which is asymptotically linear in the number of elements you're taking a median of. (Note that I don't average the middle two values when I take the median of an even-length vector; I instead take the lower of the two).
library(Rcpp)
cppFunction(
"NumericVector vectorizedMedian(NumericVector x, int chunkSize) {
const int n = x.size() / chunkSize;
std::vector<double> input = Rcpp::as<std::vector<double> >(x);
NumericVector res(n);
for (int i=0; i < n; ++i) {
std::nth_element(input.begin()+i*chunkSize, input.begin()+i*chunkSize+chunkSize/2,
input.begin()+(i+1)*chunkSize);
res[i] = input[i*chunkSize+chunkSize/2];
}
return res;
}")
Now we just invoke this function instead of using rowSums, pmin and pmax:
josilber.rcpp <- function(df) {
combos <- combn(seq_len(ncol(df)), 3)
apply(combos, 2, function(x) vectorizedMedian(as.vector(t(df[,x])), 3))
}
system.time(josilber.rcpp(df))
# user system elapsed
# 0.049 0.008 0.081
all.equal(josilber(df), josilber.rcpp(df))
# [1] TRUE
In total we therefore get a 210x speedup; 110x of the speedup is from switching from a non-vectorized application of median to a vectorized application and the remaining 2x speedup is from switching from a combination of rowSums, pmin, and pmax for computing the median in a vectorized way to a Rcpp-based approach.
This is the code snippet from recommenderlab package, that takes matrix with ratings and returns top 5 elements for each user -
reclist <- apply(ratings, MARGIN=2, FUN=function(x)
head(order(x, decreasing=TRUE, na.last=NA), 5))
For large matrix (>10K columns) it takes too long to run, is there any way to re-write it to make more efficient? Maybe by using dpyr, or data.table package)? Writing C++ code is not an option for me
An answer with data.table and base R
# 10000 column dummy matrix
cols <- 10000
mat <- matrix(rnorm(100*cols), ncol=cols)
With data.table:
library(data.table)
dt1 <- data.table(mat)
# sort every column, return first 5 rows
dt1[, lapply(.SD, sort, decreasing=T)][1:5]
system.time(dt1[, lapply(.SD, sort, decreasing=T)][1:5])
result:
user system elapsed
2.904 0.013 2.916
In plain old base, it's actually faster! (thanks for the comment Arun)
system.time(head(apply(mat, 2, sort, decreasing=T), 5))
user system elapsed
0.473 0.002 0.475
However, both are faster than the code sample above, according to system.time()
system.time(
apply(mat, MARGIN=2, FUN=function(x) {
head(order(x, decreasing=TRUE, na.last=NA), 5)
}))
user system elapsed
3.063 0.031 3.094