EDITED to provide re-producible results
It wasn't clear initially, but I need the results to take into account NAs in the raw data (df)
#
I initially wrote my code with for loops to get the proof of concept working, but now I need to speed things up. It runs in ~2-3 minutes with the for loops, but when I re-wrote it using apply(), it wasn't any faster. I thought apply() should be the vectorized solution and therefore faster, but maybe my whole premise is incorrect? (I am not new to R, but computational speed is not usually an issue for me.)
I am working with 1000+ cases and ~100 variables and need to perform 5000+ simulations with the data (turning on and off different conditions).
Starting definitions & sample data:
cases = 1000
variables = 100
simulations = 5000
df <- as.data.frame(array(rnorm(cases * variables, 0, 5), dim=c(cases, variables)))
montecarlo <- matrix(rbinom(simulations * variables, 1, 20/variables), simulations, variables)
montecarlo[montecarlo==0] <- NA
calc <- array(,dim=c(cases, variables, simulations))
interim <- array(,dim=c(cases, variables, simulations))
results <- array(,dim=c(variables, simulations))
for (j in 1:simulations) {
calc[,,j] <- exp(t(t(df) * as.numeric(montecarlo[j,])))
}
For loop version:
for (j in 1:simulations) {
interim[,,j] <- t(apply(calc[,,j], 1, function(x) x/sum(x, na.rm = TRUE))) # re-share
results[,j] <- apply(interim[,,j], 2, sum) # aggregates results
}
Apply() version:
interim <- apply(calc, c(1,3), function(x) x/sum(x, na.rm = TRUE)) # re-share
results <- as.data.frame(t(apply(interim, c(1,3), sum))) # aggregates results
I am open to any suggestions to speed things up and/or a reason why the apply() version is not any faster. Thank you!
Generally speaking: for-loops aren’t inherently slow. If you’re preallocating output (i.e. not growing a vector, causing multiple copies), they’re pretty much comparable to *apply() functions in speed. The overhead in iterating comes from repeatedly calling R functions from C code; and the advantage of using functionals is simply clarity. Here’s an example with a for-loop wrapped in a function:
foo <- function(x, f, ...) {
out <- vector("list", length(x))
for (i in seq_along(x)) {
out[[i]] <- f(x[[i]], ...)
}
out
}
x <- replicate(10000, rnorm(30), simplify = FALSE)
bench::mark(foo(x, mean), lapply(x, mean))
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 foo(x, mean) 36.9ms 38.4ms 26.1 157.3KB 52.3
#> 2 lapply(x, mean) 42.6ms 44.9ms 22.3 78.2KB 100.
The way to get speed improvements in these cases is to move all of the calculations into compiled code.
That said, there may well be other optimizations for your specific problem. You might want to provide a reproducible example, and ask a new question on Code Review about performance improvements.
Created on 2019-09-02 by the reprex package (v0.3.0.9000)
As #Mikko Marttila noted, the apply() family does not guarantee faster code. Using sweep and aperm(), the code below is about 3x faster for a 1,000 x 100 x 70 array (i.e., only 7 million elements).
results4 <- colSums(sweep(calc, c(1,3), colSums(aperm(calc, c(2,1,3)), na.rm = T), FUN = '/'), na.rm = T)
Or, for slightly less performance but is more similar to what you originally had:
interim3 <- sweep(calc, c(1,3), apply(calc, 3, rowSums, na.rm = T), FUN = '/')
results3 <- apply(interim3, c(2,3), sum, na.rm = T)
Performance:
Unit: milliseconds
expr min lq mean median uq max neval
for_loop 510.9131 514.9030 537.0344 518.2491 524.5709 705.4087 10
apply_OP 446.0352 458.4940 491.6710 500.1995 523.1843 533.9654 10
sweep_rowSums 225.5855 233.2632 252.6149 240.7245 284.1517 292.3476 10
sweep_aperm 136.2519 140.8912 163.7498 154.6984 191.5337 217.8015 10
Data
cases = 1000
variables = 100
simulations = 70
set.seed(123)
calc <- array(sample(cases *variables * simulations),dim=c(cases, variables, simulations))
interim <- array(,dim=c(cases, variables, simulations))
results <- array(,dim=c(variables, simulations))
# Original Loop
for (j in seq_len(simulations)) {
interim[,,j] <- t(apply(calc[,,j], 1, function(x) x/sum(x, na.rm = TRUE))) # re-share
results[,j] <- apply(interim[,,j], 2, sum) # aggregates results
}
# original apply
interim2 <- apply(calc, c(1,3), function(x) x/sum(x, na.rm = TRUE)) # re-share
results2 <- apply(interim2, c(1,3), sum) # aggregates results
# using sweep
interim3 <- sweep(calc, c(1,3), apply(calc, 3, rowSums, na.rm = T), FUN = '/')
results3 <- apply(interim3, c(2,3), sum, na.rm = T)
#using sweep and aperm
# interim4 <- sweep(calc, c(1,3), colSums(aperm(calc, c(2,1,3)), na.rm = T), FUN = '/')
results4 <- colSums(sweep(calc, c(1,3), colSums(aperm(calc, c(2,1,3)), na.rm = T), FUN = '/'), na.rm = T)
all.equal(results4, results3, results2, results)
library(microbenchmark)
microbenchmark(
for_loop = {
for (j in seq_len(simulations)) {
interim[,,j] <- t(apply(calc[,,j], 1, function(x) x/sum(x, na.rm = TRUE))) # re-share
results[,j] <- apply(interim[,,j], 2, sum) # aggregates results
}
}
,
apply_OP = {
interim2 <- apply(calc, c(1,3), function(x) x/sum(x, na.rm = TRUE)) # re-share
results2 <- apply(interim2, c(1,3), sum) # aggregates results
}
,
sweep_rowSums = {
interim3 <- sweep(calc, c(1,3), apply(calc, 3, rowSums, na.rm = T), FUN = '/')
results3 <- apply(interim3, c(2,3), sum, na.rm = T)
}
,
sweep_aperm = {
results4 <- colSums(sweep(calc, c(1,3), colSums(aperm(calc, c(2,1,3)), na.rm = T), FUN = '/'), na.rm = T)
}
, times = 10
)
I would like to vectorize the following code in for more efficient processing. I need to take the product of columns by row (i.e. rowProds), but the number of columns I would like the product of needs to be a function of another input.
If possible I would prefer this be done using Base R, but I'm open to and appreciate any suggestions.
This can easily be done using a loop or apply family with a udf, but these are not fast enough to meet my needs.
# Generate some data
mat <- data.frame(X = 1:5)
for (i in 1:5) {
set.seed(i)
mat[1 + i] <- runif(5)
}
# Via a for loop
for (i in 1:nrow(mat)) {
mat$calc[i] <- prod(mat[match(mat$X[i], mat$X), 2:(i + 1)])
}
mat
# Via a function with mapply
rowprodfun <- function(X) {
myprod <- prod(mat[match(X, mat$X), 2:(X + 1)])
return(myprod)
}
mat$calc <- mapply(rowprodfun, mat$X)
mat
mat$calc
# [1] 0.265508663 0.261370165 0.126427355 0.013874517 0.009758232
Both methods above result in the same "calc" column. I just need a more efficient way to generate this column.
One option would be to convert the upper triangle elements as NA and then use rowProds from matrixStats
library(matrixStats)
rowProds(as.matrix(mat[-1] * NA^upper.tri(mat[-1])), na.rm = TRUE)
#[1] 0.265508663 0.261370165 0.126427355 0.013874517 0.009758232
The use of upper.tri suggested by #akrun was super helpful. The final piece was to convert the data.frame to a matrix as.matrix before doing the element-wise multiplication.
rowProds(as.matrix(mat[-1]) * NA ^ upper.tri(mat[-1]), na.rm = T) resulted in the most efficient calculation.
apply(as.matrix(mat[-1]) * NA ^ upper.tri(mat[-1]), 1, prod, na.rm = T) was nearly as efficient if trying to accomplish in base R.
library(microbenchmark)
library(matrixStats)
library(ggplot2)
Y <- microbenchmark(
for.loop = for (i in 1:nrow(mat)) {prod(mat[match(mat$X[i], mat$X), 2:(i + 1)])},
mapply.fun = mapply(rowprodfun, mat$X),
rowProds = rowProds(as.matrix(mat[-1] * NA ^ upper.tri(mat[-1])), na.rm = T),
rowProds.matrix = rowProds(as.matrix(mat[-1]) * NA ^ upper.tri(mat[-1]), na.rm = T),
apply = apply(mat[-1] * NA ^ upper.tri(mat[-1]), 1, prod, na.rm = T),
apply.matrix = apply(as.matrix(mat[-1]) * NA ^ upper.tri(mat[-1]), 1, prod, na.rm = T)
)
> Y
Unit: microseconds
expr min lq mean median uq max neval
for.loop 4094.869 4305.5590 5682.2124 4479.8125 5193.8190 50361.025 100
mapply.fun 542.962 577.6995 1036.9821 599.2220 658.1245 32426.296 100
rowProds 518.419 553.9120 654.2657 597.5225 637.1690 2434.267 100
rowProds.matrix 99.304 116.1065 144.9313 128.0010 153.8650 516.909 100
apply 547.493 580.1540 686.2317 628.2955 703.0565 1215.812 100
apply.matrix 117.051 136.6845 158.3808 144.9920 156.5075 339.068 100
Apologies for another 'apply to rows of data.table' question however I have not found a solution in any of the other answers.
I have a data.table with >2 million rows and >5000 columns. I would like to keep the first two columns and collapse the remaining columns by some summaries.
Example: Translate this ...
keep1 keep2 c d
1: a A 568.62060 599.4427
2: b B 815.63027 728.9226
To this ...
keep1 keep2 mean median
1: a A 584.0316 584.0316
2: b B 772.2765 772.2765
Currently my solution keeps all rows and is not as fast as I had hoped.
library(data.table)
x = data.table(keep1=letters[1:5], keep2=LETTERS[1:5], c=runif(5, 1, 1000), d=runif(5, 1, 1000))
stats = function(x) list(mean(x), median(x))
x[,c("mean", "median") := get_stats(unlist(.SD)),
by = seq_len(nrow(x)), .SDcols = 3:ncol(x)]
I have two questions:
Is there a way to prevent outputting all columns and instead only output column1, column2 and the summaries I have made, as in the example?
Is there a quicker way to do this?
EDIT:
To give some context to speed problem here is the speed in a million row ~50 column table compared to base R. It is almost 3 times slower. I assume this is because I am returning the whole table back instead of the columns I want so I am hoping for a solution to this.
library(data.table)
ids = function(n) sample(LETTERS, n, rep=T)
nums = function(n) runif(n, 1, 1000)
x = data.table(keep1=ids(1e6), keep2=ids(1e6), replicate(50, nums(1e6)))
stats = function(x) c(mean(x), median(x))
ss = Sys.time()
y = x[,c("mean", "median") := stats(unlist(.SD)),
by = seq_len(nrow(x)), .SDcols = 3:ncol(x)]
Sys.time() - ss # Time difference of 1.408833 mins
ss = Sys.time()
y = cbind(x[,1:2], t(apply(x[,3:ncol(x)], 1, function(x) c(mean(x), median(x)))))
Sys.time() - ss # Time difference of 40.196 secs
So lets say I have a vector
a <- rnorm(6000)
I want to calculate the mean of the 1st value to the 60th, then again calculate the mean for the 61st value to the 120th and so fourth. So basically I want to calculate the mean for every 60th values giving me 100 means from that vector. I know I can do a for loop but I'd like to know if there is a better way to do this?
I would use
colMeans(matrix(a, 60))
.colMeans(a, 60, length(a) / 60) # more efficient (without reshaping to matrix)
Enhancement on user adunaic's request
This only works if there are 60x100 data points. If you have an incomplete 60 at the end then this errors. It would be good to have a general solution for others looking at this problem for ideas.
BinMean <- function (vec, every, na.rm = FALSE) {
n <- length(vec)
x <- .colMeans(vec, every, n %/% every, na.rm)
r <- n %% every
if (r) x <- c(x, mean.default(vec[(n - r + 1):n], na.rm = na.rm))
x
}
a <- 1:103
BinMean(a, every = 10)
# [1] 5.5 15.5 25.5 35.5 45.5 55.5 65.5 75.5 85.5 95.5 102.0
Alternative solution with group-by operation (less efficient)
BinMean2 <- function (vec, every, na.rm = FALSE) {
grp <- as.integer(ceiling(seq_along(vec) / every))
grp <- structure(grp, class = "factor",
levels = as.character(seq_len(grp[length(grp)])) )
lst <- .Internal(split(vec, grp))
unlist(lapply(lst, mean.default, na.rm = na.rm), use.names = FALSE)
}
Speed
library(microbenchmark)
a <- runif(1e+4)
microbenchmark(BinMean(a, 100), BinMean2(a, 100))
#Unit: microseconds
# expr min lq mean median uq max
# BinMean(a, 100) 40.400 42.1095 54.21286 48.3915 57.6555 205.702
# BinMean2(a, 100) 1216.823 1335.7920 1758.90267 1434.9090 1563.1535 21467.542
I recommend sapply:
a <- rnorm(6000)
seq <- seq(1, length(a), 60)
a_mean <- sapply(seq, function(i) {mean(a[i:(i+59)])})
Another option is to use tapply by creating a grouping variable.
Grouping variable could be created in two ways :
1) Using rep
tapply(a, rep(seq_along(a), each = n, length.out = length(a)), mean)
2) Using gl
tapply(a, gl(length(a)/n, n), mean)
If we convert the vector to dataframe/tibble we can use the same logic and calculate the mean
aggregate(a~gl(length(a)/n, n), data.frame(a), mean)
OR with dplyr
library(dplyr)
tibble::tibble(a) %>%
group_by(group = gl(length(a)/n, n)) %>%
summarise(mean_val = mean(a))
data
set.seed(1234)
a <- rnorm(6000)
n <- 60
I do have a similar problem that is explained in this question. Similar to that question I have a data frame that has 3 columns (id, group, value). I want to take n samples with replacement from each group and produce a smaller data frame with n samples from each group.
However, I am doing hundreds of subsamples in a simulation code and the solution based on ddply is very slow to be used in my code. I tried to rewrite a simple code to see if I can get a better performance but it is still slow (not better than the ddply solution if not worse). Below is my code. I am wondering if it can be improved for performance
#Producing example DataFrame
dfsize <- 10
groupsize <- 7
test.frame.1 <- data.frame(id = 1:dfsize, group = rep(1:groupsize,each = ceiling(dfsize/groupsize))[1:dfsize], junkdata = sample(1:10000, size =dfsize))
#Main function for subsampling
sample.from.group<- function(df, dfgroup, size, replace){
outputsize <- 1
newdf <-df # assuming a sample cannot be larger than the original
uniquegroups <- unique(dfgroup)
for (uniquegroup in uniquegroups){
dataforgroup <- which(dfgroup==uniquegroup)
mysubsample <- df[sample(dataforgroup, size, replace),]
sizeofsample <- nrow(mysubsample)
newdf[outputsize:(outputsize+sizeofsample-1), ] <- mysubsample
outputsize <- outputsize + sizeofsample
}
return(newdf[1:(outputsize-1),])
}
#Using the function
sample.from.group(test.frame.1, test.frame.1$group, 100, replace = TRUE)
Here's two plyr based solutions:
library(plyr)
dfsize <- 1e4
groupsize <- 7
testdf <- data.frame(
id = seq_len(dfsize),
group = rep(1:groupsize, length = dfsize),
junkdata = sample(1:10000, size = dfsize))
sample_by_group_1 <- function(df, dfgroup, size, replace) {
ddply(df, dfgroup, function(x) {
x[sample(nrow(df), size = size, replace = replace), , drop = FALSE]
})
}
sample_by_group_2 <- function(df, dfgroup, size, replace) {
idx <- split_indices(df[[dfgroup]])
subs <- lapply(idx, sample, size = size, replace = replace)
df[unlist(subs, use.names = FALSE), , drop = FALSE]
}
library(microbenchmark)
microbenchmark(
ddply = sample_by_group_1(testdf, "group", 100, replace = TRUE),
plyr = sample_by_group_2(testdf, "group", 100, replace = TRUE)
)
# Unit: microseconds
# expr min lq median uq max neval
# ddply 4488 4723 5059 5360 36606 100
# plyr 443 487 507 536 31343 100
The second approach is much faster because it does the subsetting in a single step - if you can figure out how to do it in one step, it's usually any easy way to get better performance.
I think this is cleaner and possibly faster:
z <- sapply(unique(test.frame.1$group), FUN= function(x){
sample(which(test.frame.1$group==x), 100, TRUE)
})
out <- test.frame.1[z,]
out