EDITED to provide re-producible results
It wasn't clear initially, but I need the results to take into account NAs in the raw data (df)
#
I initially wrote my code with for loops to get the proof of concept working, but now I need to speed things up. It runs in ~2-3 minutes with the for loops, but when I re-wrote it using apply(), it wasn't any faster. I thought apply() should be the vectorized solution and therefore faster, but maybe my whole premise is incorrect? (I am not new to R, but computational speed is not usually an issue for me.)
I am working with 1000+ cases and ~100 variables and need to perform 5000+ simulations with the data (turning on and off different conditions).
Starting definitions & sample data:
cases = 1000
variables = 100
simulations = 5000
df <- as.data.frame(array(rnorm(cases * variables, 0, 5), dim=c(cases, variables)))
montecarlo <- matrix(rbinom(simulations * variables, 1, 20/variables), simulations, variables)
montecarlo[montecarlo==0] <- NA
calc <- array(,dim=c(cases, variables, simulations))
interim <- array(,dim=c(cases, variables, simulations))
results <- array(,dim=c(variables, simulations))
for (j in 1:simulations) {
calc[,,j] <- exp(t(t(df) * as.numeric(montecarlo[j,])))
}
For loop version:
for (j in 1:simulations) {
interim[,,j] <- t(apply(calc[,,j], 1, function(x) x/sum(x, na.rm = TRUE))) # re-share
results[,j] <- apply(interim[,,j], 2, sum) # aggregates results
}
Apply() version:
interim <- apply(calc, c(1,3), function(x) x/sum(x, na.rm = TRUE)) # re-share
results <- as.data.frame(t(apply(interim, c(1,3), sum))) # aggregates results
I am open to any suggestions to speed things up and/or a reason why the apply() version is not any faster. Thank you!
Generally speaking: for-loops aren’t inherently slow. If you’re preallocating output (i.e. not growing a vector, causing multiple copies), they’re pretty much comparable to *apply() functions in speed. The overhead in iterating comes from repeatedly calling R functions from C code; and the advantage of using functionals is simply clarity. Here’s an example with a for-loop wrapped in a function:
foo <- function(x, f, ...) {
out <- vector("list", length(x))
for (i in seq_along(x)) {
out[[i]] <- f(x[[i]], ...)
}
out
}
x <- replicate(10000, rnorm(30), simplify = FALSE)
bench::mark(foo(x, mean), lapply(x, mean))
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 foo(x, mean) 36.9ms 38.4ms 26.1 157.3KB 52.3
#> 2 lapply(x, mean) 42.6ms 44.9ms 22.3 78.2KB 100.
The way to get speed improvements in these cases is to move all of the calculations into compiled code.
That said, there may well be other optimizations for your specific problem. You might want to provide a reproducible example, and ask a new question on Code Review about performance improvements.
Created on 2019-09-02 by the reprex package (v0.3.0.9000)
As #Mikko Marttila noted, the apply() family does not guarantee faster code. Using sweep and aperm(), the code below is about 3x faster for a 1,000 x 100 x 70 array (i.e., only 7 million elements).
results4 <- colSums(sweep(calc, c(1,3), colSums(aperm(calc, c(2,1,3)), na.rm = T), FUN = '/'), na.rm = T)
Or, for slightly less performance but is more similar to what you originally had:
interim3 <- sweep(calc, c(1,3), apply(calc, 3, rowSums, na.rm = T), FUN = '/')
results3 <- apply(interim3, c(2,3), sum, na.rm = T)
Performance:
Unit: milliseconds
expr min lq mean median uq max neval
for_loop 510.9131 514.9030 537.0344 518.2491 524.5709 705.4087 10
apply_OP 446.0352 458.4940 491.6710 500.1995 523.1843 533.9654 10
sweep_rowSums 225.5855 233.2632 252.6149 240.7245 284.1517 292.3476 10
sweep_aperm 136.2519 140.8912 163.7498 154.6984 191.5337 217.8015 10
Data
cases = 1000
variables = 100
simulations = 70
set.seed(123)
calc <- array(sample(cases *variables * simulations),dim=c(cases, variables, simulations))
interim <- array(,dim=c(cases, variables, simulations))
results <- array(,dim=c(variables, simulations))
# Original Loop
for (j in seq_len(simulations)) {
interim[,,j] <- t(apply(calc[,,j], 1, function(x) x/sum(x, na.rm = TRUE))) # re-share
results[,j] <- apply(interim[,,j], 2, sum) # aggregates results
}
# original apply
interim2 <- apply(calc, c(1,3), function(x) x/sum(x, na.rm = TRUE)) # re-share
results2 <- apply(interim2, c(1,3), sum) # aggregates results
# using sweep
interim3 <- sweep(calc, c(1,3), apply(calc, 3, rowSums, na.rm = T), FUN = '/')
results3 <- apply(interim3, c(2,3), sum, na.rm = T)
#using sweep and aperm
# interim4 <- sweep(calc, c(1,3), colSums(aperm(calc, c(2,1,3)), na.rm = T), FUN = '/')
results4 <- colSums(sweep(calc, c(1,3), colSums(aperm(calc, c(2,1,3)), na.rm = T), FUN = '/'), na.rm = T)
all.equal(results4, results3, results2, results)
library(microbenchmark)
microbenchmark(
for_loop = {
for (j in seq_len(simulations)) {
interim[,,j] <- t(apply(calc[,,j], 1, function(x) x/sum(x, na.rm = TRUE))) # re-share
results[,j] <- apply(interim[,,j], 2, sum) # aggregates results
}
}
,
apply_OP = {
interim2 <- apply(calc, c(1,3), function(x) x/sum(x, na.rm = TRUE)) # re-share
results2 <- apply(interim2, c(1,3), sum) # aggregates results
}
,
sweep_rowSums = {
interim3 <- sweep(calc, c(1,3), apply(calc, 3, rowSums, na.rm = T), FUN = '/')
results3 <- apply(interim3, c(2,3), sum, na.rm = T)
}
,
sweep_aperm = {
results4 <- colSums(sweep(calc, c(1,3), colSums(aperm(calc, c(2,1,3)), na.rm = T), FUN = '/'), na.rm = T)
}
, times = 10
)
Related
I would like to vectorize the following code in for more efficient processing. I need to take the product of columns by row (i.e. rowProds), but the number of columns I would like the product of needs to be a function of another input.
If possible I would prefer this be done using Base R, but I'm open to and appreciate any suggestions.
This can easily be done using a loop or apply family with a udf, but these are not fast enough to meet my needs.
# Generate some data
mat <- data.frame(X = 1:5)
for (i in 1:5) {
set.seed(i)
mat[1 + i] <- runif(5)
}
# Via a for loop
for (i in 1:nrow(mat)) {
mat$calc[i] <- prod(mat[match(mat$X[i], mat$X), 2:(i + 1)])
}
mat
# Via a function with mapply
rowprodfun <- function(X) {
myprod <- prod(mat[match(X, mat$X), 2:(X + 1)])
return(myprod)
}
mat$calc <- mapply(rowprodfun, mat$X)
mat
mat$calc
# [1] 0.265508663 0.261370165 0.126427355 0.013874517 0.009758232
Both methods above result in the same "calc" column. I just need a more efficient way to generate this column.
One option would be to convert the upper triangle elements as NA and then use rowProds from matrixStats
library(matrixStats)
rowProds(as.matrix(mat[-1] * NA^upper.tri(mat[-1])), na.rm = TRUE)
#[1] 0.265508663 0.261370165 0.126427355 0.013874517 0.009758232
The use of upper.tri suggested by #akrun was super helpful. The final piece was to convert the data.frame to a matrix as.matrix before doing the element-wise multiplication.
rowProds(as.matrix(mat[-1]) * NA ^ upper.tri(mat[-1]), na.rm = T) resulted in the most efficient calculation.
apply(as.matrix(mat[-1]) * NA ^ upper.tri(mat[-1]), 1, prod, na.rm = T) was nearly as efficient if trying to accomplish in base R.
library(microbenchmark)
library(matrixStats)
library(ggplot2)
Y <- microbenchmark(
for.loop = for (i in 1:nrow(mat)) {prod(mat[match(mat$X[i], mat$X), 2:(i + 1)])},
mapply.fun = mapply(rowprodfun, mat$X),
rowProds = rowProds(as.matrix(mat[-1] * NA ^ upper.tri(mat[-1])), na.rm = T),
rowProds.matrix = rowProds(as.matrix(mat[-1]) * NA ^ upper.tri(mat[-1]), na.rm = T),
apply = apply(mat[-1] * NA ^ upper.tri(mat[-1]), 1, prod, na.rm = T),
apply.matrix = apply(as.matrix(mat[-1]) * NA ^ upper.tri(mat[-1]), 1, prod, na.rm = T)
)
> Y
Unit: microseconds
expr min lq mean median uq max neval
for.loop 4094.869 4305.5590 5682.2124 4479.8125 5193.8190 50361.025 100
mapply.fun 542.962 577.6995 1036.9821 599.2220 658.1245 32426.296 100
rowProds 518.419 553.9120 654.2657 597.5225 637.1690 2434.267 100
rowProds.matrix 99.304 116.1065 144.9313 128.0010 153.8650 516.909 100
apply 547.493 580.1540 686.2317 628.2955 703.0565 1215.812 100
apply.matrix 117.051 136.6845 158.3808 144.9920 156.5075 339.068 100
So lets say I have a vector
a <- rnorm(6000)
I want to calculate the mean of the 1st value to the 60th, then again calculate the mean for the 61st value to the 120th and so fourth. So basically I want to calculate the mean for every 60th values giving me 100 means from that vector. I know I can do a for loop but I'd like to know if there is a better way to do this?
I would use
colMeans(matrix(a, 60))
.colMeans(a, 60, length(a) / 60) # more efficient (without reshaping to matrix)
Enhancement on user adunaic's request
This only works if there are 60x100 data points. If you have an incomplete 60 at the end then this errors. It would be good to have a general solution for others looking at this problem for ideas.
BinMean <- function (vec, every, na.rm = FALSE) {
n <- length(vec)
x <- .colMeans(vec, every, n %/% every, na.rm)
r <- n %% every
if (r) x <- c(x, mean.default(vec[(n - r + 1):n], na.rm = na.rm))
x
}
a <- 1:103
BinMean(a, every = 10)
# [1] 5.5 15.5 25.5 35.5 45.5 55.5 65.5 75.5 85.5 95.5 102.0
Alternative solution with group-by operation (less efficient)
BinMean2 <- function (vec, every, na.rm = FALSE) {
grp <- as.integer(ceiling(seq_along(vec) / every))
grp <- structure(grp, class = "factor",
levels = as.character(seq_len(grp[length(grp)])) )
lst <- .Internal(split(vec, grp))
unlist(lapply(lst, mean.default, na.rm = na.rm), use.names = FALSE)
}
Speed
library(microbenchmark)
a <- runif(1e+4)
microbenchmark(BinMean(a, 100), BinMean2(a, 100))
#Unit: microseconds
# expr min lq mean median uq max
# BinMean(a, 100) 40.400 42.1095 54.21286 48.3915 57.6555 205.702
# BinMean2(a, 100) 1216.823 1335.7920 1758.90267 1434.9090 1563.1535 21467.542
I recommend sapply:
a <- rnorm(6000)
seq <- seq(1, length(a), 60)
a_mean <- sapply(seq, function(i) {mean(a[i:(i+59)])})
Another option is to use tapply by creating a grouping variable.
Grouping variable could be created in two ways :
1) Using rep
tapply(a, rep(seq_along(a), each = n, length.out = length(a)), mean)
2) Using gl
tapply(a, gl(length(a)/n, n), mean)
If we convert the vector to dataframe/tibble we can use the same logic and calculate the mean
aggregate(a~gl(length(a)/n, n), data.frame(a), mean)
OR with dplyr
library(dplyr)
tibble::tibble(a) %>%
group_by(group = gl(length(a)/n, n)) %>%
summarise(mean_val = mean(a))
data
set.seed(1234)
a <- rnorm(6000)
n <- 60
Based on the question More efficient means of creating a corpus and DTM I've prepared my own method for building a Term Document Matrix from a large corpus which (I hope) do not require Terms x Documents memory.
sparseTDM <- function(vc){
id = unlist(lapply(vc, function(x){x$meta$id}))
content = unlist(lapply(vc, function(x){x$content}))
out = strsplit(content, "\\s", perl = T)
names(out) = id
lev.terms = sort(unique(unlist(out)))
lev.docs = id
v1 = lapply(
out,
function(x, lev) {
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
},
lev = lev.terms
)
v2 = lapply(
seq_along(v1),
function(i, x, n){
rep(i,length(x[[i]]))
},
x = v1,
n = names(v1)
)
stm = data.frame(i = unlist(v1), j = unlist(v2)) %>%
group_by(i, j) %>%
tally() %>%
ungroup()
tmp = simple_triplet_matrix(
i = stm$i,
j = stm$j,
v = stm$n,
nrow = length(lev.terms),
ncol = length(lev.docs),
dimnames = list(Terms = lev.terms, Docs = lev.docs)
)
as.TermDocumentMatrix(tmp, weighting = weightTf)
}
It slows down at calculation of v1. It was running for 30 minutes and I stopped it.
I've prepared a small example:
b = paste0("string", 1:200000)
a = sample(b,80)
microbenchmark(
lapply(
list(a=a),
function(x, lev) {
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
},
lev = b
)
)
Results are:
Unit: milliseconds
expr min lq mean median uq max neval
... 25.80961 28.79981 31.59974 30.79836 33.02461 98.02512 100
Id and content has 126522 elements, Lev.terms has 155591 elements, so it looks that I've stopped processing too early. Since ultimately I'll be working on ~6M documents I need to ask... Is there any way to speed up this fragment of code?
For now I've speeded it up replacing
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
with
ind = which(lev %in% x)
cnt = as.integer(factor(x, levels = lev[ind], ordered = TRUE))
sort(ind[cnt])
Now timings are:
expr min lq mean median uq max neval
... 5.248479 6.202161 6.892609 6.501382 7.313061 10.17205 100
I went through many iterations of solving problem in creating quanteda::dfm() (see the GitHub repo here) and the fastest solution, by far, involves using the data.table and Matrix packages to index the documents and tokenised features, counting the features within documents, and plugging the result straight into a sparse matrix like this:
require(data.table)
require(Matrix)
dfm_quanteda <- function(x) {
docIndex <- 1:length(x)
if (is.null(names(x)))
names(docIndex) <- factor(paste("text", 1:length(x), sep="")) else
names(docIndex) <- names(x)
alltokens <- data.table(docIndex = rep(docIndex, sapply(x, length)),
features = unlist(x, use.names = FALSE))
alltokens <- alltokens[features != ""] # if there are any "blank" features
alltokens[, "n":=1L]
alltokens <- alltokens[, by=list(docIndex,features), sum(n)]
uniqueFeatures <- unique(alltokens$features)
uniqueFeatures <- sort(uniqueFeatures)
featureTable <- data.table(featureIndex = 1:length(uniqueFeatures),
features = uniqueFeatures)
setkey(alltokens, features)
setkey(featureTable, features)
alltokens <- alltokens[featureTable, allow.cartesian = TRUE]
alltokens[is.na(docIndex), c("docIndex", "V1") := list(1, 0)]
sparseMatrix(i = alltokens$docIndex,
j = alltokens$featureIndex,
x = alltokens$V1,
dimnames=list(docs=names(docIndex), features=uniqueFeatures))
}
require(quanteda)
str(inaugTexts)
## Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
## - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
tokenizedTexts <- tokenize(toLower(inaugTexts), removePunct = TRUE, removeNumbers = TRUE)
system.time(dfm_quanteda(tokenizedTexts))
## user system elapsed
## 0.060 0.005 0.064
That's just a snippet of course but the full source code is easily found on the GitHub repo (dfm-main.R).
I also encourage you to use the full dfm() from the package. You can install it from CRAN or the development version using:
devtools::install_github("kbenoit/quanteda")
on your texts to see how that works in terms of performance.
Have you tried experimenting with the sort method (algorithm) and specifying quicksort or shell sort?
something like:
sort(as.integer(factor(x, levels = lev, ordered = TRUE)), method=shell)
or:
sort(as.integer(factor(x, levels = lev, ordered = TRUE)), method=quick)
Also, you might try using some intermediate variables to evaluate the nested functions in the event the sort algorithm is re-executing these steps again and again:
foo<-factor(x, levels = lev, ordered = TRUE)
bar<-as.integer(foo)
sort(bar, method=quick)
or
sort(bar)
Good luck!
I have a dataframe that houses cols of numbers - id like to check the range between these cols by row and create a new col that contains this range....
tool1 tool2 tool3 range
1 34 12 33
na 19 23 4
its has to be able to handle NAs too, byt just ignoring them.
How could this be done?
I've decide to expand this, because operating on rows in R is always a pain. So I've decided to compare base R against the two very efficient packages data.table and dplyr
(I'm not a dplyr expert, so if someone wants to modify my answer, please do)
Note:
Your case isn't a classic case of operating on rows because it can be solved using vectorized pmax and pmin, which we won't be always able to use
So creating a bit bigger data than in your example
n <- 1e4
set.seed(123)
df <- data.frame(tool1 = sample(100, n, replace = T),
tool2 = sample(100, n, replace = T),
tool3 = sample(100, n, replace = T))
Loading the necessary packages
library(data.table)
library(dplyr)
library(microbenchmark)
Defining the functions
apply1 <- function(y) apply(y, 1, function(x) max(x, na.rm = T) - min(x, na.rm = T))
apply2 <- function(y) apply(y, 1, function(x) diff(range(x, na.rm = T)))
trans <- function(y) transform(y, range = pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3))
DTfunc <- function(y) setDT(y)[, range := pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3)]
DTfunc2 <- function(y) set(y, j = "range", value = with(y, pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3))) # Thanks to #Arun for this
dplyrfunc <- function(y) mutate(y, range = pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3))
df2 <- as.data.table(df) # This is in order to avoid overriding df by `setDT` during benchmarking
Running some benchmarks
microbenchmark(apply1(df), apply2(df), trans(df), DTfunc(df2), DTfunc2(df2), dplyrfunc(df), times = 100)
Unit: microseconds
expr min lq median uq max neval
apply1(df) 37221.513 40699.3790 44103.3495 46777.305 94845.463 100
apply2(df) 262440.581 278239.6460 287478.4710 297301.116 343962.869 100
trans(df) 1088.799 1178.3355 1234.9940 1287.503 1965.328 100
DTfunc(df2) 2068.750 2221.8075 2317.5680 2400.400 5935.883 100
DTfunc2(df2) 903.981 959.0435 986.3355 1026.395 1235.951 100
dplyrfunc(df) 1040.280 1118.9635 1159.9815 1200.680 1509.189 100
Seems like the second data.table approach is the most efficient. Base R transform and dplyr both pretty much the same, while more efficient than the first data.table approach because of the overhead in calling [.data.table
I do have a similar problem that is explained in this question. Similar to that question I have a data frame that has 3 columns (id, group, value). I want to take n samples with replacement from each group and produce a smaller data frame with n samples from each group.
However, I am doing hundreds of subsamples in a simulation code and the solution based on ddply is very slow to be used in my code. I tried to rewrite a simple code to see if I can get a better performance but it is still slow (not better than the ddply solution if not worse). Below is my code. I am wondering if it can be improved for performance
#Producing example DataFrame
dfsize <- 10
groupsize <- 7
test.frame.1 <- data.frame(id = 1:dfsize, group = rep(1:groupsize,each = ceiling(dfsize/groupsize))[1:dfsize], junkdata = sample(1:10000, size =dfsize))
#Main function for subsampling
sample.from.group<- function(df, dfgroup, size, replace){
outputsize <- 1
newdf <-df # assuming a sample cannot be larger than the original
uniquegroups <- unique(dfgroup)
for (uniquegroup in uniquegroups){
dataforgroup <- which(dfgroup==uniquegroup)
mysubsample <- df[sample(dataforgroup, size, replace),]
sizeofsample <- nrow(mysubsample)
newdf[outputsize:(outputsize+sizeofsample-1), ] <- mysubsample
outputsize <- outputsize + sizeofsample
}
return(newdf[1:(outputsize-1),])
}
#Using the function
sample.from.group(test.frame.1, test.frame.1$group, 100, replace = TRUE)
Here's two plyr based solutions:
library(plyr)
dfsize <- 1e4
groupsize <- 7
testdf <- data.frame(
id = seq_len(dfsize),
group = rep(1:groupsize, length = dfsize),
junkdata = sample(1:10000, size = dfsize))
sample_by_group_1 <- function(df, dfgroup, size, replace) {
ddply(df, dfgroup, function(x) {
x[sample(nrow(df), size = size, replace = replace), , drop = FALSE]
})
}
sample_by_group_2 <- function(df, dfgroup, size, replace) {
idx <- split_indices(df[[dfgroup]])
subs <- lapply(idx, sample, size = size, replace = replace)
df[unlist(subs, use.names = FALSE), , drop = FALSE]
}
library(microbenchmark)
microbenchmark(
ddply = sample_by_group_1(testdf, "group", 100, replace = TRUE),
plyr = sample_by_group_2(testdf, "group", 100, replace = TRUE)
)
# Unit: microseconds
# expr min lq median uq max neval
# ddply 4488 4723 5059 5360 36606 100
# plyr 443 487 507 536 31343 100
The second approach is much faster because it does the subsetting in a single step - if you can figure out how to do it in one step, it's usually any easy way to get better performance.
I think this is cleaner and possibly faster:
z <- sapply(unique(test.frame.1$group), FUN= function(x){
sample(which(test.frame.1$group==x), 100, TRUE)
})
out <- test.frame.1[z,]
out