I've been seeing a lot of comments among data scientists online about how for loops are not advisable. However, I recently found myself in a situation where using one was helpful. I would like to know if there is a better alternative for the following process (and why the alternative would be better):
I needed to run a series of repeated-measures ANOVA and approached the problem similarly to the reproducible example you see below.
[I am aware that there are other issues regarding running multiple ANOVA models and that there are other options for these sorts of analyses, but for now I'd simply like to hear about the use of for loop]
As an example, four repeated-measures ANOVA models - four dependent variables that were each measured at three occasions:
set.seed(1976)
code <- seq(1:60)
time <- rep(c(0,1,2), each = 20)
DV1 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 14, 2))
DV2 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
DV3 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 8, 2))
DV4 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
dat <- data.frame(code, time, DV1, DV2, DV3, DV4)
outANOVA <- list()
for (i in names(dat)) {
y <- dat[[i]]
outANOVA[i] <- summary(aov(y ~ factor(time) + Error(factor(code)),
data = dat))
}
outANOVA
You could write it this way, it's more compact:
outANOVA <-
lapply(dat,function(y)
summary(aov(y ~ factor(time) + Error(factor(code)),data = dat)))
for loops are not necessarily slower than apply functions but they're less easy to read for many people. It is to some extent a matter of taste.
The real crime is to use a for loop when a vectorized function is available. These vectorized functions usually contain for loops written in C that are much faster (or call functions that do).
Notice that in this case we also could avoid to create a global variable y and that we didn't have to initialize the list outANOVA.
Another point, directly from this relevant post :For loops in R and computational speed (answer by Glen_b):
For loops in R are not always slower than other approaches, like apply
- but there's one huge bugbear - •never grow an array inside a loop
Instead, make your arrays full-size before you loop and then fill them
up.
In your case you're growing outANOVA, for big loops it could become problematic.
Here is some microbenchmark of different methods on a simple example:
n <- 100000
microbenchmark::microbenchmark(
preallocated_vec = {x <- vector(length=n); for(i in 1:n) {x[i] <- i^2}},
preallocated_vec2 = {x <- numeric(n); for(i in 1:n) {x[i] <- i^2}},
incremented_vec = {x <- vector(); for(i in 1:n) {x[i] <- i^2}},
preallocated_list = {x <- vector(mode = "list", length = n); for(i in 1:n) {x[i] <- i^2}},
incremented_list = {x <- list(); for(i in 1:n) {x[i] <- i^2}},
sapply = sapply(1:n, function(i) i^2),
lapply = lapply(1:n, function(i) i^2),
times=20)
# Unit: milliseconds
# expr min lq mean median uq max neval
# preallocated_vec 9.784237 10.100880 10.686141 10.367717 10.755598 12.839584 20
# preallocated_vec2 9.953877 10.315044 10.979043 10.514266 11.792158 12.789175 20
# incremented_vec 74.511906 79.318298 81.277439 81.640597 83.344403 85.982590 20
# preallocated_list 10.680134 11.197962 12.382082 11.416352 13.528562 18.620355 20
# incremented_list 196.759920 201.418857 212.716685 203.485940 205.441188 393.522857 20
# sapply 6.557739 6.729191 7.244242 7.063643 7.186044 9.098730 20
# lapply 6.019838 6.298750 6.835941 6.571775 6.844650 8.812273 20
For your use case, I would say the point is moot. Applying vectorization (and, in the process, obfuscating the code) has no benefits here.
Here's an example below, where I did a microbenchmark::microbenchmark of your solution as presented in OP, Moody's solution as in his post, and a third solution of mine, with even more vectorization (triple nested lapply).
Microbenchmark
set.seed(1976); code = seq(1:60); time = rep(c(0,1,2), each = 20);
DV1 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 14, 2)); DV2 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2)); DV3 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 8, 2)); DV4 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
dat = data.frame(code, time, DV1, DV2, DV3, DV4)
library(microbenchmark)
microbenchmark(
`Peter Miksza` = {
outANOVA1 = list()
for (i in names(dat)) {
y = dat[[i]]
outANOVA1[i] = summary(aov(y ~ factor(time) + Error(factor(code)),
data = dat))
}},
Moody_Mudskipper = {
outANOVA2 =
lapply(dat,function(y)
summary(aov(y ~ factor(time) + Error(factor(code)),data = dat)))
},
`catastrophic_failure` = {
outANOVA3 =
lapply(lapply(lapply(dat, function(y) y ~ factor(time) + Error(factor(code))), aov, data = dat), summary)
},
times = 1000L)
Results
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# Peter Miksza 26.25641 27.63011 31.58110 29.60774 32.81374 136.84448 1000 b
# Moody_Mudskipper 22.93190 23.86683 27.20893 25.61352 28.61729 135.58811 1000 a
# catastrophic_failure 22.56987 23.57035 26.59955 25.15516 28.25666 68.87781 1000 a
fiddling with JIT compilation, running compiler::setCompilerOptions(optimize = 0) and compiler::enableJIT(0) the following result ensues as well
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# Peter Miksza 23.10125 24.27295 28.46968 26.52559 30.45729 143.0731 1000 a
# Moody_Mudskipper 22.82366 24.35622 28.33038 26.72574 30.27768 146.4284 1000 a
# catastrophic_failure 22.59413 24.04295 27.99147 26.23098 29.88066 120.6036 1000 a
Conclusion
As alluded by Dirk's comment, there isn't a difference in performance, but readability is greatly impaired using vectorization.
On growing lists
Experimenting with Moody's solutions, it seems growing lists can be a bad idea if the resulting list is moderately long. Also, using byte-compiled functions directly can provide a small improvement in performance. Both are expected behaviors. Pre-allocation might prove sufficient for your application though.
Related
I have a problem where the idea is to find unique elements of an extremely large matrix, apply some function (in this example, it is a sine function, but it can be an arbitrary function, even a trained neural net) to each unique element, and then replace it back in the original matrix.
I have the following R code snippet:
nrows <- 28000
ncols <- 3000
x <- matrix(round(runif(nrows*ncols, 1, 5), 5), nrow=nrows, ncol=ncols)
u <- unique(as.vector(x))
uindex <- seq(from=1, to=length(u), by=1)
ut <- sin(u)
for (hh in uindex) {
x[x == u[hh]] <- ut[hh]
}
In the above, code-snippet, the portion
for (hh in uindex) {
x[x == u[hh]] <- ut[hh]
}
takes forever to finish for the matrix dimension of order 1e4x1e3. How can I optimize the for loop?
Please note that this is just a minimal working example for Stackoverflow only. Hence, please refrain from telling me that I can do x = sin(x). My point is the for loop.
To expand on my comment regarding avoiding the for loop entirely...
As OP knows, using loops in R is generally slow and if there's an alternative, it's likely to be faster. One answer that avoids the for loop has already been provided. Here is another, with benchmarking.
To start, turn the OP's code into a function.
withForLoop <- function(nrows=28000, ncols=3000) {
x <- matrix(round(runif(nrows*ncols, 1, 5), 5), nrow=nrows, ncol=ncols)
u <- unique(as.vector(x))
uindex <- seq(from=1, to=length(u), by=1)
ut <- sin(u)
for (hh in uindex) {
x[x == u[hh]] <- ut[hh]
}
}
and benchmark it
library(microbenchmark)
microbenchmark(withForLoop, times=100)
Unit: nanoseconds
expr min lq mean median uq max neval
withForLoop 47 49 83.46 49 55 3226 100
So we are trying to beat a median time of 49 nanoseconds. (You could, of course, pick any other summary statistic as your target metric.)
Now rewrite the code using the tidyverse. OP's code starts with a matrix, converts it to a vector and then manipulates the vector. It's not clear if the matrix needs to be recovered. Assume it does, so provide the means to recover it, but - for consistency with OP's code - don't do the recovery.
library(tidyverse)
withTidyverse <- function(nrows=28000, ncols=3000) {
x <- tibble() %>%
expand(
Row=1:nrows,
Col=1:ncols
) %>%
mutate(
Random=round(runif(nrow(.), 1, 5), 5),
Sin=sin(Random)
)
}
microbenchmark(withTidyverse, times=100)
Unit: nanoseconds
expr min lq mean median uq max neval
withTidyverse1 41 42 52.21 42.5 43 964 100
So that's reduced the median execution time to 42.5 nanoseconds on my machine. That's a saving of just over 13%.
Because sin is a relatively quick function I've not bothered to search for unique values and replace each unique value in a batch. I've just taken a blunderbuss approach and recalculated each value in the vector as it arises. Here's a way of taking the more sophisticated, "replace unique values in batches" approach:
withTidyverse2 <- function(nrows=28000, ncols=3000) {
x <- tibble() %>%
expand(
Row=1:nrows,
Col=1:ncols
) %>%
mutate(
Random=round(runif(nrow(.), 1, 5), 5)
)
y <- x %>%
distinct(Random) %>%
mutate(Sin=sin(Random))
x <- x %>%
left_join(y, by="Random")
}
microbenchmark(withTidyverse2, times=100)
Unit: nanoseconds
expr min lq mean median uq max neval
withTidyverse2 44 45 82.31 45.5 51 2543 100
So, in this specific case, the overhead of extracting the unique values and updating in batches is not worthwhile, although it is still quicker than the for loop. OP will have to investigate their actual use case.
There are, of course, lots of other ways to address the actual problem. Which one is optimal is impossible to say based on the information provided.
Here is a data.table based solution:
nrows <- 28000
ncols <- 3000
x <- round(runif(nrows*ncols, 1, 5), 5)
u <- unique(as.vector(x))
uindex = seq(from = 1, to = length(u), by = 1)
dt.x <- data.table(x)
dt.u <- data.table(u)
dt.u[, ut := sin(u)]
dt.res <- merge(dt.x, dt.u, by.x = "x", by.y = "u", all.x = TRUE)
ut <- dt.res[, ut]
output <- matrix(ut, nrow=nrows, ncol=ncols)
Main idea here, is to work with the vectors and tables. I think you can convert to the matrix at the very end.
You could use match:
nrows <- 28000
ncols <- 3000
x <- matrix(round(runif(nrows*ncols, 1, 5), 5), nrow=nrows, ncol=ncols)
u <- unique(as.vector(x))
ut <- sin(u)
i <- match(x, u)
ut[i]
EDITED to provide re-producible results
It wasn't clear initially, but I need the results to take into account NAs in the raw data (df)
#
I initially wrote my code with for loops to get the proof of concept working, but now I need to speed things up. It runs in ~2-3 minutes with the for loops, but when I re-wrote it using apply(), it wasn't any faster. I thought apply() should be the vectorized solution and therefore faster, but maybe my whole premise is incorrect? (I am not new to R, but computational speed is not usually an issue for me.)
I am working with 1000+ cases and ~100 variables and need to perform 5000+ simulations with the data (turning on and off different conditions).
Starting definitions & sample data:
cases = 1000
variables = 100
simulations = 5000
df <- as.data.frame(array(rnorm(cases * variables, 0, 5), dim=c(cases, variables)))
montecarlo <- matrix(rbinom(simulations * variables, 1, 20/variables), simulations, variables)
montecarlo[montecarlo==0] <- NA
calc <- array(,dim=c(cases, variables, simulations))
interim <- array(,dim=c(cases, variables, simulations))
results <- array(,dim=c(variables, simulations))
for (j in 1:simulations) {
calc[,,j] <- exp(t(t(df) * as.numeric(montecarlo[j,])))
}
For loop version:
for (j in 1:simulations) {
interim[,,j] <- t(apply(calc[,,j], 1, function(x) x/sum(x, na.rm = TRUE))) # re-share
results[,j] <- apply(interim[,,j], 2, sum) # aggregates results
}
Apply() version:
interim <- apply(calc, c(1,3), function(x) x/sum(x, na.rm = TRUE)) # re-share
results <- as.data.frame(t(apply(interim, c(1,3), sum))) # aggregates results
I am open to any suggestions to speed things up and/or a reason why the apply() version is not any faster. Thank you!
Generally speaking: for-loops aren’t inherently slow. If you’re preallocating output (i.e. not growing a vector, causing multiple copies), they’re pretty much comparable to *apply() functions in speed. The overhead in iterating comes from repeatedly calling R functions from C code; and the advantage of using functionals is simply clarity. Here’s an example with a for-loop wrapped in a function:
foo <- function(x, f, ...) {
out <- vector("list", length(x))
for (i in seq_along(x)) {
out[[i]] <- f(x[[i]], ...)
}
out
}
x <- replicate(10000, rnorm(30), simplify = FALSE)
bench::mark(foo(x, mean), lapply(x, mean))
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 foo(x, mean) 36.9ms 38.4ms 26.1 157.3KB 52.3
#> 2 lapply(x, mean) 42.6ms 44.9ms 22.3 78.2KB 100.
The way to get speed improvements in these cases is to move all of the calculations into compiled code.
That said, there may well be other optimizations for your specific problem. You might want to provide a reproducible example, and ask a new question on Code Review about performance improvements.
Created on 2019-09-02 by the reprex package (v0.3.0.9000)
As #Mikko Marttila noted, the apply() family does not guarantee faster code. Using sweep and aperm(), the code below is about 3x faster for a 1,000 x 100 x 70 array (i.e., only 7 million elements).
results4 <- colSums(sweep(calc, c(1,3), colSums(aperm(calc, c(2,1,3)), na.rm = T), FUN = '/'), na.rm = T)
Or, for slightly less performance but is more similar to what you originally had:
interim3 <- sweep(calc, c(1,3), apply(calc, 3, rowSums, na.rm = T), FUN = '/')
results3 <- apply(interim3, c(2,3), sum, na.rm = T)
Performance:
Unit: milliseconds
expr min lq mean median uq max neval
for_loop 510.9131 514.9030 537.0344 518.2491 524.5709 705.4087 10
apply_OP 446.0352 458.4940 491.6710 500.1995 523.1843 533.9654 10
sweep_rowSums 225.5855 233.2632 252.6149 240.7245 284.1517 292.3476 10
sweep_aperm 136.2519 140.8912 163.7498 154.6984 191.5337 217.8015 10
Data
cases = 1000
variables = 100
simulations = 70
set.seed(123)
calc <- array(sample(cases *variables * simulations),dim=c(cases, variables, simulations))
interim <- array(,dim=c(cases, variables, simulations))
results <- array(,dim=c(variables, simulations))
# Original Loop
for (j in seq_len(simulations)) {
interim[,,j] <- t(apply(calc[,,j], 1, function(x) x/sum(x, na.rm = TRUE))) # re-share
results[,j] <- apply(interim[,,j], 2, sum) # aggregates results
}
# original apply
interim2 <- apply(calc, c(1,3), function(x) x/sum(x, na.rm = TRUE)) # re-share
results2 <- apply(interim2, c(1,3), sum) # aggregates results
# using sweep
interim3 <- sweep(calc, c(1,3), apply(calc, 3, rowSums, na.rm = T), FUN = '/')
results3 <- apply(interim3, c(2,3), sum, na.rm = T)
#using sweep and aperm
# interim4 <- sweep(calc, c(1,3), colSums(aperm(calc, c(2,1,3)), na.rm = T), FUN = '/')
results4 <- colSums(sweep(calc, c(1,3), colSums(aperm(calc, c(2,1,3)), na.rm = T), FUN = '/'), na.rm = T)
all.equal(results4, results3, results2, results)
library(microbenchmark)
microbenchmark(
for_loop = {
for (j in seq_len(simulations)) {
interim[,,j] <- t(apply(calc[,,j], 1, function(x) x/sum(x, na.rm = TRUE))) # re-share
results[,j] <- apply(interim[,,j], 2, sum) # aggregates results
}
}
,
apply_OP = {
interim2 <- apply(calc, c(1,3), function(x) x/sum(x, na.rm = TRUE)) # re-share
results2 <- apply(interim2, c(1,3), sum) # aggregates results
}
,
sweep_rowSums = {
interim3 <- sweep(calc, c(1,3), apply(calc, 3, rowSums, na.rm = T), FUN = '/')
results3 <- apply(interim3, c(2,3), sum, na.rm = T)
}
,
sweep_aperm = {
results4 <- colSums(sweep(calc, c(1,3), colSums(aperm(calc, c(2,1,3)), na.rm = T), FUN = '/'), na.rm = T)
}
, times = 10
)
I'm trying to divide each row of a dataframe by a number stored in a second mapping dataframe.
for(g in rownames(data_table)){
print(g)
data_table[g,] <- data_table[g,]/mapping[g,2]
}
However, this is incredibly slow, each row takes almost 1-2 seconds to run. I know iteration is usually not the best way to do things in R, but I don't know how else to do it. Is there any way I can speed up the runtime?
Try this :
sweep(data_table, 1, mapping[[2]], "/")
In terms of speed here is a benchmark for the possibilities using iris dataset and including your version :
microbenchmark::microbenchmark(
A = {
for(g in rownames(test)){
# print(g)
test[g,] <- test[g,]/test[g,2]
}
},
B = sweep(test, 1, test[[2]], "/"),
C = test / test[[2]],
times = 100
)
#Unit: microseconds
#expr min lq mean median uq max neval
#A 82374.693 83722.023 101688.1254 84582.052 147280.057 157507.892 100
#B 453.652 484.393 514.4094 513.850 539.480 623.688 100
#C 404.506 423.794 456.0063 446.101 470.675 729.205 100
you can vectorize this operation if the two variables have the same number of rows:
dt <- data.frame(a = rnorm(100), b = rnorm(100))
mapping <- data.frame(x = rnorm(100), y = rnorm(100))
dt / mapping[,2]
I hope that, this question will be nice tutorial for beginners in R (such as me).
I was used to programming languages where loops are necessary to manipulate the data, algorithms, etc.
Nevertheless, loops in R are slow, what can be seen in case of large data.
Fortunately R provides bulit-in functions which allow to iterating through elements and do some calculation in very efficient way.
Now I'd like to avoid loops when I'm analysying data in R. So I've read about lapply, apply and other useful functions.
I'd like to make correlation between first and each other column of my data and print: sample name, sample estimate and p-value in nice table - everything without for loop.
My idea - create fake data from stratch:
surv <- c(7.1,8,4,2,0.5,5,6)
geneA_expr <- runif(n = 7, min = 1, max = 30)
geneB_expr <- runif(n = 7, min = 1, max = 30)
geneC_expr <- runif(n = 7, min = 1, max = 30)
my_data <- data.frame(surv, geneA_expr, geneB_expr, geneC_expr)
Correlation test with apply - found it here in Stack Overflow and in manual:
md_stat <- apply(my_data[,2:4], 2, cor.test, my_data$surv, method="pearson")
md_stat is a list, now I'd like to visualize it nicely, but I have no idea how to do it, it's too complicated for me, so I used for loop
for(i in names(md_stat)){
cat(i ,md_stat[[i]]$estimate, md_stat[[i]]$p.value, '\n')
}
geneA_expr 0.2517658 0.5860052
geneB_expr 0.2438112 0.5982849
geneC_expr 0.8026801 0.02977544
How to replace above for loop by other bulit-in function?
unlist every list within md_stat. then bind the outputs into a matrix.
do.call(rbind, lapply(md_stat, unlist))
Try this
temp <- lapply(seq_along(md_stat), function(i) {
cat(names(md_stat)[[i]], md_stat[[i]]$estimate, md_stat[[i]]$p.value, '\n')
})
I can think of 4 ways for you to do this, 1 of which depends on the purrr package.
You could use a loop, walk from the purrr package, lapply and a recursive function.
library(microbenchmark)
library(purrr)
surv <- c(7.1,8,4,2,0.5,5,6)
geneA_expr <- runif(n = 7, min = 1, max = 30)
geneB_expr <- runif(n = 7, min = 1, max = 30)
geneC_expr <- runif(n = 7, min = 1, max = 30)
my_data <- data.frame(surv, geneA_expr, geneB_expr, geneC_expr)
md_stat <- apply(my_data[,2:4], 2, cor.test, my_data$surv, method="pearson")
md_loop <- function(md_stat) {
for(i in names(md_stat)){
cat(i ,md_stat[[i]]$estimate, md_stat[[i]]$p.value, '\n')
}
}
md_walk <- function(md_stat) {
walk(names(md_stat), function(i) {
cat(i ,md_stat[[i]]$estimate, md_stat[[i]]$p.value, '\n')
})
}
md_apply <- function(md_stat) {
lapply(names(md_stat), function(i) {
cat(i[[1]],md_stat[[i[[1]]]]$estimate, md_stat[[i[[1]]]]$p.value, '\n')
})
}
md_recursive <- function(md_stat) {
i <- names(md_stat)
if(length(i) < 1) {
NULL
} else {
cat(i[[1]],md_stat[[i[[1]]]]$estimate, md_stat[[i[[1]]]]$p.value, '\n')
md_recursive(tail(md_stat, -1))
}
}
md_speed <- microbenchmark(
md_loop(md_stat),
md_walk(md_stat),
md_apply(md_stat),
md_recursive(md_stat)
)
Speed comparisons
I was hoping someone could help point me in the right direction to create a vector in R, containing a defined amount of randomly generated numbers. I am a complete newbie to R, and I have learned that the concatenate function is used for creating vectors. However, I wish to populate the vector with 50 random numbers. I do not wish to specify a range or any other conditions for the numbers.
MyVectorObject <- c(...)
Any suggestions would be greatly appreciated!
It depends on which numbers you want to generate. These are some options.
x1 <- rpois(n = 50, lambda = 10)
x2 <- runif(n = 50, min = 1, max = 10)
x3 <- sample(x = c(1, 3, 5), size = 50, replace = TRUE)
If we are talking about integer numbers, you want to generate number in interval <-base::.Machine$integer.max, base::.Machine$integer.max> which is for example on my computer interval <-2147483647,2147483647>
Implementation
you can use base::sample to generate positive numbers from 1 to base::.Machine$integer.max
random.pos <- function(N) {
int.max <- .Machine$integer.max
return(sample(int.max, N, replace=TRUE))
}
if you want also negative numbers, you can use
random.posneg <- function(N) {
int.max <- .Machine$integer.max
random.numbers <- sample(int.max, N, replace = TRUE)
random.signs <- sample(c(1,-1), N, replace=TRUE)
return(random.numbers * random.signs)
}
NOTE: No one from functions specified above does generate 0 (zero)
The best approach (by my opinion) is to use base::runif function.
random.runif <- function(N) {
int.max <- .Machine$integer.max
int.min <- -int.max
return(as.integer(runif(N, int.min, int.max)))
}
This will be little bit slower then using base::sample but you get random numbers uniformly distributed with possible zero.
Benchmark
library(microbenchmark)
require(compiler)
random.runif <- cmpfun(random.runif)
random.pos <- cmpfun(random.pos)
random.posneg <- cmpfun(random.posneg)
N <- 500
op <- microbenchmark(
RUNIF = random.runif(N),
POS = random.pos(N),
POSNEG = random.posneg(N),
times = 10000
)
print(op)
## library(ggplot2)
## boxplot(op)
## qplot(y=time, data=op, colour=expr) + scale_y_log10()
and results from the benchmark above
Unit: microseconds
expr min lq mean median uq max neval
RUNIF 13.423 14.251 15.197122 14.482 14.694 2425.290 10000
POS 4.174 5.043 5.613292 5.317 5.645 2436.909 10000
POSNEG 11.673 12.845 13.383194 13.285 13.800 60.304 10000