Are there cases where it is not advantageous to use the magrittr pipe inside of R functions from the perspectives of (1) speed, and (2) ability to debug effectively?
There are advantages and disadvantages to using a pipe inside of a function. The biggest advantage is that it's easier to see what's happening within a function when you read the code. The biggest downsides are that error messages become harder to interpret and the pipe breaks some of R's rules of evaluation.
Here's an example. Let's say we want to make a pointless transformation to the mtcars dataset. Here's how we could do that with pipes...
library(tidyverse)
tidy_function <- function() {
mtcars %>%
group_by(cyl) %>%
summarise(disp = sum(disp)) %>%
mutate(disp = (disp ^ 4) / 10000000000)
}
You can clearly see what's happening at every stage, even though it's not doing anything useful. Now let's look at the time code using the Dagwood Sandwich approach...
base_function <- function() {
mutate(summarise(group_by(mtcars, cyl), disp = sum(disp)), disp = (disp^5) / 10000000000)
}
Much harder to read, even though it gives us the same result...
all.equal(tidy_function(), base_function())
# [1] TRUE
The most common way to avoid using either a pipe or a Dagwood Sandwich is to save the results of each step to an intermediate variable...
intermediate_function <- function() {
x <- mtcars
x <- group_by(x, cyl)
x <- summarise(x, disp = sum(disp))
mutate(x, disp = (disp^5) / 10000000000)
}
More readable than the last function and R will give you a little more detailed information when there's an error. Plus it obeys the traditional rules of evaluation. Again, it gives the same results as the other two functions...
all.equal(tidy_function(), intermediate_function())
# [1] TRUE
You specifically asked about speed, so let's compare these three functions by running each of them 1000 times...
library(microbenchmark)
timing <-
microbenchmark(tidy_function(),
intermediate_function(),
base_function(),
times = 1000L)
timing
#Unit: milliseconds
#expr min lq mean median uq max neval cld
#tidy_function() 3.809009 4.403243 5.531429 4.800918 5.860111 23.37589 1000 a
#intermediate_function() 3.560666 4.106216 5.154006 4.519938 5.538834 21.43292 1000 a
#base_function() 3.610992 4.136850 5.519869 4.583573 5.696737 203.66175 1000 a
Even in this trivial example, the pipe is a tiny bit slower than the other two options.
Conclusion
Feel free to use the pipe in your functions if it's the most comfortable way for you to write code. If you start running into problems or if you need your code to be as fast as humanly possible, then switch to a different paradigm.
Related
I have a dataframe with two grouping variables class and group. For each class, I have a plotting task per group.
Mostly, I have 2 levels per class and 500 levels per group.
I'm using parallel package for parallelization and mclapply function for the iteration through class and group levels.
I'm wondering which is the best way to write my iterations. I think I have two options:
Run parallelization for class variable.
Run parallelization for group variable.
My computer has 3 cores working for R session and usually, preserve the 4th core for my Operating System. I was wondering that if perform the parallelization for class variable with 2 levels, the 3rd core will never will be used, so I thought that would be more efficient ensuring all 3 cores will be working running the parallelization for group variable. I've written some speed tests to be sure which is the best way:
library(microbenchmark)
library(parallel)
f = function(class, group, A, B) {
mclapply(seq(class), mc.cores = A, function(z) {
mclapply(seq(group), mc.cores = B, function(c) {
ifelse(class == 1, 'plotA', 'plotB')
})
})
}
class = 2
group = 500
microbenchmark(
up = f(class, group, 3, 1),
nest = f(class, group, 1, 3),
times = 50L
)
Unit: milliseconds
expr min lq mean median uq max neval
up 6.751193 7.897118 10.89985 9.769894 12.26880 26.87811 50
nest 16.584382 18.999863 25.54437 22.293591 28.60268 63.49878 50
Result tells that I should use the parallelization for class and not for group variable.
The overview would be that I always should write one-core functions and then call it for parallelization. I think this way, my code would be more simple or reductionist, than write nested functions with parallelization capabilities.
The ifelse condition is used because the previous code used to prepare the data for plotting task is more or less redundant for both class levels, so I thought it would be more line-coding efficient write a longer function checking which class level is used than "splitting" this function in two shorter functions.
Which is the best practice to write this kind of code?. I seams clear, but because I'm not an expert data-scientist, I would like to know your working approach.
This threat is around this problem. But I think that my question is for both points of view:
Code beauty and clear
Speed performance
Thanks
You asked this a while ago but I'll attempt an answer in case anyone else was wondering the same thing. First, I like to split up my task first and then loop over each part. This gives me more control over the process.
parts <- split(df, c(df$class, df$group))
mclapply(parts, some_function)
Second, distributing tasks to multiple cores takes a lot of computational overhead and can cancel out any gains your make from paralleizing your script. Here, mclapply splits the job into however many nodes you have and performs the fork once. This is much more efficient than nesting two mclapply loops.
I've always taken it as fact that colMeans() or colSums() are the fastest way to perform their respective operations. As a ground rule, I am talking about within base and not dplyr or data.table implementations.
While teaching some new users, I ran the benchmark myself to prove the point. I am now consistently seeing contradicting conclusions.
n = 10000
p = 100
test_matrix <- matrix(runif(n*p), n, p)
test_df <- as.data.frame(test_matrix)
benchmark <- microbenchmark(
colMeans(test_df),
colMeans(as.matrix(test_df)),
sapply(test_df, mean),
vapply(test_df, mean, 0),
colMeans(test_matrix),
apply(test_matrix, 2, mean)
)
Unit: microseconds
expr min lq mean median uq max neval
colMeans(test_df) 3099.941 3165.8290 3733.024 3241.345 3617.039 11387.090 100
colMeans(as.matrix(test_df)) 3091.634 3158.0880 3553.537 3241.345 3548.507 8531.067 100
sapply(test_df, mean) 2209.227 2267.3750 2723.176 2338.172 2602.289 10384.612 100
vapply(test_df, mean, 0) 2180.153 2228.2945 2611.982 2270.584 2514.123 7421.356 100
colMeans(test_matrix) 904.307 915.0685 1020.085 939.422 1002.667 2985.911 100
apply(test_matrix, 2, mean) 9748.388 9957.0020 12098.328 10330.429 12582.889 34873.009 100
For a matrix, colMeans() torches apply() That is expected. But for a data frame, sapply() and vapply() routinely beat colMeans(), even as I increase n and p. Is there a reason why I would want to use colMeans() on a data frame? It appears that the difference comes from the overhead associated with converting the data frame back into a matrix.
Main Question
In other words, is there a reason why (a more formal version of) the following would be inadvisable? Benchmarks show basically no drop off. Obviously this makes an assumption about the input the user pushes in, but that is not the point here.
colMeans2 <- function(myobject) {
if (typeof(myobject) == "double") {
colMeans(myobject)
} else if (typeof(myobject) == "list") {
vapply(myobject, mean, 0)
} else {
stop("what is this")
}
}
For Reference
Here are two other posts I could find, both somewhat related and mentioning how colMeans() should be faster.
Grouping functions (tapply, by, aggregate) and the *apply family
Why are `colMeans()` and `rowMeans()` functions faster than using the mean function with `lapply()`?
Is it possible to evaluate a codeblock consisting of multiple lines of code with microbenchmark? If so, how?
Example:
We have some numeric data in character columns:
testdata <- tibble::tibble(col1 = runif(1000), col2 = as.character(runif(1000)), col3 = as.character(runif(1000)))
Now we can try different ways of converting these.
We can directly call as.numeric on the columns:
testdata$col2 <- as.numeric(testdata$col2)
testdata$col3 <- as.numeric(testdata$col3)
We could try doing it inside a dplyr mutate:
testdata <- dplyr::mutate(testdata, col2 = as.numeric(col2),
col3 = as.numeric(col3))
Or maybe we know all columns should be numeric so we can try something less explicit that does some checking:
testdata <- dplyr::mutate_if(testdata, .predicate = is.character, .funs = as.numeric)
Now we want to compare the performance of these 3 options.
The latter 2 options are individual calls so these can easily be tested in microbenchmark, but the first option consists of two separate calls. We could wrap the two calls in a function and then evaluate that in microbenchmark, but this introduces the slight overhead of the function, so isn't technically evaluating the solution that we have now. We can include the calls separately in the microbenchmark and then add them up after, for the mean should do fine, but for things like the min or the max this doesn't necessarily give sensible results.
The examples in the docs for microbenchmark mostly use simple individual expressions and often use a simple function to wrap code.
Is it possible to directly input multiple lines of code into microbenchmark to be evaluated together?
By wrapping multiple lines of code in {} and separating them with a ; they can be evaluated as one block in microbenchmark
bench <- microbenchmark(separate = {as.numeric(testdata$col2); as.numeric(testdata$col3)},
mutate = dplyr::mutate(testdata, col2 = as.numeric(col2),
col3 = as.numeric(col3)),
mutateif = dplyr::mutate_if(testdata, .predicate = is.character, .funs = as.numeric))
Which gives the following results:
> bench
Unit: microseconds
expr min lq mean median uq max neval
separate 477.014 529.708 594.8982 576.4275 611.6275 1109.762 100
mutate 3410.351 3633.070 4465.0583 3876.6975 4446.0845 34298.910 100
mutateif 5118.725 5365.126 7241.5727 5637.5520 6290.7795 118874.982 100
For loops are know to be quite slow in R. I would like to know if the same is true for while loop.
If so, is there a way to optimize while loop in R? For example for the for loop the apply functions play a good job but I don't know an analogue for the while loop.
Even Hadley in his book (Advanced R) is quite vague about how to optimize a while loop.
"For loops are know to be quite slow in R." That's simply wrong. for loops are fast. What you do inside the loop is slow (in comparison to vectorized operations). I would expect a while loop to be slower than a for loop since it needs to test a condition before each iteration. Keep in mind that R is an interpreted language, i.e., there are no compiler optimizations. Also, function calls in R are not slow per se, but still there is a lot going on during a function call and that adds up. Vectorized operations avoid repeated function calls.
It's hard to come up with a fair comparison between both loop construct, but here we go:
library(microbenchmark)
microbenchmark(
for (i in seq_len(1e6)) i,
{i <- 1; while (i <= 1e6) {i <- i+1}},
times = 10, unit = "relative"
)
#Unit: relative
# expr min lq mean median uq max neval cld
# for (i in seq_len(1e+06)) i 1.000000 1.000000 1.00000 1.000000 1.000000 1.00000 10 a
# { i <- 1 while (i <= 1e+06) { i <- i + 1 } } 8.987293 8.994548 9.14089 9.019795 9.036116 10.07227 10 b
The while loop needs to test the condition, assign to i and call + at each iteration.
If you must use a while loop (often it can be avoided) and performance is important, the best solution is implementing it as compiled code which can be called from R. The Rcpp package makes this very easy. In some cases byte compilation as offered by the compiler package can also speed up R loops, but (well written) actual compiled code will always be faster.
I am trying to write R code which acts as a "moving window", just with memory (state). I have figured out (thanks to this question) how to apply a function to subsequent tuples of elements. For example, if I wish to write a (simple) moving average with a typical period 4, I would do the following:
mapply(myfunc, x[1:(length(x)-4)], x[2:(length(x)-3)], x[3:(length(x)-2)], x[4:(length(x)-1)])
Where myfunc is a function with 4 arguments, which calculates their mean (I cannot use mean, as it expects only 1 argument, and I don't know how to make the 4 arguments a single vector).
That's quite cumbersome, though, and if the typical period is 100, say, I am not sure how to do it.
So here's my first question: how do I generalize this?
But here's another issue: suppose I wish the applied function to be able to save state. A simple example would be to keep record of how many values it was applied on so far. Another example is the exponential moving average (EMA), which is not really a window function, but instead a function which works on single values but which keeps state (the last resulted mean).
How can I write a function which when applied to a vector, works on its values one by one, returning a vector of the same length, which is able to retain its last output every time, or save any other "state" during its calculations? In Python, for example, I'd use classes for that, but that's quite difficult in R.
Important note: I am not interested in auxiliary R packages like zoo or TTR to do the work for me. I am trying to learn R, and in any case the functions I wish to write, while having similarities with MA or EMA, are custom, and do not exist in any of these packages.
Regarding your first question,
n <- length(x)
k <- 4
r <- embed(x, n-k)[1:k, seq(n-k, 1)]
do.call("mapply", c("myfunc", split(r, 1:k)))
Regarding the second question, Reduce can be used to iterate over a vector saving state.
For things like this you should consider using a plain for loop:
x <- runif(10000)
k <- 100
n <- length(x)
res <- numeric(n - k)
library(microbenchmark)
microbenchmark(times=5,
for(i in k:n) res[i - k + 1] <- sum(vec[i:(i + k)]),
{
r <- embed(x, n-k)[1:k, seq(n-k, 1)]
gg <- do.call("mapply", c("sum", split(r, 1:k)))
},
flt <- filter(x, rep(1, k))
)
Produces:
Unit: milliseconds
min lq median uq max neval
for 163.5403 164.4929 165.2543 166.6315 167.0608 5
embed/mapply 1255.2833 1307.3708 1338.2748 1341.5719 1405.1210 5
filter 6.7101 6.7971 6.8073 6.8161 6.8991 5
Now, the results are not identical and I don't pretend to understand exactly what GGrothendieck is doing with embed, but generally speaking for loops are just as fast as *pply functions so long as you initialize your result vectors first. Windowed calculations don't lend themselves well to vectorization, so might as well use a for loop.
EDIT: as several have pointed out in comments, there appears to be an internally implemented function to do (filter) this that is quite a bit faster, so that seems to be the best option (though you should confirm it actually does what you want as again, the results are not exactly identical and I am not personally familiar with the function; in it's default configuration it appears to do a rolling weighted sum, or sum if weights are 1, with a centered window).