Evaluate multiline codeblock with microbenchmark - r

Is it possible to evaluate a codeblock consisting of multiple lines of code with microbenchmark? If so, how?
Example:
We have some numeric data in character columns:
testdata <- tibble::tibble(col1 = runif(1000), col2 = as.character(runif(1000)), col3 = as.character(runif(1000)))
Now we can try different ways of converting these.
We can directly call as.numeric on the columns:
testdata$col2 <- as.numeric(testdata$col2)
testdata$col3 <- as.numeric(testdata$col3)
We could try doing it inside a dplyr mutate:
testdata <- dplyr::mutate(testdata, col2 = as.numeric(col2),
col3 = as.numeric(col3))
Or maybe we know all columns should be numeric so we can try something less explicit that does some checking:
testdata <- dplyr::mutate_if(testdata, .predicate = is.character, .funs = as.numeric)
Now we want to compare the performance of these 3 options.
The latter 2 options are individual calls so these can easily be tested in microbenchmark, but the first option consists of two separate calls. We could wrap the two calls in a function and then evaluate that in microbenchmark, but this introduces the slight overhead of the function, so isn't technically evaluating the solution that we have now. We can include the calls separately in the microbenchmark and then add them up after, for the mean should do fine, but for things like the min or the max this doesn't necessarily give sensible results.
The examples in the docs for microbenchmark mostly use simple individual expressions and often use a simple function to wrap code.
Is it possible to directly input multiple lines of code into microbenchmark to be evaluated together?

By wrapping multiple lines of code in {} and separating them with a ; they can be evaluated as one block in microbenchmark
bench <- microbenchmark(separate = {as.numeric(testdata$col2); as.numeric(testdata$col3)},
mutate = dplyr::mutate(testdata, col2 = as.numeric(col2),
col3 = as.numeric(col3)),
mutateif = dplyr::mutate_if(testdata, .predicate = is.character, .funs = as.numeric))
Which gives the following results:
> bench
Unit: microseconds
expr min lq mean median uq max neval
separate 477.014 529.708 594.8982 576.4275 611.6275 1109.762 100
mutate 3410.351 3633.070 4465.0583 3876.6975 4446.0845 34298.910 100
mutateif 5118.725 5365.126 7241.5727 5637.5520 6290.7795 118874.982 100

Related

R data table: strangely poor performance in subsetting

I was under the impression that data.table is extremely well optimized, so I was quite surprised to see this:
library(data.table)
SimData <- data.table(ID = sample(1:4e5, 4e6, replace = TRUE),
DATE = sample(seq(as.Date("2000-01-01"), as.Date("2019-12-31"), by = "day"),
4e6, replace = TRUE))
microbenchmark::microbenchmark(SimData[ID==1&DATE>="2005-01-01"])
microbenchmark::microbenchmark(SimData[ID==1][DATE>="2005-01-01"])
The two solutions are quite obviously the same, yet there is more than an order of magnitude difference in runtime. Is it possible that data.table performs so poorly with the first form? (I.e., that it can't automatically optimize this call.) Or I overlook something here...?
The long operation is SimData[DATE>="2005-01-01"] because it returns millions of rows.
microbenchmark::microbenchmark(SimData[DATE>="2005-01-01"],SimData[ID==1])
Unit: microseconds
expr min lq mean median uq max neval
SimData[DATE >= "2005-01-01"] 32542.8 44549.55 51323.53 47529.75 50258.10 117396.3 100
SimData[ID == 1] 820.0 1043.55 1397.79 1435.15 1688.25 2302.5 100
SimData[ID == 1] is much shorter as it returns only a few rows,
When you execute SimData[ID==1&DATE>="2005-01-01"], you force both evaluations on all rows.
With SimData[ID==1][DATE>="2005-01-01"] the quick operation is done first, and the subsequent filter is also quick because applied on only a few row.
As mentioned by #jangorecki, there is a room for improvement in that matter.
data.table optimizes the queries like X==x, and ID==1 is in this form thus the first time you run this query it takes a while but after that calling the same query is very fast. In your case second run of the query SimData[ID==1] is very fast and the returned data table is very small which makes the second query also fast.

In R, split a vector randomly into k chunks?

I have seen many variations on the "split vector X into Y chunks in R" question on here. See for example: here and here for just two. So, when I realized I needed to split a vector into Y chunks of random size, I was surprised to find that the randomness requirement might be "new"--I couldn't find a way to do this on here.
So, here's what I've drawn up:
k.chunks = function(seq.size, n.chunks) {
break.pts = sample(1:seq.size, n.chunks, replace=F) %>% sort() #Get a set of break points chosen from along the length of the vector without replacement so no duplicate selections.
groups = rep(NA, seq.size) #Set up the empty output vector.
groups[1:break.pts[1]] = 1 #Set the first set of group affiliations because it has a unique start point of 1.
for (i in 2:(n.chunks)) { #For all other chunks...
groups[break.pts[i-1]:break.pts[i]] = i #Set the respective group affiliations
}
groups[break.pts[n.chunks]:seq.size] = n.chunks #Set the last group affiliation because it has a unique endpoint of seq.size.
return(groups)
}
My question is: Is this inelegant or inefficient somehow? It will get called 1000s of times in the code I plan to do, so efficiency is important to me. It'd be especially nice to avoid the for loop or having to set both the first and last groups "manually." My other question: Are there logical inputs that could break this? I recognize that n.chunks cannot > seq.size, so I mean other than that.
That should be pretty quick for smaller numbers. But here a more concise way.
k.chunks2 = function(seq.size, n.chunks) {
break.pts <- sort(sample(1:seq.size, n.chunks - 1, replace = FALSE))
break.len <- diff(c(0, break.pts, seq.size))
groups <- rep(1:n.chunks, times = break.len)
return(groups)
}
If you really get a huge number of groups, I think the sort will start to cost you execution time. So you can do something like this (probably can be tweaked to be even faster) to split based on proportions. I am not sure how I feel about this, because as n.chunks gets very large, the proportions will get very small. But it is faster.
k.chunks3 = function(seq.size, n.chunks) {
props <- runif(n.chunks)
grp.props <- props / sum(props)
chunk.size <- floor(grp.props[-n.chunks] * seq.size)
break.len <- c(chunk.size, seq.size - sum(chunk.size))
groups <- rep(1:n.chunks, times = break.len)
return(groups)
}
Running a benchmark, I think any of these will be fast enough (unit is microseconds).
n <- 1000
y <- 10
microbenchmark::microbenchmark(k.chunks(n, y),
k.chunks2(n, y),
k.chunks3(n, y))
Unit: microseconds
expr min lq mean median uq max neval
k.chunks(n, y) 49.9 52.05 59.613 53.45 58.35 251.7 100
k.chunks2(n, y) 46.1 47.75 51.617 49.25 52.55 107.1 100
k.chunks3(n, y) 8.1 9.35 11.412 10.80 11.75 44.2 100
But as the numbers get larger, you will notice a meaningful speedup (note the unit is now milliseconds).
n <- 1000000
y <- 100000
microbenchmark::microbenchmark(k.chunks(n, y),
k.chunks2(n, y),
k.chunks3(n, y))
Unit: milliseconds
expr min lq mean median uq max neval
k.chunks(n, y) 46.9910 51.38385 57.83917 54.54310 56.59285 113.5038 100
k.chunks2(n, y) 17.2184 19.45505 22.72060 20.74595 22.73510 69.5639 100
k.chunks3(n, y) 7.7354 8.62715 10.32754 9.07045 10.44675 58.2093 100
All said and done, I would probably use my k.chunks2() function.
Random is probably inefficient, but it would seem to be expected that it should be so. Random suggests all input elements should also be random. So, considering a desired random selection from a vector Y; it would seem the effort should be applied to an index of Y, and successive Y(s), that would be or seem random. With sufficient sets of Y(s) it can be discerned how far from completely random the indexing is, but maybe that is not material, or perhaps merely thousands of repetitions is insufficient to demonstrate it.
None the less, my sense is that both inputs to sample need to be 'random' in some way as a certainty in one reduces the randomness of the other.
my_vector <- c(1:100000)
sample_1 <- sample(my_vector, 50, replace = FALSE)
sample_2 <- sample(my_vector, 80, replace = FALSE)
full_range <- c(1, sort(unique(sample1,sample2)), 100000)
starts <- full_range[c(TRUE,FALSE)]#[generally](https://stackoverflow.com/questions/33257610/how-to-return-the-elements-in-the-odd-position)
ends <- full_range[c(FALSE, TRUE)]
!unique(diff(full_range))
And absent setting seed, I think non-reproducible is as close as you get to a random selection upon Y(s). This answer is just to suggest an approach to indexing Y. The use of indices thereafter might follow #Adam 's approach. And, of course, I could be completely wrong about all of this. Clearer random thinkers than me might well weigh in...

Code performance: apply family or optimized alternatives

I've always taken it as fact that colMeans() or colSums() are the fastest way to perform their respective operations. As a ground rule, I am talking about within base and not dplyr or data.table implementations.
While teaching some new users, I ran the benchmark myself to prove the point. I am now consistently seeing contradicting conclusions.
n = 10000
p = 100
test_matrix <- matrix(runif(n*p), n, p)
test_df <- as.data.frame(test_matrix)
benchmark <- microbenchmark(
colMeans(test_df),
colMeans(as.matrix(test_df)),
sapply(test_df, mean),
vapply(test_df, mean, 0),
colMeans(test_matrix),
apply(test_matrix, 2, mean)
)
Unit: microseconds
expr min lq mean median uq max neval
colMeans(test_df) 3099.941 3165.8290 3733.024 3241.345 3617.039 11387.090 100
colMeans(as.matrix(test_df)) 3091.634 3158.0880 3553.537 3241.345 3548.507 8531.067 100
sapply(test_df, mean) 2209.227 2267.3750 2723.176 2338.172 2602.289 10384.612 100
vapply(test_df, mean, 0) 2180.153 2228.2945 2611.982 2270.584 2514.123 7421.356 100
colMeans(test_matrix) 904.307 915.0685 1020.085 939.422 1002.667 2985.911 100
apply(test_matrix, 2, mean) 9748.388 9957.0020 12098.328 10330.429 12582.889 34873.009 100
For a matrix, colMeans() torches apply() That is expected. But for a data frame, sapply() and vapply() routinely beat colMeans(), even as I increase n and p. Is there a reason why I would want to use colMeans() on a data frame? It appears that the difference comes from the overhead associated with converting the data frame back into a matrix.
Main Question
In other words, is there a reason why (a more formal version of) the following would be inadvisable? Benchmarks show basically no drop off. Obviously this makes an assumption about the input the user pushes in, but that is not the point here.
colMeans2 <- function(myobject) {
if (typeof(myobject) == "double") {
colMeans(myobject)
} else if (typeof(myobject) == "list") {
vapply(myobject, mean, 0)
} else {
stop("what is this")
}
}
For Reference
Here are two other posts I could find, both somewhat related and mentioning how colMeans() should be faster.
Grouping functions (tapply, by, aggregate) and the *apply family
Why are `colMeans()` and `rowMeans()` functions faster than using the mean function with `lapply()`?

Build a Grid based on two input vectors

I'm trying (by using R) to build a "grid" in a matrix based on two input vectors. So, the idea is to avoid nested loop like this:
inputVector1=1:4
inputVector2=1:4
grid=NULL
for(i in inputVector1){
line=NULL
for(j in inputVector2){
cellValue=i+j # Instead of i+j it can be anything like taking a value in a dataframe
line=cbind(line,cellValue)
}
grid=rbind(grid,line)
}
Is there a dedicated function in R to do this kind of job faster and simpler ? I know there is apply family functions but I didn't found a proper way to do it (without combining multiple apply family functions). Thank you for the help.
Loops are kind of simple and they are not necessarily slow. However, it depends on how to use those loops. In your code (I call your approach L.GUEGAN(), for further reference), for instance, you don't exploit the fact that you know the size of your ultimate grid and you keep expanding vectors, matrices. That slows things down. A very simple alternative would be
niceFor <- function() {
grid <- matrix(0, nrow = length(inputVector1), ncol = length(inputVector2))
for(i in seq_along(inputVector1))
for(j in seq_along(inputVector2))
grid[i, j] <- i + j
grid
}
where the essential difference is predefining the grid object and updating its values, rather than creating new objects.
Yes, you may say that there is a dedicated function for what:
outer(inputVector1, inputVector2, `+`)
However, one needs to keep in mind that the function in the third argument needs to be vectorized, which is the case in this situation. That is, vectors are allowed when using addition
1:2 + 3:4
# [1] 4 6
`+`(1:2, 3:4)
# [1] 4 6
However, some other functions are not vectorized. E.g.,
seq(3:4, 6:7)
# Error in seq.default(3:4, 6:7) : 'from' must be of length 1
In that case, if you use outer, take a look at ?Vectorize.
Certain operations have even "more direct" dedicated functions. E.g., if we had
grid[i, j] <- i * j
Then you should use
inputVector1 %*% t(inputVector2)
as it would be faster and cleaner than both loops and outer.
A comparison of the three approaches mentioned before
microbenchmark(L.GUEGAN(), niceFor(), funOuter(), times = 2000)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# L.GUEGAN() 24.354 33.8645 38.933968 35.6315 40.878 295.661 2000 c
# niceFor() 4.011 4.7820 6.576742 5.4050 7.697 29.547 2000 a
# funOuter() 4.928 6.1935 8.701545 7.3085 10.619 74.449 2000 b
So, the nice for loop seems even to be superior if speed matters. Notice that you could further improve it by exploiting symmetry of your grid: you could compute only half of the matrix manually and then use your results to fill the other triangle.
Thanks to #hrbrmstr this is what I was looking for:
outer( 1:4, 1:4, function(a,b){mapply(FUN = function(x,y){return(x+y)},a,b)} )

Magrittr pipe in R functions

Are there cases where it is not advantageous to use the magrittr pipe inside of R functions from the perspectives of (1) speed, and (2) ability to debug effectively?
There are advantages and disadvantages to using a pipe inside of a function. The biggest advantage is that it's easier to see what's happening within a function when you read the code. The biggest downsides are that error messages become harder to interpret and the pipe breaks some of R's rules of evaluation.
Here's an example. Let's say we want to make a pointless transformation to the mtcars dataset. Here's how we could do that with pipes...
library(tidyverse)
tidy_function <- function() {
mtcars %>%
group_by(cyl) %>%
summarise(disp = sum(disp)) %>%
mutate(disp = (disp ^ 4) / 10000000000)
}
You can clearly see what's happening at every stage, even though it's not doing anything useful. Now let's look at the time code using the Dagwood Sandwich approach...
base_function <- function() {
mutate(summarise(group_by(mtcars, cyl), disp = sum(disp)), disp = (disp^5) / 10000000000)
}
Much harder to read, even though it gives us the same result...
all.equal(tidy_function(), base_function())
# [1] TRUE
The most common way to avoid using either a pipe or a Dagwood Sandwich is to save the results of each step to an intermediate variable...
intermediate_function <- function() {
x <- mtcars
x <- group_by(x, cyl)
x <- summarise(x, disp = sum(disp))
mutate(x, disp = (disp^5) / 10000000000)
}
More readable than the last function and R will give you a little more detailed information when there's an error. Plus it obeys the traditional rules of evaluation. Again, it gives the same results as the other two functions...
all.equal(tidy_function(), intermediate_function())
# [1] TRUE
You specifically asked about speed, so let's compare these three functions by running each of them 1000 times...
library(microbenchmark)
timing <-
microbenchmark(tidy_function(),
intermediate_function(),
base_function(),
times = 1000L)
timing
#Unit: milliseconds
#expr min lq mean median uq max neval cld
#tidy_function() 3.809009 4.403243 5.531429 4.800918 5.860111 23.37589 1000 a
#intermediate_function() 3.560666 4.106216 5.154006 4.519938 5.538834 21.43292 1000 a
#base_function() 3.610992 4.136850 5.519869 4.583573 5.696737 203.66175 1000 a
Even in this trivial example, the pipe is a tiny bit slower than the other two options.
Conclusion
Feel free to use the pipe in your functions if it's the most comfortable way for you to write code. If you start running into problems or if you need your code to be as fast as humanly possible, then switch to a different paradigm.

Resources