The Vectorize() and the apply() functions in R can often be used to accomplish the same goal. I usually prefer vectorizing a function for readability reasons, because the main calling function is related to the task at hand while sapply is not. It is also useful to Vectorize() when I am going to be using that vectorized function multiple times in my R code. For instance:
a <- 100
b <- 200
c <- 300
varnames <- c('a', 'b', 'c')
getv <- Vectorize(get)
getv(varnames)
vs
sapply(varnames, get)
However, at least on SO I rarely see examples with Vectorize() in the solution, only apply() (or one of it's siblings). Are there any efficiency issues or other legitimate concerns with Vectorize() that make apply() a better option?
Vectorize is just a wrapper for mapply. It just builds you an mapply loop for whatever function you feed it. Thus there are often easier things do to than Vectorize() it and the explicit *apply solutions end up being computationally equivalent or perhaps superior.
Also, for your specific example, you've heard of mget, right?
To add to Thomas's answer. Maybe also speed?
# install.packages(c("microbenchmark", "stringr"), dependencies = TRUE)
require(microbenchmark)
require(stringr)
Vect <- function(x) { getv <- Vectorize(get); getv(x) }
sapp <- function(x) sapply(x, get)
mgett <- function(x) mget(x)
res <- microbenchmark(Vect(varnames), sapp(varnames), mget(varnames), times = 15)
## Print results:
print(res)
Unit: microseconds
expr min lq median uq max neval
Vect(varnames) 106.752 110.3845 116.050 122.9030 246.934 15
sapp(varnames) 31.731 33.8680 36.199 36.7810 100.712 15
mget(varnames) 2.856 3.1930 3.732 4.1185 13.624 15
### Plot results:
boxplot(res)
Related
How should I understand the parallelism built into data.table objects? From the getDTthreads function documentation, it seems that shared memory parallelism is employed using OpenMP. That seems fairly low level, and I imagine that it only works for a certain subset
of overloaded functions and operators.
Or, is data.table somehow smart enough to split work for even more complicated expressions? More specifically, to parallelize a j-expression, what restrictions do I need to take into account?
Not to run too much afoul of Stack Overflow's question policy, here is an example. I often want to apply a function to each object in a huge data.table. For example,
library(data.table)
n <- 100000L
dt <- data.table(a = rnorm(n), b = rnorm(n))
dt[, c := sapply(a, function(x) paste(x, 'silly example')]
Would the sapply call in the j-expression work on chunks of column a in parallel? Or is it a plain old base R sapply, which works sequentially?
If the latter is the case, then is embedding one of R's many parallel computing frameworks inside the j-expression a good approach? For example, can I safely and efficiently call foreach, future, et al. in the j-expression?
From ?setDTthreads:
Internally parallelized code is used in the following places:
between.c - between()
cj.c - CJ()
coalesce.c - fcoalesce()
fifelse.c - fifelse()
fread.c - fread()
forder.c, fsort.c, and reorder.c - forder() and related
froll.c, frolladaptive.c, and frollR.c - froll() and family
fwrite.c - fwrite()
gsumm.c - GForce in various places, see GForce
nafill.c - nafill()
subset.c - Used in [.data.table subsetting
types.c - Internal testing usage
My understanding is that you should not expect data.table to make use of multithreading outside of the above use cases. Note that [.data.table uses multithreading for subsetting only, i.e., in i-expressions but not j-expressions. That is presumably just to speed up relational and logical operations, as in x[!is.na(a) & a > 0].
In a j-expression, sum and sapply are still just base::sum and base::sapply. You can test this with a benchmark:
library("data.table")
setDTthreads(4L)
x <- data.table(a = rnorm(2^25))
microbenchmark::microbenchmark(sum(x$a), x[, sum(a)], times = 1000L)
Unit: milliseconds
expr min lq mean median uq max neval
sum(x$a) 51.61281 51.68317 51.95975 51.84204 52.09202 56.67213 1000
x[, sum(a)] 51.78759 51.89054 52.18827 52.07291 52.33486 61.11378 1000
x <- data.table(a = seq_len(1e+04L))
microbenchmark::microbenchmark(sapply(x$a, paste, "is a good number"), x[, sapply(a, paste, "is a good number")], times = 1000L)
Unit: milliseconds
expr min lq mean median uq max neval
sapply(x$a, paste, "is a good number") 14.07403 15.7293 16.72879 16.31326 17.49072 45.62300 1000
x[, sapply(a, paste, "is a good number")] 14.56324 15.9375 17.03164 16.48971 17.69045 45.99823 1000
where it is clear that simply putting code into a j-expression does not improve performance.
data.table does recognize and handle certain constructs exceptionally. For instance, data.table uses its own radix-based forder instead of base::order when it sees x[order(...)]. (This feature is somewhat redundant now that users of base::order can request data.table's radix sort by passing method = "radix".) I haven't seen a "master list" of such exceptions.
As for whether using, e.g., parallel::mclapply inside of a j-expression can have performance benefits, I think the answer (as usual) depends on what you are trying to do and the scale of your data. Ultimately, you'll have to do your own benchmarks and profiling to find out. For example:
library("parallel")
cl <- makePSOCKcluster(4L)
microbenchmark::microbenchmark(x[, sapply(a, paste, "is a good number")], x[, parSapply(cl, a, paste, "is a good number")], times = 1000L)
stopCluster(cl)
Unit: milliseconds
expr min lq mean median uq max neval
x[, sapply(a, paste, "is a good number")] 14.553934 15.982681 17.105667 16.585525 17.864623 48.81276 1000
x[, parSapply(cl, a, paste, "is a good number")] 7.675487 8.426607 9.022947 8.802454 9.334532 25.67957 1000
So it is possible to see speed-up, though sometimes you pay the price in memory usage. For small enough problems, the overhead associated with R-level parallelism can definitely outweigh the performance benefits.
You'll find good thread about integrating parallel and data.table (including reasons not to) here.
I've always taken it as fact that colMeans() or colSums() are the fastest way to perform their respective operations. As a ground rule, I am talking about within base and not dplyr or data.table implementations.
While teaching some new users, I ran the benchmark myself to prove the point. I am now consistently seeing contradicting conclusions.
n = 10000
p = 100
test_matrix <- matrix(runif(n*p), n, p)
test_df <- as.data.frame(test_matrix)
benchmark <- microbenchmark(
colMeans(test_df),
colMeans(as.matrix(test_df)),
sapply(test_df, mean),
vapply(test_df, mean, 0),
colMeans(test_matrix),
apply(test_matrix, 2, mean)
)
Unit: microseconds
expr min lq mean median uq max neval
colMeans(test_df) 3099.941 3165.8290 3733.024 3241.345 3617.039 11387.090 100
colMeans(as.matrix(test_df)) 3091.634 3158.0880 3553.537 3241.345 3548.507 8531.067 100
sapply(test_df, mean) 2209.227 2267.3750 2723.176 2338.172 2602.289 10384.612 100
vapply(test_df, mean, 0) 2180.153 2228.2945 2611.982 2270.584 2514.123 7421.356 100
colMeans(test_matrix) 904.307 915.0685 1020.085 939.422 1002.667 2985.911 100
apply(test_matrix, 2, mean) 9748.388 9957.0020 12098.328 10330.429 12582.889 34873.009 100
For a matrix, colMeans() torches apply() That is expected. But for a data frame, sapply() and vapply() routinely beat colMeans(), even as I increase n and p. Is there a reason why I would want to use colMeans() on a data frame? It appears that the difference comes from the overhead associated with converting the data frame back into a matrix.
Main Question
In other words, is there a reason why (a more formal version of) the following would be inadvisable? Benchmarks show basically no drop off. Obviously this makes an assumption about the input the user pushes in, but that is not the point here.
colMeans2 <- function(myobject) {
if (typeof(myobject) == "double") {
colMeans(myobject)
} else if (typeof(myobject) == "list") {
vapply(myobject, mean, 0)
} else {
stop("what is this")
}
}
For Reference
Here are two other posts I could find, both somewhat related and mentioning how colMeans() should be faster.
Grouping functions (tapply, by, aggregate) and the *apply family
Why are `colMeans()` and `rowMeans()` functions faster than using the mean function with `lapply()`?
I'm trying (by using R) to build a "grid" in a matrix based on two input vectors. So, the idea is to avoid nested loop like this:
inputVector1=1:4
inputVector2=1:4
grid=NULL
for(i in inputVector1){
line=NULL
for(j in inputVector2){
cellValue=i+j # Instead of i+j it can be anything like taking a value in a dataframe
line=cbind(line,cellValue)
}
grid=rbind(grid,line)
}
Is there a dedicated function in R to do this kind of job faster and simpler ? I know there is apply family functions but I didn't found a proper way to do it (without combining multiple apply family functions). Thank you for the help.
Loops are kind of simple and they are not necessarily slow. However, it depends on how to use those loops. In your code (I call your approach L.GUEGAN(), for further reference), for instance, you don't exploit the fact that you know the size of your ultimate grid and you keep expanding vectors, matrices. That slows things down. A very simple alternative would be
niceFor <- function() {
grid <- matrix(0, nrow = length(inputVector1), ncol = length(inputVector2))
for(i in seq_along(inputVector1))
for(j in seq_along(inputVector2))
grid[i, j] <- i + j
grid
}
where the essential difference is predefining the grid object and updating its values, rather than creating new objects.
Yes, you may say that there is a dedicated function for what:
outer(inputVector1, inputVector2, `+`)
However, one needs to keep in mind that the function in the third argument needs to be vectorized, which is the case in this situation. That is, vectors are allowed when using addition
1:2 + 3:4
# [1] 4 6
`+`(1:2, 3:4)
# [1] 4 6
However, some other functions are not vectorized. E.g.,
seq(3:4, 6:7)
# Error in seq.default(3:4, 6:7) : 'from' must be of length 1
In that case, if you use outer, take a look at ?Vectorize.
Certain operations have even "more direct" dedicated functions. E.g., if we had
grid[i, j] <- i * j
Then you should use
inputVector1 %*% t(inputVector2)
as it would be faster and cleaner than both loops and outer.
A comparison of the three approaches mentioned before
microbenchmark(L.GUEGAN(), niceFor(), funOuter(), times = 2000)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# L.GUEGAN() 24.354 33.8645 38.933968 35.6315 40.878 295.661 2000 c
# niceFor() 4.011 4.7820 6.576742 5.4050 7.697 29.547 2000 a
# funOuter() 4.928 6.1935 8.701545 7.3085 10.619 74.449 2000 b
So, the nice for loop seems even to be superior if speed matters. Notice that you could further improve it by exploiting symmetry of your grid: you could compute only half of the matrix manually and then use your results to fill the other triangle.
Thanks to #hrbrmstr this is what I was looking for:
outer( 1:4, 1:4, function(a,b){mapply(FUN = function(x,y){return(x+y)},a,b)} )
For loops are know to be quite slow in R. I would like to know if the same is true for while loop.
If so, is there a way to optimize while loop in R? For example for the for loop the apply functions play a good job but I don't know an analogue for the while loop.
Even Hadley in his book (Advanced R) is quite vague about how to optimize a while loop.
"For loops are know to be quite slow in R." That's simply wrong. for loops are fast. What you do inside the loop is slow (in comparison to vectorized operations). I would expect a while loop to be slower than a for loop since it needs to test a condition before each iteration. Keep in mind that R is an interpreted language, i.e., there are no compiler optimizations. Also, function calls in R are not slow per se, but still there is a lot going on during a function call and that adds up. Vectorized operations avoid repeated function calls.
It's hard to come up with a fair comparison between both loop construct, but here we go:
library(microbenchmark)
microbenchmark(
for (i in seq_len(1e6)) i,
{i <- 1; while (i <= 1e6) {i <- i+1}},
times = 10, unit = "relative"
)
#Unit: relative
# expr min lq mean median uq max neval cld
# for (i in seq_len(1e+06)) i 1.000000 1.000000 1.00000 1.000000 1.000000 1.00000 10 a
# { i <- 1 while (i <= 1e+06) { i <- i + 1 } } 8.987293 8.994548 9.14089 9.019795 9.036116 10.07227 10 b
The while loop needs to test the condition, assign to i and call + at each iteration.
If you must use a while loop (often it can be avoided) and performance is important, the best solution is implementing it as compiled code which can be called from R. The Rcpp package makes this very easy. In some cases byte compilation as offered by the compiler package can also speed up R loops, but (well written) actual compiled code will always be faster.
I am trying to write R code which acts as a "moving window", just with memory (state). I have figured out (thanks to this question) how to apply a function to subsequent tuples of elements. For example, if I wish to write a (simple) moving average with a typical period 4, I would do the following:
mapply(myfunc, x[1:(length(x)-4)], x[2:(length(x)-3)], x[3:(length(x)-2)], x[4:(length(x)-1)])
Where myfunc is a function with 4 arguments, which calculates their mean (I cannot use mean, as it expects only 1 argument, and I don't know how to make the 4 arguments a single vector).
That's quite cumbersome, though, and if the typical period is 100, say, I am not sure how to do it.
So here's my first question: how do I generalize this?
But here's another issue: suppose I wish the applied function to be able to save state. A simple example would be to keep record of how many values it was applied on so far. Another example is the exponential moving average (EMA), which is not really a window function, but instead a function which works on single values but which keeps state (the last resulted mean).
How can I write a function which when applied to a vector, works on its values one by one, returning a vector of the same length, which is able to retain its last output every time, or save any other "state" during its calculations? In Python, for example, I'd use classes for that, but that's quite difficult in R.
Important note: I am not interested in auxiliary R packages like zoo or TTR to do the work for me. I am trying to learn R, and in any case the functions I wish to write, while having similarities with MA or EMA, are custom, and do not exist in any of these packages.
Regarding your first question,
n <- length(x)
k <- 4
r <- embed(x, n-k)[1:k, seq(n-k, 1)]
do.call("mapply", c("myfunc", split(r, 1:k)))
Regarding the second question, Reduce can be used to iterate over a vector saving state.
For things like this you should consider using a plain for loop:
x <- runif(10000)
k <- 100
n <- length(x)
res <- numeric(n - k)
library(microbenchmark)
microbenchmark(times=5,
for(i in k:n) res[i - k + 1] <- sum(vec[i:(i + k)]),
{
r <- embed(x, n-k)[1:k, seq(n-k, 1)]
gg <- do.call("mapply", c("sum", split(r, 1:k)))
},
flt <- filter(x, rep(1, k))
)
Produces:
Unit: milliseconds
min lq median uq max neval
for 163.5403 164.4929 165.2543 166.6315 167.0608 5
embed/mapply 1255.2833 1307.3708 1338.2748 1341.5719 1405.1210 5
filter 6.7101 6.7971 6.8073 6.8161 6.8991 5
Now, the results are not identical and I don't pretend to understand exactly what GGrothendieck is doing with embed, but generally speaking for loops are just as fast as *pply functions so long as you initialize your result vectors first. Windowed calculations don't lend themselves well to vectorization, so might as well use a for loop.
EDIT: as several have pointed out in comments, there appears to be an internally implemented function to do (filter) this that is quite a bit faster, so that seems to be the best option (though you should confirm it actually does what you want as again, the results are not exactly identical and I am not personally familiar with the function; in it's default configuration it appears to do a rolling weighted sum, or sum if weights are 1, with a centered window).