Optimization R code - Rcpp - r

In addition to benchmarking functions, is there any tool in R so we can fetch the biggest bottlenecks in an R code?
I often get very undecided about the computational gain I will get when rewriting the R code in C ++. For example, in a bootstrap where each iteration needs to do an optimization, I do not know if it is useful to use the GSL library to do an optimization of a log-likelihood function, since the optim language function R uses the stats.so file. I noticed this doing stats ::: C_optim.
> stats:::C_optim
$name
[1] "optim"
$address
<pointer: 0x1cb34e0>
attr(,"class")
[1] "RegisteredNativeSymbol"
$dll
DLL name: stats
Filename: /usr/lib/R/library/stats/libs/stats.so
Dynamic lookup: FALSE
$numParameters
[1] 7
attr(,"class")
[1] "ExternalRoutine" "NativeSymbolInfo"
Looking at the body of the optim function (edit(optim)), I see that there is the import of efficient functions implemented in C. For example, there is:
.External2(C_optim, par, fn1, gr1, method, con, lower,
upper)
Doubt: To Rcpp users, in your projects, do you normally try to implement all your C++ functions or implement a set of small C++ functions to be used in an R function?
I know it's a pretty general question, but all the functions I use Rcpp always try to implement C++ function from scratch. I felt that I'm programming more in C++ than in R. I sometimes think that I need to program directly in C++.
R has many characteristics that make the language slow for various tasks. I always try to avoid loops and give way to the use of the apply family of functions. However, I often find the R very slow. That way, because I'm very undecided on what's worth optimizing, I end up implementing everything in C++.

If you (generally) code faster in R and feel like writing to much C++ code, I suggest the following approach:
Implement your solution in R.
Only if the R solution is not fast enough, try to optimize it.
The first step in optimization is measuring the performance, i.e. profile your code.
Once you have identified the bottlenecks you can improve those using, better R code or compiled code.
With experience you might be able to cut some corners, i.e. know from the beginning that some things in your problem will require compiled code. But that really depends on the kind of problems you are working on.

Related

Converting cosine distance function in R to Rcpp

I've been developing an R package for single cell RNA-seq analysis, and one of the functions I used repeatedly calculates the cosine dissimilarity matrix for a given matrix of m cells by n genes. The function I wrote is as follows:
CosineDist <- function(input = NULL) {
if (is.null(input)) { stop("You forgot to provide an input matrix") }
dist_mat <- as.dist(1 - input %*% t(input) / (sqrt(rowSums(input^2) %*% t(rowSums(input^2)))))
return(dist_mat)
}
This code works fine for smaller datasets, but when I run it on anything over 20,000 rows it takes forever and then crashes my R session due to memory issues. I believe that porting this to Rcpp would make it both faster and more memory efficient (I know this is a bit of a naive belief, but my knowledge of C++ in general is limited). Finally, the output of the function, though it does not have to be a distance matrix object when returned, does need to be able to be converted to that format after its generation.
How should I got about converting this function to Rcpp and then calling it as I would any of the other functions in my package? Alternatively, is this the best way to go about solving the speed / memory problem?
Hard to help you, since as the comments pointed out you are basically searching for an Rcpp intro.
I'll try to give you some hints, which I already mentioned partly in the comments.
In general using C/C++ can provide a great speedup (dependent on the task of course). But I've reached for (loop intensive, not optimized code) 100x+ speedups.
Since adding C++ can be complicated and sometimes cause problems, before you go this way check the following:
1. Is your R code optimized?
You can make lot of bad choices here (e.g. loops are slow in R). Just by optimizing your R code speedups of 10x or much more can often be easily reached.
2. Are there better implementations in other packages?
Especially if it is helper functions or common functionalities, often other packages have these already implemented. Benchmark different existing solutions with the 'microbenchmark' package. It is easier to just use an optimized function from another R package then doing everything on your own. (maybe the other package implementations are already in C++). I mostly try to look for mainstream and popular packages (since these are better tested and they are unlikely to suddenly drop from CRAN).
3. Profile your code
Take a look what parts exactly cause the speed / memory problems. Might be that you can keep parts in R and only create a function for the critical parts in C++. Or you find another package that has a R function that is implemented in C for exactly this critical part.
In the end I'd say, I prefer using Rcpp/C++ over C code. Think this is the easier way to go. For the Rcpp learning part you have to go with a dedicated tutorial (and not a SO question).

Parallel programming for all R packages

Do you know if there are any plans to introduce parallel programming in R for all packages?
I'm aware of some developments such as R-revolution and parallel programming packages, but they seem to have specialised functions which replace the most popular functions (linear programming etc..). However one of the great things about R is the huge amount of specialised packages which prop up every day and make complex and time-consuming analysis very easy to run. Many of these use very popular functions such as the generalised linear model, but also use the results for additional calculation and comparison and finally sort out the output. As far as I understand you need to define which parts of a function can be run in parallel programming so this is probably why most specialised R packages don't have this functionality and cannot have it unless the code is edited.
Are there are any plans (or any packages) to enable all the most popular R functions to run in parallel processing so that all the less popular functions containing these can be run in parallel processing? For example, the package difR uses the glm function for most of its functions; if the glm package was enabled to run in parallel processing (or re-written and then released in a new R version) for all multi-processor machines then there would be no need to re-write the difR package and this could then run some of its most burdensome procedures with the aid of parallel programming on a Windows PC.
I completely agree with Paul's answer.
In addition, a general system for parallelization needs some very non-trivial calibration, even for those functions that can be easily parallelized: What if you have a call stack of several functions that offer parallel computation (e.g. you are bootstrapping some model fitting, the model fitting may already offer parallelization and low level linear algebra can be implicitly parallel)? You need to estimate (or choose manually) at which level explicit parallelization should be done. In addition, you possibly have implicit parallelization, so you need to trade off between these.
However, there is one particularly easy and general way to parallelize computations implicitly in R: linear algebra can be parallelized and sped up considerably by using an optimized BLAS. Using this can (depending on your system) be as easy as telling your package manager to install the optimized BLAS and R will use it. Once it is linked to R, all packages that use the base linear algebra functions like %*%, crossprod, solve etc. will profit.
See e.g. Dirk Eddelbüttel's gcbd package and its vignette, and also discussions how to use GotoBLAS2 / OpenBLAS.
How to parallelize a certain problem is often non-trivial. Therefore, a specific implementation has to be made in each and every case, in this case for each R package. So, I do not think a general implementation of parallel processing in R will be made, or is even possible.

Converting models in Matlab/R to C++/Java

I would like to convert an ARIMA model developed in R using the forecast library to Java code. Note that I need to implement only the forecasting part. The fitting can be done in R itself. I am going to look at the predict function and translate it to Java code. I was just wondering if anyone else had been in a similar situation before and managed to successfully use a Java library for the same.
Along similar lines, and perhaps this is a more general question without a concrete answer; What is the best way to deal with situations where in model building can be done in Matlab/R but the prediction/forecasting needs to be done in Java/C++? Increasingly, I have been encountering such a situation over and over again. I guess you have to bite the bullet and write the code yourself and this is not generally as hard as writing the fitting/estimation yourself. Any advice on the topic would be helpful.
You write about 'R or Matlab' to 'C++ or Java'. This gives 2 x 2 choices which is too many degrees of freedom for my taste. So allow me to concentrate on C++ as the target.
Let's consider a simpler case: Prototyping in R, and deploying in C++. If and when the R package you use is actually implemented in C or C++, this becomes pretty easy. You "merely" need to disentangle the routine you are after from its other dependencies (header files, defines, data structures, ...) and provide it with the data and parameters needed. I have done that in the past for production systems.
Here, you talk about the forecast package. This happens to depend on the RcppArmadillo package which itself brings the nice Armadillo C++ library to R. So chances are you can in fact re-write this as a self-contained unit.
Armadillo is also interesting when you want to port Matlab to C++ as it is written to help with exactly that task in mind. I have ported some relatively extensive Matlab code to C++ and reaped a substantial speed gain.
I'm not sure whether this is possible in R, but in Matlab you can interact with your Matlab code from Java - see http://www.cs.virginia.edu/~whitehouse/matlab/JavaMatlab.html. This would enable you to leave all the forecasting code in Matlab and have e.g. an interface written in Java.
Alternatively, you might want to have predictive code written in Java so that you can produce a model and then distribute a program that uses the model without having a dependency on Matlab. The Matlab compiler maybe be useful here, but I've never used it.
A final simple way of interacting messily between Matlab and Java would be (on linux) using pseudoterminals where you would have a pty/tty pair to interface Java and Matlab. In this case you would send data from Java to Matlab, and have Matlab return the forecasting results. I expect this would also work in R, but I don't know the syntax.
In general though, reimplementing the code is a decent solution and probably quicker than learning how to interface java+matlab or create Matlab libraries.
Some further information on the answer given by Richante: Matlab has some really nice capabilities for interop with compiled languages such as C/C++, C#, and Java. In your particular case you might find the toolbox Matlab Builder JA to be particularly relevant. It allows you to export your Matlab code directly to Java, meaning you can directly call code that you've constructed during your model-building phase in Matlab from Java.
More information from the Mathworks here.
I am also concerned with converting "R to Java" so will speak to that part.
As Vincent Zooneykind said in his comment - the PMML library in R makes sense for model export in general but "forecast" is not a supported library as of yet.
An alternative is to use something like https://www.opencpu.org/ to make a call to R from your java program. It surfaces the R code on a http server. Can then just call it with parameters as with a normal http call and return what is neede using java.net.HttpUrlConnection or a choice of http libraries available in Java.
Pros: Separation of concerns, no need to re-write the R code
Cons: Invoking an R server in your live process so need to make sure that is handled robustly

Lua alternative to optim()

I'm currently looking for a lua alternative to the R programming languages; optim() function, if anyone knows how to deal with this?
http://numlua.luaforge.net/ looks interesting but doesn't seem to have minimization. The most promising lead seems to be a Lua wrapper for GSL, which has a variety of multidimensional minimization algorithms included.
With derivatives
- BFGS (method="BFGS" in optim) and two conjugate gradient methods (Fletcher-Reeves and Polak-Ribiere) which are two of the three options available for method="CG" in optim.
Without derivatives
- the Nelder-Mead simplex (method="Nelder-Mead", the default in optim).
More specifically, see here for the Lua shell documentation covering minimization.
I agree with #Zack that you should try to use existing implementations if at all possible, and that you might need a little bit more background knowledge to know which algorithms will be useful for your particular problems ...
R's implementation of optim isn't actually written in R. If you type "optim" with no parentheses at the prompt, it'll dump out the definition of the function, and you can see that after some error checking and argument shuffling it invokes an .Internal routine (coded in C and/or Fortran) to do all the real work.
So your best bet is to find a C library for mathematical optimization -- sorry, I have no recommendations -- and wrap that into Lua. I doubt anyone has written native-Lua code for this, and I would not recommend trying to code it yourself; doing mathematical optimization efficiently is still an active domain of basic research, and the best-so-far algorithms are decidedly nontrivial to implement.

Rllvm and compiler packages: R compilation

This is a fairly general question about the future of R: Any hope to see a merger of compilerand Rllvm (from Omegahat) or another JIT compilation scheme for R (I know there is Ra, but not updated recently)?
In my tests the speed gain from compiler are marginal for "complicated" functions...
What matters isn't how complicated a function is but what kinds of computations it performs. The compiler will make most difference for functions dominated by interpreter overhead, such as ones that perform mostly simple operations on scalar or other small data. In cases like that I have seen a factor of 3 for artificial examples and a a bit
better than a factor of 2 for some production code. Functions that spend most of their time in operations implemented in native code, like linear algebra operations, will see little benefit.
This is just the first release of the compiler and it will evolve over time. LLVM is one of several possible direction we will look at but probably not for a while. In any case, I would expect using something like LLVM to provide further improvements in cases where the current compiler already makes a difference, but not to add much in cases where it does not.
(Moving from a comment to an answer ...)
This sounds more like a question for the r development mailing list. Based on my general impressions I would say "probably not". Are your complicated functions already based on heavily vectorized (and hence efficient) functions? I think a more promising direction for not-so-easily-automatically-optimized situations is the increased simplicity of embedding C++ etc. (i.e. Rcpp), inline if necessary

Resources