Interaction between C and R - r

I have naive questions to ask:
1) When I want to call C in R, I have to write some C code. But sometimes I have to call a function which is written in R by myself, can I call that function in the C function?
2) If 1) is feasible, then if I use a function written in R repeatedly for 1000 times for a loop, will this speed up by using C to call this function to do the loop?

Well put question. A quick take:
Yes you can. It is (as so many things) possible yet a little tedious with the C-based API that R offers -- but (in our opinion at least) much easier with the C++ layer we put on top via Rcpp
That is the critical point. If the R code is the bottleneck, it remains the bottleneck when you call it from C or C++ as it does not matter where it is called from. What matters is its relative speed.
The rcpp-devel list (links are on Rcpp page) has a lot of related discussions; you can also find a lot here on SO under the [rcpp] tag.

Related

Converting cosine distance function in R to Rcpp

I've been developing an R package for single cell RNA-seq analysis, and one of the functions I used repeatedly calculates the cosine dissimilarity matrix for a given matrix of m cells by n genes. The function I wrote is as follows:
CosineDist <- function(input = NULL) {
if (is.null(input)) { stop("You forgot to provide an input matrix") }
dist_mat <- as.dist(1 - input %*% t(input) / (sqrt(rowSums(input^2) %*% t(rowSums(input^2)))))
return(dist_mat)
}
This code works fine for smaller datasets, but when I run it on anything over 20,000 rows it takes forever and then crashes my R session due to memory issues. I believe that porting this to Rcpp would make it both faster and more memory efficient (I know this is a bit of a naive belief, but my knowledge of C++ in general is limited). Finally, the output of the function, though it does not have to be a distance matrix object when returned, does need to be able to be converted to that format after its generation.
How should I got about converting this function to Rcpp and then calling it as I would any of the other functions in my package? Alternatively, is this the best way to go about solving the speed / memory problem?
Hard to help you, since as the comments pointed out you are basically searching for an Rcpp intro.
I'll try to give you some hints, which I already mentioned partly in the comments.
In general using C/C++ can provide a great speedup (dependent on the task of course). But I've reached for (loop intensive, not optimized code) 100x+ speedups.
Since adding C++ can be complicated and sometimes cause problems, before you go this way check the following:
1. Is your R code optimized?
You can make lot of bad choices here (e.g. loops are slow in R). Just by optimizing your R code speedups of 10x or much more can often be easily reached.
2. Are there better implementations in other packages?
Especially if it is helper functions or common functionalities, often other packages have these already implemented. Benchmark different existing solutions with the 'microbenchmark' package. It is easier to just use an optimized function from another R package then doing everything on your own. (maybe the other package implementations are already in C++). I mostly try to look for mainstream and popular packages (since these are better tested and they are unlikely to suddenly drop from CRAN).
3. Profile your code
Take a look what parts exactly cause the speed / memory problems. Might be that you can keep parts in R and only create a function for the critical parts in C++. Or you find another package that has a R function that is implemented in C for exactly this critical part.
In the end I'd say, I prefer using Rcpp/C++ over C code. Think this is the easier way to go. For the Rcpp learning part you have to go with a dedicated tutorial (and not a SO question).

Optimization R code - Rcpp

In addition to benchmarking functions, is there any tool in R so we can fetch the biggest bottlenecks in an R code?
I often get very undecided about the computational gain I will get when rewriting the R code in C ++. For example, in a bootstrap where each iteration needs to do an optimization, I do not know if it is useful to use the GSL library to do an optimization of a log-likelihood function, since the optim language function R uses the stats.so file. I noticed this doing stats ::: C_optim.
> stats:::C_optim
$name
[1] "optim"
$address
<pointer: 0x1cb34e0>
attr(,"class")
[1] "RegisteredNativeSymbol"
$dll
DLL name: stats
Filename: /usr/lib/R/library/stats/libs/stats.so
Dynamic lookup: FALSE
$numParameters
[1] 7
attr(,"class")
[1] "ExternalRoutine" "NativeSymbolInfo"
Looking at the body of the optim function (edit(optim)), I see that there is the import of efficient functions implemented in C. For example, there is:
.External2(C_optim, par, fn1, gr1, method, con, lower,
upper)
Doubt: To Rcpp users, in your projects, do you normally try to implement all your C++ functions or implement a set of small C++ functions to be used in an R function?
I know it's a pretty general question, but all the functions I use Rcpp always try to implement C++ function from scratch. I felt that I'm programming more in C++ than in R. I sometimes think that I need to program directly in C++.
R has many characteristics that make the language slow for various tasks. I always try to avoid loops and give way to the use of the apply family of functions. However, I often find the R very slow. That way, because I'm very undecided on what's worth optimizing, I end up implementing everything in C++.
If you (generally) code faster in R and feel like writing to much C++ code, I suggest the following approach:
Implement your solution in R.
Only if the R solution is not fast enough, try to optimize it.
The first step in optimization is measuring the performance, i.e. profile your code.
Once you have identified the bottlenecks you can improve those using, better R code or compiled code.
With experience you might be able to cut some corners, i.e. know from the beginning that some things in your problem will require compiled code. But that really depends on the kind of problems you are working on.

Is there performance difference between NumericVector and vector<double>?

Suppose one uses NumericVector and the other uses vector<double> in their Rcpp code. Is there notable difference between the two usages, especially in performance?
Generally, yes.
All of the Rcpp(11) types are "thin proxy objects" (which we talk about in several places, talks, slide decks, my book, ...) around the underlying SEXP objects. That means no copies are made when you go from R to C++, and when you go back from C++ to R.
Using standard C++ types like std::vector<T>, however, generally requires a copy.
So you should easily see a difference on some trivial test script as N increases enough.
Personally speaking, I generally like the "clean" use of C++ / STL types for code that "feels more C++-ish" but remain aware of the performance penalty. Often it does not really matter as the C++ solution is faster than what you replace in a pure R solution.
But your question is if one dominates the other, and the other is a clear yes.

Rllvm and compiler packages: R compilation

This is a fairly general question about the future of R: Any hope to see a merger of compilerand Rllvm (from Omegahat) or another JIT compilation scheme for R (I know there is Ra, but not updated recently)?
In my tests the speed gain from compiler are marginal for "complicated" functions...
What matters isn't how complicated a function is but what kinds of computations it performs. The compiler will make most difference for functions dominated by interpreter overhead, such as ones that perform mostly simple operations on scalar or other small data. In cases like that I have seen a factor of 3 for artificial examples and a a bit
better than a factor of 2 for some production code. Functions that spend most of their time in operations implemented in native code, like linear algebra operations, will see little benefit.
This is just the first release of the compiler and it will evolve over time. LLVM is one of several possible direction we will look at but probably not for a while. In any case, I would expect using something like LLVM to provide further improvements in cases where the current compiler already makes a difference, but not to add much in cases where it does not.
(Moving from a comment to an answer ...)
This sounds more like a question for the r development mailing list. Based on my general impressions I would say "probably not". Are your complicated functions already based on heavily vectorized (and hence efficient) functions? I think a more promising direction for not-so-easily-automatically-optimized situations is the increased simplicity of embedding C++ etc. (i.e. Rcpp), inline if necessary

Coding practice in R : what are the advantages and disadvantages of different styles?

The recent questions regarding the use of require versus :: raised the question about which programming styles are used when programming in R, and what their advantages/disadvantages are. Browsing through the source code or browsing on the net, you see a lot of different styles displayed.
The main trends in my code :
heavy vectorization I play a lot with the indices (and nested indices), which results in rather obscure code sometimes but is generally a lot faster than other solutions.
eg: x[x < 5] <- 0 instead of x <- ifelse(x < 5, x, 0)
I tend to nest functions to avoid overloading the memory with temporary objects that I need to clean up. Especially with functions manipulating large datasets this can be a real burden. eg : y <- cbind(x,as.numeric(factor(x))) instead of y <- as.numeric(factor(x)) ; z <- cbind(x,y)
I write a lot of custom functions, even if I use the code only once in eg. an sapply. I believe it keeps it more readible without creating objects that can remain lying around.
I avoid loops at all costs, as I consider vectorization to be a lot cleaner (and faster)
Yet, I've noticed that opinions on this differ, and some people tend to back away from what they would call my "Perl" way of programming (or even "Lisp", with all those brackets flying around in my code. I wouldn't go that far though).
What do you consider good coding practice in R?
What is your programming style, and how do you see its advantages and disadvantages?
What I do will depend on why I am writing the code. If I am writing a data analysis script for my research (day job), I want something that works but that is readable and understandable months or even years later. I don't care too much about compute times. Vectorizing with lapply et al. can lead to obfuscation, which I would like to avoid.
In such cases, I would use loops for a repetitive process if lapply made me jump through hoops to construct the appropriate anonymous function for example. I would use the ifelse() in your first bullet because, to my mind at least, the intention of that call is easier to comprehend than the subset+replacement version. With my data analysis I am more concerned with getting things correct than necessarily with compute time --- there are always the weekends and nights when I'm not in the office when I can run big jobs.
For your other bullets; I would tend not to inline/nest calls unless they were very trivial. If I spell out the steps explicitly, I find the code easier to read and therefore less likely to contain bugs.
I write custom functions all the time, especially if I am going to be calling the code equivalent of the function repeatedly in a loop or similar. That way I have encapsulated the code out of the main data analysis script into it's own .R file which helps keep the intention of the analysis separate from how the analysis is done. And if the function is useful I have it for use in other projects etc.
If I am writing code for a package, I might start with the same attitude as my data analysis (familiarity) to get something I know works, and only then go for the optimisation if I want to improve compute times.
The one thing I try to avoid doing, is being too clever when I code, whatever I am coding for. Ultimately I am never as clever as I think I am at times and if I keep things simple, I tend not to fall on my face as often as I might if I were trying to be clever.
I write functions (in standalone .R files) for various chunks of code that conceptually do one thing. This keeps things short and sweet. I found debugging somewhat easier, because traceback() gives you which function produced an error.
I too tend to avoid loops, except when its absolutely necessary. I feel somewhat dirty if I use a for() loop. :) I try really hard to do everything vectorized or with the apply family. This is not always the best practice, especially if you need to explain the code to another person who is not as fluent in apply or vectorization.
Regarding the use of require vs ::, I tend to use both. If I only need one function from a certain package I use it via ::, but if I need several functions, I load the entire package. If there's a conflict in function names between packages, I try to remember and use ::.
I try to find a function for every task I'm trying to achieve. I believe someone before me has thought of it and made a function that works better than anything I can come up with. This sometimes works, sometimes not so much.
I try to write my code so that I can understand it. This means I comment a lot and construct chunks of code so that they somehow follow the idea of what I'm trying to achieve. I often overwrite objects as the function progresses. I think this keeps the transparency of the task, especially if you're referring to these objects later in the function. I think about speed when computing time exceeds my patience. If a function takes so long to finish that I start browsing SO, I see if I can improve it.
I found out that a good syntax editor with code folding and syntax coloring (I use Eclipse + StatET) has saved me a lot of headaches.
Based on VitoshKa's post, I am adding that I use capitalizedWords (sensu Java) for function names and fullstop.delimited for variables. I see that I could have another style for function arguments.
Naming conventions are extremely important for the readability of the code. Inspired by R's S4 internal style here is what I use:
camelCase for global functions and objects (like doSomething, getXyyy, upperLimit)
functions start with a verb
not exported and helper functions always start with "."
local variables and functions are all in small letters and in "_" syntax (do_something, get_xyyy), It makes it easy to distinguish local vs global and therefore leads to a cleaner code.
For data juggling I try to use as much SQL as possible, at least for the basic things like GROUP BY averages. I like R a lot but sometimes it's not only fun to realize that your research strategy was not good enough to find yet another function hidden in yet another package. For my cases SQL dialects do not differ much and the code is really transparent. Most of the time the threshold (when to start to use R syntax) is rather intuitive to discover. e.g.
require(RMySQL)
# selection of variables alongside conditions in SQL is really transparent
# even if conditional variables are not part of the selection
statement = "SELECT id,v1,v2,v3,v4,v5 FROM mytable
WHERE this=5
AND that != 6"
mydf <- dbGetQuery(con,statement)
# some simple things get really tricky (at least in MySQL), but simple in R
# standard deviation of table rows
dframe$rowsd <- sd(t(dframe))
So I consider it good practice and really recommend to use a SQL database for your data for most use cases. I am also looking into TSdbi and saving time series in relational database, but cannot really judge that yet.

Resources