Converting cosine distance function in R to Rcpp - r

I've been developing an R package for single cell RNA-seq analysis, and one of the functions I used repeatedly calculates the cosine dissimilarity matrix for a given matrix of m cells by n genes. The function I wrote is as follows:
CosineDist <- function(input = NULL) {
if (is.null(input)) { stop("You forgot to provide an input matrix") }
dist_mat <- as.dist(1 - input %*% t(input) / (sqrt(rowSums(input^2) %*% t(rowSums(input^2)))))
return(dist_mat)
}
This code works fine for smaller datasets, but when I run it on anything over 20,000 rows it takes forever and then crashes my R session due to memory issues. I believe that porting this to Rcpp would make it both faster and more memory efficient (I know this is a bit of a naive belief, but my knowledge of C++ in general is limited). Finally, the output of the function, though it does not have to be a distance matrix object when returned, does need to be able to be converted to that format after its generation.
How should I got about converting this function to Rcpp and then calling it as I would any of the other functions in my package? Alternatively, is this the best way to go about solving the speed / memory problem?

Hard to help you, since as the comments pointed out you are basically searching for an Rcpp intro.
I'll try to give you some hints, which I already mentioned partly in the comments.
In general using C/C++ can provide a great speedup (dependent on the task of course). But I've reached for (loop intensive, not optimized code) 100x+ speedups.
Since adding C++ can be complicated and sometimes cause problems, before you go this way check the following:
1. Is your R code optimized?
You can make lot of bad choices here (e.g. loops are slow in R). Just by optimizing your R code speedups of 10x or much more can often be easily reached.
2. Are there better implementations in other packages?
Especially if it is helper functions or common functionalities, often other packages have these already implemented. Benchmark different existing solutions with the 'microbenchmark' package. It is easier to just use an optimized function from another R package then doing everything on your own. (maybe the other package implementations are already in C++). I mostly try to look for mainstream and popular packages (since these are better tested and they are unlikely to suddenly drop from CRAN).
3. Profile your code
Take a look what parts exactly cause the speed / memory problems. Might be that you can keep parts in R and only create a function for the critical parts in C++. Or you find another package that has a R function that is implemented in C for exactly this critical part.
In the end I'd say, I prefer using Rcpp/C++ over C code. Think this is the easier way to go. For the Rcpp learning part you have to go with a dedicated tutorial (and not a SO question).

Related

Fastest way in R to compute the inverse for large matrices

I need to compute a hat matrix (as from linear regression). Standard R code would be:
H <- tcrossprod(tcrossprod(X, solve(crossprod(X))), X)
with X being a relatively large matrix (i.e 1e5*100), and this line has to run thousands of times. I understand the most limiting part is the inverse computation, but the crossproducts may be time-consuming too. Is there any faster alternative to perform these matrix operations? I tried Rcpp and reviewed several posts but any alternative I tested was slower. Maybe I did not code properly my C++ code, as I am not an advanced C++ programmer.
Thanks!
Chasing the code for this line by line is a little difficult because the setup of R code is a little on the complicated side. But read on, pointers below.
The important part is that the topic has been discussed many times: what happens is that R dispatches this to the BLAS (Basic Linear Algebra Subprogram) and LAPACK (Linear Algebra PACKage) libraries. Which contain the most efficient code known to man for this. In general, you cannot gain on it by rewriting.
One can gain performance differences by switching one BLAS/LAPACK implementation for another---there are many, many posts on this online too. R itself comes with the so-called 'reference BLAS' known to be correct, but slowest. You can switch to Atlas, OpenBLAS, MKL, ... depending on your operating system; instructions on how to do so are in some of the R manuals that come with your installation.
For completeness, per file src/main/names.c the commands %*%, crossprod and tcrossprod all refer to do_matprod. This is in file src/main/array.c and does much argument checking and arranging and branching on types of arguments but e.g. one path then calls
F77_CALL(dsyrk)(uplo, trans, &nc, &nr, &one, x, &nr, &zero, z, &nc
FCONE FCONE);
which is this LAPACK function. It is essentially the same for all others making this an unlikely venue for your optimisation.

Using all cores for R MASS::stepAIC process

I've been struggling to perform this sort of analysis and posted on the stats site about whether I was taking things in the right direction, but as I've been investigating I've also found that my lovely beefy processor (linux OS, i7) is only actually using 1 of its cores. Turns out this is default behaviour, but I have a fairly large dataset and between 40 and 50 variables to select from.
A stepAIC function that is checking various different models seems like the ideal sort of thing for parellizing, but I'm a relative newb with R and I only have sketchy notions about parallel computing.
I've taken a look at the documentation for the packages parallel, and snowfall, but these seems to have some built-in list functions for parallelisation and I'm not sure how to morph the stepAIC into a form that can be run in parellel using these packages.
Does anyone know 1) whether this is a feasible exercise, 2) how to do what I'm looking to do and can give me a sort of basic structure/list of keywords I'll need?
Thanks in advance,
Steph
I think that a process in which a step depends on de last (as in step wise selection) is not trivial to do in parallel.
The simplest way to do something in parallel I know is:
library(doMC)
registerDoMC()
l <- foreach(i=1:X) %dopar% { fun(...) }
in my poor understanding of stepwise one extracts variables (or add forward/backward) of a model and measure the fitting in each step. If extracting a variable the model fit is best you keep this model, for example. In the foreach parallel function each step is blind to other step, maybe you could write your own function to perform this task as in
http://beckmw.wordpress.com/tag/stepwise-selection/
I looked for this code, and seems to me that you could use parallel computing with the vif_func function...
I think you also should check optimized codes to do that task as in the package leaps
http://cran.r-project.org/web/packages/leaps/index.html
hope this helps...

CVX-esque convex optimization in R?

I need to solve (many times, for lots of data, alongside a bunch of other things) what I think boils down to a second order cone program. It can be succinctly expressed in CVX something like this:
cvx_begin
variable X(2000);
expression MX(2000);
MX = M * X;
minimize( norm(A * X - b) + gamma * norm(MX, 1) )
subject to
X >= 0
MX((1:500) * 4 - 3) == MX((1:500) * 4 - 2)
MX((1:500) * 4 - 1) == MX((1:500) * 4)
cvx_end
The data lengths and equality constraint patterns shown are just arbitrary values from some test data, but the general form will be much the same, with two objective terms -- one minimizing error, the other encouraging sparsity -- and a large number of equality constraints on the elements of a transformed version of the optimization variable (itself constrained to be non-negative).
This seems to work pretty nicely, much better than my previous approach, which fudges the constraints something rotten. The trouble is that everything else around this is happening in R, and it would be quite a nuisance to have to port it over to Matlab. So is doing this in R viable, and if so how?
This really boils down to two separate questions:
1) Are there any good R resources for this? As far as I can tell from the CRAN task page, the SOCP package options are CLSCOP and DWD, which includes an SOCP solver as an adjunct to its classifier. Both have similar but fairly opaque interfaces and are a bit thin on documentation and examples, which brings us to:
2) What's the best way of representing the above problem in the constraint block format used by these packages? The CVX syntax above hides a lot of tedious mucking about with extra variables and such, and I can just see myself spending weeks trying to get this right, so any tips or pointers to nudge me in the right direction would be very welcome...
You might find the R package CVXfromR useful. This lets you pass an optimization problem to CVX from R and returns the solution to R.
OK, so the short answer to this question is: there's really no very satisfactory way to handle this in R. I have ended up doing the relevant parts in Matlab with some awkward fudging between the two systems, and will probably migrate everything to Matlab eventually. (My current approach predates the answer posted by user2439686. In practice my problem would be equally awkward using CVXfromR, but it does look like a useful package in general, so I'm going to accept that answer.)
R resources for this are pretty thin on the ground, but the blog post by Vincent Zoonekynd that he mentioned in the comments is definitely worth reading.
The SOCP solver contained within the R package DWD is ported from the Matlab solver SDPT3 (minus the SDP parts), so the programmatic interface is basically the same. However, at least in my tests, it runs a lot slower and pretty much falls over on problems with a few thousand vars+constraints, whereas SDPT3 solves them in a few seconds. (I haven't done a completely fair comparison on this, because CVX does some nifty transformations on the problem to make it more efficient, while in R I'm using a pretty naive definition, but still.)
Another possible alternative, especially if you're eligible for an academic license, is to use the commercial Mosek solver, which has an R interface package Rmosek. I have yet to try this, but may give it a go at some point.
(As an aside, the other solver bundled with CVX, SeDuMi, fails completely on the same problem; the CVX authors aren't kidding when they suggest trying multiple solvers. Also, in a significant subset of cases, SDTP3 has to switch from Cholesky to LU decomposition, which makes the processing orders of magnitude slower, with only very marginal improvement in the objective compared to the pre-LU steps. I've found it worth reducing the requested precision to avoid this, but YMMV.)
There is a new alternative: CVXR, which comes from the same people.
There is a website, a paper and a github project.
Disciplined Convex Programming seems to be growing in popularity observing cvxpy (Python) and Convex.jl (Julia), again, backed by the same people.

Interaction between C and R

I have naive questions to ask:
1) When I want to call C in R, I have to write some C code. But sometimes I have to call a function which is written in R by myself, can I call that function in the C function?
2) If 1) is feasible, then if I use a function written in R repeatedly for 1000 times for a loop, will this speed up by using C to call this function to do the loop?
Well put question. A quick take:
Yes you can. It is (as so many things) possible yet a little tedious with the C-based API that R offers -- but (in our opinion at least) much easier with the C++ layer we put on top via Rcpp
That is the critical point. If the R code is the bottleneck, it remains the bottleneck when you call it from C or C++ as it does not matter where it is called from. What matters is its relative speed.
The rcpp-devel list (links are on Rcpp page) has a lot of related discussions; you can also find a lot here on SO under the [rcpp] tag.

Using system.time in R, getting very varied times

I have written two functions in R and I need to see which is faster so I used system.time. However, the answers are so varied I can't tell. As its for assessed work I don't feel I can actually post the code (in case someone corrects it). Both functions call rbinom to generate multiple values and this is the only part that isn't a simple calculation.
The function time needs to be as fast as possible but both are returning times of anywhere between 0.17 and 0.33. As the mark is 0.14/(my function time) x 10 it's important I know the exact time.
I have left gcFirst=TRUE as recommended in the R help.
My question is why are the times so inconsistent? Is it most likely to be the functions themselves, my laptop or R?
You probably want to use one of the benchmarking packages
rbenchmark
microbenchmark
for this. And even then, variability will always enter. Benchmarking and performance testing is not the most exact science.
Also see the parts on profiling in the "Writing R Extensions" manual.

Resources