R 2.14 byte compilation - how much performance increase - r

I saw that the latest R version supports byte compilation. What is the performance gain I can expect? And are specific tasks more positively impacted that others?

It is a good question:
The byte compiler appeared with 2.13.0 in April, and some of us run tests then (with thanks to gsk3 for the links).
I tend to look more at what Rcpp can do for us, but I now always include 'uncompiled R' and 'byte compiled R' as baselines.
I fairly often see gains by a factor of two (eg the varSims example) but also essentially no gains at all (the fibonacci example which is of course somewhat extreme).
We could do with more data. When we had the precursor byte compile in the Ra engine by Stephen Milborrow (which I had packaged up for Debian too), it was often said that both algebraic expression and loops benefited. I don't know of a clear rule for the byte compiler by Luke Tierney now in R---but it generally never seems to hurt so it may be worthwhile to get into the habit of turning it on.

Between a one- and five-fold increase:
http://radfordneal.wordpress.com/2011/05/13/speed-tests-for-r-%E2%80%94-and-a-look-at-the-compiler/
http://dirk.eddelbuettel.com/blog/2011/04/12/

Related

lineprof equivalent for Rcpp

The lineprof package in R is very useful for profiling which parts of function take up time and allocate/free memory.
Is there a lineprof() equivalent for Rcpp ?
I currently use std::chrono::steady_clock and such to get chunk timings out of an Rcpp function. Alternatives? Does Rstudio IDE provide some help here?
To supplement #Dirk's answer...
If you are working on OS X, the Time Profiler Instrument, part of Apple's Instruments set of instrumentation tools, is an excellent sampling profiler.
Just to fix ideas:
A sampling profiler lets you answer the question, what code paths does my program spend the most time executing?
A (full) cache profiler lets you answer the question, which are the most frequently executed code paths in my program?
These are different questions -- it's possible that your hottest code paths are already optimized enough that, even though the total number of instructions executed in that path is very high, the amount of time required to execute them might be relatively low.
If you want to use instruments to profile C++ code / routines used in an R package, the easiest way to go about this is:
Create a target, pointed at your R executable, with appropriate command line arguments to run whatever functions you wish to profile:
Set the command line arguments to run the code that will exercise your C++ routines -- for example, this code runs Rcpp:::test(), to instrument all of the Rcpp test code:
Click the big red Record button, and off you go!
I'll leave the rest of the instructions in understanding instruments + the timing profiler to your google-fu + the documentation, but (if you're on OS X) you should be aware of this tool.
See any decent introduction to high(er) performance computing as eg some slides from (older) presentation of my talks page which include worked examples for both KCacheGrind (part of the KDE frontend to Valgrind) as well as Google Perftools.
In a more abstract sense, you need to come to terms with the fact that C++ != R and not all tools have identical counterparts. In particular Rprof, the R profiler which several CRAN packages for profiling build on top of, is based on the fact that R is interpreted. C++ is not, so things will be different. But profiling compiled is about as old as compiling and debugging so you will find numerous tutorials.

OpenCL BLAS in julia

I am really amazed by the julia language, having implemented lots of machine learning algorithms there for my current project. Even though julia 0.2 manages to get some great results out of my 2011 MBA outperforming all other solutuions on similar linux hardware (due to vecLib blas, i suppose), I would certainly like more. I am in a process of buying radeon 5870 and would like to push my matrix operations there. I use basically only simple BLAS operations such as matmul, additios and transpositions. I use julia's compact syntax A' * B + C and would certainly like to keep it.
Is there any way (or pending milestone) I can get those basic operations execute on GPU? I have like 2500x2500 single precision matrices so I expect significant speedup.
I don't believe that GPU integration into the core of Julia is planned at this time. One of the key issues is that there is substantial overhead moving the data to and from the GPU, making a drop-in replacement for BLAS operations infeasible.
I expect that most of the progress in this area will actually come from the package ecosystem, in particular packages under the JuliaGPU organization. I see there is a CLBLAS package in there.

Parallel arithmetic on large integers

Are there any software tools for performing arithmetic on very large numbers in parallel? What I mean by parallel is that I want to use all available cores on my computer for this.
The constraints are wide open for me. I don't mind trying any language or tech.
Please and thanks.
It seems like you are either dividing really huge numbers, or are using a suboptimal algorithm. Parallelizing things to a fixed number of cores will only tweak the constants, but have no effect on the asymptotic behavior of your operation. And if you're talking about hours for a single division, asymptotic behavior is what matters most. So I suggest you first make sure sure your asymptotic complexity is as good as can be, and then start looking for ways to improve the constants, perhaps by parallelizing.
Wikipedia suggests Barrett division, and GMP has a variant of that. I'm not sure whether what you've tried so far is on a similar level, but unless you are sure that it is, I'd give GMP a try.
See also Parallel Modular Multiplication on Multi-core Processors for recent research. Haven't read into that myself, though.
The only effort I am aware of is a CUDA library called CUMP. However, the library only provides support for addition, subtraction and multiplication. Anyway, you can use multiplication to perform the division on the GPU and check if the quality of the result is enough for your particular problem.

MPI and OpenMP. Do I even have a choice?

I have a linear algebra code that I am trying get to run faster. Its a iterative algorithm with a loop and matrix vector multiplications within in.
So far, I have used MATMUL (Fortran Lib.), DGEMV, Tried writing my own MV code in OpenMP but the algorithm is doing no better in terms of scalability. Speed ups are barely 3.5 - 4 irrespective of how many processors I am allotting to it (I have tried up 64 processors).
The profiling shows significant time being spent in Matrix-Vector and the rest is fairly nominal.
My question is:
I have a shared memory system with tons of RAM and processors. I have tried tweaking OpenMP implementation of the code (including Matrix Vector) but has not helped. Will it help to code in MPI? I am not a pro at MPI but the ability to fine tune the message communication might help a bit but I can't be sure. Any comments?
More generally, from the literature I have read, MPI = Distributed, OpenMP = Shared but can they perform well in the others' territory? Like MPI in Shared? Will it work? Will it be better than the OpenMP implementation if done well?
You're best off just using a linear algebra package that is already well optimized for a multitcore environment and using that for your matrix-vector multiplication. The Atlas package, gotoblas (if you have a nehalem or older; sadly it's no longer being updated), or vendor BLAS implementations (like MKL for intel CPUs, ACML for AMD, or VecLib for apple, which all cost money) all have good, well-tuned, multithreaded BLAS implementations. Unless you have excellent reason to believe that you can do better than those full time development teams can, you're best off using them.
Note that you'll never get the parallel speedup with DGEMV that you do with DGEMM, just because the vector is smaller than another matrix and so there's less work; but you can still do quite well, and you'll find you get much better perforamance with these libraries than you do with anything hand-rolled unless you were already doing multi-level cache blocking.
You can use MPI in a shared environment (though not OpenMP in a distributed one). However, achieving a good speedup depends a lot more on your algorithms and data dependencies than the technology used. Since you have a lot of shared memory, I'd recommend you stick with OpenMP, and carefully examine whether you're making the best use of your resources.

Rllvm and compiler packages: R compilation

This is a fairly general question about the future of R: Any hope to see a merger of compilerand Rllvm (from Omegahat) or another JIT compilation scheme for R (I know there is Ra, but not updated recently)?
In my tests the speed gain from compiler are marginal for "complicated" functions...
What matters isn't how complicated a function is but what kinds of computations it performs. The compiler will make most difference for functions dominated by interpreter overhead, such as ones that perform mostly simple operations on scalar or other small data. In cases like that I have seen a factor of 3 for artificial examples and a a bit
better than a factor of 2 for some production code. Functions that spend most of their time in operations implemented in native code, like linear algebra operations, will see little benefit.
This is just the first release of the compiler and it will evolve over time. LLVM is one of several possible direction we will look at but probably not for a while. In any case, I would expect using something like LLVM to provide further improvements in cases where the current compiler already makes a difference, but not to add much in cases where it does not.
(Moving from a comment to an answer ...)
This sounds more like a question for the r development mailing list. Based on my general impressions I would say "probably not". Are your complicated functions already based on heavily vectorized (and hence efficient) functions? I think a more promising direction for not-so-easily-automatically-optimized situations is the increased simplicity of embedding C++ etc. (i.e. Rcpp), inline if necessary

Resources