"Standard" R benchmarking code? - r

I am recompiling/upgrading my R install and I want to measure performance pre/post upgrade. Is there possibly a standard script to run to time some commonly used functions and libraries? I have already installed rbenchmark, but I am just not enough of an R user to know what type of code to write to properly benchmark the new installation.

I've seen people use R-benchmark-25 as on overall test of R.
When I compile BLAS's, I use something like what I post here to benchmark matrix operations from various packages.

Related

Eliminating the need for packages in base R?

I know one of the reasons R is so popular is because of its amazing packages. But for data security reasons, I can't install packages on my work computer. So, it got me thinking if I could still make R do what I would typically make it do using packages with just base R, since packages are, after all, a compiled list of functions. I am wondering if it is possible run regression models and make charts in base R (without using, say ggplot2, caret, etc.). Is it possible to copy the functions in these packages into base R to get the same functionality out of base R as one would if they were using the packages? Is the list of functions that are published as part of these packages available somewhere publicly by chance?
I am wondering if it is possible run regression models and make charts in base R (without using, say ggplot2, caret, etc.).
Yes, before ggplot2 was invented, R was genereally praised for publication ready graphics. R comes with great plotting capabilities without ggplot2 even though the latter is definitively an improvement.
Obviously, people used R for regression decades before caret was invented. A base R installation comes with a solid set of linear and nonlinear regression methods but obviously, all those packages (well, most of them) have a reason to exist. It will mainly depend on what you plan to do use. Many things are implemented in a base installation, many are not.
You can find lists of packages included with all binary distributions of R here: https://cran.r-project.org/doc/manuals/r-release/R-FAQ.html#Add_002don-packages-in-R
You will find, that that not only includes the stats package but lots of useful modelling packages like MASS, splines, boot, mgcv, nlme, cluster, rpart, spatial and survival, so a large number of even specialized models is at hand without additional downloading of packages.
Is it possible to copy the functions in these packages into base R to get the same functionality out of base R as one would if they were using the packages?
Many packages contain just plain R code, others will contain code in other languages, mostly C and C++, which will need a compiler to be translated on your system. However, where the use of foreign code / packages is considered a security breach, you should refrain from that and talk to your employer.
If it is not considered a problem but they do not want to make exceptions for you and your installation -- I was in the same place for quite some time and I just ran R from a USB stick. If that is allowed and feasible on your system, you can download packages to that USB stick installation.

SAS IML and SAS/R interface

Does one need to have SAS IML installed to use the SAS/R interface? or should/could one use the sas x command to run R and feed data to it?
If you want to actually use the SAS/R interface, then yes, you must license and have SAS/IML installed as it is specifically a feature of SAS/IML (which makes sense, as SAS/IML is SAS's matrix programming language, and R is a matrix programming language).
However, you're welcome to use R the way you describe (by submitting R programs via xcmd); you will, however, need to use a CSV file or similar to exchange data between the two programs. There are several ways to do it, so look at the different options available to see what's easiest for you.
If you're choosing between the different ways to do this, here is a list of the advantages of using IML which serves as a nice comparison between the two (perhaps a biased one (Rick is the lead developer of SAS/IML), but it is sufficiently detailed in what you won't have available to you running it as a separate program that it should be helpful in making the decision).

How to test and compare multiple versions of R functions for a package?

I use unit tests via testthat for many of my simpler functions in R but where I have more complex functions combining simple functions and business logic. I'd like to test the cumulative impact via a before and after view for a sample of inputs. Ideally, I'd like to do this for a variety of candidate changes.
At the moment I'm:
Using Rmarkdown documents
Loading the package as-is
Getting my sample
Running my sample through the package as-is and outputting table of results
sourceing new copies of functions
Running my sample again and outputting table of results
Reloading package and sourceing different copies of functions as required
This has proven difficult due to some functions that sit in the package namespace still running the package versions of functions, making results unreliable unless I thoroughly reload all downstream dependencies of functions. Additionally, the mechanism is complex to manage and difficult to reuse.
Is there a better strategy for testing candidate changes in R packages?
I've reduced this issue by creating a sub-package in my impact analysis folder that contains the amended functions.
I then use devtools::load_all to load these new function versions up.
I can then compare and contrast results by accessing the originals via the namespace e.g. myoriginalpackage:::testfunction whilst looking at the new ones with testfunction
Maybe you can replace steps 5 and 7 with "Rcmd build YourPackage" on each version. Then with
install.packages("Path/To/MyPackage_1.1.tar.gz", type="source")
test the old version, and then replace 1.1 with 1.2

How do I ensure R / Rcpp code is reproducible ("distributable")?

I've written some R code for a dissertation, relying on some external packages (e.g., plyr and reshape) and writing a couple relatively simple inline C++ functions using inline and RcppArmadillo.
I would like to ensure it can be performed "as is" on computers others than my own (Win64), for research reproducibility purposes.
My question: suppose I included code for installing the required packages, would the RcppArmadillo (and Rcpp and inline) packages be sufficient to be able to compile the functions written in RcppArmadillo, or would the end user need to change system paths for compilation on his Windows machine? If not, is it possible/recommended to saved the compiled functions from my end and included in the R code I'm shipping?
Also, in the unlikely case that the code should be run some time later (say, a couple of years), is it suffient to include a full R installation with the relevant packages in their current version to make the code "future-proof"?
I hope the question is clear.
I think you mean your code to be "distributable" and "executable by someone else" which is a looser requirement. Being "reproducible" implies that the previous question is a true, and adds the restriction that the results are identical (up to the an epsilon of your choice).
And the usual answer for 'how can I let others run my R code' is to create a package.
For Rcpp-related code, we have an entire vignette devoted to building your own package with your Rcpp-using cod. The vignette is called 'Rcpp-package'.

R package that automatically uses several cores?

I have noticed that R only uses one core while executing one of my programs which requires lots of calculations. I would like to take advantage of my multi-core processor to make my program run faster.
I have not yet investigated the question in depth but I would appreciate to benefit from your comments because I do not have good knowledge in computer science and it is difficult for me to get easily understandable information on that subject.
Is there a package that allows R to automatically use several cores when needed?
I guess it is not that simple.
R can only make use of multiple cores with the help of add-on packages, and only for some types of operation. The options are discussed in detail on the High Performance Computing Task View on CRAN
Update: From R Version 2.14.0 add-on packages are not necessarily required due to the inclusion of the parallel package as a recommended package shipped with R. parallel includes functionality from the multicore and snow packages, largely unchanged.
The easiest way to take advantage of multiprocessors is the multicore package which includes the function mclapply(). mclapply() is a multicore version of lapply(). So any process that can use lapply() can be easily converted to an mclapply() process. However, multicore does not work on Windows. I wrote a blog post about this last year which might be helpful. The package Revolution Analytics created, doSMP, is NOT a multi-threaded version of R. It's effectively a Windows version of multicore.
If your work is embarrassingly parallel, it's a good idea to get comfortable with the lapply() type of structuring. That will give you easy segue into mclapply() and even distributed computing using the same abstraction.
Things get much more difficult for operations that are not "embarrassingly parallel".
[EDIT]
As a side note, Rstudio is getting increasingly popular as a front end for R. I love Rstudio and use it daily. However it needs to be noted that Rstudio does not play nice with Multicore (at least as of Oct 2011... I understand that the RStudio team is going to fix this). This is because Rstudio does some forking behind the scenes and these forks conflict with Multicore's attempts to fork. So if you need Multicore, you can write your code in Rstuido, but run it in a plain-Jane R session.
On this question you always get very short answers. The easiest solution according to me is the package snowfall, based on snow. That is, on a Windows single computer with multiple cores. See also here the article of Knaus et al for a simple example. Snowfall is a wrapper around the snow package, and allows you to setup a multicore with a few commands. It's definitely less hassle than most of the other packages (I didn't try all of them).
On a sidenote, there are indeed only few tasks that can be parallelized, for the very simple reason that you have to be able to split up the tasks before multicore calculation makes sense. the apply family is obviously a logical choice for this : multiple and independent computations, which is crucial for multicore use. Anything else is not always that easily multicored.
Read also this discussion on sfApply and custom functions.
Microsoft R Open includes multi-threaded math libraries to improve the performance of R.It works in Windows/Unix/Mac all OS type. It's open source and can be installed in a separate directory if you have any existing R(from CRAN) installation. You can use popular IDE Rstudio also with this.From its inception, R was designed to use only a single thread (processor) at a time. Even today, R works that way unless linked with multi-threaded BLAS/LAPACK libraries.
The multi-core machines of today offer parallel processing power. To take advantage of this, Microsoft R Open includes multi-threaded math libraries.
These libraries make it possible for so many common R operations, such as matrix multiply/inverse, matrix decomposition, and some higher-level matrix operations, to compute in parallel and use all of the processing power available to reduce computation times.
Please check the below link:
https://mran.revolutionanalytics.com/rro/#about-rro
http://www.r-bloggers.com/using-microsoft-r-open-with-rstudio/
As David Heffernan said, take a look at the Blog of revolution Analytics. But you should know that most packages are for Linux. So, if you use windows it will be much harder.
Anyway, take a look at these sites:
Revolution. Here you will find a lecture about parallerization in R. The lecture is actually very good, but, as I said, most tips are for Linux.
And this thread here at Stackoverflow will disscuss some implementation in Windows.
The package future makes it extremely simple to work in R using parallel and distributed processing. More info here. If you want to apply a function to elements in parallel, the future.apply package provides a quick way to use the "apply" family functions (e.g. apply(), lapply(), and vapply()) in parallel.
Example:
library("future.apply")
library("stats")
x <- 1:10
# Single core
y <- lapply(x, FUN = quantile, probs = 1:3/4)
# Multicore in parallel
plan(multiprocess)
y <- future_lapply(x, FUN = quantile, probs = 1:3/4)

Resources