How could I speed up my R code? [closed] - r

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
Disclaimer
Hello everyone! I recently started programming in R. My codes are working just fine, but in terms of speed some of them are taking way too long to put to good use. I hope someone can help me making this code run faster either by optimising the code, or with the use of one of the multicore packages.
About the code
I have large datasets containing about 15000 numeric data each. The code takes two parameters (p, n) where p >= n, and make subsets of the data. It applies the zyp.yuepilon function (from the zyp package) to each row of the subsets. Then the parameter n is used to apply the same function on an n sized subset.
Problem is I run this code in a nested for loop: p in 10:40 and n in 10:40 so it takes an eternity to get the results, and it's just one dataset among many others.
sp <- function(p, n){
library(zyp)
data <- runif(15000, 1, 4)
lower <- seq(80 - p + 1, by=1, length.out=length(data)-81)
upper <- lower + p - 1
subsets <- matrix(nrow=length(lower), ncol=p)
for(j in 1:length(lower)){
subsets[j, ] = data[lower[j] : upper[j]]
}
ret <- apply(subsets, 1, zyp.yuepilon)
subset_n <- subsets[, 1:n]
ret2 <- apply(subset_n, 1, zyp.yuepilon)
return(list(ret, ret2))
}
Benchmark results in seconds:
expr min lq median uq max neval
sp(7, 6) 92.77266 94.24901 94.53346 95.10363 95.64914 10

Here is a series of comments, rather than an answer.
Looking at the zyp.yuepilon function body, by calling the function without parenthesis in a R session, you see that this function, and the function zyp.sen are written in plain R code (as opposed to compiled code).
The biggest speed-up is likely attained by using the Rcpp package which facilitates calling (compiled) C++ code within R. In fact, there is a small linear model example here Fast LM model using Rcpp/RcppArmadillo.
I would be inclined to rewrite the two functions zyp.yuepilon and zyp.sen in C++, using Rcpp, including the loop over subset vectors (for which you are currently using apply to do).
For general R speed-up issues see this question R loop performance, as well as the R package plyr, which may provide an entry point for taking a map-reduce type of approach to your problem.
If you want to steer clear of C++, then a series of micro-optimisations would be your quickest win. To speed up the apply aspect of your code, you could use something like this
library(doParallel)
library(parallel)
library(foreach)
library(zyp)
cl<-makeCluster(4)
registerDoParallel(cl)
sp_1<-function(p=7, n=6){
N_ob=15000;
off_set=81;
N_ob_o=N_ob-off_set;
am<-matrix(runif(N_ob*p),ncol=p);
subsets<-am[-(1:off_set),];
ret=matrix(unlist( foreach(i=1:N_ob_o) %dopar% zyp::zyp.yuepilon(subsets[i,]),use.names=FALSE),ncol=11, byrow=TRUE);
subset_n <- subsets[, 1:n]
ret2=matrix(unlist( foreach(i=1:N_ob_o) %dopar% zyp::zyp.yuepilon(subset_n[i,]),use.names=FALSE),nrow=11);
return(list(ret, ret2))
}
sp<-function(p=7, n=6){
data <- runif(15000, 1, 4)
lower <- seq(80 - p + 1, by=1, length.out=length(data)-81)
upper <- lower + p - 1
subsets <- matrix(nrow=length(lower), ncol=p)
for(j in 1:length(lower)){
subsets[j, ] = data[lower[j] : upper[j]]
}
ret <- apply(subsets, 1, zyp.yuepilon)
subset_n <- subsets[, 1:n]
ret2 <- apply(subset_n, 1, zyp.yuepilon)
return(list(ret, ret2))
}
system.time(sp_1())
system.time(sp())
This gives me a speed-up of around a factor of 2. But this will depend on your platform, etc. Check out the help files for the functions and packages above, and tune the number of clusters using makeCluster to see what works best for your platform (in the absence of any information about your particular set-up).
Another route might be to make use of the byte-code compiler via library(compiler) to see if the various functions can be optimised, this way.
library(compiler)
enableJit(3);
zyp_comp<-cmpfun(zyp.yuepilon);

Related

Optimizing lm() function in a loop

I'm using the R built-in lm() function in a loop for estimating a custom statistic:
for(i in 1:10000)
{
x<-rnorm(n)
reg2<-lm(x~data$Y)
Max[i]<-max(abs(rstudent(reg2)))
}
This is really slow when increasing both the loop counter (typically we want to test over 10^6 or 10^9 iterations values for precision issues) and the size of Y.
Having read the following Stack topic, a very first attemp was to try optimizing the whole using parallel regression (with calm()):
cls = makeCluster(4)
distribsplit(cls, "test")
distribsplit(cls, "x")
for(i in 1:10000)
{
x<-rnorm(n)
reg2 <- calm(cls, "x ~ test$Y, data = test")
Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
}
This ended with a much slower version (by a factor 6) when comparing with the original, unparalleled loop. My assumption is that we ask for creating /destroying the threads in each loop iteration and that slow down the process a lot in R.
A second attemp was to use lm.fit() according to this Stack topic:
for(i in 1:10000)
{
x<- rnorm(n)
reg2<- .lm.fit(as.matrix(x), data$Y)
Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
}
It resulted in a much faster processing compared to the initial and orgininal version. Such that we now have: lm.fit() < lm() < calm(), speaking of overall processing time.
However, we are still looking for options to improve the efficiency (in term of processing time) of this code. What are the possible options? I assume that making the loop parallel would save some processing time?
Edit: Minimal Example
Here is a minimal example:
#Import data
sample <- read.csv("sample.txt")
#Preallocation
Max <- vector(mode = "numeric", length = 100)
n <- length(sample$AGE)
x <- matrix(rnorm(100 * n), 100)
for(i in 1 : 100)
{
reg <- lm(x ~ data$AGE)
Max[i] <- max(abs(rstudent(reg)))
}
with the following dataset 'sample.txt':
AGE
51
22
46
52
54
43
61
20
66
27
From here, we made several tests and noted the following:
Following #Karo contribution, we generate the matrix of normal samples outside the loop to spare some execution time. We expected a noticeable impact, but run tests indicate that doing so produce the unexpected inverse results (i.e. a longer execution time). Maybe the effect reverse when increasing the number of simulations.
Following #BenBolker uggestion, we also tested fastlm() and it reduces the execution time but the results seem to differ (from a factor 0.05) compared to the typical lm()
We are still struggling we effectively reducing the execution time. Following #Karo suggestions, we will try to directly pass a vector to lm() and investigate parallelization (but failed with calm() for an unknown reason).
Wide-ranging comments above, but I'll try to answer a few narrower points.
I seem to get the same (i.e., all.equal() is TRUE) results with .lm.fit and fitLmPure, if I'm careful about random-number seeds:
library(Rcpp)
library(RcppEigen)
library(microbenchmark)
nsim <- 1e3
n <- 1e5
set.seed(101)
dd <- data.frame(Y=rnorm(n))
testfun <- function(fitFn=.lm.fit, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
x <- rnorm(n)
reg2 <- fitFn(as.matrix(x), dd$Y)$residuals
return(max(abs(reg2) / sd(reg2)))
}
## make sure NOT to use seed=101 - also used to pick y -
## if we have y==x then results are unstable (resids approx. 0)
all.equal(testfun(seed=102), testfun(fastLmPure,seed=102)) ## TRUE
fastLmPure is fastest (but not by very much):
(bm1 <- microbenchmark(testfun(),
testfun(lm.fit),
testfun(fastLmPure),
times=1000))
Unit: milliseconds
expr min lq mean median uq max
testfun() 6.603822 8.234967 8.782436 8.332270 8.745622 82.54284
testfun(lm.fit) 7.666047 9.334848 10.201158 9.503538 10.742987 99.15058
testfun(fastLmPure) 5.964700 7.358141 7.818624 7.471030 7.782182 86.47498
If you wanted to fit many independent responses, rather than many independent predictors (i.e. if you were varying Y rather than X in the regression), you could provide a matrix for Y in .lm.fit, rather than looping over lots of regressions, which might be a big win. If all you care about are "residuals of random regressions" that might be worth a try. (Unfortunately, providing a matrix that combines may separate X vectors runs a multiple regression, not many univariate regressions ...)
Parallelizing is worthwhile, but will only scale (at best) according to the number of cores you have available. Doing a single run rather than a set of benchmarks because I'm lazy ...
Running 5000 replicates sequentially takes about 40 seconds for me (modern Linux laptop).
system.time(replicate(5000,testfun(fastLmPure), simplify=FALSE))
## user system elapsed
## 38.953 0.072 39.028
Running in parallel on 5 cores takes about 13 seconds, so a 3-fold speedup for 5 cores. This will probably be a bit better if the individual jobs are larger, but obviously will never scale better than the number of cores ... (8 cores didn't do much better).
library(parallel)
system.time(mclapply(1:5000, function(x) testfun(fastLmPure),
mc.cores=5))
## user system elapsed
## 43.225 0.627 12.970
It makes sense to me that parallelizing at a higher/coarser level (across runs rather than within lm fits) will perform better.
I wonder if there are analytical results you could use in terms of the order statistics of a t distribution ... ?
Since I still can't comment:
Try to avoid loops in R. For some reason you are recalculating those random numbers every iteration. You can do that without a loop:
duration_loop <- system.time({
for(i in 1:10000000)
{
x <- rnorm(10)
}
})
duration <- system.time({
m <- matrix(rnorm(10000000*10), 10000000)
})
Both ways should create 10 random values per iteration/matrix row with the same amount of iterations/rows. Though both ways seem to scale linearly, you should see a difference in execution time, the loop will probably be CPU-bound and the "vectorized" way probably memory-bound.
With that in mind you probably should and most likely can avoid the loop altogether, you can for instance pass a vector into the lm-function. If you still need to be faster after that you can definitely parallelise a number of ways, it would be easier to suggest how with a working example of data.

Simulate over an xts object in R

I am looking for a function, or package, that will help me with this goal. I've looked through several packages but can't find what I am looking for:
Lets say I have an xts object with 10 columns and 250 rows.
What I want to do is run a simulation, such that I get a robust calculation of my performance metric over the period.
So, lets say that I have 250 data points, I want to run x number of simulations over random samples of the data computing the Sharpe Ratio using the function (PerformanceAnalytics::SharpeRatio) varying the random samples to be lengths 30-240, and then find the average. Keep in mind I want to do this for every column and I'd rather not have to use apply if possible. I'd also like to find something that processes the information rather quickly.
What package or functions would best serve this purpose?
Thank you!
Subsetting xts objects for the rows you want to randomly sample should be good enough, performance wise, if that is your main concern. If you want some other concrete examples, you may find it useful to look at the monte carlo simulation functions recently added to the R blotter package:
https://github.com/braverock/blotter/blob/master/R/mcsim.R
Your requirements are quite detailed and a little tricky to follow, but I think this example may be what you're after?
This solution does use apply functions though! Because it just makes life easier. If you don't use lapply, the code will expand quickly and distract from achieving the goal quickly (and you risk introducing bugs with longer, messier code; one reason to use apply family functions where you can).
library(quantmod)
library(PerformanceAnalytics)
# Set up the data:
syms <- c("GOOG", "FB", "TSLA", "SNAP", "MU")
getSymbols(syms)
z <- do.call(merge, lapply(syms, function(s) {
x <- get(s)
dailyReturn(Cl(x))
}))
# Here we have 250 rows, 5 columns:
z <- tail(z, 250)
colnames(z) <- paste0(syms, ".rets")
subSample <- function(x, n.sub = 40) {
# Assuming subsampling by row, preserving all returns and cross symbol dependence structure at a given timestamp
ii <- sample(1:NROW(x), size = n.sub, replace = FALSE)
# sort in order to preserve time ordering?
ii <- sort(ii)
xs <- x[ii, ]
xs
}
set.seed(5)
# test:
z2 <- subSample(z, n.sub = 40)
zShrp <- SharpeRatio(z2)[1, ]
# now run simulation:
nSteps <- seq(30, 240, by = 30)
sharpeSimulation <- function(x, n.sub) {
x <- subSample(x, n.sub)
SharpeRatio(x)[1, ]
}
res <- lapply(nSteps, FUN = sharpeSimulation, x = z)
res <- do.call(rbind, res)
resMean <- colMeans(res)
resMean
# GOOG.rets FB.rets TSLA.rets SNAP.rets MU.rets
# 0.085353854 0.059577882 0.009783841 0.026328660 0.080846592
Do you realise that SharpeRatio uses sapply? And it's likely other performance metrics you want to use will as well. Since you seem to have something against apply (possibly all apply functions in R), this might be worth noting.

Correct way to generate parallel random booleans

Following on from this q (Cannot understand why random number algorithms give different results), I have some code simulating random booleans. Because I wish to do this ALOT and fast, I wish to wrap this in a function like so:
# setup external to function
number <- 5
probs <- rep(0.1, 5)
# core function
event.sim <- function(var, things){
mod.probs <- probs * var
events <- matrix(rbinom(things*number, 1, probs), ncol=number, byrow=FALSE)
av.events <- max(rowSums(events))
return(av.events)
}
library("parallel")
cl <- makeCluster(4)
clusterExport(cl, c("event.sim", "probs", "number"))
test <- clusterMap(cl, event.sim, var=df1$var1, things=df1$things, SIMPLIFY=TRUE)
stopCluster(cl)
and parallelize it using clusterMap() from parallel. Now this is no problem and I have this working, however I am concerned that by executing in parallel, my booleans are not sufficiently "random" anymore. I can find alot of info online about generating random numbers in parallel, but they all seem to describe generating lots of random numbers at once, and I can't relate that to my function that draws relatively few random booleans each time it is run. Have I problem here and do I need to do something differently?
You just need to use clusterSetRNGStream(cl) after creating your cluster and before running your function.

R package available for "Adjusted Mutual Information"? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I am searching for a package to calculate the adjusted mutual information between two clusterings. I have only found some python code via google. Is there any built-in R package or function that can be clustering the data via mutual information?
Here is a link
https://github.com/defleury/adjusted_mutual_information
It says "contains code for the fast & parallelized calculation of Adjusted Mutual Information (AMI), Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) between clusterings in R."
For small clusters, here are 3 functions. Function f_rez() takes as input 2 vectors in which numbers says the partition of that element and returns AMI. It takes about 30 s for 3 pairs of clusters of length N = 11.117, on a dual core non-parallel.
f_nij <- function(v1,v2,l1,l2){ #contingency table n(i,j)=t(i,j)
m <- matrix(0,l1,l2)
for (i in 1:length(v1)){
m[v1[i],v2[i]] <- m[v1[i],v2[i]] +1
}
m
}
f_emi <- function(s1,s2,l1,l2,n){ #expected mutual information
s_emi <- 0
for(i in 1:l1){
for (j in 1:l2){
min_nij <- max(1,s1[i]+s2[j]-n)
max_nij <- min(s1[i],s2[j])
n.ij <- seq(min_nij, max_nij) #sequence of consecutive numbers
t1<- (n.ij / n) * log((n.ij * n) / (s1[i]*s2[j]))
t2 <- exp(lfactorial(s1[i]) + lfactorial(s2[j]) + lfactorial(n - s1[i]) + lfactorial(n - s2[j]) - lfactorial(n) - lfactorial(n.ij) - lfactorial(s1[i] - n.ij) - lfactorial(s2[j] - n.ij) - lfactorial(n - s1[i] - s2[j] + n.ij))
emi <- sum(t1*t2)
s_emi <- s_emi + emi
}
}
return(s_emi)
}
f_rez <- function(v1,v2){
library(infotheo)
s1 <- tabulate(v1);
s2 <- tabulate(v2);
l1 <- length(s1)
l2 <- length(s2)
N <- length(v1)
tij <- f_nij(v1,v2,l1,l2) #contingency table n(i,j)=t(i,j). this would be equivalent with table(v1,v2)
mi <- mutinformation(v1,v2) #function for Mutual Information from package infotheo
h1 <- -sum(s1*log(s1/N))/N
h2 <- -sum(s2*log(s2/N))/N
nmi <- mi/max(h1,h2) # NMI Normalized MI
emi <- f_emi(s1,s2,l1,l2,N) # EMI Expected MI
ami <- (mi-emi)/max(h1,h2) #AMI Adjusted MI
return(c(nmi,ami))
}
I found here a matlab code for Adjusted Mutual Information(AMI). And according to this thread on stackoverflow, it is possible to translate .m file into .r file, though this isn't trivial, at least it is a way to get a function of AMI for R.
The R package "CLUE" provided Normalized Mutual Information(NMI), a "less-good" version of AMI. According to this paper "A Novel Approach for Automatic Number of Clusters Detection in Microarray Data based on Consensus Clustering" by Nguyen Xuan Vinh and Julien Epps, it seems that the Adjusted Rand Index(ARI) is a good substitution of AMI.
Luckily for ARI, there are a few R packages that implemented this funciton, such as, RRand() funciton in Phyclust package, RandIndex() function in flexclust package, adjustedRandIndex() function in mclust package, and a.rand.index2() in clustergas package.
Hope this helps.

Is there an efficient way to parallelize mapply?

I have many rows and on every row I compute the uniroot of a non-linear function. I have a quad-core Ubuntu machine which hasn't stopped running my code for two days now. Not surprisingly, I'm looking for ways to speed things up ;-)
After some research, I noticed that only one core is currently used and parallelization is the thing to do. Digging deeper, I came to the conclusion (maybe incorrectly?) that the package foreach isn't really meant for my problem because too much overhead is produced (see, for example, SO). A good alternative seems to be multicore for Unix machines. In particular, the pvec function seems to be the most efficient one after I checked the help page.
However, if I understand it correctly, this function only takes one vector and splits it up accordingly. I need a function that can be parallized, but takes multiple vectors (or a data.frame instead), just like the mapply function does. Is there anything out there that I missed?
Here is a small example of what I want to do: (Note that I include a plyr example here because it can be an alternative to the base mapply function and it has a parallelize option. However, it is slower in my implementation and internally, it calls foreach to parallelize, so I think it won't help. Is that correct?)
library(plyr)
library(foreach)
n <- 10000
df <- data.frame(P = rnorm(n, mean=100, sd=10),
B0 = rnorm(n, mean=40, sd=5),
CF1 = rnorm(n, mean=30, sd=10),
CF2 = rnorm(n, mean=30, sd=5),
CF3 = rnorm(n, mean=90, sd=8))
get_uniroot <- function(P, B0, CF1, CF2, CF3) {
uniroot(function(x) {-P + B0 + CF1/x + CF2/x^2 + CF3/x^3},
lower = 1,
upper = 10,
tol = 0.00001)$root
}
system.time(x1 <- mapply(get_uniroot, df$P, df$B0, df$CF1, df$CF2, df$CF3))
#user system elapsed
#0.91 0.00 0.90
system.time(x2 <- mdply(df, get_uniroot))
#user system elapsed
#5.85 0.00 5.85
system.time(x3 <- foreach(P=df$P, B0=df$B0, CF1=df$CF1, CF2=df$CF2, CF3=df$CF3, .combine = "c") %do% {
get_uniroot(P, B0, CF1, CF2, CF3)})
#user system elapsed
# 10.30 0.00 10.36
all.equal(x1, x2$V1) #TRUE
all.equal(x1, x3) #TRUE
Also, I tried to implement Ryan Thompson's function chunkapply from the SO link above (only got rid of doMC part, because I couldn't install it. His example works, though, even after adjusting his function.),
but didn't get it to work. However, since it uses foreach, I thought the same arguments mentioned above apply, so I didn't try it too long.
#chunkapply(get_uniroot, list(P=df$P, B0=df$B0, CF1=df$CF1, CF2=df$CF2, CF3=df$CF3))
#Error in { : task 1 failed - "invalid function value in 'zeroin'"
PS: I know that I could just increase tol to reduce the number of steps that are necessary to find a uniroot. However, I already set tol as big as possible.
I'd use the parallel package that's built into R 2.14 and work with matrices. You could then simply use mclapply like this:
dfm <- as.matrix(df)
result <- mclapply(seq_len(nrow(dfm)),
function(x) do.call(get_uniroot,as.list(dfm[x,])),
mc.cores=4L
)
unlist(result)
This is basically doing the same mapply does, but in a parallel way.
But...
Mind you that parallelization always counts for some overhead as well. As I explained in the question you link to, going parallel only pays off if your inner function calculates significantly longer than the overhead involved. In your case, your uniroot function works pretty fast. You might then consider to cut your data frame in bigger chunks, and combine both mapply and mclapply. A possible way to do this is:
ncores <- 4
id <- floor(
quantile(0:nrow(df),
1-(0:ncores)/ncores
)
)
idm <- embed(id,2)
mapply_uniroot <- function(id){
tmp <- df[(id[1]+1):id[2],]
mapply(get_uniroot, tmp$P, tmp$B0, tmp$CF1, tmp$CF2, tmp$CF3)
}
result <-mclapply(nrow(idm):1,
function(x) mapply_uniroot(idm[x,]),
mc.cores=ncores)
final <- unlist(result)
This might need some tweaking, but it essentially breaks your df in exactly as many bits as there are cores, and run the mapply on every core. To show this works :
> x1 <- mapply(get_uniroot, df$P, df$B0, df$CF1, df$CF2, df$CF3)
> all.equal(final,x1)
[1] TRUE
it's an old topic but fyi you now have parallel::mcmapply doc is here. don't forget to set mc.cores in the options. I usually use mc.cores=parallel::detectCores()-1 to let one cpu free for OS operations.
x4 <- mcmapply(get_uniroot, df$P, df$B0, df$CF1, df$CF2, df$CF3,mc.cores=parallel::detectCores()-1)
This isn't exactly a best practices suggestion, but considerable speed-up can be had by identifying the root for all parameters in a 'vectorized' fashion. For instance,
bisect <-
function(f, interval, ..., lower=min(interval), upper=max(interval),
f.lower=f(lower, ...), f.upper=f(upper, ...), maxiter=20)
{
nrow <- length(f.lower)
bounds <- matrix(c(lower, upper), nrow, 2, byrow=TRUE)
for (i in seq_len(maxiter)) {
## move lower or upper bound to mid-point, preserving opposite signs
mid <- rowSums(bounds) / 2
updt <- ifelse(f(mid, ...) > 0, 0L, nrow) + seq_len(nrow)
bounds[updt] <- mid
}
rowSums(bounds) / 2
}
and then
> system.time(x2 <- with(df, {
+ f <- function(x, PB0, CF1, CF2, CF3)
+ PB0 + CF1/x + CF2/x^2 + CF3/x^3
+ bisect(f, c(1, 10), PB0, CF1, CF2, CF3)
+ }))
user system elapsed
0.180 0.000 0.181
> range(x1 - x2)
[1] -6.282406e-06 6.658593e-06
versus about 1.3s for application of uniroot separately to each. This also combined P and B0 into a single value ahead of time, since that is how they enter the equation.
The bounds on the final value are +/- diff(interval) * (.5 ^ maxiter) or so. A fancier implementation would replace bisection with linear or quadratic interpolation (as in the reference cited in ?uniroot), but then uniform efficient convergence (and in all cases error handling) would be more tricky to arrange.

Resources