Condensing a for-loop in R - r

I have the following code in R
library(mvtnorm)
m = matrix(rnorm(2000000),nrow=200)
A = matrix(rnorm(40000),ncol=200)
A = A%*%t(A)
C = array(A,c(200,200,10000))
B = 10000
S = 100
postpred = array(NA,c(200,S,B))
for(i in 1:B){
postpred[,,i] = t(rmvnorm(S,m[,i],C[,,i],method="svd"))
}
but this code is extremely slow because I have to loop 10,000 times while also simulating from the multivariate normal 100 times and m and C can be very large as well. So what I would like to do is be able to calculate postpred outside of a loop. I have tried using the apply function but to no avail. Any help or suggestions greatly appreciated.

Others have pointed out that apply (and similar functions) won't help you much in your case, and they are right.
For what it is worth, I checked whether your would have a gain of performance by compiling your code. Here is a little benchmark that I made with your problem (I reduced the size of the matrices, because otherwise I cannot run them):
library(mvtnorm)
func = function()
{
m = matrix(rnorm(200000),nrow=100)
A = matrix(rnorm(10000),ncol=100)
A = A%*%t(A)
C = array(A,c(100,100,1000))
B = 1000
S = 10
postpred = array(NA,c(1000,S,B))
for(i in 1:B){
postpred[,,i] = t(rmvnorm(S,m[,i],C[,,i],method="svd"))
}
}
require(compiler)
func_compiled <- cmpfun(func)
require(microbenchmark)
microbenchmark(func_compiled(), func(), times=10) # grab a coffee, this takes some time
The results show that compiling won't give you any advantage:
Unit: seconds
expr min lq median uq max neval
slow_func_compiled() 9.938632 10.12269 10.18237 10.48215 15.43299 10
slow_func() 9.969320 10.07676 10.21916 15.44664 15.66109 10
(this could have been expected, as the library mvtnorm should be already compiled)
Overall, you have only two ways left to optimize your code in R:
use smaller numbers (if acceptable)
parallelize your code

As Josillber says, vectorisation (apply family of functions) ain't going to do much for you, it really is a bit of an R myth that it gives significant speed improvements.
Suggest you look at parallel options, there are parallel mcapply and snow packages. Read more here http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf

Related

Optimizing lm() function in a loop

I'm using the R built-in lm() function in a loop for estimating a custom statistic:
for(i in 1:10000)
{
x<-rnorm(n)
reg2<-lm(x~data$Y)
Max[i]<-max(abs(rstudent(reg2)))
}
This is really slow when increasing both the loop counter (typically we want to test over 10^6 or 10^9 iterations values for precision issues) and the size of Y.
Having read the following Stack topic, a very first attemp was to try optimizing the whole using parallel regression (with calm()):
cls = makeCluster(4)
distribsplit(cls, "test")
distribsplit(cls, "x")
for(i in 1:10000)
{
x<-rnorm(n)
reg2 <- calm(cls, "x ~ test$Y, data = test")
Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
}
This ended with a much slower version (by a factor 6) when comparing with the original, unparalleled loop. My assumption is that we ask for creating /destroying the threads in each loop iteration and that slow down the process a lot in R.
A second attemp was to use lm.fit() according to this Stack topic:
for(i in 1:10000)
{
x<- rnorm(n)
reg2<- .lm.fit(as.matrix(x), data$Y)
Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
}
It resulted in a much faster processing compared to the initial and orgininal version. Such that we now have: lm.fit() < lm() < calm(), speaking of overall processing time.
However, we are still looking for options to improve the efficiency (in term of processing time) of this code. What are the possible options? I assume that making the loop parallel would save some processing time?
Edit: Minimal Example
Here is a minimal example:
#Import data
sample <- read.csv("sample.txt")
#Preallocation
Max <- vector(mode = "numeric", length = 100)
n <- length(sample$AGE)
x <- matrix(rnorm(100 * n), 100)
for(i in 1 : 100)
{
reg <- lm(x ~ data$AGE)
Max[i] <- max(abs(rstudent(reg)))
}
with the following dataset 'sample.txt':
AGE
51
22
46
52
54
43
61
20
66
27
From here, we made several tests and noted the following:
Following #Karo contribution, we generate the matrix of normal samples outside the loop to spare some execution time. We expected a noticeable impact, but run tests indicate that doing so produce the unexpected inverse results (i.e. a longer execution time). Maybe the effect reverse when increasing the number of simulations.
Following #BenBolker uggestion, we also tested fastlm() and it reduces the execution time but the results seem to differ (from a factor 0.05) compared to the typical lm()
We are still struggling we effectively reducing the execution time. Following #Karo suggestions, we will try to directly pass a vector to lm() and investigate parallelization (but failed with calm() for an unknown reason).
Wide-ranging comments above, but I'll try to answer a few narrower points.
I seem to get the same (i.e., all.equal() is TRUE) results with .lm.fit and fitLmPure, if I'm careful about random-number seeds:
library(Rcpp)
library(RcppEigen)
library(microbenchmark)
nsim <- 1e3
n <- 1e5
set.seed(101)
dd <- data.frame(Y=rnorm(n))
testfun <- function(fitFn=.lm.fit, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
x <- rnorm(n)
reg2 <- fitFn(as.matrix(x), dd$Y)$residuals
return(max(abs(reg2) / sd(reg2)))
}
## make sure NOT to use seed=101 - also used to pick y -
## if we have y==x then results are unstable (resids approx. 0)
all.equal(testfun(seed=102), testfun(fastLmPure,seed=102)) ## TRUE
fastLmPure is fastest (but not by very much):
(bm1 <- microbenchmark(testfun(),
testfun(lm.fit),
testfun(fastLmPure),
times=1000))
Unit: milliseconds
expr min lq mean median uq max
testfun() 6.603822 8.234967 8.782436 8.332270 8.745622 82.54284
testfun(lm.fit) 7.666047 9.334848 10.201158 9.503538 10.742987 99.15058
testfun(fastLmPure) 5.964700 7.358141 7.818624 7.471030 7.782182 86.47498
If you wanted to fit many independent responses, rather than many independent predictors (i.e. if you were varying Y rather than X in the regression), you could provide a matrix for Y in .lm.fit, rather than looping over lots of regressions, which might be a big win. If all you care about are "residuals of random regressions" that might be worth a try. (Unfortunately, providing a matrix that combines may separate X vectors runs a multiple regression, not many univariate regressions ...)
Parallelizing is worthwhile, but will only scale (at best) according to the number of cores you have available. Doing a single run rather than a set of benchmarks because I'm lazy ...
Running 5000 replicates sequentially takes about 40 seconds for me (modern Linux laptop).
system.time(replicate(5000,testfun(fastLmPure), simplify=FALSE))
## user system elapsed
## 38.953 0.072 39.028
Running in parallel on 5 cores takes about 13 seconds, so a 3-fold speedup for 5 cores. This will probably be a bit better if the individual jobs are larger, but obviously will never scale better than the number of cores ... (8 cores didn't do much better).
library(parallel)
system.time(mclapply(1:5000, function(x) testfun(fastLmPure),
mc.cores=5))
## user system elapsed
## 43.225 0.627 12.970
It makes sense to me that parallelizing at a higher/coarser level (across runs rather than within lm fits) will perform better.
I wonder if there are analytical results you could use in terms of the order statistics of a t distribution ... ?
Since I still can't comment:
Try to avoid loops in R. For some reason you are recalculating those random numbers every iteration. You can do that without a loop:
duration_loop <- system.time({
for(i in 1:10000000)
{
x <- rnorm(10)
}
})
duration <- system.time({
m <- matrix(rnorm(10000000*10), 10000000)
})
Both ways should create 10 random values per iteration/matrix row with the same amount of iterations/rows. Though both ways seem to scale linearly, you should see a difference in execution time, the loop will probably be CPU-bound and the "vectorized" way probably memory-bound.
With that in mind you probably should and most likely can avoid the loop altogether, you can for instance pass a vector into the lm-function. If you still need to be faster after that you can definitely parallelise a number of ways, it would be easier to suggest how with a working example of data.

How to use doParallel for calculating distance between zipcodes in R?

I have a large dataset (2.6M rows) with two zip codes and the corresponding latitudes and longitudes, and I am trying to compute the distance between them. I am primarily using the package geosphere to calculate Vincenty Ellipsoid distance between the zip codes but it is taking a massive amount of time for my dataset. What can be a fast way to implement this?
What I tried
library(tidyverse)
library(geosphere)
zipdata <- select(fulldata,originlat,originlong,destlat,destlong)
## Very basic approach
for(i in seq_len(nrow(zipdata))){
zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
c(zipdata$destlat[i],zipdata$destlong[i]),
fun=distVincentyEllipsoid)
}
## Tidyverse approach
zipdata <- zipdata%>%
mutate(dist2 = distm(cbind(originlat,originlong), cbind(destlat,destlong),
fun = distHaversine))
Both of these methods are extremely slow. I understand that 2.1M rows will never be a "fast" calculation, but I think it can be made faster. I have tried the following approach on a smaller test data without any luck,
library(doParallel)
cores <- 15
cl <- makeCluster(cores)
registerDoParallel(cl)
test <- select(head(fulldata,n=1000),originlat,originlong,destlat,destlong)
foreach(i = seq_len(nrow(test))) %dopar% {
library(geosphere)
zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
c(zipdata$destlat[i],zipdata$destlong[i]),
fun=distVincentyEllipsoid)
}
stopCluster(cl)
Can anyone help me out with either the correct way to use doParallel with geosphere or a better way to handle this?
Edit: Benchmarks from (some) replies
## benchmark
library(microbenchmark)
zipsamp <- sample_n(zip,size=1000000)
microbenchmark(
dave = {
# Dave2e
zipsamp$dist1 <- distHaversine(cbind(zipsamp$patlong,zipsamp$patlat),
cbind(zipsamp$faclong,zipsamp$faclat))
},
geohav = {
zipsamp$dist2 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "haversine")
},
geovin = {
zipsamp$dist3 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "vincenty")
},
geocheap = {
zipsamp$dist4 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "cheap")
}
,unit = "s",times = 100)
# Unit: seconds
# expr min lq mean median uq max neval cld
# dave 0.28289613 0.32010753 0.36724810 0.32407858 0.32991396 2.52930556 100 d
# geohav 0.15820531 0.17053853 0.18271300 0.17307864 0.17531687 1.14478521 100 b
# geovin 0.23401878 0.24261274 0.26612401 0.24572869 0.24800670 1.26936889 100 c
# geocheap 0.01910599 0.03094614 0.03142404 0.03126502 0.03203542 0.03607961 100 a
A simple all.equal test showed that for my dataset the haversine method is equal to the vincenty method, but has a "Mean relative difference: 0.01002573" with the "cheap" method from the geodist package.
R is a vectorized language, thus the function will operate over all of the elements in the vectors. Since you are calculating the distance between the original and destination for each row, the loop is unnecessary. The vectorized approach is approximately 1000x the performance of the loop.
Also using the distVincentyEllipsoid (or distHaveersine, etc. )directly and bypassing the distm function should also improve the performance.
Without any sample data this snippet is untested.
library(geosphere)
zipdata <- select(fulldata,originlat,originlong,destlat,destlong)
## Very basic approach
zipdata$dist1 <- distVincentyEllipsoid(c(zipdata$originlong, zipdata$originlat),
c(zipdata$destlong, zipdata$destlat))
Note: For most of the geosphere functions to work correctly, the proper order is: longitude first then latitude.
The reason the tidyverse approach listed above is slow is the distm function is calculating the distance between every origin and destination which would result in a 2 million by 2 million element matrix.
I used #SymbolixAU's suggestion to use the geodist package to perform the 2.1M distance calculations on my datasets. I found it to be significantly faster than the geosphere package for every test (I have added one of them in my main question). The measure=cheap option in the geodist uses the cheap ruler method which has low error rates below distances of 100kms. See the geodist vignette for more information. Given some of my distances were higher than 100km, I settled on using the Vincenty Ellipsoid measure.
If you are going to use geosphere, I would either use a fast approximate method like distHaversine, or the still fast and very precise distGeo method. (The distVincenty* these are mainly implemented for curiosity).

(local/nodal and global) Efficiency using igraph shortest.paths function

I am trying to calculate local efficiency of a graph using shortest.paths of igraph package.
The local efficiency of a vertice v, by definition, is the "global efficiency" computed among all direct neighbors of v (Latora & Machiori, 2001).
I came up with the code below for global and local efficiency. However, the latter is including the target vertex in its calculation. And in the paper above they say the target vertex has to be taken out.
#Global Efficiency (average inverse shortest paths between all u--v vertices)
eff<-1/(shortest.paths(my.graph))
eff[!is.finite(eff)]<-0
gl.eff<-mean(eff,na.rm=TRUE)
#Mean local efficiency (global efficiency for each node)
gn<-graph.neighborhood(my.graph,1) #list with subgraphs of directly connected graphs
names(gn)<-colnames(my.corr.matrix)
local.eff<-numeric(length(gn))
for (i in 1:length(gn)){
gn[[i]]<-gn[[i]] - vertex(V(gn[[i]])[grep(names(gn[i]),V(gn[[i]]))]) #doesn't match
eff.gn<-1/(shortest.paths(gn[[i]]))
eff.gn[!is.finite(gleff.gn)]<-0
eff.gn<-mean(eff.gn,na.rm=TRUE)
local.eff[i]<-gleff.gn
mean.local.eff<-mean(local.eff, na.rm=TRUE)
}
I am trying to match the list name (each element of the list is a subgraph) with the name of the vertex inside that subgraph. I am trying to use 'grep()', but haven't been able to get it right. Could someone give me a hand on that?
Thanks in advance,
I have already written a function to do this that is many times faster than what you've written. See if the following will suit your needs. For smaller graphs (or if you are using Windows), you will possibly want to replace simplify2array(mclapply(nodes, with sapply(nodes,, and then of course remove the argument mc.cores=detectCores(). However this really helps performance on large graphs.
You can see the code at the following link:
Local efficiency code
EDIT: Included some benchmark info (where the function f is yours, and g is what I pasted above). This was done on a laptop with 4 cores #2.10 GHz (Intel i3-2310m).
g.rand <- sample_gnp(100, .1)
V(g.rand)$degree <- degree(g.rand)
compare <- microbenchmark(f(g.rand), g(g.rand), times=1e2)
compare
Unit: milliseconds
expr min lq mean median uq max neval cld
f(g.rand) 476.9853 4097.2202 4544.720 4539.911 4895.020 9346.873 100 b
g(g.rand) 299.2696 329.6629 1319.377 1114.054 2314.304 3003.966 100 a
In case that someone needs the local efficiency in python, here is my code for that:
Python version
import numpy as np
from igraph import *
np.seterr(divide='ignore')
def nodal_eff(g):
"""
This function calculates the nodal efficiency of a weighted graph object.
Created by: Loukas Serafeim (seralouk), Nov 2017
Args:
g: A igraph Graph() object.
Returns:
The nodal efficiency of each node of the graph
"""
sp = g.shortest_paths_dijkstra(weights = g.es["weight"])
sp = np.asarray(sp)
temp = 1.0 / sp
np.fill_diagonal(temp, 0)
N = temp.shape[0]
ne = ( 1.0 / (N - 1)) * np.apply_along_axis(sum, 0, temp)
return ne
I know that my title doesn't fully account for what my question really was. So I'm answering it myself since I just got it to work.
#Mean local efficiency (global efficiency for each node)
gn<-graph.neighborhood(my.graph,1) #list with subgraphs of directly connected graphs
names(gn)<-V(my.graph)$name
local.eff<-numeric(length(gn))
for (i in 1:length(gn)){
gn[[i]]<-gn[[i]] - vertex(V(gn[[i]])[match(names(gn[i]),V(gn[[i]])$name)]) #MATCHES
eff.gn<-1/(shortest.paths(gn[[i]]))
eff.gn[!is.finite(eff.gn)]<-0
eff.gn<-mean(eff.gn,na.rm=TRUE)
local.eff[i]<-eff.gn
}
local.eff[local.eff %in% NaN]<-NA
mean.local.eff<-mean(local.eff, na.rm=TRUE)

How could I speed up my R code? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
Disclaimer
Hello everyone! I recently started programming in R. My codes are working just fine, but in terms of speed some of them are taking way too long to put to good use. I hope someone can help me making this code run faster either by optimising the code, or with the use of one of the multicore packages.
About the code
I have large datasets containing about 15000 numeric data each. The code takes two parameters (p, n) where p >= n, and make subsets of the data. It applies the zyp.yuepilon function (from the zyp package) to each row of the subsets. Then the parameter n is used to apply the same function on an n sized subset.
Problem is I run this code in a nested for loop: p in 10:40 and n in 10:40 so it takes an eternity to get the results, and it's just one dataset among many others.
sp <- function(p, n){
library(zyp)
data <- runif(15000, 1, 4)
lower <- seq(80 - p + 1, by=1, length.out=length(data)-81)
upper <- lower + p - 1
subsets <- matrix(nrow=length(lower), ncol=p)
for(j in 1:length(lower)){
subsets[j, ] = data[lower[j] : upper[j]]
}
ret <- apply(subsets, 1, zyp.yuepilon)
subset_n <- subsets[, 1:n]
ret2 <- apply(subset_n, 1, zyp.yuepilon)
return(list(ret, ret2))
}
Benchmark results in seconds:
expr min lq median uq max neval
sp(7, 6) 92.77266 94.24901 94.53346 95.10363 95.64914 10
Here is a series of comments, rather than an answer.
Looking at the zyp.yuepilon function body, by calling the function without parenthesis in a R session, you see that this function, and the function zyp.sen are written in plain R code (as opposed to compiled code).
The biggest speed-up is likely attained by using the Rcpp package which facilitates calling (compiled) C++ code within R. In fact, there is a small linear model example here Fast LM model using Rcpp/RcppArmadillo.
I would be inclined to rewrite the two functions zyp.yuepilon and zyp.sen in C++, using Rcpp, including the loop over subset vectors (for which you are currently using apply to do).
For general R speed-up issues see this question R loop performance, as well as the R package plyr, which may provide an entry point for taking a map-reduce type of approach to your problem.
If you want to steer clear of C++, then a series of micro-optimisations would be your quickest win. To speed up the apply aspect of your code, you could use something like this
library(doParallel)
library(parallel)
library(foreach)
library(zyp)
cl<-makeCluster(4)
registerDoParallel(cl)
sp_1<-function(p=7, n=6){
N_ob=15000;
off_set=81;
N_ob_o=N_ob-off_set;
am<-matrix(runif(N_ob*p),ncol=p);
subsets<-am[-(1:off_set),];
ret=matrix(unlist( foreach(i=1:N_ob_o) %dopar% zyp::zyp.yuepilon(subsets[i,]),use.names=FALSE),ncol=11, byrow=TRUE);
subset_n <- subsets[, 1:n]
ret2=matrix(unlist( foreach(i=1:N_ob_o) %dopar% zyp::zyp.yuepilon(subset_n[i,]),use.names=FALSE),nrow=11);
return(list(ret, ret2))
}
sp<-function(p=7, n=6){
data <- runif(15000, 1, 4)
lower <- seq(80 - p + 1, by=1, length.out=length(data)-81)
upper <- lower + p - 1
subsets <- matrix(nrow=length(lower), ncol=p)
for(j in 1:length(lower)){
subsets[j, ] = data[lower[j] : upper[j]]
}
ret <- apply(subsets, 1, zyp.yuepilon)
subset_n <- subsets[, 1:n]
ret2 <- apply(subset_n, 1, zyp.yuepilon)
return(list(ret, ret2))
}
system.time(sp_1())
system.time(sp())
This gives me a speed-up of around a factor of 2. But this will depend on your platform, etc. Check out the help files for the functions and packages above, and tune the number of clusters using makeCluster to see what works best for your platform (in the absence of any information about your particular set-up).
Another route might be to make use of the byte-code compiler via library(compiler) to see if the various functions can be optimised, this way.
library(compiler)
enableJit(3);
zyp_comp<-cmpfun(zyp.yuepilon);

Running multiple regressions is there a smart way to vectorize

I have written what I believe to be a semi-quick ols-regression function
ols32 <- function (y, x,Ridge=1.1) {
xrd<-crossprod(x)
xry<-crossprod(x, y)
diag(xrd)<-Ridge*diag(xrd)
solve(xrd,xry)
}
Now I want to apply this to the following
(vapply(1:la, function(J)
ME %*% ols32((nza[,J]),(cbind(nzaf1[,J],nzaf2[,J],nza[,-J],MOMF)))
[(la+2):(2*la+1)],FUN.VALUE=0))
Where nza,nzaf1,nzaf2 and MOMF are 500x50 matrixes and la=50 and ME is a vector of length 50.
So what I actually do is I do a regression but only use the beta-coefficients from MOMF which I multiply by ME.
nza.mat<-matrix(rnorm(500*200),ncol=200)
nza<-nza.mat[,1:50]
nzaf2<-nza.mat[,101:150]
MOMF<-nza.mat[,151:200]
nzaf1<-nza.mat[,51:100]
ME<-nza.mat[500,151:200]
Is there an imediate way of making things faster or do I need to use someting like RcppEigen?
Tks P
So I came up with a slightly faster way of solving this by rewriting my ols-function so that it calculates the two crossproducts only once for a whole matrix. The new function looks like this:
ols.quick <- function (y, x, ME) {
la<-dim(y)[2]
XX.cross<-crossprod(x)
XY.cross<-crossprod(x, y)
diag(XX.cross)<-Ridge*diag(XX.cross)
betas<-sapply(1:la, function(J){
idx<-c((1:la)[-J],la+J,2*la+J,(3*la+1):(4*la));
solve(XX.cross[idx,idx],XY.cross[idx,J])},simplify=T)
ME%*%betas[(la+2):(2*la+1),]
}
where
y=nza (500x50) and x=cbind(nza,nzaf1,nzaf2,MOMF) (500x200)
This solves the problem about 3.5 times faster.
microbenchmark(ols.quick(nza,nza.mat,ME),
vapply(1:la, function(J) ME%*%ols32(nza[,J],(cbind(nzaf1[,J],nzaf2[,J],nza[,-J],MOMF)))
[(la+2): (lb+2)],FUN.VALUE=0))
min lq median uq max neval
66.30495 67.71903 68.57001 70.17742 77.80069 100
251.59395 255.43306 258.35041 262.85742 296.51313 100
I suppose I could gain some speed with parLapply from the parallel package but I havet looked into that yet.

Resources