Speed up Simulation in R with Code Optimization - r

The generic version of what I am trying to do is to conduct a simulation study where I manipulate a few variables to see how that impacts a result. I'm having some speed issues with R. The latest simulation worked with a few iterations (10 per experiment). However, when I moved to a large scale (10k per experiment) version, the simulation has been running for 14 hours (and is still running).
Below is the code (with comments) that I am running. Being a rookie with R, and am struggling to optimize the simulation to be efficient. My hope is to learn from the comments and suggestions provided here to optimize this code and use these comments for future simulation studies.
Let me say a few things about what this code is supposed to do. I am manipulating two variables: effect size and sample size. Each combination is run 10k times (i.e., 10k experiments per condition). I initialize a data frame to store my results (called Results). I loop over three variables: Effect size, sample size, and iterations (10k).
Within the loops, I initialize four NULL components: p.test, p.rep, d.test, and d.rep. The former two capture the p-value of the initial t-test and the p-value of the replication (replicated under similar conditions). The latter two calculate the effect size (Cohen's d).
I generate my random data from a standard normal for the control condition (DVcontrol), and I use my effect size as the mean for the experimental condition (DVexperiment). I take the difference between the values and throw the result into the t-test function in R (paired-samples t-test). I store the results in a list called Trials and I rbind this to the Results data frame. This process is repeated 10k times until completion.
# Set Simulation Parameters
## Effect Sizes (ES is equal to mean difference when SD equals Variance equals 1)
effect_size_range <- seq(0, 2, .1) ## ES
## Sample Sizes
sample_size_range <- seq(10, 1000, 10) ## SS
## Iterations for each ES-SS Combination
iter <- 10000
# Initialize the Vector of Results
Results <- data.frame()
# Set Random Seed
set.seed(12)
# Loop over the Different ESs
for(ES in effect_size_range) {
# Loop over the Different Sample Sizes
for(SS in sample_size_range) {
# Create p-value Vectors
p.test <- NULL
p.rep <- NULL
d.test <- NULL
d.rep <- NULL
# Loop over the iterations
for(i in 1:iter) {
# Generate Test Data
DVcontrol <- rnorm(SS, mean=0, sd=1)
DVexperiment <- rnorm(SS, mean=ES, sd=1)
DVdiff <- DVexperiment - DVcontrol
p.test[i] <- t.test(DVdiff, alternative="greater")$p.value
d.test[i] <- mean(DVdiff) / sd(DVdiff)
# Generate Replication Data
DVcontrol <- rnorm(iter, mean=0, sd=1)
DVexperiment <- rnorm(iter, mean=ES, sd=1)
DVdiff <- DVexperiment - DVcontrol
p.rep[i] <- t.test(DVdiff, alternative="greater")$p.value
d.rep[i] <- mean(DVdiff) / sd(DVdiff)
}
# Results
Trial <- list(ES=ES, SS=SS,
d.test=mean(d.test), d.rep=mean(d.rep),
p.test=mean(p.test), p.rep=mean(p.rep),
r=cor(p.test, p.rep, method="kendall"),
r.log=cor(log2(p.test)*(-1), log2(p.rep)*(-1), method= "kendall"))
Results <- rbind(Results, Trial)
}
}
Thanks in advance for your comments and suggestions,
Josh

The general approach to optimization is to run a profiler to determine what portion of the code the interpreter spends the most time in, and then to optimize that portion. Let's say your code resides in a file called test.R. In R, you can profile it by running the following sequence of commands:
Rprof() ## Start the profiler
source( "test.R" ) ## Run the code
Rprof( NULL ) ## Stop the profiler
summaryRprof() ## Display the results
(Note that these commands will generate a file Rprof.out in the directory of your R session.)
If we run the profiler on your code (with iter <- 10, rather than iter <- 10000), we get the following profile:
# $by.self
# self.time self.pct total.time total.pct
# "rnorm" 1.56 24.53 1.56 24.53
# "t.test.default" 0.66 10.38 2.74 43.08
# "stopifnot" 0.32 5.03 0.86 13.52
# "rbind" 0.32 5.03 0.52 8.18
# "pmatch" 0.30 4.72 0.34 5.35
# "mean" 0.26 4.09 0.42 6.60
# "var" 0.24 3.77 1.38 21.70
From here, we observe that rnorm and t.test are your most expensive operations (shouldn't really be a surprise as these are in your inner-most loop).
Once you figured out where the expensive function calls are, the actual optimization consists of two steps:
Optimize the function, and/or
Optimize the number of times the function is called.
Since t.test and rnorm are built-in R functions, your only option for Step 1 above is to look for alternative packages that may have faster implementations of sampling from the normal distribution and/or running multiple t tests. Step 2 is really about restructuring your code in a way that does not recompute the same thing multiple times. For example, the following lines of code do not depend on i:
# Generate Test Data
DVcontrol <- rnorm(SS, mean=0, sd=1)
DVexperiment <- rnorm(SS, mean=ES, sd=1)
Does it make sense to move these outside the loop, or do you really need a new sample of your test data for each different value of i?

Related

Data splitting: ordinal grouped data with custom probability of outcomes

The createDataPartition in caret has a Data Splitting function which can sample data preserving the relative outcome of each rating. I am looking for something similar, but that can preserve groups and handle ordinal data
I am trying to specify the target distribution of my outcomes. I want to preserve groups and see which groups I should conduct a follow-up experiment with if I want to reach a target distribution (rather than simply the current distribution). I have made code that attempts this in a very blunt way:
# Load data
library(rethinking)
data(Trolley)
d <- Trolley
# Inspect current distribution of ratings
d$response <- factor(d$response)
round(summary(d$response)/dim(d)[1],2)
# Find 5 cases that roughly have my target distribution
targetdist <- c(0.3,0.1,0.1,0.1,0.1,0.1,0.1) # Arbitrary goal
# Unique cases
uniqcase <- unique(d$case)
# Poor method
runs <- 100
difmatrix <- matrix(NA,runs,2)
for(i in 1:runs){
# Take subset
difmatrix[i,1] <- i
set.seed(i)
casetests<- sample(uniqcase,5)
datasub <- subset(d, case %in% casetests)
# Find ratings of subset
difmatrix[i,2] <- sum(abs(round(summary(datasub$response)/dim(datasub)[1],2)-targetdist))
}
difmatrix[which.min(difmatrix[,2]),]
# Look at best distribution
set.seed(which.min(difmatrix[,2]))
casetests<- sample(uniqcase,5)
datasub <- subset(d, case %in% casetests)
round(summary(datasub$response)/dim(datasub)[1],2) # Current best distribution
In this toy example, the overall distribution in the data is:
0.13 0.09 0.11 0.23 0.15 0.15 0.15
I aim to get a distribution of
0.3,0.1,0.1,0.1,0.1,0.1,0.1 and get one of:
0.21 0.12 0.12 0.22 0.12 0.11 0.10
I cannot help but think there is a better way to do it. For my actual case, I want to select about 200 from a group of 10,000 so it seems unlikely that I can luck on a good choice.
Thanks for reading. I hope this makes sense at all. I have been working on it for a while, yet still have issues formulating it concisely.

Optimizing lm() function in a loop

I'm using the R built-in lm() function in a loop for estimating a custom statistic:
for(i in 1:10000)
{
x<-rnorm(n)
reg2<-lm(x~data$Y)
Max[i]<-max(abs(rstudent(reg2)))
}
This is really slow when increasing both the loop counter (typically we want to test over 10^6 or 10^9 iterations values for precision issues) and the size of Y.
Having read the following Stack topic, a very first attemp was to try optimizing the whole using parallel regression (with calm()):
cls = makeCluster(4)
distribsplit(cls, "test")
distribsplit(cls, "x")
for(i in 1:10000)
{
x<-rnorm(n)
reg2 <- calm(cls, "x ~ test$Y, data = test")
Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
}
This ended with a much slower version (by a factor 6) when comparing with the original, unparalleled loop. My assumption is that we ask for creating /destroying the threads in each loop iteration and that slow down the process a lot in R.
A second attemp was to use lm.fit() according to this Stack topic:
for(i in 1:10000)
{
x<- rnorm(n)
reg2<- .lm.fit(as.matrix(x), data$Y)
Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
}
It resulted in a much faster processing compared to the initial and orgininal version. Such that we now have: lm.fit() < lm() < calm(), speaking of overall processing time.
However, we are still looking for options to improve the efficiency (in term of processing time) of this code. What are the possible options? I assume that making the loop parallel would save some processing time?
Edit: Minimal Example
Here is a minimal example:
#Import data
sample <- read.csv("sample.txt")
#Preallocation
Max <- vector(mode = "numeric", length = 100)
n <- length(sample$AGE)
x <- matrix(rnorm(100 * n), 100)
for(i in 1 : 100)
{
reg <- lm(x ~ data$AGE)
Max[i] <- max(abs(rstudent(reg)))
}
with the following dataset 'sample.txt':
AGE
51
22
46
52
54
43
61
20
66
27
From here, we made several tests and noted the following:
Following #Karo contribution, we generate the matrix of normal samples outside the loop to spare some execution time. We expected a noticeable impact, but run tests indicate that doing so produce the unexpected inverse results (i.e. a longer execution time). Maybe the effect reverse when increasing the number of simulations.
Following #BenBolker uggestion, we also tested fastlm() and it reduces the execution time but the results seem to differ (from a factor 0.05) compared to the typical lm()
We are still struggling we effectively reducing the execution time. Following #Karo suggestions, we will try to directly pass a vector to lm() and investigate parallelization (but failed with calm() for an unknown reason).
Wide-ranging comments above, but I'll try to answer a few narrower points.
I seem to get the same (i.e., all.equal() is TRUE) results with .lm.fit and fitLmPure, if I'm careful about random-number seeds:
library(Rcpp)
library(RcppEigen)
library(microbenchmark)
nsim <- 1e3
n <- 1e5
set.seed(101)
dd <- data.frame(Y=rnorm(n))
testfun <- function(fitFn=.lm.fit, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
x <- rnorm(n)
reg2 <- fitFn(as.matrix(x), dd$Y)$residuals
return(max(abs(reg2) / sd(reg2)))
}
## make sure NOT to use seed=101 - also used to pick y -
## if we have y==x then results are unstable (resids approx. 0)
all.equal(testfun(seed=102), testfun(fastLmPure,seed=102)) ## TRUE
fastLmPure is fastest (but not by very much):
(bm1 <- microbenchmark(testfun(),
testfun(lm.fit),
testfun(fastLmPure),
times=1000))
Unit: milliseconds
expr min lq mean median uq max
testfun() 6.603822 8.234967 8.782436 8.332270 8.745622 82.54284
testfun(lm.fit) 7.666047 9.334848 10.201158 9.503538 10.742987 99.15058
testfun(fastLmPure) 5.964700 7.358141 7.818624 7.471030 7.782182 86.47498
If you wanted to fit many independent responses, rather than many independent predictors (i.e. if you were varying Y rather than X in the regression), you could provide a matrix for Y in .lm.fit, rather than looping over lots of regressions, which might be a big win. If all you care about are "residuals of random regressions" that might be worth a try. (Unfortunately, providing a matrix that combines may separate X vectors runs a multiple regression, not many univariate regressions ...)
Parallelizing is worthwhile, but will only scale (at best) according to the number of cores you have available. Doing a single run rather than a set of benchmarks because I'm lazy ...
Running 5000 replicates sequentially takes about 40 seconds for me (modern Linux laptop).
system.time(replicate(5000,testfun(fastLmPure), simplify=FALSE))
## user system elapsed
## 38.953 0.072 39.028
Running in parallel on 5 cores takes about 13 seconds, so a 3-fold speedup for 5 cores. This will probably be a bit better if the individual jobs are larger, but obviously will never scale better than the number of cores ... (8 cores didn't do much better).
library(parallel)
system.time(mclapply(1:5000, function(x) testfun(fastLmPure),
mc.cores=5))
## user system elapsed
## 43.225 0.627 12.970
It makes sense to me that parallelizing at a higher/coarser level (across runs rather than within lm fits) will perform better.
I wonder if there are analytical results you could use in terms of the order statistics of a t distribution ... ?
Since I still can't comment:
Try to avoid loops in R. For some reason you are recalculating those random numbers every iteration. You can do that without a loop:
duration_loop <- system.time({
for(i in 1:10000000)
{
x <- rnorm(10)
}
})
duration <- system.time({
m <- matrix(rnorm(10000000*10), 10000000)
})
Both ways should create 10 random values per iteration/matrix row with the same amount of iterations/rows. Though both ways seem to scale linearly, you should see a difference in execution time, the loop will probably be CPU-bound and the "vectorized" way probably memory-bound.
With that in mind you probably should and most likely can avoid the loop altogether, you can for instance pass a vector into the lm-function. If you still need to be faster after that you can definitely parallelise a number of ways, it would be easier to suggest how with a working example of data.

silhouette calculation in R for a large data

I want to calculate silhouette for cluster evaluation. There are some packages in R, for example cluster and clValid. Here is my code using cluster package:
# load the data
# a data from the UCI website with 434874 obs. and 3 variables
data <- read.csv("./data/spatial_network.txt",sep="\t",header = F)
# apply kmeans
km_res <- kmeans(data,20,iter.max = 1000,
nstart=20,algorithm="MacQueen")
# calculate silhouette
library(cluster)
sil <- silhouette(km_res$cluster, dist(data))
# plot silhouette
library(factoextra)
fviz_silhouette(sil)
The code works well for smaller data, say data with 50,000 obs, however I get an error like "Error: cannot allocate vector of size 704.5 Gb" when the data size is a bit large. This might be problem for Dunn index and other internal indices for large datasets.
I have 32GB RAM in my computer. The problem comes from calculating dist(data). I am wondering if it is possible to not calculating dist(data) in advance, and calculate corresponding distances when it is required in the silhouette formula.
I appreciate your help regarding this problem and how I can calculate silhouette for large and very large datasets.
You can implement Silhouette yourself.
It only needs every distance twice, so storing an entire distance matrix is not necessary. It may run a bit slower because it computes distances twice, but at the same time the better memory efficiency may well make up for that.
It will still take a LONG time though.
You should consider to only use a subsample (do you really need to consider all points?) as well as alternatives such as Simplified Silhouette, in particular with KMeans... You only gain very little with extra data on such methods. So you may just use a subsample.
Anony-Mousse answer is perfect, particularly subsampling. This is very important for very large datasets due to the increase in computational cost.
Here is another solution for calculating internal measures such as silhouette and Dunn index, using an R package of clusterCrit. clusterCrit is for calculating clustering validation indices, which does not require entire distance matrix in advance. However, it might be slow as Anony-Mousse discussed. Please see the below link for documentation for clusterCrit:
https://www.rdocumentation.org/packages/clusterCrit/versions/1.2.8/topics/intCriteria
clusterCrit also calculates most of Internal measures for cluster validation.
Example:
intCriteria(data,km_res$cluster,c("Silhouette","Calinski_Harabasz","Dunn"))
If it is possible to calculate the Silhouette index, without using the distance matrix, alternatively you can use the clues package, optimizing both the time and the memory used by the cluster package. Here is an example:
library(rbenchmark)
library(cluster)
library(clues)
set.seed(123)
x = c(rnorm(1000,0,0.9), rnorm(1000,4,1), rnorm(1000,-5,1))
y = c(rnorm(1000,0,0.9), rnorm(1000,6,1), rnorm(1000, 5,1))
cluster = rep(as.factor(1:3),each = 1000)
df <- cbind(x,y)
head(df)
x y
[1,] -0.50442808 -0.13527673
[2,] -0.20715974 -0.29498142
[3,] 1.40283748 -1.30334876
[4,] 0.06345755 -0.62755613
[5,] 0.11635896 2.33864121
[6,] 1.54355849 -0.03367351
Runtime comparison between the two functions
benchmark(f1 = silhouette(as.integer(cluster), dist = dist(df)),
f2 = get_Silhouette(y = df, mem = cluster))
test replications elapsed relative user.self sys.self user.child sys.child
1 f1 100 15.16 1.902 13.00 1.64 NA NA
2 f2 100 7.97 1.000 7.76 0.00 NA NA
Comparison in memory usage between the two functions
library(pryr)
object_size(silhouette(as.integer(cluster), dist = dist(df)))
73.9 kB
object_size(get_Silhouette(y = df, mem = cluster))
36.6 kB
As a conclusion clues::get_Silhouette, it reduces the time and memory used to the same.

Fast ANOVA computation in R

I have a dataframe with the following dimensions:
dim(b)
[1] 974 433685
The columns represent variables that I want to run ANOVAs on (i.e., I want to run 433,685 ANOVAs). Sample size is 974. The last column is the 'group' variable.
I've come up with 3 different methods, but all are too slow due to the number of tests.
First, let's generate a small practice dataset to play with:
dat = as.data.frame(matrix(runif(10000*500), ncol = 10000, nrow = 500))
dat$group = rep(letters[1:10], 5000)
Method 1 (based on 'sapply'):
system.time(sapply(dat[,-length(dat)], function(x) aov(x~group, data=dat) ))
user system elapsed
143.76 0.33 151.79
Methods 2 (based on 'mclapply' from 'parallel' package):
library(parallel)
options(mc.cores=3)
system.time(mclapply(dat[,-length(dat)], function(x) aov(x~group, data=dat) ))
user system elapsed
141.76 0.21 142.58
Methods 3 (based on 'cbind'-ing the LHS):
formula = as.formula( paste0("cbind(", paste(names(dat)[-length(dat)],collapse=","), ")~group") )
system.time(aov(formula, data=dat))
user system elapsed
10.00 0.22 10.25
In the practice dataset, Method 3 is a clear winner. However, when I do this on my actual data, computing on just 10 (of 433,685) columns using Method 3 takes this long:
user system elapsed
119.028 5.430 124.414
Not sure why it takes substantially longer on my actual data. I have access to a Linux cluster with upwards of 16 cores and 72GB of RAM.
Is there any way to compute this faster?
For simultaneously fitting many general linear models (such as ANOVA) using the same design matrix, the Bioconductor/R limma package provides a very fast lmFit() function. This is how to fit an ANOVA model using limma:
library(limma)
# generate some data
# (same dimensions as in your question)
nrows <- 1e4
ncols <- 5e2
nlevels <- 10
dat <- matrix(
runif(nrows * ncols),
nrow = nrows,
ncol = ncols
)
group <- factor(rep(
letters[1:nlevels],
ncols / nlevels
))
# construct the design matrix
# (same as implicitly used in your question)
dmat <- model.matrix(~ group)
# fit the ANOVA model
fit <- lmFit(dat, dmat)
On my laptop it finished in 0.4 - 0.45 seconds, on data of the same dimensions as the data in your question.

Speeding up time series simulation (for bootstrap)

I need to run a bootstrap on a time series with non-standard dependence. So to do this I need to create a function that simulates the time series by making time by time adjustments.
testing<-function(){
sampleData<-as.zoo(data.frame(index=1:1000,vol=(rnorm(1000))^2,x=NA))
sampleData[,"x"]<-sampleData[,"vol"]+rnorm(1000) #treat this is completely exognenous and unknown in connection to vol
sampleData<-cbind(sampleData,mean=rollmean(sampleData[,"vol"],k=3,align="right"))
sampleData<-cbind(sampleData,vol1=lag(sampleData[,"vol"],k=-1),x1=lag(sampleData[,"x"],k=-1),mean1=lag(sampleData[,"mean"],k=-1))
#get estimate
mod<-lm(vol~vol1+x1+mean1,data=sampleData)
res<-mod$residuals
for(i in 5:1000){
#recursively estimate
sampleData[i,"vol"]<-as.numeric(predict(mod,newdata=data.frame(sampleData[i-1,])))+res[i-3]
#now must update other paramaters
#first our rolled average
sampleData[i,"mean"]<-mean(sampleData[(i-3):i,"vol"])
#reupdate our lagged variables
sampleData[i,"vol1"]<-sampleData[i-1,"vol"]
sampleData[i,"mean1"]<-sampleData[i-1,"mean"]
}
lm(vol~vol1+x1+mean1,data=sampleData)
}
When I run this code and measure the run time I get
system.time(testing())
user system elapsed
2.711 0.201 2.915
This is a slight problem for me as a will be integrating this code to construct a bootstrap. This means any time taken here is multiplied by about 100 for each step. And I am updating this at a few thousand times. That means a single run will take hours (to days) to run.
Is there anyway to speed this code up?
Kind regards,
Matthew
Here's how to avoid the overhead of predict.lm. Also note that I used a matrix instead of a zoo object, which would be a tiny bit slower. You can see just how much this slowed down your code. That's the price you pay for convenience.
testing.jmu <- function() {
if(!require(xts)) stop("xts package not installed")
set.seed(21) # for reproducibility
sampleData <- .xts(data.frame(vol=(rnorm(1000))^2,x=NA), 1:1000)
sampleData$x <- sampleData$vol+rnorm(1000)
sampleData$mean <- rollmean(sampleData$vol, k=3, align="right")
sampleData$vol1 <- lag(sampleData$vol,k=1)
sampleData$x1 <- lag(sampleData$x,k=1)
sampleData$mean1 <- lag(sampleData$mean,k=1)
sampleMatrix <- na.omit(cbind(as.matrix(sampleData),constant=1))
mod.fit <- lm.fit(sampleMatrix[,c("constant","vol1","x1","mean1")],
sampleMatrix[,"vol"])
res.fit <- mod.fit$residuals
for(i in 5:nrow(sampleMatrix)){
sampleMatrix[i,"vol"] <-
sum(sampleMatrix[i-1,c("constant","vol1","x1","mean1")] *
mod.fit$coefficients)+res.fit[i-3]
sampleMatrix[i,"mean"] <- mean(sampleMatrix[(i-3):i,"vol"])
sampleMatrix[i,c("vol1","mean1")] <- sampleMatrix[i-1,c("vol","mean")]
}
lm.fit(sampleMatrix[,c("constant","vol1","x1","mean1")], sampleMatrix[,"vol"])
}
system.time(out <- testing.jmu())
# user system elapsed
# 0.05 0.00 0.05
coef(out)
# constant vol1 x1 mean1
# 1.08787779 -0.06487441 0.03416802 -0.02757601
Add the set.seed(21) call to your function and you'll see that my function returns the same coefficients as yours.

Resources