Speeding up lots of GLMs in R - r

First off, sorry about the long post. Figured it's better to give context to get good answers (I hope!). Some time ago I wrote an R function that will get all pairwise interactions of variables in a data frame. This worked fine at the time, but now a colleague would like me to do this with a much larger dataset. They don't know how many variables they are going to have in the end but they are guessing approximately 2,500 - 3,000. My function below is way too slow for this (4 minutes for 100 variables). At the bottom of this post I have included some timings for various numbers of variables and total numbers of interactions. I have the results of calling Rprof() on the 100 variables run of my function, so If anyone wants to take a look at it let me know. I don't want to make a super long any longer than it needs to be.
What I'd like to know is if there is anything I can do to speed this function up. I tried looking going directly to glm.fit, but as far as I understood, for that to be useful the computation of the design matrices and all of that other stuff that I frankly don't understand, needs to be the same for each model, which is not the case for my analysis, although perhaps I am wrong about this.
Any ideas on how to make this run faster would be greatly appreciated. I am planning on using parallelization to run the analysis in the end but I don't know how many CPU's I am going to have access to but I'd say it won't be more than 8.
Thanks in advance,
Cheers
Davy.
getInteractions2 = function(data, fSNPcol, ccCol)
{
#fSNPcol is the number of the column that contains the first SNP
#ccCol is the number of the column that contains the outcome variable
require(lmtest)
a = data.frame()
snps = names(data)[-1:-(fSNPcol-1)]
names(data)[ccCol] = "PHENOTYPE"
terms = as.data.frame(t(combn(snps,2)))
attach(data)
fit1 = c()
fit2 = c()
pval = c()
for(i in 1:length(terms$V1))
{
fit1 = glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i])),family="binomial")
fit2 = glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i]))+I(get(as.character(terms$V1[i]))*get(as.character(terms$V2[i]))),family="binomial")
a = lrtest(fit1, fit2)
pval = c(pval, a[2,"Pr(>Chisq)"])
}
detach(data)
results = cbind(terms,pval)
return(results)
}
In the table below is the system.time results for increasing numbers of variables being passed through the function. n is the number, and Ints, is the number of pair-wise interactions given by that number of variables.
n Ints user.self sys.self elapsed
time 10 45 1.20 0.00 1.30
time 15 105 3.40 0.00 3.43
time 20 190 6.62 0.00 6.85
...
time 90 4005 178.04 0.07 195.84
time 95 4465 199.97 0.13 218.30
time 100 4950 221.15 0.08 242.18
Some code to reproduce a data frame in case you want to look at timings or the Rprof() results. Please don't run this unless your machine is super fast, or your prepared to wait for about 15-20 minutes.
df = data.frame(paste("sid",1:2000,sep=""),rbinom(2000,1,.5))
gtypes = matrix(nrow=2000, ncol=3000)
gtypes = apply(gtypes,2,function(x){x=sample(0:2, 2000, replace=T);x})
snps = paste("rs", 1000:3999,sep="")
df = cbind(df,gtypes)
names(df) = c("sid", "status", snps)
times = c()
for(i in seq(10,100, by=5)){
if(i==100){Rprof()}
time = system.time((pvals = getInteractions2(df[,1:i], 3, 2)))
print(time)
times = rbind(times, time)
if(i==100){Rprof(NULL)}
}
numI = function(n){return(((n^2)-n)/2)}
timings = cbind(seq(10,100,by=5), sapply(seq(10,100,by=5), numI),times)

So I have sort of solved this (with help from the R mailing lists) and am posting it up in-case it's useful to anyone.
Basically, where the SNPs or variables are independent (i.e. Not in LD, not correlated) you can centre each SNP/Variable at it's mean like so:
rs1cent <- rs1-mean(rs1)
rs2cent <- rs2 -mean(rs2)
you can then test for correlation between phenotype and interaction as a screening step:
rs12interaction <- rs1cent*rs2cent
cor(PHENOTYPE, rs12interaction)
and then fully investigate using the full glm any that seem to be correlated. cut-off choice is, as ever, arbitrary.
Other suggestions were to use a RAO score test, which involves only fitting the null hypothesis model this halving the computation time for this step, but I don't really understand how this works (yet! more reading required.)
Anyway there you go. Maybe be of use to someone someday.

Related

Optimizing lm() function in a loop

I'm using the R built-in lm() function in a loop for estimating a custom statistic:
for(i in 1:10000)
{
x<-rnorm(n)
reg2<-lm(x~data$Y)
Max[i]<-max(abs(rstudent(reg2)))
}
This is really slow when increasing both the loop counter (typically we want to test over 10^6 or 10^9 iterations values for precision issues) and the size of Y.
Having read the following Stack topic, a very first attemp was to try optimizing the whole using parallel regression (with calm()):
cls = makeCluster(4)
distribsplit(cls, "test")
distribsplit(cls, "x")
for(i in 1:10000)
{
x<-rnorm(n)
reg2 <- calm(cls, "x ~ test$Y, data = test")
Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
}
This ended with a much slower version (by a factor 6) when comparing with the original, unparalleled loop. My assumption is that we ask for creating /destroying the threads in each loop iteration and that slow down the process a lot in R.
A second attemp was to use lm.fit() according to this Stack topic:
for(i in 1:10000)
{
x<- rnorm(n)
reg2<- .lm.fit(as.matrix(x), data$Y)
Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
}
It resulted in a much faster processing compared to the initial and orgininal version. Such that we now have: lm.fit() < lm() < calm(), speaking of overall processing time.
However, we are still looking for options to improve the efficiency (in term of processing time) of this code. What are the possible options? I assume that making the loop parallel would save some processing time?
Edit: Minimal Example
Here is a minimal example:
#Import data
sample <- read.csv("sample.txt")
#Preallocation
Max <- vector(mode = "numeric", length = 100)
n <- length(sample$AGE)
x <- matrix(rnorm(100 * n), 100)
for(i in 1 : 100)
{
reg <- lm(x ~ data$AGE)
Max[i] <- max(abs(rstudent(reg)))
}
with the following dataset 'sample.txt':
AGE
51
22
46
52
54
43
61
20
66
27
From here, we made several tests and noted the following:
Following #Karo contribution, we generate the matrix of normal samples outside the loop to spare some execution time. We expected a noticeable impact, but run tests indicate that doing so produce the unexpected inverse results (i.e. a longer execution time). Maybe the effect reverse when increasing the number of simulations.
Following #BenBolker uggestion, we also tested fastlm() and it reduces the execution time but the results seem to differ (from a factor 0.05) compared to the typical lm()
We are still struggling we effectively reducing the execution time. Following #Karo suggestions, we will try to directly pass a vector to lm() and investigate parallelization (but failed with calm() for an unknown reason).
Wide-ranging comments above, but I'll try to answer a few narrower points.
I seem to get the same (i.e., all.equal() is TRUE) results with .lm.fit and fitLmPure, if I'm careful about random-number seeds:
library(Rcpp)
library(RcppEigen)
library(microbenchmark)
nsim <- 1e3
n <- 1e5
set.seed(101)
dd <- data.frame(Y=rnorm(n))
testfun <- function(fitFn=.lm.fit, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
x <- rnorm(n)
reg2 <- fitFn(as.matrix(x), dd$Y)$residuals
return(max(abs(reg2) / sd(reg2)))
}
## make sure NOT to use seed=101 - also used to pick y -
## if we have y==x then results are unstable (resids approx. 0)
all.equal(testfun(seed=102), testfun(fastLmPure,seed=102)) ## TRUE
fastLmPure is fastest (but not by very much):
(bm1 <- microbenchmark(testfun(),
testfun(lm.fit),
testfun(fastLmPure),
times=1000))
Unit: milliseconds
expr min lq mean median uq max
testfun() 6.603822 8.234967 8.782436 8.332270 8.745622 82.54284
testfun(lm.fit) 7.666047 9.334848 10.201158 9.503538 10.742987 99.15058
testfun(fastLmPure) 5.964700 7.358141 7.818624 7.471030 7.782182 86.47498
If you wanted to fit many independent responses, rather than many independent predictors (i.e. if you were varying Y rather than X in the regression), you could provide a matrix for Y in .lm.fit, rather than looping over lots of regressions, which might be a big win. If all you care about are "residuals of random regressions" that might be worth a try. (Unfortunately, providing a matrix that combines may separate X vectors runs a multiple regression, not many univariate regressions ...)
Parallelizing is worthwhile, but will only scale (at best) according to the number of cores you have available. Doing a single run rather than a set of benchmarks because I'm lazy ...
Running 5000 replicates sequentially takes about 40 seconds for me (modern Linux laptop).
system.time(replicate(5000,testfun(fastLmPure), simplify=FALSE))
## user system elapsed
## 38.953 0.072 39.028
Running in parallel on 5 cores takes about 13 seconds, so a 3-fold speedup for 5 cores. This will probably be a bit better if the individual jobs are larger, but obviously will never scale better than the number of cores ... (8 cores didn't do much better).
library(parallel)
system.time(mclapply(1:5000, function(x) testfun(fastLmPure),
mc.cores=5))
## user system elapsed
## 43.225 0.627 12.970
It makes sense to me that parallelizing at a higher/coarser level (across runs rather than within lm fits) will perform better.
I wonder if there are analytical results you could use in terms of the order statistics of a t distribution ... ?
Since I still can't comment:
Try to avoid loops in R. For some reason you are recalculating those random numbers every iteration. You can do that without a loop:
duration_loop <- system.time({
for(i in 1:10000000)
{
x <- rnorm(10)
}
})
duration <- system.time({
m <- matrix(rnorm(10000000*10), 10000000)
})
Both ways should create 10 random values per iteration/matrix row with the same amount of iterations/rows. Though both ways seem to scale linearly, you should see a difference in execution time, the loop will probably be CPU-bound and the "vectorized" way probably memory-bound.
With that in mind you probably should and most likely can avoid the loop altogether, you can for instance pass a vector into the lm-function. If you still need to be faster after that you can definitely parallelise a number of ways, it would be easier to suggest how with a working example of data.

silhouette calculation in R for a large data

I want to calculate silhouette for cluster evaluation. There are some packages in R, for example cluster and clValid. Here is my code using cluster package:
# load the data
# a data from the UCI website with 434874 obs. and 3 variables
data <- read.csv("./data/spatial_network.txt",sep="\t",header = F)
# apply kmeans
km_res <- kmeans(data,20,iter.max = 1000,
nstart=20,algorithm="MacQueen")
# calculate silhouette
library(cluster)
sil <- silhouette(km_res$cluster, dist(data))
# plot silhouette
library(factoextra)
fviz_silhouette(sil)
The code works well for smaller data, say data with 50,000 obs, however I get an error like "Error: cannot allocate vector of size 704.5 Gb" when the data size is a bit large. This might be problem for Dunn index and other internal indices for large datasets.
I have 32GB RAM in my computer. The problem comes from calculating dist(data). I am wondering if it is possible to not calculating dist(data) in advance, and calculate corresponding distances when it is required in the silhouette formula.
I appreciate your help regarding this problem and how I can calculate silhouette for large and very large datasets.
You can implement Silhouette yourself.
It only needs every distance twice, so storing an entire distance matrix is not necessary. It may run a bit slower because it computes distances twice, but at the same time the better memory efficiency may well make up for that.
It will still take a LONG time though.
You should consider to only use a subsample (do you really need to consider all points?) as well as alternatives such as Simplified Silhouette, in particular with KMeans... You only gain very little with extra data on such methods. So you may just use a subsample.
Anony-Mousse answer is perfect, particularly subsampling. This is very important for very large datasets due to the increase in computational cost.
Here is another solution for calculating internal measures such as silhouette and Dunn index, using an R package of clusterCrit. clusterCrit is for calculating clustering validation indices, which does not require entire distance matrix in advance. However, it might be slow as Anony-Mousse discussed. Please see the below link for documentation for clusterCrit:
https://www.rdocumentation.org/packages/clusterCrit/versions/1.2.8/topics/intCriteria
clusterCrit also calculates most of Internal measures for cluster validation.
Example:
intCriteria(data,km_res$cluster,c("Silhouette","Calinski_Harabasz","Dunn"))
If it is possible to calculate the Silhouette index, without using the distance matrix, alternatively you can use the clues package, optimizing both the time and the memory used by the cluster package. Here is an example:
library(rbenchmark)
library(cluster)
library(clues)
set.seed(123)
x = c(rnorm(1000,0,0.9), rnorm(1000,4,1), rnorm(1000,-5,1))
y = c(rnorm(1000,0,0.9), rnorm(1000,6,1), rnorm(1000, 5,1))
cluster = rep(as.factor(1:3),each = 1000)
df <- cbind(x,y)
head(df)
x y
[1,] -0.50442808 -0.13527673
[2,] -0.20715974 -0.29498142
[3,] 1.40283748 -1.30334876
[4,] 0.06345755 -0.62755613
[5,] 0.11635896 2.33864121
[6,] 1.54355849 -0.03367351
Runtime comparison between the two functions
benchmark(f1 = silhouette(as.integer(cluster), dist = dist(df)),
f2 = get_Silhouette(y = df, mem = cluster))
test replications elapsed relative user.self sys.self user.child sys.child
1 f1 100 15.16 1.902 13.00 1.64 NA NA
2 f2 100 7.97 1.000 7.76 0.00 NA NA
Comparison in memory usage between the two functions
library(pryr)
object_size(silhouette(as.integer(cluster), dist = dist(df)))
73.9 kB
object_size(get_Silhouette(y = df, mem = cluster))
36.6 kB
As a conclusion clues::get_Silhouette, it reduces the time and memory used to the same.

Data perturbation - How to perform it?

I am doing some projects related to statistics simulation using R based on "Introduction to Scientific Programming and Simulation Using R" and in the Students projects session (chapter 24) i am doing the "The pipe spiders of Brunswick" problem, but i am stuck on one part of an evolutionary algorithm, where you need to perform some data perturbation according to the sentence bellow:
"With probability 0.5 each element of the vector is perturbed, independently
of the others, by an amount normally distributed with mean 0 and standard
deviation 0.1"
What does being "perturbed" really mean here? I dont really know which operation I should be doing with my vector to make this perturbation happen and im not finding any answers to this problem.
Thanks in advance!
# using the most important features, we create a ML model:
m1 <- lm(PREDICTED_VALUE ~ PREDICTER_1 + PREDICTER_2 + PREDICTER_N )
#summary(m1)
#anova(m1)
# after creating the model, we perturb as follows:
#install.packages("perturb") #install the package
library(perturb)
set.seed(1234) # for same results each time you run the code
p1_new <- perturb(m1, pvars=c("PREDICTER_1","PREDICTER_N") , prange = c(1,1),niter=200) # your can change the number of iterations to any value n. Total number of iteration would come to be n+1
p1_new # check the values of p1
summary(p1_new)
Perturbing just means adding a small, noisy shift to a number. Your code might look something like this.
x = sample(10, 10)
ind = rbinom(length(x), 1, 0.5) == 1
x[ind] = x[ind] + rnorm(sum(ind), 0, 0.1)
rbinom gets the elements to be modified with probability 0.5 and rnorm adds the perturbation.

Speed up Simulation in R with Code Optimization

The generic version of what I am trying to do is to conduct a simulation study where I manipulate a few variables to see how that impacts a result. I'm having some speed issues with R. The latest simulation worked with a few iterations (10 per experiment). However, when I moved to a large scale (10k per experiment) version, the simulation has been running for 14 hours (and is still running).
Below is the code (with comments) that I am running. Being a rookie with R, and am struggling to optimize the simulation to be efficient. My hope is to learn from the comments and suggestions provided here to optimize this code and use these comments for future simulation studies.
Let me say a few things about what this code is supposed to do. I am manipulating two variables: effect size and sample size. Each combination is run 10k times (i.e., 10k experiments per condition). I initialize a data frame to store my results (called Results). I loop over three variables: Effect size, sample size, and iterations (10k).
Within the loops, I initialize four NULL components: p.test, p.rep, d.test, and d.rep. The former two capture the p-value of the initial t-test and the p-value of the replication (replicated under similar conditions). The latter two calculate the effect size (Cohen's d).
I generate my random data from a standard normal for the control condition (DVcontrol), and I use my effect size as the mean for the experimental condition (DVexperiment). I take the difference between the values and throw the result into the t-test function in R (paired-samples t-test). I store the results in a list called Trials and I rbind this to the Results data frame. This process is repeated 10k times until completion.
# Set Simulation Parameters
## Effect Sizes (ES is equal to mean difference when SD equals Variance equals 1)
effect_size_range <- seq(0, 2, .1) ## ES
## Sample Sizes
sample_size_range <- seq(10, 1000, 10) ## SS
## Iterations for each ES-SS Combination
iter <- 10000
# Initialize the Vector of Results
Results <- data.frame()
# Set Random Seed
set.seed(12)
# Loop over the Different ESs
for(ES in effect_size_range) {
# Loop over the Different Sample Sizes
for(SS in sample_size_range) {
# Create p-value Vectors
p.test <- NULL
p.rep <- NULL
d.test <- NULL
d.rep <- NULL
# Loop over the iterations
for(i in 1:iter) {
# Generate Test Data
DVcontrol <- rnorm(SS, mean=0, sd=1)
DVexperiment <- rnorm(SS, mean=ES, sd=1)
DVdiff <- DVexperiment - DVcontrol
p.test[i] <- t.test(DVdiff, alternative="greater")$p.value
d.test[i] <- mean(DVdiff) / sd(DVdiff)
# Generate Replication Data
DVcontrol <- rnorm(iter, mean=0, sd=1)
DVexperiment <- rnorm(iter, mean=ES, sd=1)
DVdiff <- DVexperiment - DVcontrol
p.rep[i] <- t.test(DVdiff, alternative="greater")$p.value
d.rep[i] <- mean(DVdiff) / sd(DVdiff)
}
# Results
Trial <- list(ES=ES, SS=SS,
d.test=mean(d.test), d.rep=mean(d.rep),
p.test=mean(p.test), p.rep=mean(p.rep),
r=cor(p.test, p.rep, method="kendall"),
r.log=cor(log2(p.test)*(-1), log2(p.rep)*(-1), method= "kendall"))
Results <- rbind(Results, Trial)
}
}
Thanks in advance for your comments and suggestions,
Josh
The general approach to optimization is to run a profiler to determine what portion of the code the interpreter spends the most time in, and then to optimize that portion. Let's say your code resides in a file called test.R. In R, you can profile it by running the following sequence of commands:
Rprof() ## Start the profiler
source( "test.R" ) ## Run the code
Rprof( NULL ) ## Stop the profiler
summaryRprof() ## Display the results
(Note that these commands will generate a file Rprof.out in the directory of your R session.)
If we run the profiler on your code (with iter <- 10, rather than iter <- 10000), we get the following profile:
# $by.self
# self.time self.pct total.time total.pct
# "rnorm" 1.56 24.53 1.56 24.53
# "t.test.default" 0.66 10.38 2.74 43.08
# "stopifnot" 0.32 5.03 0.86 13.52
# "rbind" 0.32 5.03 0.52 8.18
# "pmatch" 0.30 4.72 0.34 5.35
# "mean" 0.26 4.09 0.42 6.60
# "var" 0.24 3.77 1.38 21.70
From here, we observe that rnorm and t.test are your most expensive operations (shouldn't really be a surprise as these are in your inner-most loop).
Once you figured out where the expensive function calls are, the actual optimization consists of two steps:
Optimize the function, and/or
Optimize the number of times the function is called.
Since t.test and rnorm are built-in R functions, your only option for Step 1 above is to look for alternative packages that may have faster implementations of sampling from the normal distribution and/or running multiple t tests. Step 2 is really about restructuring your code in a way that does not recompute the same thing multiple times. For example, the following lines of code do not depend on i:
# Generate Test Data
DVcontrol <- rnorm(SS, mean=0, sd=1)
DVexperiment <- rnorm(SS, mean=ES, sd=1)
Does it make sense to move these outside the loop, or do you really need a new sample of your test data for each different value of i?

Statistical inefficiency (block-averages)

I have a series of data, these are obtained through a molecular dynamics simulation, and therefore are sequential in time and correlated to some extent. I can calculate the mean as the average of the data, I want to estimate the the error associated to mean calculated in this way.
According to this book I need to calculate the "statistical inefficiency", or roughly the correlation time for the data in the series. For this I have to divide the series in blocks of varying length and, for each block length (t_b), the variance of the block averages (v_b). Then, if the variance of the whole series is v_a (that is, v_b when t_b=1), I have to obtain the limit, as t_b tends to infinity, of (t_b*v_b/v_a), and that is the inefficiency s.
Then the error in the mean is sqrt(v_a*s/N), where N is the total number of points. So, this means that only one every s points is uncorrelated.
I assume this can be done with R, and maybe there's some package that does it already, but I'm new to R. Can anyone tell me how to do it? I have already found out how to read the data series and calculate the mean and variance.
A data sample, as requested:
# t(ps) dH/dl(kJ/mol)
0.0000 582.228
0.0100 564.735
0.0200 569.055
0.0300 549.917
0.0400 546.697
0.0500 548.909
0.0600 567.297
0.0700 638.917
0.0800 707.283
0.0900 703.356
0.1000 685.474
0.1100 678.07
0.1200 687.718
0.1300 656.729
0.1400 628.763
0.1500 660.771
0.1600 663.446
0.1700 637.967
0.1800 615.503
0.1900 605.887
0.2000 618.627
0.2100 587.309
0.2200 458.355
0.2300 459.002
0.2400 577.784
0.2500 545.657
0.2600 478.857
0.2700 533.303
0.2800 576.064
0.2900 558.402
0.3000 548.072
... and this goes on until 500 ps. Of course, the data I need to analyze is the second column.
Suppose x is holding the sequence of data (e.g., data from your second column).
v = var(x)
m = mean(x)
n = length(x)
si = c()
for (t in seq(2, 1000)) {
nblocks = floor(n/t)
xg = split(x[1:(nblocks*t)], factor(rep(1:nblocks, rep(t, nblocks))))
v2 = sum((sapply(xg, mean) - m)**2)/nblocks
#v rather than v1
si = c(si, t*v2/v)
}
plot(si)
Below image is what I got from some of my time series data. You have your lower limit of t_b when the curve of si becomes approximately flat (slope = 0). See http://dx.doi.org/10.1063/1.1638996 as well.
There are a couple different ways to calculate the statistical inefficiency, or integrated autocorrelation time. The easiest, in R, is with the CODA package. They have a function, effectiveSize, which gives you the effective sample size, which is the total number of samples divided by the statistical inefficiency. The asymptotic estimator for the standard deviation in the mean is sd(x)/sqrt(effectiveSize(x)).
require('coda')
n_eff = effectiveSize(x)
Well it's never too late to contribute to a question, isn't it?
As I'm doing some molecular simulation myself, I did step uppon this problem but did not see this thread already. I found out that the method actually proposed by Allen & Tildesley seems a bit out dated compared to modern error analysis methods. The rest of the book is good enought to worth the look though.
While Sunhwan Jo's answer is correct concerning block averages method,concerning error analysis you can find other methods like the jacknife and bootstrap methods (closely related to one another) here: http://www.helsinki.fi/~rummukai/lectures/montecarlo_oulu/lectures/mc_notes5.pdf
In short, with the bootstrap method, you can make series of random artificial samples from your data and calculate the value you want on your new sample. I wrote a short piece of Python code to work some data out (don't forget to import numpy or the functions I used):
def Bootstrap(data):
B = 100 # arbitraty number of artificial samplings
es = 0.
means = numpy.zeros(B)
sizeB = data.shape[0]/4 # (assuming you pass a numpy array)
# arbitrary bin-size proportional to the one of your
# sampling.
for n in range(B):
for i in range(sizeB):
# if data is multi-column array you may have to add the one you use
# specifically in randint, else it will give you a one dimension array.
# Check the doc.
means[n] = means[n] + data[numpy.random.randint(0,high=data.shape[0])] # Assuming your desired value is the mean of the values
# Any calculation is ok.
means[n] = means[n]/sizeB
es = numpy.std(means,ddof = 1)
return es
I know it can be upgraded but it's a first shot. With your data, I get the following:
Mean = 594.84368
Std = 66.48475
Statistical error = 9.99105
I hope this helps anyone stumbling across this problem in statistical analysis of data. If I'm wrong or anything else (first post and I'm no mathematician), any correction is welcomed.

Resources