How to bootstrap using large datasets? - r

I would like to use the boot() and boot.ci() functions from library("boot") for a large data set(~20 000) with type="bca".
If R(number of bootstraps) is too small (I have tried 1k - 10k), then I get the following error:
Error in bca.ci(boot.out, conf, index[1L], L = L, t = t.o, t0 = t0.o, :
estimated adjustment 'a' is NA
However, if I do 15k - 20+k bootstraps, then I get:
Cannot allocate vector size # GB
(usually ranging from 1.7 to 6.4gb, depending on the dataset and # of bootstraps).
I read that I needed to have more ram, but I have Windows desktop with 16gb ram and I'm using 64-bit R, suggesting my computer should be able to handle this.
How can I use bootstrapping methods on larger datasets if too few bootstraps cannot produce estimates and sufficient bootstraps results in insufficient memory?
My code:
multRegress<-function(mydata){
numVar<<-NCOL(mydata)
Variables<<- names(mydata)[2:numVar]
mydata<-cor(mydata, use="pairwise.complete.obs")
RXX<-mydata[2:numVar,2:numVar]
RXY<-mydata[2:numVar,1]
RXX.eigen<-eigen(RXX)
D<-diag(RXX.eigen$val)
delta<-sqrt(D)
lambda<-RXX.eigen$vec%*%delta%*%t(RXX.eigen$vec)
lambdasq<-lambda^2
beta<-solve(lambda)%*%RXY
rsquare<<-sum(beta^2)
RawWgt<-lambdasq%*%beta^2
import<-(RawWgt/rsquare)*100
result<<-data.frame(Variables, Raw.RelWeight=RawWgt,
Rescaled.RelWeight=import)
}
# function passed to boot
multBootstrap <- function(mydata, indices){
mydata<-mydata[indices,]
multWeights<-multRegress(mydata)
return(multWeights$Raw.RelWeight)
}
# call boot
multBoot<-boot(thedata, multBootstrap, 15000)
multci<-boot.ci(multBoot,conf=0.95, type="bca")

Related

Is it possible to work with vector larger than system RAM in R?

I'm using the R package kuenm to produce and project species distribution models.
I've produced the models without a problem, but when I try to evaluate the extrapolation risk for future projections with the function kuenm_mop I get the error:
Error: cannot allocate vector of size 92GB
The system I'm using has Windows 8.1 Pro and 64GB of RAM (which I believe is the limiting facotr here).
My question is: is it possible to work with a vector of greater size than my RAM?
This is the function I'm using:
library(kuenm)
sets_var <- "Set_1" #set of variables used
out_mop <- "MOP_results" #output directory
percent <- 10
paral <- FALSE
is_swd <- FALSE
M_var_dir <- "M_variables"
G_var_dir <- "G_variables"
kuenm_mmop(G.var.dir = G_var_dir, M.var.dir = M_var_dir, sets.var = sets_var, is.swd = is_swd, out.mop = out_mop, percent = percent, parallel = paral)

Can I make this R foreach loop faster?

Thanks in advance for your help.
The short of this is that I have huge foreach loops that are running much slower than I'm used to, and I'm curious as to whether I can speed them up -- it's taking hours (maybe even days).
So, I've been given two large pieces of data ( by friend's who needs help). The first is a very large matrix (728396 rows by 276 columns) of genetic data for 276 participants (I'll call this M1). The second is a dataset (276 rows and 34 columns) of other miscellaneous data about the participants (I'll call this DF1). We're running a multilevel logistic regression model utilizing both sets of data.
I'm using a Windows PC with 8 virtual cores running at 4.7ghz and 36gb of ram.
Here's a portion of the code I've written/modified:
library(pacman)
p_load(car, svMisc, doParallel, foreach, tcltk, lme4, lmerTest, nlme)
load("M1.RDATA")
load("DF1.RDATA")
clust = makeCluster(detectCores() - 3, outfile="")
#I have 4 physical cores, 8 virtual. I've been using 5 because my cpu sits at about 89% like this.
registerDoParallel(clust)
getDoParWorkers() #5 cores
n = 728396
res_function = function (i){
x = as.vector(M1[i,])
#Taking one row of genetic data to be used in the regression
fit1 = glmer(r ~ x + m + a + e + n + (1 | famid), data = DF1, family = binomial(link = "logit"))
#Running the model
c(coef(summary(fit1))[2,1:4], coef(summary(fit1))[3:6,1], coef(summary(fit1))[3:6,4], length(fit1#optinfo[["conv"]][["lme4"]][["messages"]]))
#Collecting data, including whether there are any convergence error messages
}
start_time = Sys.time()
model1 = foreach(i = 1:n, .packages = c("tcltk", "lme4"), .combine = rbind) %dopar% {
if(!exists("pb")) pb <- tkProgressBar("Parallel task", min=1, max=n)
setTkprogressBar(pb, i)
#This is some code I found here to keep track of my progress
res_function(i)
}
end_time = Sys.time()
end_time - start_time
stopCluster(clust)
showConnections()
I've run nearly identical code in the past and it took me only about 13 minutes. However, I suspect that this model is taking up more memory than usual on each core (likely due to the second level) and slowing things down. I've read that BiocParallel, Future, or even Microsoft R Open might work better, but I haven't had much success using any of them (likely due to my own lack of know how). I've also read a bit about the package "bigmemory" to more efficiently use the large matrix across cores, but I ran into several errors when I tried to use it (failed workers and such). I'm also curious about the potential of using my GPU (a Titan X Pascal) for some additional umph if anyone knows more about this.
Any advice would be very appreciated!

How to limit the number of iterations the pam function from cluster library does?

How to reduce the number of iterations for PAM clustering algorithm in the cluster package?
I am trying to produce a couple of plots showing how pam works, so trying to reduce the number of iterations to 2. I have cloned the cluster repo to my working directory, where I have edited the pam.q file (directory ./cluster/R) for nMax to be equal to 2.
# original
nMax <- 65536 # 2^16 (as 1+ n(n-1)/2 must be < max_int = 2^31-1)
# modified
nMax <- 2
However, even with no changes applied to the original file, pam algorithm fails to run. If I load it by typing in library(cluster) instead, it works as supposed, but this way I have no ability to manipulate the number of iterations.
Sample code of what I'm trying to achieve is displayed below:
# -- Working code --
library(datasets)
data(iris)
library(cluster)
df <- data.frame(iris$Petal.Length, iris_modified$Petal.Width)
pam.res <- pam(df, k = 2)
pam.res
# -- Failing Code --
library(datasets)
data(iris)
source("./cluster/R/pam.q")
df <- data.frame(iris$Petal.Length, iris_modified$Petal.Width)
pam.res <- pam(df, k = 2)
pam.res
This is the error I'm getting, when running the "Failing Code" above:
Error in pam(clust_ex, k = 2) : object 'cl_Pam' not found
I expect the same output as for the working code, when I am linking the pam.q file directly instead of loading the library.
Is there something I'm not doing quite right in the way I import the q file? Or is there another way to change the number of iterations the pam algorithm performs?
Nmax is the maximum number of objects.
Its not the maximum number of iterations.
Is also not sufficient to just modify the .q file.
It's probably easier to do this with ELKI...

makeCluster with parallelSVM in R takes up all Memory and swap

I'm trying to train a SVM model on a large dataset(~110k training points). This is a sample of the code where I am using the parallelSVM package to parallelize the training step on a subset of the training data on my 4 core Linux machine.
numcore = 4
train.time = c()
for(i in 1:5)
{
cl = makeCluster(4)
registerDoParallel(cores=numCore)
getDoParWorkers()
dummy = train_train[1:10000*i,]
begin = Sys.time()
model.svm = parallelSVM(as.factor(target) ~ .,data =dummy,
numberCores=detectCores(),probability = T)
end = Sys.time() - begin
train.time = c(train.time,end)
stopCluster(cl)
registerDoSEQ()
}
The idea of this snippet of code is to estimate the time it'll take to train the model on the entire dataset by gradually increasing the size of the dummy training set. After running the code above for 10,000 and 20,000 training samples, this is the memory and swap history usage statistic from the System Monitor.After 4 runs of the for loop,both the memory and swap usage is about 95%,and I get the following error :
Error in summary.connection(connection) : invalid connection
Any ideas on how to manage this problem? Is there a way to deallocate the memory used by a cluster after using the stopCluster() function ?
Please take into consideration the fact that I am an absolute beginner in this field. A short explanation of the proposed solutions will be greatly appreciated. Thank you.
Your line
registerDoParallel(cores=numCore)
creates a new cluster with number of nodes equal to numCore (which you haven't stated). This cluster is never destroyed, so with each iteration of the loop you're starting more new R processes. Since you're already creating a cluster with cl = makeCluster(4), you should use
registerDoParallel(cl)
instead.
(And move the makeCluster, registerDoParallel, stopCluster and registerDoSEQ calls outside the loop.)

Memory leak when using MNP package in R

I have a question concerning memory use in R when using the MNP package. My goal is to estimate a multinomial probit model and then using the model to predict choices on a large set of data. I have split the predictor data in a list of pieces.
The problem is that when I loop over the list to predict, the memory used by R grows constantly and uses swap space after reaching the maximum memory of my computer. The allocated memory is not released even when hitting those boundaries. This happens even though I do not create any additional objects and so I don't understand what is going on.
Below I pasted an example code that suffers from the described problem. When running the example, the memory grows constantly and remains used even after removing all variables and calling gc().
The real data I have is much larger than what is generated in the example, so I need to find a workaround.
My questions are:
Why does this script use so much memory?
How can force R to release the allocated memory after each step?
library(MNP)
nr <- 10000
draws <- 500
pieces <- 100
# Create artificial training data
trainingData <- data.frame(y = sample(c(1,2,3), nr, rep = T), x1 = sample(1:nr), x2 = sample(1:nr), x3 = sample(1:nr))
# Create artificial predictor data
predictorData <- list()
for(i in 1:pieces){
predictorData[[i]] <- data.frame(y = NA, x1 = sample(1:nr), x2 = sample(1:nr), x3 = sample(1:nr))
}
# Estimate multinomial probit
mnp.out <- mnp(y ~ x1 + x2, trainingData, n.draws = draws)
# Predict using predictor data
predicted <- list()
for(i in 1:length(predictorData)){
cat('|')
mnp.pred <- predict(mnp.out, predictorData[[i]], type = 'prob')$p
mnp.pred <- colnames(mnp.pred)[apply(mnp.pred, 1, which.max)]
predicted[[i]] <- mnp.pred
rm(mnp.pred)
gc()
}
# Unite output into one string
predicted <- factor(unlist(predicted))
Here are the output statistics after running the script:
> rm(list = ls())
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 158950 8.5 407500 21.8 407500 21.8
Vcells 142001 1.1 33026373 252.0 61418067 468.6
Here are my specifications of R:
> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MNP_2.6-2 MASS_7.3-14
The results don't seem anomalous, as in I don't think this evidences a memory leak. I suspect that you're misreading the data from gc(): the right hand column is the maximum memory used during the tracking of memory by R. If you use gc(reset = TRUE), then the maximum shown will be the memory used in the LHS, i.e. the 8.5MB and 1.1MB listed under "used".
I suspect that MNP just consumes a lot of memory during the prediction phase, so there's not much that can be done, other than to break up the prediction data into even smaller chunks, with fewer rows.
If you have multiple cores, you might consider using the foreach package, along with doSMP or doMC, as this will give you the speedup of independent calculations and the benefit of clearing the RAM allocated after each iteration of the loop is complete (as it involves a fork of R that uses a separate memory space, I believe).

Resources