Memory leak when using MNP package in R - r

I have a question concerning memory use in R when using the MNP package. My goal is to estimate a multinomial probit model and then using the model to predict choices on a large set of data. I have split the predictor data in a list of pieces.
The problem is that when I loop over the list to predict, the memory used by R grows constantly and uses swap space after reaching the maximum memory of my computer. The allocated memory is not released even when hitting those boundaries. This happens even though I do not create any additional objects and so I don't understand what is going on.
Below I pasted an example code that suffers from the described problem. When running the example, the memory grows constantly and remains used even after removing all variables and calling gc().
The real data I have is much larger than what is generated in the example, so I need to find a workaround.
My questions are:
Why does this script use so much memory?
How can force R to release the allocated memory after each step?
library(MNP)
nr <- 10000
draws <- 500
pieces <- 100
# Create artificial training data
trainingData <- data.frame(y = sample(c(1,2,3), nr, rep = T), x1 = sample(1:nr), x2 = sample(1:nr), x3 = sample(1:nr))
# Create artificial predictor data
predictorData <- list()
for(i in 1:pieces){
predictorData[[i]] <- data.frame(y = NA, x1 = sample(1:nr), x2 = sample(1:nr), x3 = sample(1:nr))
}
# Estimate multinomial probit
mnp.out <- mnp(y ~ x1 + x2, trainingData, n.draws = draws)
# Predict using predictor data
predicted <- list()
for(i in 1:length(predictorData)){
cat('|')
mnp.pred <- predict(mnp.out, predictorData[[i]], type = 'prob')$p
mnp.pred <- colnames(mnp.pred)[apply(mnp.pred, 1, which.max)]
predicted[[i]] <- mnp.pred
rm(mnp.pred)
gc()
}
# Unite output into one string
predicted <- factor(unlist(predicted))
Here are the output statistics after running the script:
> rm(list = ls())
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 158950 8.5 407500 21.8 407500 21.8
Vcells 142001 1.1 33026373 252.0 61418067 468.6
Here are my specifications of R:
> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MNP_2.6-2 MASS_7.3-14

The results don't seem anomalous, as in I don't think this evidences a memory leak. I suspect that you're misreading the data from gc(): the right hand column is the maximum memory used during the tracking of memory by R. If you use gc(reset = TRUE), then the maximum shown will be the memory used in the LHS, i.e. the 8.5MB and 1.1MB listed under "used".
I suspect that MNP just consumes a lot of memory during the prediction phase, so there's not much that can be done, other than to break up the prediction data into even smaller chunks, with fewer rows.
If you have multiple cores, you might consider using the foreach package, along with doSMP or doMC, as this will give you the speedup of independent calculations and the benefit of clearing the RAM allocated after each iteration of the loop is complete (as it involves a fork of R that uses a separate memory space, I believe).

Related

Is it possible to work with vector larger than system RAM in R?

I'm using the R package kuenm to produce and project species distribution models.
I've produced the models without a problem, but when I try to evaluate the extrapolation risk for future projections with the function kuenm_mop I get the error:
Error: cannot allocate vector of size 92GB
The system I'm using has Windows 8.1 Pro and 64GB of RAM (which I believe is the limiting facotr here).
My question is: is it possible to work with a vector of greater size than my RAM?
This is the function I'm using:
library(kuenm)
sets_var <- "Set_1" #set of variables used
out_mop <- "MOP_results" #output directory
percent <- 10
paral <- FALSE
is_swd <- FALSE
M_var_dir <- "M_variables"
G_var_dir <- "G_variables"
kuenm_mmop(G.var.dir = G_var_dir, M.var.dir = M_var_dir, sets.var = sets_var, is.swd = is_swd, out.mop = out_mop, percent = percent, parallel = paral)

How to bootstrap using large datasets?

I would like to use the boot() and boot.ci() functions from library("boot") for a large data set(~20 000) with type="bca".
If R(number of bootstraps) is too small (I have tried 1k - 10k), then I get the following error:
Error in bca.ci(boot.out, conf, index[1L], L = L, t = t.o, t0 = t0.o, :
estimated adjustment 'a' is NA
However, if I do 15k - 20+k bootstraps, then I get:
Cannot allocate vector size # GB
(usually ranging from 1.7 to 6.4gb, depending on the dataset and # of bootstraps).
I read that I needed to have more ram, but I have Windows desktop with 16gb ram and I'm using 64-bit R, suggesting my computer should be able to handle this.
How can I use bootstrapping methods on larger datasets if too few bootstraps cannot produce estimates and sufficient bootstraps results in insufficient memory?
My code:
multRegress<-function(mydata){
numVar<<-NCOL(mydata)
Variables<<- names(mydata)[2:numVar]
mydata<-cor(mydata, use="pairwise.complete.obs")
RXX<-mydata[2:numVar,2:numVar]
RXY<-mydata[2:numVar,1]
RXX.eigen<-eigen(RXX)
D<-diag(RXX.eigen$val)
delta<-sqrt(D)
lambda<-RXX.eigen$vec%*%delta%*%t(RXX.eigen$vec)
lambdasq<-lambda^2
beta<-solve(lambda)%*%RXY
rsquare<<-sum(beta^2)
RawWgt<-lambdasq%*%beta^2
import<-(RawWgt/rsquare)*100
result<<-data.frame(Variables, Raw.RelWeight=RawWgt,
Rescaled.RelWeight=import)
}
# function passed to boot
multBootstrap <- function(mydata, indices){
mydata<-mydata[indices,]
multWeights<-multRegress(mydata)
return(multWeights$Raw.RelWeight)
}
# call boot
multBoot<-boot(thedata, multBootstrap, 15000)
multci<-boot.ci(multBoot,conf=0.95, type="bca")

Can I make this R foreach loop faster?

Thanks in advance for your help.
The short of this is that I have huge foreach loops that are running much slower than I'm used to, and I'm curious as to whether I can speed them up -- it's taking hours (maybe even days).
So, I've been given two large pieces of data ( by friend's who needs help). The first is a very large matrix (728396 rows by 276 columns) of genetic data for 276 participants (I'll call this M1). The second is a dataset (276 rows and 34 columns) of other miscellaneous data about the participants (I'll call this DF1). We're running a multilevel logistic regression model utilizing both sets of data.
I'm using a Windows PC with 8 virtual cores running at 4.7ghz and 36gb of ram.
Here's a portion of the code I've written/modified:
library(pacman)
p_load(car, svMisc, doParallel, foreach, tcltk, lme4, lmerTest, nlme)
load("M1.RDATA")
load("DF1.RDATA")
clust = makeCluster(detectCores() - 3, outfile="")
#I have 4 physical cores, 8 virtual. I've been using 5 because my cpu sits at about 89% like this.
registerDoParallel(clust)
getDoParWorkers() #5 cores
n = 728396
res_function = function (i){
x = as.vector(M1[i,])
#Taking one row of genetic data to be used in the regression
fit1 = glmer(r ~ x + m + a + e + n + (1 | famid), data = DF1, family = binomial(link = "logit"))
#Running the model
c(coef(summary(fit1))[2,1:4], coef(summary(fit1))[3:6,1], coef(summary(fit1))[3:6,4], length(fit1#optinfo[["conv"]][["lme4"]][["messages"]]))
#Collecting data, including whether there are any convergence error messages
}
start_time = Sys.time()
model1 = foreach(i = 1:n, .packages = c("tcltk", "lme4"), .combine = rbind) %dopar% {
if(!exists("pb")) pb <- tkProgressBar("Parallel task", min=1, max=n)
setTkprogressBar(pb, i)
#This is some code I found here to keep track of my progress
res_function(i)
}
end_time = Sys.time()
end_time - start_time
stopCluster(clust)
showConnections()
I've run nearly identical code in the past and it took me only about 13 minutes. However, I suspect that this model is taking up more memory than usual on each core (likely due to the second level) and slowing things down. I've read that BiocParallel, Future, or even Microsoft R Open might work better, but I haven't had much success using any of them (likely due to my own lack of know how). I've also read a bit about the package "bigmemory" to more efficiently use the large matrix across cores, but I ran into several errors when I tried to use it (failed workers and such). I'm also curious about the potential of using my GPU (a Titan X Pascal) for some additional umph if anyone knows more about this.
Any advice would be very appreciated!

Conditional simulation (with Kriging) in R with parallelization?

I am using gstat package in R to generate sequential gaussian simulations. My pc have 4 cores and I tried to parallelize the krige() function using the parallel package following the script provided by Guzmán to answer the question How to achieve parallel Kriging in R to speed up the process?.
The resulting simulations are, however, different from the ones using only one core at the time (no parallelization). It looks a geometry problem, but i can't find out how to fix it.
Next i will provide an example (using 4 cores) generating 2 simulations. You will see that after running the code, the simulated maps derived from parallelization show some artifacts (like vertical lines), and are different from the ones using only one core at the time.
The code needs the libraries gstat, sp, raster, parallel and spatstat. If any of the lines library() do not work, run install.packages() first.
library(gstat)
library(sp)
library(raster)
library(parallel)
library(spatstat)
# create a regular grid
nx=100 # number of columns
ny=100 # number of rows
srgr <- expand.grid(1:ny, nx:1)
names(srgr) <- c('x','y')
gridded(srgr)<-~x+y
# generate a spatial process (unconditional simulation)
g<-gstat(formula=z~x+y, locations=~x+y, dummy=T, beta=15, model=vgm(psill=3, range=10, nugget=0,model='Exp'), nmax=20)
sim <- predict(g, newdata=srgr, nsim=1)
r<-raster(sim)
# generate sample data (Poisson process)
int<-0.02
rpp<-rpoispp(int,win=owin(c(0,nx),c(0,ny)))
df<-as.data.frame(rpp)
coordinates(df)<-~x+y
# assign raster values to sample data
dfpp <-raster::extract(r,df,df=TRUE)
smp<-cbind(coordinates(df),dfpp)
smp<-smp[complete.cases(smp), ]
coordinates(smp)<-~x+y
# fit variogram to sample data
vs <- variogram(sim1~1, data=smp)
m <- fit.variogram(vs, vgm("Exp"))
plot(vs, model = m)
# generate 2 conditional simulations with one core processor
one <- krige(formula = sim1~1, locations = smp, newdata = srgr, model = m,nmax=12,nsim=2)
# plot simulation 1 and 2: statistics (min, max) are ok, simulations are also ok.
spplot(one["sim1"], main = "conditional simulation")
spplot(one["sim2"], main = "conditional simulation")
# generate 2 conditional with parallel processing
no_cores<-detectCores()
cl<-makeCluster(no_cores)
parts <- split(x = 1:length(srgr), f = 1:no_cores)
clusterExport(cl = cl, varlist = c("smp", "srgr", "parts","m"), envir = .GlobalEnv)
clusterEvalQ(cl = cl, expr = c(library('sp'), library('gstat')))
par <- parLapply(cl = cl, X = 1:no_cores, fun = function(x) krige(formula=sim1~1, locations=smp, model=m, newdata=srgr[parts[[x]],], nmax=12, nsim=2))
stopCluster(cl)
# merge all parts
mergep <- maptools::spRbind(par[[1]], par[[2]])
mergep <- maptools::spRbind(mergep, par[[3]])
mergep <- maptools::spRbind(mergep, par[[4]])
# create SpatialPixelsDataFrame from mergep
mergep <- SpatialPixelsDataFrame(points = mergep, data = mergep#data)
# plot mergep: statistics (min, max) are ok, but simulated maps show "vertical lines". i don't understand why.
spplot(mergep[1], main = "conditional simulation")
spplot(mergep[2], main = "conditional simulation")
I have tried your code and I think the problem lies with the way you split the work:
parts <- split(x = 1:length(srgr), f = 1:no_cores)
On my dual core machine that meant that all odd indices in srgr where handled by one process and all even indices where handled by the other process. This is probably the source of the vertical artifacts you are seeing.
A better way should be to split the data into consecutive chunks like this:
parts <- parallel::splitIndices(length(srgr), no_cores)
Using this splitting with the rest of your code I get results that look comparable to the sequential ones. At least to my untrained eyes ...
Original answer, which is only a minor effect. It still might make sense to fix the seed with set.seed for sequential and clusterSetRNGStream for parallel processing.
From what I have read about Kriging it requires you to draw random numbers. These random numbers will be different with parallel processing. See section 6 of the parallel vignette (vignette("parallel")) for more details.

rm(list = ls()) on 64GB then Error: cannot allocate vector of size 11.6 Gb when trying a decision tree

I'm trying to run a decision tree via caret package. I start my script fresh by removing everything from memory with rm(list = ls()) then I load my training data which is 3M rows and 522 features. R studio doesn't show the size in gb but presumably by the error message it's 11.6.
If I'm using 64gb R then is it expected I see this error? Is there any way around it without resorting to training on smaller data?
rm(list = ls())
library(tidyverse)
library(caret)
library(xgboost)
# read in data
training_data <- readRDS("/home/myname/training_data.rds")
R studio environment pane currently shows one object, training data with the dims mentioned above.
### Modelling
# tuning & parameters
set.seed(123)
train_control <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE, # IMPORTANT!
verboseIter = TRUE,
allowParallel = TRUE
)
# Fit a decision tree (minus cad field)
print("begin decision tree regular")
mod_decitiontree <- train(
cluster ~.,
tuneLength = 5,
data = select(training_data, -c(cad, id)), # a data frame
method = "rpart",
trControl = train_control,
na.action = na.pass
)
Loading required package: rpart
Error: cannot allocate vector of size 11.6 Gb
I could ask our admin to increase my RAM but before doing that want to make sure I'm not missing something. Don't I have lot's of RAM available if I'm on 64 GB?
Do I have any options? I tried making my data frame a matrix and passing that to caret instead but it threw an error. Is passing a matrix instead a worthwhile endevour?
Here is your error message reproduced:
cannot allocate vector of size 11.6 Gb when trying a decision tree
This means that the specific failure happened when R requested another 11.6 GB of memory, and was unable to do so. However, the random forest calculation itself may require many such allocations, and, most likely, the remainder of free RAM was already being used.
I don't know the details of your calculation, but I would say that even running random forests on a 1GB data set is already very large. My advice would be to find a way to take a statistically accurate sub sample of your data set such that you don't need such large numbers of RAM.

Resources