Temporary objects in parallel computing in R - r

I have a pretty long R code which needs to be iterated several hundred times. I am using a 32 core and 32 GB RAM cloud service to do the job. To make the code run faster, I want to use parallel computing using foreach() command. I have set the codes working with no errors. However, I need to make sure if I am getting proper results. To illustrate my point I have set a simplified mock code:
foreach (i = 1:100) %dopar% {
age <- seq(from=20,to=79, by=1)
d <- as.data.frame(age)
d$gender <- rbinom(nrow(d),size = 1,prob = 0.5)
d$prob <- cut(d$age, breaks = c(20,30,40,50,60,70,80), include.lowest = T,right = F,labels = c(.001,.01,.1,.25,.3,.1))
d$prob <- as.numeric(as.character(d$prob))
d$event <- rbinom(nrow(d),size = 1,prob = d$prob)
save(d,file = paste("d_",i,".rda", sep = ""))
table(d$gender,d$event)
}
I am wondering if temporary objects, like ā€œdā€ in this example, is independent for each cluster when running this code. If there is only one object ā€œdā€ in the memory which is shared by different clusters, what is the solution for an independent object.
For reference, I am using the code proposed by this page (https://github.com/tobigithub/R-parallel) to make clusters.
Thanks in advance for your reply.

Related

R code runs when required parameter is not specified

I am assisting a colleague with adding functionality to one of his R packages.
I have implemented nonparametric bootstrapping using a for loop construct in R.
# perform resampling
# resample `subsample_size` values with or without replacement replicate_size times
for (i in 1:replicate_size) {
if (replacement == TRUE) { # bootstrapping
z <- sample(x, size = subsample_size, replace = TRUE)
zz <- sample(x, size = subsample_size, replace = TRUE)
} else { # subsampling
z <- sample(x, size = subsample_size, replace = FALSE)
zz <- sample(x, size = subsample_size, replace = FALSE)
}
# calculate statistic
boot_samples[i] <- min(zz) - max(z)
}
The above loop is nested within another for loop, which itself is nested within a function (details not shown). The code I'm dealing with is messy, and there are most certainly more efficient ways of coding things up, but I've had to leave it be since my colleague is only familiar with very basic and rudimentary coding constructs.
Upon running said function, I specified all required arguments (replicate_size, replacement) except subsample_size. subsample_size is needed to carry out the resampling. This mistake on my part was revealing because, for some strange reason, the code still runs without throwing an error regarding missing a value for subsample_size.
Question: Does anyone have any idea on why this happens?
I'd include more code, but it is very verbose and unwieldy (his code, not mine). Running the for loop outside the function does indeed raise the error regarding the missing value as expected.

Reducing NbClust memory usage

I need some help with massive usage of memory by the NbClust function.
On my data, memory balloons to 56GB at which point R crashes with a fatal error. Using debug(), I was able to trace the error to these lines:
if (any(indice == 23) || (indice == 32)) {
res[nc - min_nc + 1, 23] <- Index.sPlussMoins(cl1 = cl1,
md = md)$gamma
Debugging of Index.sPlussMoins revealed that the crash happens during a for loop. The iteration that it crashes at varies, and during the loop memory usage varies between 41 and 57Gb (I have 64 total):
for (k in 1:nwithin1) {
s.plus <- s.plus + (colSums(outer(between.dist1,
within.dist1[k], ">")))
s.moins <- s.moins + (colSums(outer(between.dist1,
within.dist1[k], "<")))
print(s.moins)
}
I'm guessing that the memory usage comes from the outer() function.
Can I modify NbClust to be more memory efficient (perhaps using the bigmemory package)?
At very least, it would be nice to get R to exit the function with an "cannot allocate vector of size..." instead of crashing. That way I would have an idea of just how much more memory I need to handle the matrix causing the crash.
Edit: I created a minimal example with a matrix the approximate size of the one I am using, although now it crashes at a different point, when the hclust function is called:
set.seed(123)
cluster_means = sample(1:25, 10)
mlist = list()
for(cm in cluster_means){
name = as.character(cm)
m = data.frame(matrix(rnorm(60000*60,mean=cm,sd=runif(1, 0.5, 3.5)), 60000, 60))
mlist[[name]] = m
}
test_data = do.call(cbind, cbind(mlist))
library(NbClust)
debug(fun = "NbClust")
nbc = NbClust(data = test_data, diss = NULL, distance = "euclidean", min.nc = 2, max.nc = 30,
method = "ward.D2", index = "alllong", alphaBeale = 0.1)
debug: hc <- hclust(md, method = "ward.D2")
It seems to crash before using up available memory (according to my system monitor, 34Gb is being used when it crashes out of 64 total.
So is there any way I can do this without sub-sampling manageable sized matrices? And if I did, how do I know how much memory I will need for a matrix of a given size? I would have thought my 64Gb would be enough.
Edit:
I tried altering NbClust to use fastcluster instead of the stats version. It didn't crash, but did exit with a memory error:
Browse[2]>
exiting from: fastcluster::hclust(md, method = "ward.D2")
Error: cannot allocate vector of size 9.3 Gb
If you check the source code of Nbclust, you'll see that is all but optimized for speed or memory efficiency.
The crash you're reporting is not even during clustering - it's in the evaluation afterwards, specifically in the "Gamma, Gplus and Tau" index code. Disable these indexes and you may get further, but most likely you'll just have the same problem again in another index. Maybe you can pick only a few indices to run, specifically such indices that so not need a lot of memory?
I forked NbClust and made some changes that seem to have made it go for longer without crashing with bigger matrices. I changed some of the functions to use Rfast, propagate and fastcluster. However there are still problems.
I haven't run all my data yet and only run a few tests on dummy data with gap, so there is still time for it to fail. But any suggestions/criticisms would be welcome.
My (in progress) fork of NbCluster:
https://github.com/jbhanks/NbClust

R, why parLapply works very well on plm, but for lm, it's extremely slow?

I'm trying to use parLapply to speed up my program.
I have a function
ols.models <- function( mydata, temp.expr, temp.expr2) {
cl <- makeCluster(getOption('cl.cores',8))
clusterExport(cl,c('mydata'),envir = environment())
clusterEvalQ(cl,library(plm))
temp.reg <- parLapply(cl,temp.expr,function(x) {
plm(as.formula(x), index = c("ID_","DATE_"), model = "pooling", data = mydata)
}
)
temp.ols.reg <- parLapply(cl,temp.expr2,function(x) lm(as.formula(x),data=mydata))
# temp.ols.reg <- lapply(temp.expr2,function(x) lm(as.formula(x),data=mydata))
)
}
temp.expr and temp.expr2 are lists with 2,000 formulas.
temp.reg <- parLapply(cl,temp.expr,function(x) {
plm(as.formula(x), index = c("ID_","DATE_"), model = "pooling", data = mydata)}
)
This above part of code runs perfectly and greatly reduce calculation time. However, my computer stuck at the following piece of code
temp.ols.reg <- parLapply(cl,temp.expr2,function(x) lm(as.formula(x),data=mydata))
This line is very slow and keeps running for more than 20 mins, so I have to stop it. However, if I just use lapply, it's much faster and only takes 1 min to finish.
temp.ols.reg <- lapply(temp.expr2,function(x) lm(as.formula(x),data=mydata))
I'm wondering why there's such a big difference. parLapply works very well for plm, but for lm, it's extremely slow and memory usage explodes.
Is it because lm is from package 'stats' and it's kind of base environment package? Or this is caused by some environment related problems? This confuses me a lot.
Thanks.

Parallel processing with BiocParallel running much longer than serial

I am trying to use parallel processing to speed up running many Boosted Regression Trees in R. I am using the BiocParallel package (http://lcolladotor.github.io/2016/03/07/BiocParallel/#.WiqF7bQ-e3c). I have created some dummy data and then set up a function to run two BRT models, which I hoped to time in Serial then in Parallel. However, my Parallel run never seems to complete, while my Serial run only takes about 3 seconds.
##CAN I USE PARALLEL PROCESSING TO SPEED UP BRT'S?
##LOAD PACKAGES
library(BiocParallel)
library(dismo)
library(gbm)
library(MASS)
##CREATE RANDOM, CORRELATED DATA
## FROM https://www.r-bloggers.com/simulating-random-multivariate-correlated-data-continuous-variables/
R = matrix(cbind(1,.80,.2, .80,1,.7, .2,.7,1),nrow=3)
U = t(chol(R))
nvars = dim(U)[1]
numobs = 100
set.seed(1)
random.normal = matrix(rnorm(nvars*numobs,0,1), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw = as.data.frame(newX)
orig.raw = as.data.frame(t(random.normal))
names(raw) = c("response","predictor1","predictor2")
cor(raw)
###########################################################
## MODEL
##########################################################
##WITH FUNCTIONS,
Tc<-c(4, 8) ##Tree Complexities
Lr<-c(0.01) ## Learning Rates
Vars <- split(expand.grid(Tc,Lr),seq(nrow(expand.grid(Tc,Lr))))
brt <- function(x){
a <- gbm.step(raw,gbm.x=c(2:3),gbm.y="response",tree.complexity=x[1],learning.rate=x[2],bag.fraction=0.65, family="gaussian")
b <- data.frame(model=paste("Tc= ",x[1]," _ ","Lr= ",x[2],sep=""), R2=a$cv.statistics$correlation.mean, Dev=a$cv.statistics$deviance.mean)
##Reassign model with unique name
assign(paste("patch.tc",x[1],".lr",x[2],sep=""),a, envir = .GlobalEnv)
assign(paste("RESULTS","patch.tc",x[1],".lr",x[2],sep=""),b, envir = .GlobalEnv)
print(b)
}
############################
###IN Serial
############################
system.time(
lapply(Vars, brt)
)
############################
###IN PARALLEL
############################
system.time(
bplapply(Vars, brt)
)
Some quick comments:
Always avoid assign(); if you find yourself using it, it's a good sign you're approaching the problem the wrong way.
Assign variables to global environment from within a function (using assign() or <<-) is always a bad idea and again, a hint that there is a better solution that you should use.
If you still choose to break 1 and 2 above, it will certainly not work when you use it parallel processing.
Instead, return your values (see below).
That dismo::gbm.step() function tries to plot by default (plot.main = TRUE). That will not work (actually invalid) in so called forked parallel processing, which is often the default go-to on Unix and macOS.
Plotting in parallel is often not what you want to do (unless you plot an image file or similar).
To your problem: After modifying your brt() to (according to 1-6):
brt <- function(x){
a <- gbm.step(raw, gbm.x=c(2:3), gbm.y="response", tree.complexity=x[1], learning.rate=x[2], bag.fraction=0.65, family="gaussian", plot.main = FALSE)
b <- data.frame(model=paste("Tc= ", x[1], " _ ", "Lr= ", x[2], sep=""), R2=a$cv.statistics$correlation.mean, Dev=a$cv.statistics$deviance.mean)
list(a = a, b = b)
}
it works for me bplapply(Vars, brt) as well as with future::future_lapply(Vars, brt). With parallel::parLapply(cl, Vars, brt) you need to take more care exporting globals.
PS. I would probably just return a and extract the b info outside.

Keep R Model in memory for Rest API

We have a GLM R model that is ~ 2 GB in size. We are using this model to service a REST API. We need < 3 second response times. The problem is that using the conventional saveRDS / readRDS functions takes way too long. Our last option is to find a way to keep the R Model in memory in between sessions.
At a high level:
Service a REST API.
Keep a multi gigabyte model in memory.
Keep response times below 3 seconds.
Stuff we've tried and doesn't work:
Shrinking the size of the model. Our data scientist says that's as small as he can get it.
I've experimented with compression and other settings on saveRDS. The best I could do is 12 seconds to load the model from file.
We tried Microsoft R Server. All the web API requests would be redirected to the same session. Problem is that we had to wrap a decent amount of code around keeping the session alive. Even then it would flake out often.
Microsoft R Server Real Time is out because it only accepts models generated by the ScaleR. I know ScaleR has a GLM function, but I've been told its not an option.
Faster IO doesn't seem to help. It appears that the bottleneck is the deserialization of the rData file. R being single threaded doesn't help.
Edit:
The question is what R for REST API library / service would allow us to statefully keep a model in memory in between calls.
Continuing on my comment from above as well as the suggestion of #TenniStats, the best approach is to reduce the size of the GLM. Consider the following:
#generating some sample data that's fairly large
sample.data <- data.frame('target' = sample(c(1:10), size = 5000000, replace = T),
'regressor1' = rnorm(5000000),
'regressor2' = rnorm(5000000),
'regressor3' = rnorm(5000000),
'regressor4' = rnorm(5000000),
'regressor5' = rnorm(5000000),
'regressor6' = rnorm(5000000),
'regressor7' = rnorm(5000000),
'regressor8' = rnorm(5000000),
'regressor9' = rnorm(5000000),
'regressor10' = rnorm(5000000))
#building a toy glm - this one is about 3.3 GB
lm.mod <- glm(sample.data, formula = target ~ ., family = gaussian)
#baseline predictions
lm.default.preds <- predict(lm.mod, sample.data)
#extracting coefficients
lm.co <- coefficients(lm.mod)
#applying coefficients to original data set by row and adding intercept
lightweight.preds <- lm.co[1] +
apply(sample.data[,2:ncol(sample.data)],
1,
FUN = function(x) sum(x * lm.co[2:length(lm.co)]))
#clearing names from vector for comparison
names(lm.default.preds) <- NULL
#taa daa
all.equal(lm.default.preds, lightweight.preds)
Then we can do the following:
#saving for our example and starting timing
saveRDS(lm.co, file = 'myfile.RDS')
start.time <- Sys.time()
#reading from file
coefs.from.file <- readRDS('myfile.RDS')
#scoring function
light.scoring <- function(coeff, new.data) {
prediction <- coeff[1] + sum(coeff[2:length(coeff)] * new.data)
names(prediction) <- NULL
return(prediction)
}
#same as before
light.scoring(coefs.from.file, sample.data[1, 2:11])
#~.03 seconds on my machine
Sys.time() - start.time

Resources