I'm trying to train a SVM model on a large dataset(~110k training points). This is a sample of the code where I am using the parallelSVM package to parallelize the training step on a subset of the training data on my 4 core Linux machine.
numcore = 4
train.time = c()
for(i in 1:5)
{
cl = makeCluster(4)
registerDoParallel(cores=numCore)
getDoParWorkers()
dummy = train_train[1:10000*i,]
begin = Sys.time()
model.svm = parallelSVM(as.factor(target) ~ .,data =dummy,
numberCores=detectCores(),probability = T)
end = Sys.time() - begin
train.time = c(train.time,end)
stopCluster(cl)
registerDoSEQ()
}
The idea of this snippet of code is to estimate the time it'll take to train the model on the entire dataset by gradually increasing the size of the dummy training set. After running the code above for 10,000 and 20,000 training samples, this is the memory and swap history usage statistic from the System Monitor.After 4 runs of the for loop,both the memory and swap usage is about 95%,and I get the following error :
Error in summary.connection(connection) : invalid connection
Any ideas on how to manage this problem? Is there a way to deallocate the memory used by a cluster after using the stopCluster() function ?
Please take into consideration the fact that I am an absolute beginner in this field. A short explanation of the proposed solutions will be greatly appreciated. Thank you.
Your line
registerDoParallel(cores=numCore)
creates a new cluster with number of nodes equal to numCore (which you haven't stated). This cluster is never destroyed, so with each iteration of the loop you're starting more new R processes. Since you're already creating a cluster with cl = makeCluster(4), you should use
registerDoParallel(cl)
instead.
(And move the makeCluster, registerDoParallel, stopCluster and registerDoSEQ calls outside the loop.)
Related
I am supposed to perform a combined K-means + Gaussian mixture Models to determine a set of consensus clusters for a fixes number of clusters (k = 4). My data is composed of 231 cells from 4 different types of tumor which have a total of 19'177 variables (genes in this case).
I have never tried to perform this and I tried to follow the instructions from this R package : https://search.r-project.org/CRAN/refmans/diceR/html/consensus_cluster.html
However I must have done something wrong since when I try to run the code:
cc <- consensus_cluster(data, nk = 4, algorithms =c("gmm", "km"), progress = F )
it takes way too much time and ends up saying this error:
Error: cannot allocate vector of size 11.0 Gb
So clearly my generated vector is too heavy and I must have understood things wrong in the tutorial.
Is someone familiar with diceR package and could explain to me if there is a way to make it work?
The consensus_cluster during it's execution "eats up" memory of R session. You have so many variables that their handling cannot be allocated in the memory.
So you have two choices: increase physical memory or use not full data, but its partial sample. Let's assume that physical memory increase is not feasible. Then you should use prep.data = "sample" option. However you'll need to wait. I model data and for GMM it was 8 hours to wait.
Please see below:
library(diceR)
observ = 23
variables = 19177
dat <- matrix(rnorm(observ * variables), ncol = variables)
cc <- consensus_cluster(dat, nk = 4, algorithms =c("gmm", "km"), progress = TRUE,
prep.data = "sample")
Output (was not so patient to wait):
Clustering Algorithm 1 of 2: GMM (k = 4) [---------------------------------] 1% eta: 8h
I'm modelling a quite big network with EpiModel in R, and the code takes very long to run, so I want to run it on multiple cores instead of just 1. I thought this was possible in EpiModel itself, but when I try it, my code just keeps running without starting the simulations. This is the code I am using:
library(EpiModel)
library(parallel)
nw <- network::network.initialize(n=6000, directed=FALSE)
formation <- ~edges + concurrent
target.stats<-c(1500, 600)
coef.diss <- dissolution_coefs(dissolution=~offset(edges), duration = 1)
est <- netest(nw, formation, target.stats, coef.diss)
dx<-netdx(est, nsims=10, nsteps=122, dynamic=FALSE, ncores=4)
init <- init.net(i.num=1, r.num=0)
param <-param.net(inf.prob=0.55, act.rate=0.6, rec.rate=0.05)
control<-control.net(type='SIR', nsteps= 122, nsims =10, ncores=4)
mainsim <- netsim(est, param, init, control)
plot(mainsim, y='si.flow')
When I set ncores to 1 it will run, but any other number doesn't work. Does anybody know how to solve this?
Thanks in advance for your help.
The short of this is that I have huge foreach loops that are running much slower than I'm used to, and I'm curious as to whether I can speed them up -- it's taking hours (maybe even days).
So, I've been given two large pieces of data ( by friend's who needs help). The first is a very large matrix (728396 rows by 276 columns) of genetic data for 276 participants (I'll call this M1). The second is a dataset (276 rows and 34 columns) of other miscellaneous data about the participants (I'll call this DF1). We're running a multilevel logistic regression model utilizing both sets of data.
I'm using a Windows PC with 8 virtual cores running at 4.7ghz and 36gb of ram.
Here's a portion of the code I've written/modified:
library(pacman)
p_load(car, svMisc, doParallel, foreach, tcltk, lme4, lmerTest, nlme)
load("M1.RDATA")
load("DF1.RDATA")
clust = makeCluster(detectCores() - 3, outfile="")
#I have 4 physical cores, 8 virtual. I've been using 5 because my cpu sits at about 89% like this.
registerDoParallel(clust)
getDoParWorkers() #5 cores
n = 728396
res_function = function (i){
x = as.vector(M1[i,])
#Taking one row of genetic data to be used in the regression
fit1 = glmer(r ~ x + m + a + e + n + (1 | famid), data = DF1, family = binomial(link = "logit"))
#Running the model
c(coef(summary(fit1))[2,1:4], coef(summary(fit1))[3:6,1], coef(summary(fit1))[3:6,4], length(fit1#optinfo[["conv"]][["lme4"]][["messages"]]))
#Collecting data, including whether there are any convergence error messages
}
start_time = Sys.time()
model1 = foreach(i = 1:n, .packages = c("tcltk", "lme4"), .combine = rbind) %dopar% {
if(!exists("pb")) pb <- tkProgressBar("Parallel task", min=1, max=n)
setTkprogressBar(pb, i)
#This is some code I found here to keep track of my progress
res_function(i)
}
end_time = Sys.time()
end_time - start_time
stopCluster(clust)
showConnections()
I've run nearly identical code in the past and it took me only about 13 minutes. However, I suspect that this model is taking up more memory than usual on each core (likely due to the second level) and slowing things down. I've read that BiocParallel, Future, or even Microsoft R Open might work better, but I haven't had much success using any of them (likely due to my own lack of know how). I've also read a bit about the package "bigmemory" to more efficiently use the large matrix across cores, but I ran into several errors when I tried to use it (failed workers and such). I'm also curious about the potential of using my GPU (a Titan X Pascal) for some additional umph if anyone knows more about this.
Any advice would be very appreciated!
For a certain combination of parameters in the deeplearning function of h2o, I get different results each time I run it.
args <- list(list(hidden = c(200,200,200),
loss = "CrossEntropy",
hidden_dropout_ratio = c(0.1, 0.1,0.1),
activation = "RectifierWithDropout",
epochs = EPOCHS))
run <- function(extra_params) {
model <- do.call(h2o.deeplearning,
modifyList(list(x = columns, y = c("Response"),
validation_frame = validation, distribution = "multinomial",
l1 = 1e-5,balance_classes = TRUE,
training_frame = training), extra_params))
}
model <- lapply(args, run)
What would I need to do in order to get consistent results for the model each time I run this?
Deeplearning with H2O will not be reproducible if it is run on more than a single core. The results and performance metrics may vary slightly from what you see each time you train the deep learning model. The implementation in H2O uses a technique called "Hogwild!" which increases the speed of training at the cost of reproducibility on multiple cores.
So if you want reproducible results you will need to restrict H2O to run on a single core and make sure to use a seed in the h2o.deeplearning call.
Edit based on comment by Darren Cook:
I forgot to include the reproducible = TRUE parameter that needs to be set in combination with the seed to make it truly reproducible. Note that this will make it a lot slower to run. And is is not advisable to do this with a large dataset.
More information on "Hogwild!"
Users,
I am looking for a solution to "parallelize" my PLSR predictions in order to save pprocessing time. I was trying to use the "foreach" construct with "doPar" (cf. 2nd part of code below), but I was unable to allocate the predicted values as well as the model performance parameters (RMSEP) to the output variable.
The code:
set.seed(10000) # generate some data...
mat <- replicate(100, rnorm(100))
y <- as.matrix(mat[,1], drop=F)
x <- mat[,2:100]
eD <- dist(x, method = "euclidean") # distance matrix to find close samples
eDm <- as.matrix(eD)
kns <- matrix(NA,nrow(x),10) # empty matrix to allocate 10 closest samples
for (i in 1:nrow(eDm)) { # identify closest samples in a loop and allocate to kns
kns[i,] <- head(order(eDm[,i]), 11)[-1]
}
So far I consider the code as "safe", but the next part is challenging me, since I never used the "foreach" construct before:
library(pls)
library(foreach)
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
out <- foreach(j = 1:nrow(mat), .combine="rbind", .packages="pls") %dopar% {
pls <- plsr(y ~ x, ncomp=5, validation="CV", , subset=kns[j,])
predict(pls, ncomp=5, newdata=x[j,,drop=F])
RMSEP(pls, estimate="CV")$val[1,1,5]
}
stopCluster(cl)
As I understand, the code line starting with "RMSEP(pls,..." is simply overwriting the previously written data from the "predict" code line. Somehow I was assuming the .combine option would take care of this?
Many thanks for your help!
Best, Chega
If you want to return two objects from the body of a foreach loop, you need to put them into an object such as a list:
out <- foreach(j = 1:nrow(mat), .packages="pls") %dopar% {
pls <- plsr(y ~ x, ncomp=5, validation="CV", , subset=kns[j,])
list(p=predict(pls, ncomp=5, newdata=x[j,,drop=F]),
r=RMSEP(pls, estimate="CV")$val[1,1,5])
}
Only the "final value" of the loop body is returned to the master and then processed by the .combine function.
Note that I removed the .combine argument so that the result will be a list of lists of length 2. It's not clear to me that rbind is the appropriate function to use to process the results.
Since this question was originally answered, the pls package has been modified to allow the cross-validation to be run in parallel. The implementation is trivially easy--simply a matter of defining either a persistent cluster, or the number of cores to use in a transient cluster, in pls.options.
If transient clusters are used, implementation literally requires only two lines of code:
library(parallel)
pls.options(parallel=NumberOfCoresToUse)
No changes to the output variables are needed.
I haven't checked whether parallelizing at the calibration level, as in the question, would be more efficient. I suspect it would be, particularly when the number of calibration iterations is much larger than the number of cross-validation steps (especially when the number of CVs isn't a multiple of the number of cores used), but this approach is so straightforward that the extra coding effort may not be worth it.