How to run a single model using parallel processing in R? - r

I am running a single network model in R. I want to utilize parallel processing in R to run it. I see examples online but they are usually for multiple operations.
model.2 <- ergmm(network.M.CS ~ euclidean(d=2, G=2)+
nodematch("Party", diff = F) +
nodematch("State", diff = F) +
absdiff("Ideology")+
edgecov(Donor.Network),
response = "Norm.Num.Bill.CS",
family = c("Bernoulli"),
control=ergmm.control(burnin=20000, sample.size= 4000,interval=10),
verbose=T)
summary(model.2)
As an example, this is the model I would like to run. Any tips you can offer that would allow me to use parallel processing on this single model would be much appreciated. Thank you!

Related

Error in lme4::allFit() -- no applicable method for 'isGLMM'

I'm hitting a confusing error while trying to run the lme4::allFit() using some built-in parallelization. I fit an initial model m0, which uses a larger dataframe ckDF (n = 265,623 rows) to model a binary response to a number of categorical and continuous predictors in a logistic framework with a random intercept for year.
I'm interested in determining whether different optimizers yield different results, following some recommendations I've found online (e.g. by #BenBolker here). My data is fairly large and takes ~20 minutes to run usually, so I'm hoping to use the parallel and ncpus parameters of allFit() to speed it up a bit. Here's my relevant code:
require(lme4)
require(parallel)
m0 <- glmer(returned ~ 1 + barge + site + barge:site +
(run + rearType + basin)^2 +
(tdg + temp + holdingTime)^2 +
(1|year),
data = ckDF, family = 'binomial',
control = glmerControl(optimizer='bobyqa',
optCtrl = list(maxfun = 1e5)))
af1 <- allFit(m0, parallel = 'multicore', ncpus = detectCores())
Upon doing this, I encounter the following error:
Error in checkForRemoteErrors(val) :
7 nodes produced errors; first error: no applicable method for 'isGLMM' applied to an object of class "list"
Any ideas? It seems to me that when it constructs a bunch of nodes, somehow some of them don't import the lme4 package and thus do not recognize isGLMM(); but I don't know why allFit() would do this, since it's from lme4(). I tried looking under the hood and altering the function for my own allFit() package, but ran into other errors.
Any help would be appreciated. R Version: 3.6.1; lme4 Version: 1.1-21; platform: Windows 10 64-bit
Thanks to #user20650 & #Ben Bolker for the tips in comments above -- it worked and I was able to get allFit() to run as expected, by ensuring I use parallel = "snow" in my function call since I'm running in Windows. Just posting the edited code here for anyone else who finds this useful:
require(lme4); require(snow)
# Define initial model (switched to defaults here)
m0 <- glmer(returned ~ 1 + barge + site + barge:site +
(run + rearType + basin)^2 +
(tdg + temp + holdingTime)^2 +
(1|year),
data = ckDF, family = 'binomial')
# Set up cluster for running allFit()
optCls <- makeCluster(detectCores()-1, type = "SOCK")
clusterEvalQ(optCls,library("lme4"))
clusterExport(optCls, "ckDF")
# Use allFit() to look at differences in optimizers
system.time(af1 <- allFit(m0, parallel = 'snow',
ncpus = detectCores()-1, cl=optCls))
stopCluster(optCls)
Ended up taking ~40 minutes using 11 cores on my machine.

Consisten results with Multiple runs of h2o deeplearning

For a certain combination of parameters in the deeplearning function of h2o, I get different results each time I run it.
args <- list(list(hidden = c(200,200,200),
loss = "CrossEntropy",
hidden_dropout_ratio = c(0.1, 0.1,0.1),
activation = "RectifierWithDropout",
epochs = EPOCHS))
run <- function(extra_params) {
model <- do.call(h2o.deeplearning,
modifyList(list(x = columns, y = c("Response"),
validation_frame = validation, distribution = "multinomial",
l1 = 1e-5,balance_classes = TRUE,
training_frame = training), extra_params))
}
model <- lapply(args, run)
What would I need to do in order to get consistent results for the model each time I run this?
Deeplearning with H2O will not be reproducible if it is run on more than a single core. The results and performance metrics may vary slightly from what you see each time you train the deep learning model. The implementation in H2O uses a technique called "Hogwild!" which increases the speed of training at the cost of reproducibility on multiple cores.
So if you want reproducible results you will need to restrict H2O to run on a single core and make sure to use a seed in the h2o.deeplearning call.
Edit based on comment by Darren Cook:
I forgot to include the reproducible = TRUE parameter that needs to be set in combination with the seed to make it truly reproducible. Note that this will make it a lot slower to run. And is is not advisable to do this with a large dataset.
More information on "Hogwild!"

makeCluster with parallelSVM in R takes up all Memory and swap

I'm trying to train a SVM model on a large dataset(~110k training points). This is a sample of the code where I am using the parallelSVM package to parallelize the training step on a subset of the training data on my 4 core Linux machine.
numcore = 4
train.time = c()
for(i in 1:5)
{
cl = makeCluster(4)
registerDoParallel(cores=numCore)
getDoParWorkers()
dummy = train_train[1:10000*i,]
begin = Sys.time()
model.svm = parallelSVM(as.factor(target) ~ .,data =dummy,
numberCores=detectCores(),probability = T)
end = Sys.time() - begin
train.time = c(train.time,end)
stopCluster(cl)
registerDoSEQ()
}
The idea of this snippet of code is to estimate the time it'll take to train the model on the entire dataset by gradually increasing the size of the dummy training set. After running the code above for 10,000 and 20,000 training samples, this is the memory and swap history usage statistic from the System Monitor.After 4 runs of the for loop,both the memory and swap usage is about 95%,and I get the following error :
Error in summary.connection(connection) : invalid connection
Any ideas on how to manage this problem? Is there a way to deallocate the memory used by a cluster after using the stopCluster() function ?
Please take into consideration the fact that I am an absolute beginner in this field. A short explanation of the proposed solutions will be greatly appreciated. Thank you.
Your line
registerDoParallel(cores=numCore)
creates a new cluster with number of nodes equal to numCore (which you haven't stated). This cluster is never destroyed, so with each iteration of the loop you're starting more new R processes. Since you're already creating a cluster with cl = makeCluster(4), you should use
registerDoParallel(cl)
instead.
(And move the makeCluster, registerDoParallel, stopCluster and registerDoSEQ calls outside the loop.)

Export Linear Mixed Effects Model Outputs in csv using Julia Language

I am new to Julia programming language, however, I am fitting a Linear Mixed Effects Model and I find it difficult to save the fixed and random effects estimates in .csv files.
An example code can be found:
using MixedModels
#time modelOutput = fit(lmm(Y~ A + B + (0 + A | group), data))
There is available reference about how to obtain the fixed (fixef(modelOutput)) and random (ranef(modelOutput)) effects however using a DataFrame I am facing errors.
Any advice is appreciated.
Okay, I actually took the time to do this for you. A CoefTable is a type defined in statmodels here. Given this information, we can extract the relevant information from the CoefTable instance as follows:
df = DataFrame(variable = ct.rownms,
Estimate = ct.mat[:,1],
StdError = ct.mat[:,2],
z_val = ct.mat[:,3])
This will give an nvar-by-4 DataFrame which you can then write to csv as described earlier using writetable("output.csv",df)
I had a number of problems getting the accepted answer to work; Julia has evolved a lot since then. I rewrote it based primarily on code from the jglmm R package, with some adaptation/cobbling-together from other sources ...
"""
outfun(m, outfn="output.csv")
output the coefficient table of a fitted model to a file
"""
outfun = function(m, outfn="output.csv")
ct = coeftable(m)
coef_df = DataFrame(ct.cols);
rename!(coef_df, ct.colnms, makeunique = true)
coef_df[!, :term] = ct.rownms;
CSV.write(outfn, coef_df);
end

Using parallel processors with the Amelia package

I want to create multiple data sets with Amelia, but the data set is large so it takes a long time. As a result, I'm trying to run the multiple imputation with parallel processors in Windows. Could someone can help me?
library(Amelia)
library(parallel)
detectCores(all.tests = FALSE, logical = TRUE)
[1] 4
mi <- amelia(impute, m=10,
idvars=c("ID","SCHL","SEX","WAVE", "YEAR"),
parallel=c("snow"), cl=cluster(c("localhost")))
I don't know how to write up this command.
Try using the multicore package instead. Works for me:
library(Amelia)
library(multicore)
mi <- amelia(impute, m=10,
idvars=c("ID","SCHL","SEX","WAVE", "YEAR"),
parallel = "multicore" , ncpus = 4)
In the comments, you say that your posted code "works", but that execution time is the same when not using the parallel option. Perhaps your data set is relatively small and does not benefit from being split?

Resources