Using parallel processing in vegan functions? - r

I am interested in executing the R function adonis from the vegan package in parallel. However, it isn't clear to me how exactly to make it run in parallel. Regardless of how I try to initialize it, it seems to take the same amount of time to execute. Can someone explain what I am doing wrong?
require(vegan)
require(parallel)
data(dune)
data(dune.env)
#This:
system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999))
#Runs faster (4.49 s) than this (6.7 s):
system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999, parallel=3))
#or this (6.7 s)
cl <- makeCluster(3)
system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999, parallel=cl))
stopCluster(cl)
Computer details:
R V4.0
Win 10x64
i5-8350 4 cores

I'm not sure how helpful this answer will really be, but I'll share a few of my own observations and things I've slowly pieced together. I don't pretend to be an expert on this, so take my answer realizing there may be some inaccuracies in here. I'm a biologist first.
Some of these parallel libraries seem to reload the r-environment and run any start up files (e.g. rprofiles) you have per each core. So, there is an inherent time cost using the parallel libraries that makes it so that you will only see benefits to parallel functions if you it is a large enough computation to be worth the parallelization (in your example, the Dune dataset is really small. I'll share my own benchmarks below). That said, there are a few things that seem to help.
Using the doParallel library, you can specify arguments to not load unnecessary information into your session like so:
library(doParallel)
cl <- makeCluster(3, rscript_args = c("--no-init-file", "--no-site-file","--no-environ"))
#for linux .... cl <- makePSOCKcluster(2)
registerDoParallel(cl)
unif_w = UniFrac(d, weighted=T, parallel=T, normalized = T)
unif_uw = UniFrac(d, weighted=F, parallel=T)
stopCluster(cl)
I noticed in my own work that the addition of the rscript option greatly enhanced my speeds (sorry, no benchmarks for this, hoping to get a quick anwer out). If I remember the source where I got that suggestion from I'll come back to share.
This doesn't help with running Adonis, however I think that initial time cost might explain why we don't see a time benefit using the parallel options built in to Adonis on the Dune dataset. Here are my benchmarks.
> data("dune")
> data("dune.env")
> system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999))
user system elapsed
3.90 0.00 3.93
> #Runs faster (4.49 s) than this (6.7 s):
> system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999, parallel=3))
user system elapsed
0.71 0.04 6.53
Not a big difference on this set, but it IS slower in parallel. However, repeated with a large set I'm working with at the moment (bc is a distance matrix was calculated from species matrix that has 887 species by 3734 sites)
> system.time(adonis(bc ~ fmet$Diagnosis, parallel = 1))
user system elapsed
109.95 21.27 131.22
> system.time(adonis(bc ~ fmet$Diagnosis, parallel = 4))
user system elapsed
3.44 1.41 82.36
Long story short, in this specific case you might only benefits by applying the adonis option to a larger dataset.
I'm not sure how important computer specs are here, but I do have a large bit of memory intended for this kind of purpose. The memory in my case is more important for allowing me to work with large matrices a little easier.
R version: 4.0.2
Windows 10, 64bit
AMD Ryzen 3600
64gb DRAM
Anyways, I'm still looking for other work-arounds and tricks.

Related

How long for forecast::auto.arima to work?

I am working through some demo code that accompanied a medium post on high frequency time series forecasting using the forecast::auto.arima function. Whether in this application or when I have tried other datasets, I have never been able to get a result from this function - it does seem to stop calculating once I have executed it. Others have, obviously; so I'm asking how long you need to wait to get a result from the function.
Example
Data on hourly energy use can be downloaded from Kaggle.
Using that data:
library(tidyverse)
library(forecast)
duq_pc <- read.csv(file.choose(),stringsAsFactors = F)
duq_new <- duq_pc[duq_pc$Datetime >= '2013-01-01 00:00:00' & duq_pc$Datetime <= '2017-09-30 00:00:00',]
duq_train <- duq_new[duq_new$Datetime <= '2016-12-31',]
duq_test <- duq_new[duq_new$Datetime >= '2017-01-01',]
msts_power <- msts(duq_train$DUQ_MW, seasonal.periods = c(24,24*7,24*365.25), start = decimal_date(as.POSIXct("2013-01-01 00:00:00")))
#Dynamic Harmonic regression with Auto Arima
fourier_power <- auto.arima(msts_power, seasonal=FALSE, lambda=0,
xreg=fourier(msts_power, K=c(10,10,10)))
Very interested to hear whether this is something that I would need to leave running overnight or whether other people are getting results in minutes.
In my case, running your code and measuring the times in between, it took about 40 minutes to finish. For what it's worth, I launched the script on a computer with an AMD Ryzen 2700 Eight-Core Processor 3.20 GHZ, 16 GB of RAM.
It really depends on the size of your dataset and your computer specs. You can use the tictoc library for an easy way to record time, and set trace = TRUE in your auto.arima call to output its progress to console.
library(tictoc)
tic()
fourier_power <- auto.arima(msts_power, seasonal=FALSE, lambda=0, trace = TRUE
xreg=fourier(msts_power, K=c(10,10,10)))
toc()

Q: R Torch memory usage on GPU for convolutional LSTM tutorial

The convolution LSTM tutorial linked from the Torch for R frontpage is written to run on cpu by default -- no calls to a gpu device. I can set up and run it just fine.
When I make a GPU modification, as follows in the code below,
model <- convlstm(input_dim = 1, hidden_dims = c(64, 1), kernel_sizes = c(3,3), n_layers = 2)
#---CUDA modification
device <- if (cuda_is_available()) torch_device("cuda:0") else "cpu"
model <- model$to(device = device)
#---
optimizer <- optim_adam(model$parameters)
num_epochs <- 100
for (epoch in 1:num_epochs) {
model$train()
batch_losses <- c()
for (b in enumerate(dl)) {
optimizer$zero_grad()
preds <- model(b$x$to(device = device))[[2]][[2]][[1]] #last time step output from the last layer
loss <- nnf_mse_loss(preds, b$y$to(dtype = torch_float(), device = device))
batch_losses <- c(batch_losses, loss$item())
loss$backward()
optimizer$step()
}
if(epoch %%10 ==0)
cat(sprintf("\nEpoch %d, training loss: %3f\n", epoch, mean(batch_losses)))
}
I get 40-50 epochs in before encountering an obvious "your GPU outta memory," error:
Error in (function (self, gradient, retain_graph, create_graph) :
CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 6.00 GiB total capacity; 4.40 GiB already allocated; 10.19 MiB free; 4.74 GiB reserved in total by PyTorch)
Exception raised from malloc at ..\..\c10\cuda\CUDACachingAllocator.cpp:272 (most recent call first):
00007FFF51AFA7B200007FFF51AFA750 c10.dll!c10::Error::Error [<unknown file> # <unknown line number>]
My GPU is pretty dated, a 2014-2015 GeForce GTX 970M, 3GB of memory native. But the tensors in this example are not particularly large. They're dim(100 6 1 24 24) synthetic tensors, although admittedly this structure is preserving all of the hidden and cell states. I don't have the background to calculate what this 'should' be using, but my intuition is that something about this setup (or the current R torch implementation) isn't cleaning up after itself, particularly through training epochs.
Is anyone able to either reproduce or easily run a GPU modification of my example, and is there a straightforward solution here or is there simply a fundamental limit to the capacity of my GPU in this case?

Can I make this R foreach loop faster?

Thanks in advance for your help.
The short of this is that I have huge foreach loops that are running much slower than I'm used to, and I'm curious as to whether I can speed them up -- it's taking hours (maybe even days).
So, I've been given two large pieces of data ( by friend's who needs help). The first is a very large matrix (728396 rows by 276 columns) of genetic data for 276 participants (I'll call this M1). The second is a dataset (276 rows and 34 columns) of other miscellaneous data about the participants (I'll call this DF1). We're running a multilevel logistic regression model utilizing both sets of data.
I'm using a Windows PC with 8 virtual cores running at 4.7ghz and 36gb of ram.
Here's a portion of the code I've written/modified:
library(pacman)
p_load(car, svMisc, doParallel, foreach, tcltk, lme4, lmerTest, nlme)
load("M1.RDATA")
load("DF1.RDATA")
clust = makeCluster(detectCores() - 3, outfile="")
#I have 4 physical cores, 8 virtual. I've been using 5 because my cpu sits at about 89% like this.
registerDoParallel(clust)
getDoParWorkers() #5 cores
n = 728396
res_function = function (i){
x = as.vector(M1[i,])
#Taking one row of genetic data to be used in the regression
fit1 = glmer(r ~ x + m + a + e + n + (1 | famid), data = DF1, family = binomial(link = "logit"))
#Running the model
c(coef(summary(fit1))[2,1:4], coef(summary(fit1))[3:6,1], coef(summary(fit1))[3:6,4], length(fit1#optinfo[["conv"]][["lme4"]][["messages"]]))
#Collecting data, including whether there are any convergence error messages
}
start_time = Sys.time()
model1 = foreach(i = 1:n, .packages = c("tcltk", "lme4"), .combine = rbind) %dopar% {
if(!exists("pb")) pb <- tkProgressBar("Parallel task", min=1, max=n)
setTkprogressBar(pb, i)
#This is some code I found here to keep track of my progress
res_function(i)
}
end_time = Sys.time()
end_time - start_time
stopCluster(clust)
showConnections()
I've run nearly identical code in the past and it took me only about 13 minutes. However, I suspect that this model is taking up more memory than usual on each core (likely due to the second level) and slowing things down. I've read that BiocParallel, Future, or even Microsoft R Open might work better, but I haven't had much success using any of them (likely due to my own lack of know how). I've also read a bit about the package "bigmemory" to more efficiently use the large matrix across cores, but I ran into several errors when I tried to use it (failed workers and such). I'm also curious about the potential of using my GPU (a Titan X Pascal) for some additional umph if anyone knows more about this.
Any advice would be very appreciated!

Parallel processing with xgboost and caret

I want to parallelize the model fitting process for xgboost while using caret. From what I have seen in xgboost's documentation, the nthread parameter controls the number of threads to use while fitting the models, in the sense of, building the trees in a parallel way. Caret's train function will perform parallelization in the sense of, for example, running a process for each iteration in a k-fold CV. Is this understanding correct, if yes, is it better to:
Register the number of cores (for example, with the doMC package and the registerDoMC function), set nthread=1 via caret's train function so it passes that parameter to xgboost, set allowParallel=TRUE in trainControl, and let caret handle the parallelization for the cross-validation; or
Disable caret parallelization (allowParallel=FALSE and no parallel back-end registration) and set nthread to the number of physical cores, so the parallelization is contained exclusively within xgboost.
Or is there no "better" way to perform the parallelization?
Edit: I ran the code suggested by #topepo, with tuneLength = 10 and search="random", and specifying nthread=1 on the last line (otherwise I understand that xgboost will use multithreading). There are the results I got:
xgb_par[3]
elapsed
283.691
just_seq[3]
elapsed
276.704
mc_par[3]
elapsed
89.074
just_seq[3]/mc_par[3]
elapsed
3.106451
just_seq[3]/xgb_par[3]
elapsed
0.9753711
xgb_par[3]/mc_par[3]
elapsed
3.184891
At the end, it turned out that both for my data and for this test case, letting caret handle the parallelization was a better choice in terms of runtime.
It is not simple to project what the best strategy would be. My (biased) thought is that you should parallelize the process that takes the longest. Here, that would be the resampling loop since an open thread/worker would invoke the model many times. The opposite approach of parallelizing the model fit will start and stop workers repeatedly and theoretically slows things down. Your mileage may vary.
I don't have OpenMP installed but there is code below to test (if you could report your results, that would be helpful).
library(caret)
library(plyr)
library(xgboost)
library(doMC)
foo <- function(...) {
set.seed(2)
mod <- train(Class ~ ., data = dat,
method = "xgbTree", tuneLength = 50,
..., trControl = trainControl(search = "random"))
invisible(mod)
}
set.seed(1)
dat <- twoClassSim(1000)
just_seq <- system.time(foo())
## I don't have OpenMP installed
xgb_par <- system.time(foo(nthread = 5))
registerDoMC(cores=5)
mc_par <- system.time(foo())
My results (without OpenMP)
> just_seq[3]
elapsed
326.422
> xgb_par[3]
elapsed
319.862
> mc_par[3]
elapsed
102.329
>
> ## Speedups
> xgb_par[3]/mc_par[3]
elapsed
3.12582
> just_seq[3]/mc_par[3]
elapsed
3.189927
> just_seq[3]/xgb_par[3]
elapsed
1.020509

Get randomForest regression faster in R

I have to make a regression with randomforest in R. My problem is that my dataframe is huge: I have 12 variables and more than 400k entries. When I try - the code is written in the bottom - to get a randomForest regression the system takes many hours to process the data: after 5, 6 hours of calculation, I am obliged to stop the operation without any output. Someone can suggests me how I can get it faster?
Thanks
library(caret)
library(randomForest)
dataset <- read.csv("/home/anonimo/Modelli/total_merge.csv", header=TRUE)
dati <- data.frame(dataset)
attach(dati)
trainSet <- dati[2:107570,]
testSet <- dati[107570:480343,]
output.forest <- randomForest(dati$Clip_pm25 ~ dati$e_1 + dati$Clipped_so + dati$Clip_no2 + dati$t2m_1 + dati$tp_1 + dati$Clipped_nh + dati$Clipped_co + dati$Clipped_o3 + dati$ssrd_1 + dati$Clipped_no + dati$Clip_pm10 + dati$sp_1, data=trainSet, ntree=250)
I don't think to parallelize on a single PC (2-4 cores) is the answer. There are plenty of lower hanging fruits to pick.
1) RF models increase in complexity with number of training samples. The average tree depth would be something like log(480,000/5)/log(2) = 16.5 intermediary nodes. In the vast majority of examples 2000-10000 samples per tree is fine. If you competing to win on kaggle, a small extra performance really matters, as winner takes all. In practice, you probably don't need that.
2) Don't clone you data set in your R code and try to only keep one copy of your data set (pass by reference is of course fine). It's not a big problem for this data set, as the dataset is not that big (~38Mb) even for R.
3) Don't use formula interface with randomForest algorithm for large datasets. It will make an extra copy of the data set. But again memory is not that much of a problem.
4) Use a faster RF algorithm: extraTrees, ranger or Rborist are available for R. extraTrees is not exactly a RF algorithm but pretty close.
5) avoid categorical features with more than 10 categories. RF can handle up to 32, but becomes super slow as any 2^32 possible split has to be evaluated. extraTrees and Rborist handle more categories by only testing some random selected splits (which works fine). Another solution as in the python-sklearn every category are assigned a unique integer, and the feature is handled as numeric. You can convert your categorical features with as.numeric and before runing randomForest to do the same trick.
6) For much bigger data. Split the data set in random blocks and train a few(~10) trees on each. Combine forests or save forests separate. This will slightly increase the tree correlation. There are some nice cluster implementation to train like these. But won't be necessary for datasets below 1-100Gb, depending on tree complexity etc.
#below I use solution 1-3) and get a run time of some minutes
library(randomForest)
#simulate data
dataset <- data.frame(replicate(12,rnorm(400000)))
dataset$Clip_pm25 = dataset[,1]+dataset[,2]^2+dataset[,4]*dataset[,3]
#dati <- data.frame(dataset) #no need to keep the data set, an extra time in memory
#attach(dati) #if you attach dati you don't need to write data$Clip_pm25, just Clip_pm25
#but avoid formula interface for randomForest for large data sets because it cost extra memory and time
#split data in X and y manually
y = dataset$Clip_pm25
X = dataset[,names(dataset) != "Clip_pm25"]
rm(dataset);gc()
object.size(X) #38Mb, no problemo
#if you were using formula interface
#output.forest <- randomForest(dati$Clip_pm25 ~ dati$e_1 + dati$Clipped_so + dati$Clip_no2 + dati$t2m_1 + dati$tp_1 + dati$Clipped_nh + dati$Clipped_co + dati$Clipped_o3 + dati$ssrd_1 + dati$Clipped_no + dati$Clip_pm10 + dati$sp_1, data=trainSet, ntree=250)
#output.forest <- randomForest(dati$Clip_pm25 ~ ., ntree=250) # use dot to indicate all variables
#start small, and scale up slowly
rf = randomForest(X,y,sampsize=1000,ntree=5) #runtime ~15 seconds
print(rf) #~67% explained var
#you probably really don't need to exeed 5000-10000 samples per tree, you could grow 2000 trees to sample most of training set
rf = randomForest(X,y,sampsize=5000,ntree=500) # runtime ~5 minutes
print(rf) #~87% explained var
#regarding parallel
#here you could implement some parallel looping
#.... but is it really worth for a 2-4 x speedup?
#coding parallel on single PC is fun but rarely worth the effort
#If you work at some company or university with a descent computer cluster,
#then you can spawn the process across 20-80-200 nodes and get a ~10-60-150 x speedup
#I can recommend the BatchJobs package
Since you are using caret, you could use the method = "parRF". This is an implementation of parallel randomforest.
For example:
library(caret)
library(randomForest)
library(doParallel)
cores <- 3
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
dataset <- read.csv("/home/anonimo/Modelli/total_merge.csv", header=TRUE)
dati <- data.frame(dataset)
attach(dati)
trainSet <- dati[2:107570,]
testSet <- dati[107570:480343,]
# 3 times cross validation.
my_control <- trainControl(method = "cv", number = 3 )
my_forest <- train(Clip_pm25 ~ e_1 + Clipped_so + Clip_no2 + t2m_1 + tp_1 + Clipped_nh + Clipped_co + Clipped_o3 + ssrd_1 + Clipped_no + Clip_pm10 + sp_1, ,
data = trainSet,
method = "parRF",
ntree = 250,
trControl=my_control)
Here is a foreach implementation as well:
foreach_forest <- foreach(ntree=rep(250, cores),
.combine=combine,
.multicombine=TRUE,
.packages="randomForest") %dopar%
randomForest(Clip_pm25 ~ e_1 + Clipped_so + Clip_no2 + t2m_1 + tp_1 + Clipped_nh + Clipped_co + Clipped_o3 + ssrd_1 + Clipped_no + Clip_pm10 + sp_1,
data = trainSet, ntree=ntree)
# don't forget to stop the cluster
stopCluster(cl)
Remember I didn't set any seeds. You might want to consider this as well. And here is a link to a randomforest package that also runs in parallel. But I have not tested this.
The other two answers are good. Another option is to actually use more recent packages that are purpose-built for highly dimensional / high volume data sets. They run their code using lower-level languages (C++ and/or Java) and in certain cases use parallelization.
I'd recommend taking a look into these three:
ranger (uses C++ compiler)
randomForestSRC (uses C++ compiler)
h2o (Java compiler - needs Java version 8 or higher)
Also, some additional reading here to give you more to go off on which package to choose: https://arxiv.org/pdf/1508.04409.pdf
Page 8 shows benchmarks showing the performance improvement of ranger against randomForest against growing data size - ranger is WAY faster due to linear growth in runtime rather than non-linear for randomForest for rising tree/sample/split/feature sizes.
Good Luck!

Resources