I am working through some demo code that accompanied a medium post on high frequency time series forecasting using the forecast::auto.arima function. Whether in this application or when I have tried other datasets, I have never been able to get a result from this function - it does seem to stop calculating once I have executed it. Others have, obviously; so I'm asking how long you need to wait to get a result from the function.
Example
Data on hourly energy use can be downloaded from Kaggle.
Using that data:
library(tidyverse)
library(forecast)
duq_pc <- read.csv(file.choose(),stringsAsFactors = F)
duq_new <- duq_pc[duq_pc$Datetime >= '2013-01-01 00:00:00' & duq_pc$Datetime <= '2017-09-30 00:00:00',]
duq_train <- duq_new[duq_new$Datetime <= '2016-12-31',]
duq_test <- duq_new[duq_new$Datetime >= '2017-01-01',]
msts_power <- msts(duq_train$DUQ_MW, seasonal.periods = c(24,24*7,24*365.25), start = decimal_date(as.POSIXct("2013-01-01 00:00:00")))
#Dynamic Harmonic regression with Auto Arima
fourier_power <- auto.arima(msts_power, seasonal=FALSE, lambda=0,
xreg=fourier(msts_power, K=c(10,10,10)))
Very interested to hear whether this is something that I would need to leave running overnight or whether other people are getting results in minutes.
In my case, running your code and measuring the times in between, it took about 40 minutes to finish. For what it's worth, I launched the script on a computer with an AMD Ryzen 2700 Eight-Core Processor 3.20 GHZ, 16 GB of RAM.
It really depends on the size of your dataset and your computer specs. You can use the tictoc library for an easy way to record time, and set trace = TRUE in your auto.arima call to output its progress to console.
library(tictoc)
tic()
fourier_power <- auto.arima(msts_power, seasonal=FALSE, lambda=0, trace = TRUE
xreg=fourier(msts_power, K=c(10,10,10)))
toc()
Related
I'm using the psych package to compute tetrachoric correlations for a very large dataset, comprising 1000 variables and 288,059 cases.
The data can be downloaded here:
https://www.dropbox.com/s/iqwgdywqfjvlkku/data.csv.zip?dl=0
(4MB)
My code looks like the following:
library(psych)
library(tidyverse)
temp = read.csv("~/Temp/data.csv", sep=",")
tetravalues = tetrachoric(temp, delete=FALSE)
tetraframe = tetravalues$rho
write.csv(tetraframe, file="~/Temp/output.csv")
Currently, this bit of the code has been running for 8 hours and hasn't ended yet:
tetravalues = tetrachoric(temp, delete=FALSE)
According to the psych package manual (tetrachoric):
This is a computationally intensive function which can be speeded up considerably by using mul- tiple cores and using the parallel package. The number of cores to use when doing polychoric or tetrachoric may be specified using the options command. The greatest step up in speed is going from 1 cores to 2. This is about a 50% savings. Going to 4 cores seems to have about at 66% savings, and 8 a 75% savings. The number of parallel processes defaults to 2 but can be modified by using the options command: options("mc.cores"=4) will set the number of cores to 4.
My laptop has 10 cores.
I'm new to R, and I haven't been able to figure out how to run my code in parallel.
Any ideas are appreciated.
library(psych)
library(tidyverse)
temp = read.csv("~/Temp/data.csv", sep=",")
tetravalues = tetrachoric(temp, delete=FALSE)
tetraframe = tetravalues$rho
write.csv(tetraframe, file="~/Temp/output.csv")
You basically provided the answer to your question yourself.
You can adjust the number of cores in the code below.
Note that when you want to use your laptop for other things while the computation is running, I would not set the number of cores to the maximum.
Here is a quick intro about parallel computing in R.
library(psych)
library(tidyverse)
# Here you can pick the number of cores.
options("mc.cores"=4)
temp = read.csv("~/Temp/data.csv", sep=",")
tetravalues = tetrachoric(temp, delete=FALSE)
tetraframe = tetravalues$rho
write.csv(tetraframe, file="~/Temp/output.csv")
tetravalues = tetrachoric(temp, delete=FALSE)
I'm modelling a quite big network with EpiModel in R, and the code takes very long to run, so I want to run it on multiple cores instead of just 1. I thought this was possible in EpiModel itself, but when I try it, my code just keeps running without starting the simulations. This is the code I am using:
library(EpiModel)
library(parallel)
nw <- network::network.initialize(n=6000, directed=FALSE)
formation <- ~edges + concurrent
target.stats<-c(1500, 600)
coef.diss <- dissolution_coefs(dissolution=~offset(edges), duration = 1)
est <- netest(nw, formation, target.stats, coef.diss)
dx<-netdx(est, nsims=10, nsteps=122, dynamic=FALSE, ncores=4)
init <- init.net(i.num=1, r.num=0)
param <-param.net(inf.prob=0.55, act.rate=0.6, rec.rate=0.05)
control<-control.net(type='SIR', nsteps= 122, nsims =10, ncores=4)
mainsim <- netsim(est, param, init, control)
plot(mainsim, y='si.flow')
When I set ncores to 1 it will run, but any other number doesn't work. Does anybody know how to solve this?
I am interested in executing the R function adonis from the vegan package in parallel. However, it isn't clear to me how exactly to make it run in parallel. Regardless of how I try to initialize it, it seems to take the same amount of time to execute. Can someone explain what I am doing wrong?
require(vegan)
require(parallel)
data(dune)
data(dune.env)
#This:
system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999))
#Runs faster (4.49 s) than this (6.7 s):
system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999, parallel=3))
#or this (6.7 s)
cl <- makeCluster(3)
system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999, parallel=cl))
stopCluster(cl)
Computer details:
R V4.0
Win 10x64
i5-8350 4 cores
I'm not sure how helpful this answer will really be, but I'll share a few of my own observations and things I've slowly pieced together. I don't pretend to be an expert on this, so take my answer realizing there may be some inaccuracies in here. I'm a biologist first.
Some of these parallel libraries seem to reload the r-environment and run any start up files (e.g. rprofiles) you have per each core. So, there is an inherent time cost using the parallel libraries that makes it so that you will only see benefits to parallel functions if you it is a large enough computation to be worth the parallelization (in your example, the Dune dataset is really small. I'll share my own benchmarks below). That said, there are a few things that seem to help.
Using the doParallel library, you can specify arguments to not load unnecessary information into your session like so:
library(doParallel)
cl <- makeCluster(3, rscript_args = c("--no-init-file", "--no-site-file","--no-environ"))
#for linux .... cl <- makePSOCKcluster(2)
registerDoParallel(cl)
unif_w = UniFrac(d, weighted=T, parallel=T, normalized = T)
unif_uw = UniFrac(d, weighted=F, parallel=T)
stopCluster(cl)
I noticed in my own work that the addition of the rscript option greatly enhanced my speeds (sorry, no benchmarks for this, hoping to get a quick anwer out). If I remember the source where I got that suggestion from I'll come back to share.
This doesn't help with running Adonis, however I think that initial time cost might explain why we don't see a time benefit using the parallel options built in to Adonis on the Dune dataset. Here are my benchmarks.
> data("dune")
> data("dune.env")
> system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999))
user system elapsed
3.90 0.00 3.93
> #Runs faster (4.49 s) than this (6.7 s):
> system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999, parallel=3))
user system elapsed
0.71 0.04 6.53
Not a big difference on this set, but it IS slower in parallel. However, repeated with a large set I'm working with at the moment (bc is a distance matrix was calculated from species matrix that has 887 species by 3734 sites)
> system.time(adonis(bc ~ fmet$Diagnosis, parallel = 1))
user system elapsed
109.95 21.27 131.22
> system.time(adonis(bc ~ fmet$Diagnosis, parallel = 4))
user system elapsed
3.44 1.41 82.36
Long story short, in this specific case you might only benefits by applying the adonis option to a larger dataset.
I'm not sure how important computer specs are here, but I do have a large bit of memory intended for this kind of purpose. The memory in my case is more important for allowing me to work with large matrices a little easier.
R version: 4.0.2
Windows 10, 64bit
AMD Ryzen 3600
64gb DRAM
Anyways, I'm still looking for other work-arounds and tricks.
Thanks in advance for your help.
The short of this is that I have huge foreach loops that are running much slower than I'm used to, and I'm curious as to whether I can speed them up -- it's taking hours (maybe even days).
So, I've been given two large pieces of data ( by friend's who needs help). The first is a very large matrix (728396 rows by 276 columns) of genetic data for 276 participants (I'll call this M1). The second is a dataset (276 rows and 34 columns) of other miscellaneous data about the participants (I'll call this DF1). We're running a multilevel logistic regression model utilizing both sets of data.
I'm using a Windows PC with 8 virtual cores running at 4.7ghz and 36gb of ram.
Here's a portion of the code I've written/modified:
library(pacman)
p_load(car, svMisc, doParallel, foreach, tcltk, lme4, lmerTest, nlme)
load("M1.RDATA")
load("DF1.RDATA")
clust = makeCluster(detectCores() - 3, outfile="")
#I have 4 physical cores, 8 virtual. I've been using 5 because my cpu sits at about 89% like this.
registerDoParallel(clust)
getDoParWorkers() #5 cores
n = 728396
res_function = function (i){
x = as.vector(M1[i,])
#Taking one row of genetic data to be used in the regression
fit1 = glmer(r ~ x + m + a + e + n + (1 | famid), data = DF1, family = binomial(link = "logit"))
#Running the model
c(coef(summary(fit1))[2,1:4], coef(summary(fit1))[3:6,1], coef(summary(fit1))[3:6,4], length(fit1#optinfo[["conv"]][["lme4"]][["messages"]]))
#Collecting data, including whether there are any convergence error messages
}
start_time = Sys.time()
model1 = foreach(i = 1:n, .packages = c("tcltk", "lme4"), .combine = rbind) %dopar% {
if(!exists("pb")) pb <- tkProgressBar("Parallel task", min=1, max=n)
setTkprogressBar(pb, i)
#This is some code I found here to keep track of my progress
res_function(i)
}
end_time = Sys.time()
end_time - start_time
stopCluster(clust)
showConnections()
I've run nearly identical code in the past and it took me only about 13 minutes. However, I suspect that this model is taking up more memory than usual on each core (likely due to the second level) and slowing things down. I've read that BiocParallel, Future, or even Microsoft R Open might work better, but I haven't had much success using any of them (likely due to my own lack of know how). I've also read a bit about the package "bigmemory" to more efficiently use the large matrix across cores, but I ran into several errors when I tried to use it (failed workers and such). I'm also curious about the potential of using my GPU (a Titan X Pascal) for some additional umph if anyone knows more about this.
Any advice would be very appreciated!
I'm trying to train a SVM model on a large dataset(~110k training points). This is a sample of the code where I am using the parallelSVM package to parallelize the training step on a subset of the training data on my 4 core Linux machine.
numcore = 4
train.time = c()
for(i in 1:5)
{
cl = makeCluster(4)
registerDoParallel(cores=numCore)
getDoParWorkers()
dummy = train_train[1:10000*i,]
begin = Sys.time()
model.svm = parallelSVM(as.factor(target) ~ .,data =dummy,
numberCores=detectCores(),probability = T)
end = Sys.time() - begin
train.time = c(train.time,end)
stopCluster(cl)
registerDoSEQ()
}
The idea of this snippet of code is to estimate the time it'll take to train the model on the entire dataset by gradually increasing the size of the dummy training set. After running the code above for 10,000 and 20,000 training samples, this is the memory and swap history usage statistic from the System Monitor.After 4 runs of the for loop,both the memory and swap usage is about 95%,and I get the following error :
Error in summary.connection(connection) : invalid connection
Any ideas on how to manage this problem? Is there a way to deallocate the memory used by a cluster after using the stopCluster() function ?
Please take into consideration the fact that I am an absolute beginner in this field. A short explanation of the proposed solutions will be greatly appreciated. Thank you.
Your line
registerDoParallel(cores=numCore)
creates a new cluster with number of nodes equal to numCore (which you haven't stated). This cluster is never destroyed, so with each iteration of the loop you're starting more new R processes. Since you're already creating a cluster with cl = makeCluster(4), you should use
registerDoParallel(cl)
instead.
(And move the makeCluster, registerDoParallel, stopCluster and registerDoSEQ calls outside the loop.)