I'm trying to fit +- 70.000 values as a function of two variables using the loess() function several times. I want to use this fit to de-trend the data. My problem is that once I start the loess function, the R session takes up all available cores on the system, and that would be inconsiderate towards other users on the same computing cluster.
The relevant code would be analogous to the following:
# Approximation of the data
df <- data.frame(y = rpois(70000, rnorm(70000, 10, 2)), # y is count data
x = 50000 - rpois(70000, 100),
z = runif(70000))
# The problematic operation
fit <- loess(y ~ x + z, data = df)
When I run this example on my local machine, it only takes up 1 core, but on the cluster it takes as many cores as it could get (up to 48). Ideally, I would loess() to run on only 1 core.
I've tried to trace any multicore parameters in the code of loess, which I couldn't find. I know that loess calls stats:::simpleLoess, which in turn calls C code, which in turn calls Fortran code. I have no experience in C or Fortran and I haven't been able to figure out how I can restrict the CPU usage for this function.
Does anyone has any suggestion on how I can limit the CPU usage of the loess function?
I am not knowledgeable enough to comment on specifics about how all of this works, but I know that C++ and FORTRAN for R are usually built using the OpenMP framework for multi-thread programming. Empirically, I do know that your issue can be resolved if you set the OMP_NUM_THREADS argument before you launch R or if you set it within an R session.
Let's say you wanted to use 2 threads for the loess function. Before you launch R, you would do this ($ to signify typing this in a shell session):
$ OMP_NUM_THREADS=2 R [whatever other options you use to launch R]
Here's how to do it from within R (> to indicate an interactive R session):
> Sys.setenv("OMP_NUM_THREADS" = 2)
If you ever need to check the variable from within R, you can do the following (this will return a character vector with the number):
> Sys.getenv("OMP_NUM_THREADS")
# The result in our example will be "2"
For completeness, be sure to use ?Sys.setenv or ?Sys.getenv if you wish to get more information about those functions, and check out this site for details about OMP_NUM_THREADS.
Hope that helps!
So McG led me down a path that eventually gave me the ability to control the number of cores, which I'll post as another answer.
There were a few details I foolishly neglected to mention, namely that I was working on an RStudio server. For all other purposes, I indeed think that McG's answer would be excellent.
That answer helped me get the correct terms to google, and strolling around the search results I stumbled upon this thread that suggested that the RhpcBLASctl package has a function to set the number of cores as follows:
blas_set_num_threads(2)
Setting this in an RMarkdown document before running loess kept my CPU usage at 200% while running the loess function afterwards that was problematic before.
Related
I am developing an iterative algorithm that uses quantile regression models at each iteration. For that I use the rq function from the quantreg package in R. So far it has worked fine. However, I have found a dataset where, at one of the iterations, the rq function simply gets stuck. No error message, no warning. It simply goes on as if still working, but never finishes computation.
I provide here a very small minimal code example. You can download the problematic data on this link:
https://www.dropbox.com/s/yrlotit1ovk9yzd/r555.RData?dl=0
library(quantreg)
load('~r555.RData')
dependent = r$dependent
independent = r$independent
quantreg::rq(dependent ~ -1 + independent, tau=0.1)
If you execute the above mentioned code, the rq function will get stuck and never finish. Be aware that the data provided is part of the iterative process I am developing, so it has no direct interpretation by itself. I am writing to check for possible reasons on this behaviour and check for possible solutions.
Dont know if it matters, but I have tested this on two different computers running Windows10 and using different versions of the quantreg package.
Changing the default method="br" to method="fn" fixes the problem.
quantreg::rq(dependent ~ -1 + independent, tau=0.1, method="fn")
I was running a spatstat envelop to generate simulations sample, however, it got stuck and did not run. So, I attempted to close the application but fail.
RStudio diagnostic log
Additional error message:
This application has requested the Runtime to terminate it in an
unusual way. Please contact the application's support team for more
information
There are several typing errors in the command shown in the question. The argument rank should be nrank and the argument glocal should be global. I will assume that these were typed correctly when you ran the command.
Since global=TRUE this command will generate 2 * nsim = 198 realisations of a completely random pattern and calculate the L function for each one of them. In my experience it should take only a minute or two to compute this, unless the geometry of the window is very complicated. One hour is really extraordinary.
So I'm guessing either you have a very complicated window (so that the edge correction calculation is taking a long time) or that RStudio is hanging somehow.
Try setting correction="border" or correction="none" and see if that makes it run faster. (These are the fastest choices.) If that works, then read the help for Lest or Kest about edge corrections, and choose an edge correction that you like. If not, then try running the same command in R instead of RStudio.
Suppose that I want to do bootstrap procedure 1000 times on each of 100 different simulated data set.
At top level, I can set up foreach backend to distribute the 100 jobs to different CPUs. Then at the lower level, by using function boot from R package boot I can also invoke parallel computing by specifying 'parallel' option in the function.
The pseudo code may look like following.
library(doParallel}
registerDoParallel(cores=4)
foreach(i=seq(100, 5, length.out = 100), .combine=cbind) %dopar% {
sim.dat <- simualateData(i)
boot.res <- boot(sim.dat, mean, R=1000, parallel = 'multicore', ...)
## then extract results and combine
...
}
I am curious to know how the parallel computing really works in this case.
Would the two different levels of parallel computing work at the same time? how would they affect (interact? interrupt? disable?) each other?
More generally, I guess there are now more and more R functions that provide parallel computing option like boot for intensive simulation. In that situation, is there a need to specify the lower-level parallel provided the top level? Or vice versa?
What are the pros and cons, if any, for this two-level parallel setup?
Thanks for any clarification.
EDIT:
I should have explained more clearly the problem. Actually after the boot.res is returned, more additional calculations are to be done on it to finally get summary statistics from boot.res. That means the whole computation is not mutually independent bootstrapping procedure. In this case, only outer parallel loop would mess up the results. So if I understand correctly here, the best way would be using nested foreach parallel backend, but suppress 'parallel' option from boot.
Anyone please correct me if I am wrong. Regards.
END EDIT
I am using a package called missForest to estimate the missing values in my data set.
My question is: how can we parallelize this process to shorten the time that it takes to get the results?
Please refer to this example (from missForest package):
data(iris)
summary(iris)
The data contains four continuous and one categorical variable.
Artificially produce missing values using the prodNA function:
set.seed(81)
iris.mis <- prodNA(iris, noNA = 0.2)
summary(iris.mis)
Impute missing values providing the complete matrix for illustration. Use ’verbose’ to see what happens between iterations:
iris.imp <- missForest(iris.mis, xtrue = iris, verbose = TRUE)
Yesterday I submitted version 1.4 of missForest to CRAN; the Windows and Linux packages are ready, the Mac version will follow soon.
The new function has an additional argument "parallelize" which allows to either compute the single forests in a parallel fashion (parallelize="forests") or to compute several forests on multiple variables at the same time (parallelize="variables"). The default setting is without parallel computing (parallelize="no").
Do not forget to register a suitable parallel backend, e.g. using the package "doParallel", before trying it for the first time. The "doParallel" vignette gives an illustrative example in Section 4.
Due to some other details I had to temporarily remove the "missForest" vignette from the package. But I will resolve this in due course and release it as version 1.4-1.
It's a bit tricky to do a good job of parallelizing the missForest function. There seem to be two basic ways to do it:
Create the randomForest model objects in parallel;
Execute multiple randomForest operations (create model and predict) in parallel for each of the columns of the data frame that contain NA's.
Method 1 is rather easy to implement, except that you have to compute the error estimates yourself since the randomForest combine function doesn't compute them for you. However, if the randomForest objects don't take that long to compute and there are many columns containing NA's, you may get very little if any speed up, even though the operations in aggregate take a long time to compute.
Method 2 is a bit harder to implement because the sequential algorithm updates the columns of the xmis data frame after each randomForest operation. I think the right way to parallelize this is to process n columns in parallel at a time (where n is the number of worker processes), thus requiring another loop around the n columns in order to process all of the columns of the data frame. My experiments suggest that unless this is done, the outer loop takes longer to converge, thus losing the benefit of executing in parallel.
In general, to get a performance improvement you will need to implement both of these methods, and choose which to use based on your input data. If you just have a few columns with NA's but the randomForest models take a long time to compute, you should choose method 1. If you have many columns with NA's, you should probably choose method 2, even if the individual randomForest models take a long time to compute because this can be done more efficiently, although it's possible that it will still require an extra iteration of the outer while loop.
In the process of experimenting with missForest, I eventually developed a parallel version of the package. I put the modified version of library.R on GitHub Gist, however it isn't trivial to use in that form, especially without documentation. So I contacted the author of missForest, and he is very interested in incorporating at least some of my modifications into the official package, so hopefully the next version of missForest that is posted to CRAN will support parallel execution.
I've been struggling to perform this sort of analysis and posted on the stats site about whether I was taking things in the right direction, but as I've been investigating I've also found that my lovely beefy processor (linux OS, i7) is only actually using 1 of its cores. Turns out this is default behaviour, but I have a fairly large dataset and between 40 and 50 variables to select from.
A stepAIC function that is checking various different models seems like the ideal sort of thing for parellizing, but I'm a relative newb with R and I only have sketchy notions about parallel computing.
I've taken a look at the documentation for the packages parallel, and snowfall, but these seems to have some built-in list functions for parallelisation and I'm not sure how to morph the stepAIC into a form that can be run in parellel using these packages.
Does anyone know 1) whether this is a feasible exercise, 2) how to do what I'm looking to do and can give me a sort of basic structure/list of keywords I'll need?
Thanks in advance,
Steph
I think that a process in which a step depends on de last (as in step wise selection) is not trivial to do in parallel.
The simplest way to do something in parallel I know is:
library(doMC)
registerDoMC()
l <- foreach(i=1:X) %dopar% { fun(...) }
in my poor understanding of stepwise one extracts variables (or add forward/backward) of a model and measure the fitting in each step. If extracting a variable the model fit is best you keep this model, for example. In the foreach parallel function each step is blind to other step, maybe you could write your own function to perform this task as in
http://beckmw.wordpress.com/tag/stepwise-selection/
I looked for this code, and seems to me that you could use parallel computing with the vif_func function...
I think you also should check optimized codes to do that task as in the package leaps
http://cran.r-project.org/web/packages/leaps/index.html
hope this helps...