How can you limit cpu usage in R? - r

When I run an R script to generate a model through machine learning frameworks like mxnet and tensorflow, I see in task manager that the cpu usage reaches 100%.
I have 2x 2.7 ghz and the pc becomes too slow until it blocks.
Is there a method to limit cpu usage in R with a slower model training time?

MXnet looks at some environment variables:
https://mxnet.incubator.apache.org/faq/env_var.html
You could experiment by setting MXNET_GPU_WORKER_NTHREADS=2 at the command line, for example.
Note that you may have to restart R after you set the environment variables for this to take effect.

0) As mentioned above, you could manipulate the environment variables that dictate how many workers you want.
1) You could adjust your workbook context to use only one of the CPUs.
e.g. z = nd.ones(shape=(3,3), ctx=mx.cpu(0))
2) Could resort to using OS-level tools, in Windows there are a few: https://superuser.com/questions/214566/are-there-solutions-that-can-limit-the-cpu-usage-of-a-process
Vishaal

Related

protection from stack overflow in R with a lot of free RAM

I apologize in advance since this post will not have any reproducible example.
I am using R x64 3.4.2 to run some cross-validated analyses on quite big matrices (number of columns ~ 80000, number of rows between 40 and 180). The analyses involve several features selection steps (performed with in-house functions or with functions from the CORElearnpackage, which is written in C++), as well as some clustering of the features and the fitting of a SVM model (by means of the package RWeka, that is written in Java).
I am working on a DELL Precision T7910 machine, with 2 processors Intel Xeon E5-2695 v3 2.30 GHz, 192 Gb RAM and Windows 7 x64 operating system.
To speed up the running time of my analysis I thought to use the doParallel package in combination with foreach. I would set up the cluster as follow
cl <- makeCluster(number_of_cores, type='PSOCK')
registerDoParallel(cl)
with number_of_clusterset to various numbers between 2 and 10 (detectCore() tells me that I have 56 cores in total).
My problem is that even if only setting number_of_cluster to 2, I got a protection from stack overflowerror message. The thing is that I monitor the RAM usage while the script is running and not even 20 Gb of my 192 Gb RAM are being used.
If I run the script in a sequential way it takes its sweet time (~ 3 hours with 42 rows and ~ 80000 columns), but it does run until the end.
I have tried (almost) every trick in the book for good memory management in R:
I am loading and removing big variables as needed in order to reduce memory usage
I am breaking down the steps with functions rather than scripting them directly, to take advantage of scoping
I am calling gc()every time I delete a big object in order to prompt R to return memory to the operating system
But I am still unable to run the script in parallel.
Do someone have any suggestion about this ? Should I just give up and wait > 3 hours every time I run the analyses ? And more generally: how is it possible to have a stack overflow problem when having a lot of free RAM ?
UPDATE
I have now tried to "pseudo-parallelize" the work using the same machine: since I am running a 10-fold cross-validation scheme, I am opening 5 different instances of Rgui and running 2 folds in each instances. Proceeding in this way, everything run smoothly, and the process indeed take 10 times less than running it in a single instance of R. What makes me wonder is that if 10 instances of Rgui can run at the same time and get the job done, this means that the machine has the computational resources needed. Hence I can not really get my head around the fact that %dopar% with 10 clusters does not work.
The "protection stack overflow" means that you have run out of the "protection stack", that is too many pointers have been PROTECTed but not (yet) UNPROTECTed. This could be because of a bug or inefficiency in the code you are running (in native code of a package or in native code of R, but not a bug in R source code).
This problem has nothing to do with the amount of available memory on the heap, so calling gc() will have no impact, and it is not important how much physical memory the machine has. Please do not call gc() explicitly at all, even if there was a problem with the heap usage, it just makes the program run slower but does not help: if there is not enough heap space but it could be obtained by garbage collection, the garbage collector will run automatically. As the problem is the protection stack, neither restructuring the R code nor removing dead variables explicitly will help. In principle, structuring the code into (relatively small) functions is a good thing for maintainability/readability and it also indirectly reduces scope of variables, so removing variables explicitly should become unnecessary.
It might help to increase the pointer protection stack size, which can be done at R startup from the command line using --max-ppsize.

Is H2O supposed to be so slow?

I'm running cross-validation deep learning training (nfolds=4) iteratively for feature selection on H2O through R. Currently, I have only 2 layers (i.e. not deep) and between 8 and 50 neurons per layer. There are only 323 inputs, and 12 output classes.
To train one model takes in average around 40 seconds on my Intel 4770K, (32 GB ram). During training, H2o is able to max out all cpu cores.
Now, to try to speed up the training, I've set up an EC2 instance in the amazon cloud. I tried the largest compute unit (c4.8xlarge), but the speed up was minimal. It took around 24 seconds to train one model with the same settings. Therefore, I suspecting there's something I've overlooked.
I started the training like this:
localH2O <- h2o.init(ip = 'localhost', port = 54321, max_mem_size = '24G', nthreads=-1)
Just to compare the processors, the 4770K got 10163 on cpu benchmark, while the Intel Xeon E5-2666 v3 got 24804 (vCPU is 36).
This speedup is quite disappointing to say the least, and is not worth all the extra work of installing and setting everything up in the amazon cloud, while paying over $2/hour.
Is there something else that needs to be done to get all cores working besides setting nthreads=-1 ?
Do I need to start making several clusters in order to get the training time down, or should I just start on a new deep learning library that supports GPUs?
To directly answer your question, no, H2O is not supposed to be slow. :-) It looks like you have a decent PC and the Amazon instances (even though there are more vCPUs) are not using the best processors (like what you would find in a gaming PC). The base / max turbo frequency of your PC's processor is 3.5GHz / 3.9GHz and the c4.8xlarge is only 2.9GHz / 3.5GHz.
I'm not sure that this is necessary, but since the c4.8xlarge instances have 60GB of RAM, you could increase max_mem_size from '24G' to at least '32G', since that's what your PC has, or even something bigger. (Although not sure that will do anything since memory is not usually the limiting factor, but may be worth a try).
Also, if you are concerned about EC2 price, maybe look into spot instances instead. If you require additional real speedup, you should consider using multiple nodes in your EC2 H2O cluster, rather than a single node.

Does multicore computing using R's doParallel package use more memory?

I just tested an elastic net with and without a parallel backend. The call is:
enetGrid <- data.frame(.lambda=0,.fraction=c(.005))
ctrl <- trainControl( method="repeatedcv", repeats=5 )
enetTune <- train( x, y, method="enet", tuneGrid=enetGrid, trControl=ctrl, preProc=NULL )
I ran it without a parallel backend registered (and got the warning message from %dopar% when the train call was finished), and then again with one registered for 7 cores (of 8). The first run took 529 seconds, the second, 313. But the first took 3.3GB memory max (reported by the Sun cluster system), and the second took 22.9GB. I've got 30GB of ram, and the task only gets more complicated from here.
Questions:
1) Is this a general property of parallel computation? I thought they shared memory....
2) Is there a way around this while still using enet inside train? If doParallel is the problem, are there other architectures that I could use with %dopar%--no, right?
Because I am interested in whether this is the expected result, this is closely related but not the exact same as this question, but I'd be fine closing this and merging my question in to that one (or marking that as duplicate and pointing to this one, since this has more detail) if that's what the concensus is:
Extremely high memory consumption of new doParallel package
In multithreaded programs, threads share lots of memory. It's primarily the stack that isn't shared between threads. But, to quote Dirk Eddelbuettel, "R is, and will remain, single-threaded", so R parallel packages use processes rather than threads, and so there is much less opportunity to share memory.
However, memory is shared between the processes that are forked by mclapply (as long as the processes don't modify it, which triggers a copy of the memory region in the operating system). That is one reason that the memory footprint can be smaller when using the "multicore" API versus the "snow" API with parallel/doParallel.
In other words, using:
registerDoParallel(7)
may be much more memory efficient than using:
cl <- makeCluster(7)
registerDoParallel(cl)
since the former will cause %dopar% to use mclapply on Linux and Mac OS X, while the latter uses clusterApplyLB.
However, the "snow" API allows you to use multiple machines, and that means that your memory size increases with the number of CPUs. This is a great advantage since it can allow programs to scale. Some programs even get super-linear speedup when running in parallel on a cluster since they have access to more memory.
So to answer your second question, I'd say to use the "multicore" API with doParallel if you only have a single machine and are using Linux or Mac OS X, but use the "snow" API with multiple machines if you're using a cluster. I don't think there is any way to use shared memory packages such as Rdsm with the caret package.
There is a minimum number of characters elsewise I would simply have typed: 1) Yes. 2) No, er, maybe. There are packages that use a "shared memory" model for parallel computation, but R's more thoroughly tested packages don't use it.
http://www.stat.berkeley.edu/scf/paciorek-parallelWorkshop.pdf
http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf
http://heather.cs.ucdavis.edu/Rdsm/BARUGSlides.pdf

Parallel processing in R limited

I'm running ubuntu 12.04 and R 2.15.1 with the parallel and doParallel packages. when I run anything in parallel I'm limited to 100% of a core, when I should have up to 800%, since I am running it with 8 cores. What shows up on the system monitor is that each child process is getting only 12%.
What is going on that is limiting my execution speed?
The problem may be that the R process is restricted to one core (and the subprocesses inherit that).
Try this:
> system(sprintf("taskset -p 0xffffffff %d", Sys.getpid()))
pid 3064's current affinity mask: fff
pid 3064's new affinity mask: fff
Now, if on your machine, the current affinity mask reports a 1, then this was the problem. The line above should solve it (i.e. the second line should report fff (or similar).
Simon Urbanek wrote a function mcaffinity that allows this control for multicore. As far as I know, it's still in R-devel.
For details, see e.g. this discussion on R-sig-hpc.
Update, and addition to Xin Guo's answer:
If you use implicit parallelization via openblas and explicit parallelization (via parallel/snow/multicore) together, you may want to change the number of threads that openblas uses depending on whether you are inside an explicitly parallel part or not.
This is possible (with openblas under Linux, I'm not aware of any other of the usual optimized BLAS' that provides a function to the number of threads), see Simon Fuller's blog post for details.
I experienced the same problem because of the libblas.so(.3gf) packages, and I do not know if this also causes your problem. When R starts it calls a BLAS system installed in your system to conduct linear algebraic computations. I have libopenblas.so(.3gf) and it is highly optimized with the option "CPU Affinity", that is to say, when you do numerical vector or matrix computation, the openblas package will just make 8 threads and make each one of the threads stuck to one specified and fixed CPU to speed up the codes. However, by setting this, your system is then told that all the CPU's are very busy, and thus if further parallel tasks come, the system will try to squeeze them into one CPU so as to try best not to interfere the busy CPU's.
So this was my solution which worked: I downloaded an openblas package source and compiled it with the file "Makefile.rule" changed: there is one line "#NO_AFFINITY = 1" and I just deleted "#" so that after compiled, there is no affinity option selected. Then I installed the package and the problem was solved.
For the reference of this, see https://github.com/ipython/ipython/issues/840
Please note that this is a trade-off. Removing CPU affinity will make you lose some efficiency when doing numerical computations, that's why though the openblas maintainer (Dr. Xianyi Zhang) knows the problem, he still publish the codes with the cpu affinity as a default option.
My guess is that you probably had the wrong code. I would like to post one example copied from online http://www.r-bloggers.com/parallel-r-loops-for-windows-and-linux/ :
library(doMC)
registerDoMC()
x<- iris[which(iris[,5]!='setosa'),c(1,5)]
trials<- 10000
r<- foreach(icount(trials), .combine=cbind) %dopar% {
ind<- sample(100,100,replace=T)
result1<- glm(x[ind,2]~x[ind,1],family=binomial(logit))
coefficients(result1)
}
and you can define how many cores you want to use in the parallel:
options(cores=4)

How to let R use all the cores of the computer?

I have read that R uses only a single CPU. How can I let R use all the available cores to run statistical algorithms?
Yes, for starters, see the High Performance Computing Task View on CRAN. This lists details of packages that can be used in support of parallel computing on a single machine.
From R version 2.14.0, there is inbuilt support for parallel computing via the parallel package, which includes slightly modified versions of the existing snow and multicore packages. The parallel package has a vignette that you should read. You can view it using:
vignette(package="parallel", topic = "parallel")
There are other ways to exploit your multiple cores, for example via use of a multi-threaded BLAS for linear algebra computations.
Whether any of this will speed up the "statistics calculations" you want to do will depend on what those "statistics calculations" are. Spawning off multiple threads or workers entails an overhead cost to set them up, manage them and collect the results. Some operations see a benefit (some large, some small) of using multiple cores/threads, others are slowed down because of this extra overhead.
In short, do not expect to get an n times decrease in your compute time by using n cores instead of just 1.
If you happen to do few* iterations of the same thing (or same code for few* different parameters), the easiest way to go is to run several copies of R -- OS will allocate the work on different cores.
In the opposite case, go and learn how to use real parallel extensions.
For the sake of this answer, few means less or equal the number of cores.

Resources