I'm running into memory issues using the bnlearn package's structure learning algorithms. Specifically, I notice that score based methods (e.g. hc and tabu) use LOTS of memory--especially when given a non-empty starting network.
Memory usage wouldn't be an issue except that it continually brings down both my laptop (16GB RAM) and a VM I'm using (128 GB RAM), yet the data set in question is a discrete BN with 41 nodes and ~250 rows (69KB in memory). The issue occurs both when running sequentially with 16GB of RAM and in parallel on a VM (32GB/core).
One last bit of detail: Occasionally I can get 100-200 nets with a random start to run successfully, but then one net will randomly get too big and bring the system down.
My question: I'm newer to BNs, so is this just inherent to the method or is it a memory management issue with the package?
Related
I am currently trying to run a parallelized RStan job on a computing cluster in R. I am doing this by specifying the following two options:
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
The above allocates 48 cores available and the total RAM I have is 180 GB. I always thought in theory that more cores and more RAM was better. I am running very long jobs and I am getting insufficient memory errors in my cluster. I am wondering if I am perhaps not giving enough memory to each core. Is it possible that the 48 cores each are splitting the 180 GB and each core is then maxed out?
If I were to use the 180 GB of RAM and instead had 3 cores, would this get around memory errors? Or is it no matter how many cores I have, the total memory will always be used up at some point if its a long job? Thanks!
RStan is only going to utilize as many cores as there are chains (by default 4). And if you are using a shell on a Linux cluster, it will fork rather than making copies of the data. So, if you have idle cores, you are better off utilizing them to parallelize computations within a Stan program using the map_rect function, if possible.
But all of that is probably unrelated to your memory problems. It should not require even 1 GB in most cases.
I have recently inherited a legacy R script that at some point trains a gradient boost model with a large regression matrix. This task is parallelised using the doParallel::registerDoParallel function. Originally, the script started the parallel back end with:
cl = makeCluster(10)
doParallel::registerDoParallel(c)
The workstation has 12 CPU and 28 GB of RAM. The regression matrix being just over 2 GB, I thought this set up would be manageable, however, it would launch dozens of sub-processes, exhausting the memory and crashing R in a few seconds. Eventually I understood that on Linux, a more predictable set up can be achieved by using the cores parameter:
doParallel::registerDoParallel(cores = 1)
Where cores is actually the number of sub-processes per CPU, i.e., with cores = 2 it launches 24 sub-process in this 12 CPU architecture. The problem is that with such a large regression matrix, even 12 sub-processes is too much.
How can the number of sub-processes launched by registerDoParallel be restricted to, say, 8? Would it help using a different parallelisation library?
Update: To ascertain the memory required by each sub-process, I ran the script without starting a cluster or using registerDoParallel; still ,12 back end sub-processes were spawned. The culprit is the caret::train function, it seems to be managing parallelisation without oversight. I opened a new issue at GitHub asking for instructions on how to limit the resources used by this function.
I am trying to use caret to cross-validate an elastic net model using the glmnet implementation on an Ubuntu machine with 8 CPU cores & 32 GB of RAM. When I train sequentially, I am maxing out CPU usage on one core, but using 50% of the memory on average.
When I use doMC(cores = xxx), do I need to worry about only registering xxx = floor(100/y) cores, where y is the memory usage of the model when using a single core (in %), in order to not run out of memory?
Does caret have any heuristics that allow it to figure out the max. number of cores to use?
Is there any set of heuristics that I can use to dynamically adjust the number of cores to use my computing resources optimally across different sizes of data and model complexities?
Edit:
FWIW, attempting to use 8 cores made my machine unresponsive. Clearly caret does not check to see if the spawning xxx processes is likely to be problematic. How can I then choose the number of cores dynamically?
Clearly caret does not check to see if the spawning xxx processes is likely to be problematic.
True; it cannot predict future performance of your computer.
You should get an understanding of how much memory you use for modeling use when running sequentially. You can start the training and use top or other methods to estimate the amount of ram used then kill the process. If sequentially you use X GB of RAM sequentially, running on M cores will require X(M+1) GB of ram.
I'm running a foreach loop with the snow back-end on a windows machine. I have 8 cores to work with. The rscript is exectuted via a system call embedded in a python script, so there would be an active python instance too.
Is there any benefit to not have #workers=#cores and instead #workers<#cores so there is always an opening for system processes or the python instance?
It successfully runs having #workers=#cores but do I take a performance hit by saturating the cores (max possible threads) with the r worker instances?
It will depend on
Your processor (specifically hyperthreading)
How much info has to be copied to/from the different images
If you're implementing this over multiple boxes (LAN)
For 1) hyperthreading helps. I know my machine does it so I typically have twice as many workers are cores and my code completes in about 85% of the time compared to if I matched the number of workers with cores. It won't improve more than that.
2) If you're not forking, using sockets for instance, you're working as if you're in a distributed memory paradigm, which means creating one copy in memory for every worker. This can be a non-trivial amount of time. Also, multiple images in the same machine may take up a lot of space, depending on what you're working on. I often match the number of workers with number because doubling workers will make me run out of memory.
This is compounded by 3) network speeds over multiple workstations. Locally between machines our switch will transfer things at about 20 mbytes/second which is 10x faster than my internet download speeds at home, but is a snail's pace compared to making copies in the same box.
You might consider increasing R's nice value so that the python has priority when it needs to do something.
I am running the termstrc yield curve analysis package in R across 10 years of daily bond price data for 5 different countries. This is highly compute intensive, it takes 3200 seconds per country on a standard lapply, and if I use foreach and %dopar% (with doSNOW) on my 2009 i7 mac, using all 4 cores (8 with hyperthreading) I get this down to 850 seconds. I need to re-run this analysis every time I add a country (to compute inter-country spreads), and I have 19 countries to go, with many more credit yield curves to come in the future. The time taken is starting to look like a major issue. By the way, the termstrc analysis function in question is accessed in R but is written in C.
Now, we're a small company of 12 people (read limited budget), all equipped with 8GB ram, i7 PCs, of which at least half are used for mundane word processing / email / browsing style tasks, that is, using 5% maximum of their performance. They are all networked using gigabit (but not 10-gigabit) ethernet.
Could I cluster some of these underused PCs using MPI and run my R analysis across them? Would the network be affected? Each iteration of the yield curve analysis function takes about 1.2 seconds so I'm assuming that if the granularity of parallel processing is to pass a whole function iteration to each cluster node, 1.2 seconds should be quite large compared with the gigabit ethernet lag?
Can this be done? How? And what would the impact be on my co-workers. Can they continue to read their emails while I'm taxing their machines?
I note that Open MPI seems not to support Windows anymore, while MPICH seems to. Which would you use, if any?
Perhaps run an Ubuntu virtual machine on each PC?
Yes you can. There are a number of ways. One of the easiest is to use redis as a backend (as easy as calling sudo apt-get install redis-server on an Ubuntu machine; rumor has that you could have a redis backend on a windows machine too).
By using the doRedis package, you can very easily en-queue jobs on a task queue in redis, and then use one, two, ... idle workers to query the queue. Best of all, you can easily mix operating systems so yes, your co-workers' windows machines qualify. Moreover, you can use one, two, three, ... clients as you see fit and need and scale up or down. The queue does not know or care, it simply supplies jobs.
Bost of all, the vignette in the doRedis has working examples of a mix of Linux and Windows clients to make a bootstrapping example go faster.
Perhaps not the answer you were looking for, but - this is one of those situations where an alternative is sooo much better that it's hard to ignore.
The cost of AWS clusters is ridiculously low (my emphasis) for exactly these types of computing problems. You pay only for what you use. I can guarantee you that you will save money (at the very least in opportunity costs) by not spending the time trying to convert 12 windows machines into a cluster. For your purposes, you could probably even do this for free. (IIRC, they still offer free computing time on clusters)
References:
Using AWS for parallel processing with R
http://blog.revolutionanalytics.com/2011/01/run-r-in-parallel-on-a-hadoop-cluster-with-aws-in-15-minutes.html
http://code.google.com/p/segue/
http://www.vcasmo.com/video/drewconway/8468
http://aws.amazon.com/ec2/instance-types/
http://aws.amazon.com/ec2/pricing/
Some of these instances are so powerful you probably wouldn't even need to figure out how to setup your work on a cluster (given your current description). As you can see from the references costs are ridiculously low, ranging from 1-4$ per hour of compute time.
What about OpenCL?
This would require rewriting the C code, but would allow potentially large speedups. The GPU has immense computing power.