beginCluster function takes a very long time

beginCluster function takes a very long time - r

I'm trying to go parallel with a moving window operation on a large RasterStack with a function that calls overlay(). The issue is, that setting up a cluster with raster::beginCluster takes a very long time (more than 18 hours so far) and hasn't succeeded yet. Any suggestions? I work with a 125 GB RAM remote instance with 32 cores. I have tried to set up the cluster with 31, 28 or only 4 cores and neither way it worked. Nor did beginCluster(31, type = "SOCK"). Setting up clusters with other packages, eg. doParallel or snow do work, though.
cheers,
Ben

Related

Not Running on All Cores: Package Ecole

I'm using the package ecole within R, with the function found here (https://rdrr.io/github/phytomosaic/ecole/man/ord_nms.html), which may not be relevant.
I specify cores, and to run in parallel, but I only run on a single core it seems based on task manager, although I have 24, it's very slow.
(p <- parallel::detectCores() - 1) # number of cores on your machine, minus one
m1 <- ord_nms(glt_WWTPXCMSspe, autopilot='medium', method='bray', weakties=F, parallel=p)
What's going wrong here? How can I specify to run on available cores? p returns 23.
I expected to see usage on all cores, but only cores 8 and 13 show any activity, the rest completely empty, while the function runs for an hour or so.

Parallelization loop R "repeat"

this is my first post on this forum so if I don't do things correctly, please tell me I'm correcting it!
So here's my problem: at the end of one of my R programs, I do a simple loop using "repeat". This loop will perform a series of calculations for each month of the year 137 times, making a total of 1,646 repetitions. This takes time, about 50 minutes.
numero_profil_automatisation = read.csv2("./import_csv/numero_profil_automatisation.csv", header=TRUE, sep=";")
n=1
repeat {
annee="2020"
annee_profil="2020"
mois_de_debut="1"
mois_de_fin="12"
operateur=numero_profil_automatisation[n,1]
profil_choisi=numero_profil_automatisation[n,3]
couplage=numero_profil_automatisation[n,4]
subvention=numero_profil_automatisation[n,5]
type=numero_profil_automatisation[n,6]
seuil=0.05
graine=4356
resultat=calcul_depense(annee,annee_profil,mois_de_debut,mois_de_fin,operateur,
profil_choisi,couplage,subvention,type,graine,seuil)
nom=paste("./resultat_csv/",operateur,an,type,couplage,subvention,
profil,".csv",sep="_")
write.csv2(resultat,nom)
n<-n+1
if (n == 138) break
}
I would like to optimize the code, so I talked to a friend of mine, a computational developer (who doesn't "know" R) who advised me among other things to parallelize the calculations. My new work computer has 4 cores (R detects the 8 logical cores) so I can save a lot of time.
Being an economist-statistician and not a developer, I'm completely uncertain on this subject. I looked at some forums and articles and I found a code that worked on my previous computer which had only 2 cores (R detected the 4 logical cores), it divided the computing time by almost 2, from 2h to 1h. On my new computer, this piece of code doesn't change at all the computing time, with or without this piece of code it runs arround 50 minutes (better processor, more ram memory).
Below are the two lines of code I added just above the code I put above. With of course the packages at the beginning of the code that I can give you
no_cores <- availableCores() - 1
plan(multicore, workers = no_cores)
Do you have any idea why it seemed to work on my previous computer and not on the new one when nothing has changed except the computer? Or a corrective action to take?

Foreach in R: optimise RAM & CPU use by sorting tasks (objects)?

I have ~200 .Rds datasets that I perform various operations on (different scripts) in a pipeline (of multiple scripts). In most of these scripts I've begun with a for loop and upgraded to a foreach. My problem is that the dataset objects are different sizes (x axis is size in mb):
so if I optimise core number usage (I have a 12core 16gbRAM machine at the office and a 16core 32gbRAM machine at home), it'll whip through the first 90 without incident, but then larger files bunch up and max out the total RAM allocation (remember Rds files are compressed so these are larger in RAM than on disk, but the variability in file size at least gives an indication of the problem). This causes workers to crash and typically leaves me with 1 to 3 cores running through the remainder of the big files (using .errorhandling = "pass"). I'm thinking it would be great to optimise the core number based on number and RAM size of workers, and total available RAM, and figured others might have been in a similar dilemma and developed strategies to address this. Some approaches I've thought of but not tried:
Approach 1: first loop or list through the files on disk, potentially by opening & closing them, use object.size() to get their sizes in RAM, sort largest to smallest, cut halfway, reverse the order of the second half, and intersperse them: smallest, biggest, 2nd smallest, 2nd biggest, etc. 2 workers (or any even numbered multiple) should therefore be working on the 'mean' RAM usage. However: worker 1 will finish its job faster than any other job in the stack and then go onto job 3, the 2nd smallest, likely finish that really quickly also then do job 4, the second largest, while worker 2 is still on the largest, meaning that by job 4, this approach has the machine processing the 2 largest RAM objects concurrently, the opposite of what we want.
Approach 2: sort objects by size-in-RAM for each object, small to large. Starting from object 1, iteratively add subsequent objects' RAM usage until total RAM core number is exceeded. Foreach on that batch. Repeat. This would work but requires some convoluted coding (probably a for loop wrapper around the foreach which passes the foreach its task list each time?). Also if there are a lot of tasks which won't exceed the RAM (per my example), the cores limit batching process will mean all 12 or 16 have to complete before the next 12 or 16 are started, introducing inefficiency.
Approach 3: sort small-large per 2. Run foreach with all cores. This will churn through the small ones maximally efficiently until the tasks get bigger, at which point workers will start to crash, reducing the number of workers sharing the RAM and thus increasing the chance the remaining workers can continue. Conceptually this will mean cores-1 tasks fail and need to be re-run, but the code is easy and should work fast. I already have code that checks the output directory and removes tasks from the jobs list if they've already been completed, which means I could just re-run this approach, however I should anticipate further losses and therefore reruns required unless I lower the cores number.
Approach 4: as 3 but somehow close the worker (reduce core number) BEFORE the task is assigned, meaning the task doesn't have to trigger a RAM overrun and fail in order to reduce worker count. This would also mean no having to restart RStudio.
Approach 5: ideally there would be some intelligent queueing system in foreach that would do this all for me but beggars can't be choosers! Conceptually this would be similar to 4, above: for each worker, don't start the next task until there's sufficient RAM available.
Any thoughts appreciated from folks who've run into similar issues. Cheers!

I've thought a bit about this too.
My problem is a bit different, I don't have any crash but more some slowdowns due to swapping when not enough RAM.
Things that may work:
randomize the iterations so that it is approximately evenly distributed (without needing to know the timings in advance)
similar to approach 5, have some barriers (waiting of some workers with a while loop and Sys.sleep()) while not enough memory (e.g. determined via package {memuse}).
Things I do in practice:
always store the results of iterations in foreach loops and test if already computed (RDS file already exists)
skip some iterations if needed
rerun the "intensive" iterations using less cores

Generating CPU utilization levels

First I would like to let you know that I have recently asked this question already, however it was considered to be unclear, see Linux: CPU benchmark requiring longer time and different CPU utilization levels. This is now a new attempt to formulate the question using a different approach.
What I need: In my research, I look at the CPU utilization of a computer and analyze the CPU utilization pattern within a period of time. For example, a CPU utilization pattern within time period 0 to 10 has the following form:
time, % CPU used
0 , 21.1
1 , 17
2 , 18
3 , 41
4 , 42
5 , 60
6 , 62
7 , 62
8 , 61
9 , 50
10 , 49
I am interested in finding a simple representation for a given CPU utilization pattern. For the evaluation part, I need to create some CPU utilization patterns on my laptop which I will then record and analyse. These CPU utilization patterns that I need to create on my laptop should
be over a time period of more than 5 minutes, ideally of about 20 minutes.
the CPU utilization pattern should have "some kind of dynamic behavior" or in other words, the % CPU used should not be (almost) constant over time, but should vary over time.
My Question: How can I create such a utilization pattern? Of course, I could just run an arbitrary program on my laptop and I will obtain a desired CPU pattern. However, this solution is not ideal since a reader of my work has no means to repeat this experiment if wanted since he has not access to the program I used. Therefore it would be much more beneficial to use something instead of an arbitrary program on my laptop (in my previous post I was thinking about open source CPU benchmarks for example). Can anyone recommend me something?
Many thanks!

I suggest a moving average. Select a window size and use it to average over. You'll need to decide what type of patterns you want to identify since the wider the window, the more smoothing you get and the fewer "features" you'll see. And CPU activity is very bursty. For example, if you are trying to identify cache bottlenecks, you'll want a small window, probably in the 10ms to 100ms range. If instead you want to correlate to longer term features, such as energy or load, you'll want a larger window, perhaps 10sec to minutes.
It looks like you are using OS provided CPU usage and not hardware registers. This means that the OS is already doing some smoothing. It may also be doing estimation for some performance values. Try to find documentation on this if you are integrating over a smaller window. A word of warning: this level of information can be hard to find. You may have to do a lot of digging. Depending upon your familiarity with kernel code, it may be easier to look at the code.

Shared memory for parallel processes

It's been a while since the last time I looked (e.g. outdated nws package) and so I wondered if anything "happened" in the meantime.
Is there a way to share memory across parallel processes?
I would like each process to have access to an environment object that plays the role of a meta object.

The rredis package provides functionality that is similar to nws. You could use rredis with the foreach and doRedis packages, or with any other parallel programming package, such as parallel.

What will work quite efficiently is a shared matrix via the bigmemory package. You can serialize/unserialize pretty every R object into such matrix.
Unfortunately the only way you can share the matrices between the processes is via their descriptors, which are not deterministic (i.e. unless you communicate with the other process, you can not get a descriptor). To solve this chicken-and-egg problem you can save the descriptor on the chosen location on the filesystem. (The descriptor is really small, the only non-trivial thing it contains is a memory address of the actual bigmatrx).
If you are still interested, I can post the R code.

You can do this efficiently via a new yaplr package.
First install it
devtools::install_github('adamryczkowski/yaplr')
R session number 1:
library(yaplr)
send_object(obj=1:10, tag='myobject')
# Server process spawned
R session number 2:
library(yaplr)
list_objects()
# size ctime
# myobject 62 Sat Sep 24 13:01:57 2016
retrieve_object(tag='myobject')
# [1] 1 2 3 4 5 6 7 8 9 10
remove_object('myobject')
quit_server()
The package uses bigmemory::big.matrix for efficient data transfer: when copying a big object between R processes no unnecessary copies are made: only one serialization and one unserialization.
No network sockets are used.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex