load balancing for parallel processing

load balancing for parallel processing - r

I am running a function that is similar to finding standard deviation...but takes much longer to run.
I intend for the function to be used to calculate the cumulative value for standard deviation, i.e for the days 1 to n the standard deviation type function for that.
However due to the long period of time needed to calculate, I wanted to run this over a cluster.
So I wanted to split the data up so that each node of the cluster would finish at roughly the same time. e.g. if my function was as follows the single machine method would work in the following way:
vec <- xts(rnorm(1000),Sys.Date()-(1:1000)
lapply(1:length(vec), function(x){
Sys.sleep(30)
sd(as.numeric(vec[1:x]))
}
(N.B The sys.sleep is added in there to represent the extra time taken to process my custom function)
however lets say I wanted to split this over two machines and instead of 1, how would I split the vector 1:length(vec) such that i could give each machine a list of c(1:y) to machine 1 and c((y+1):length(vec)) to machine 2, so that both machines finish on time. i.e What would be the value of y such that both processes would complete at roughly the same time... and what if we were to do it over 10 machines...how would one go about finding the breaks in the original vector c(1:length(vec)) for that to work...
i.e. I would have
y <- 750 # This is just a guess as to potentially where it might be.
vec <- xts(rnorm(1000),Sys.Date()-(1:1000)
# on machine 1 I would have
lapply(1:y, function(x){
Sys.sleep(30)
sd(as.numeric(vec[1:x]))
}
# and on machine 2 I would have
lapply(y+1:length(vec), function(x){
Sys.sleep(30)
sd(as.numeric(vec[1:x]))
}

The parallel package is now part of base R, and can help run R on moderately sized clusters, including on Amazon EC2. The function parLapplyLB will distribute work from an input vector over the worker nodes of a cluster.
One thing to know is that makePSOCKcluster is (currently as of R 2.15.2) limited to 128 workers by the NCONNECTIONS constant in connections.c.
Here's a quick example of a session using the parallel package that you can try on your own machine:
library(parallel)
help(package=parallel)
## create the cluster passing an IP address for
## the head node
## hostname -i works on Linux, but not on BSD
## descendants (like OS X)
# cl <- makePSOCKcluster(hosts, master=system("hostname -i", intern=TRUE))
## for testing, start a cluster on your local machine
cl <- makePSOCKcluster(rep("localhost", 3))
## do something once on each worker
ans <- clusterEvalQ(cl, { mean(rnorm(1000)) })
## push data to the workers
myBigData <- rnorm(10000)
moreData <- c("foo", "bar", "blabber")
clusterExport(cl, c('myBigData', 'moreData'))
## test a time consuming job
## (~30 seconds on a 4 core machine)
system.time(ans <- parLapplyLB(cl, 1:100, function(i) {
## summarize a bunch of random sample means
summary(
sapply(1:runif(1, 100, 2000),
function(j) { mean(rnorm(10000)) }))
}))
## shut down worker processes
stopCluster(cl)
The Bioconductor group has set up a really easy way to get started: Using a parallel cluster in the cloud
For more about using the parallel package on EC2, see: R in the Cloud and for R on clusters in general, see: CRAN Task View: High-Performance and Parallel Computing with R.
Finally, another well established option external to R is Starcluster.

Look at the snow package -- specifically the clusterApplyLB function to handle a load-balanced apply function.
That will actually handle the distribution of work to the nodes/cores more intelligently than just an even partition.

Consider using Hadoop (aka MapReduce) via RHIPE.

Related

R Simmer - lapply on simulation environments built with a loop resulting in identical simulations

I have created the following MWE to illustrate the code. Consider a system with 20 queues and servers (such that for each server there is a FCFS queue). When an arrival comes, it takes the shortest. Say you wish to compare 20 simulations, each 400 over units of time from start to finish.
You can see in my code that the queues are added using a for loop (indeed, this becomes necessary when simulating asymptotically large queueing systems).
library(simmer)
library(simmer.plot)
set.seed(1337)
sim<-simmer()
queues<- vector(length=10)
for (i in 1:10) {
queues[i]<- paste0("q_",i)
}
queueing_system <- trajectory() %>%
select(function()queues,policy = "shortest-queue")%>%
seize_selected(1)%>%
timeout(function() rpois(1, 10)+1)%>%
release_selected()
for(i in 1:10){sim%>%
add_resource(queues[i], 1)}
sim%>%
add_generator("path",queueing_system, function()rexp(1,20))
environments<-lapply(1:20, function(o){
sim%>%run(400)
})
queue1<-list()
for(a in 1:20){
exqueue<-get_mon_resources(environments[a])[which(get_mon_resources(environments[a])[1]=="q_1"),]
exqueue$total<-exqueue[,3]+exqueue[,4]
queue1[[a]] <- exqueue
}
head(queue1[[1]])
head(queue1[[2]]) #both are the same!
Draw notice to the following two lines in particular - I pull out the total amount of people in the queue #1 after a discrete event has occured:
head(queue1[[1]])
head(queue1[[2]]) #both are the same!
Indeed, they are the same between all sims (not just 1 and 2). Is there a way to generate unique sample paths for each instance?

The simulation environment, as every environment in R, has reference semantics. Therefore, here
environments<-lapply(1:20, function(o){
sim%>%run(400)
})
when o=1, the simulation runs. And then, when o=2...20, you are calling the same simulation environment, which already is at t=400, so you are running nothing. But even if you reset the simulation:
environments<-lapply(1:20, function(o){
sim %>% reset() %>% run(400)
})
This effectively runs 20 different simulations, but you obtain a list of 20 references to the last simulation (again, due to reference semantics).
How to solve this? Simply instantiate a new simulation environment for each replication; i.e., put your call to simmer() inside the replication function, as the introduction vignette does.

R: how to check how many cores/CPU usage available

R is single-threaded.
Using R, how to check how many cores/threads are running R in Windows and Linux? (Or how many Rs are running)
Using R, how to check the usage of each core that is running R in Windows and Linux? (Or the percentage of CPU each R is using)
For example, if I have two R opened running projects. I would expect that there are 2 threads running R with some % of CPU for each thread. Then I open another R. How to use the third R to check the number of threads (2 in this case) and percentage of CPU being used by R?

If you open multiple R windows, each window will be running on a different core up to the maximum number of cores that you have. This is automatically implemented on windows and mac computers. If you want to know how many cores you have, you can run:
library(parallel)
detectCores()

On Linux you can send ps command to the system: it gives you the average cpu usage and the memory usage of the program called rsession:
splitted <- strsplit(system("ps -C rsession -o %cpu,%mem,pid,cmd", intern = TRUE), " ")
df <- do.call(rbind, lapply(splitted[-1],
function(x) data.frame(
cpu = as.numeric(x[2]),
mem = as.numeric(x[4]),
pid = as.numeric(x[5]),
cmd = paste(x[-c(1:5)], collapse = " "))))
df
# cpu mem pid cmd
#1 0.8 0.7 11001 /usr/lib/rstudio/bin/rsession
#2 0.0 0.2 12397 /usr/lib/rstudio/bin/rsession
#3 0.1 0.7 14960 /usr/lib/rstudio/bin/rsession
#4 0.4 0.2 26122 /usr/lib/rstudio-server/bin/rsession
#5 0.3 8.3 35782 /usr/lib/rstudio/bin/rsession
You can probably improve it to get the parent id and the instantaneous CPU usage with other options passed to ps or top and deduce the number of cores used by each session.
On Windows you can try this:
a <- system("wmic path Win32_PerfFormattedData_PerfProc_Process get Name,PercentProcessorTime", intern = TRUE)
df <- do.call(rbind, lapply(strsplit(a, " "), function(x) {x <- x[x != ""];data.frame(process = x[1], cpu = x[2])}))
df[grepl("Rgui|rstudio", df$process),]
# process cpu
# 105 Rgui 0
# 108 rstudio 0

Using R, how to check how many cores/threads are running R in Windows
and Linux? (Or how many Rs are running)
One valid answer that I haven't read yet here is simply using the ps R package with the function ps() you can then subset the table returned by processes with the name "rsession":
ps::ps()[ps::ps()$name == "rsession",]
Number of rows will give you the number of sessions existing on the computer/server:
nrow(ps::ps()[ps::ps()$name == "rsession",])
I am not entirely sure about what the function ps_num_threads() does but it might also be interesting to check if the result make sense:
ps::ps_num_threads(ps::ps_handle())
I unfortunately did not found anything about %CPU usage in the ps R package but you can give a try to the function I quote in my other answer, it should work under Linux.

For those who would like to know the number of cores/cpus and/or number of workers/compute nodes that 1) are available in the machine or 2) are allocated by HPC clusters on which current R programm is running, try this (using functions from parallel and future packages):
library(parallel) # for using parallel::mclapply() and checking #totalCores on compute nodes / workstation: detectCores()
library(future) # for checking #availble cores / workers on compute nodes / workstation: availableWorkers() / availableCores()
workers <- availableWorkers()
cat(sprintf("#workders/#availableCores/#totalCores: %d/%d/%d, workers:\n", length(workers), availableCores(), detectCores()))
print( workers )

For those interested, a friend and I have created a Github Repository with functions to probe ongoing processes on a computer/server from a R console.
Here is the link: https://github.com/mathosi/cluster_check
To answer the question you can use the function I made ps.to.df():
devtools::source_url("https://github.com/mathosi/cluster_check/blob/master/ps_to_df.R?raw=TRUE")
ps.to.df() #List All processes sorted by %CPU usage
ps.to.df(bylist.selection = "C rsession") #List All processes named 'rsession' sorted by %CPU usage
The output is a data.frame so you can then sort it subset it the way you want in R to look for whatever you want to see!
I haven't try yet all the possible query ps.to.df() supports, but I guess it should support others.
There is room to improve flexibility and readability of the outputs, maybe to create additionnal functions also. Anyone interested can join and contribute.

There is a simpler method, using benchmarkme package.
library(benchmarkme)
get_cpu()$no_of_cores

R parLapply not parallel

I'm currently developing an R package that will be using parallel computing to solve some tasks, through means of the "parallel" package.
I'm getting some really awkward behavior when utilizing clusters defined inside functions of my package, where the parLapply function assigns a job to a worker and waits for it to finish to assign a job to next worker.
Or at least this is what appears to be happening, through the observation of the log file "cluster.log" and the list of running processes in the unix shell.
Below is a mockup version of the original function declared inside my package:
.parSolver <- function( varMatrix, var1 ) {
no_cores <- detectCores()
#Rows in varMatrix
rows <- 1:nrow(varMatrix[,])
# Split rows in n parts
n <- no_cores
parts <- split(rows, cut(rows, n))
# Initiate cluster
cl <- makePSOCKcluster(no_cores, methods = FALSE, outfile = "/home/cluster.log")
clusterEvalQ(cl, library(raster))
clusterExport(cl, "varMatrix", envir=environment())
clusterExport(cl, "var1", envir=environment())
rParts <- parLapply(cl = cl, X = 1:n, fun = function(x){
part <- rasterize(varMatrix[parts[[x]],], raster(var1), .....)
print(x)
return(part)
})
do.call(merge, rParts)
}
NOTES:
I'm using makePSOCKcluster because i want the code to run on windows and unix systems alike although this particular problem is only manifesting itself in a unix system.
Functions rasterize and raster are defined in library(raster), exported to the cluster.
The weird part to me is if I execute the exact same code of the function parSolver in a global environment every thing works smoothly, all workers take one job at the same time and the task completes in no time.
However if I do something like:
library(myPackage)
varMatrix <- (...)
var1 <- (...)
result <- parSolver(varMatrix, var1)
the described problem appears.
It appears to be a load balancing problem however that does not explain why it works ok in one situation and not in the other.
Am I missing something here?
Thanks in advance.

I don't think parLapply is running sequentially. More likely, it's just running inefficiently, making it appear to run sequentially.
I have a few suggestions to improve it:
Don't define the worker function inside parSolver
Don't export all of varMatrix to each worker
Create the cluster outside of parSolver
The first point is important, because as your example now stands, all of the variables defined in parSolver will be serialized along with the anonymous worker function and sent to the workers by parLapply. By defining the worker function outside of any function, the serialization won't capture any unwanted variables.
The second point avoids unnecessary socket I/O and uses less memory, making the code more scalable.
Here's a fake, but self-contained example that is similar to yours that demonstrates my suggestions:
# Define worker function outside of any function to avoid
# serialization problems (such as unexpected variable capture)
workerfn <- function(mat, var1) {
library(raster)
mat * var1
}
parSolver <- function(cl, varMatrix, var1) {
parts <- splitIndices(nrow(varMatrix), length(cl))
varMatrixParts <- lapply(parts, function(i) varMatrix[i,,drop=FALSE])
rParts <- clusterApply(cl, varMatrixParts, workerfn, var1)
do.call(rbind, rParts)
}
library(parallel)
cl <- makePSOCKcluster(3)
r <- parSolver(cl, matrix(1:20, 10, 2), 2)
print(r)
Note that this takes advantage of the clusterApply function to iterate over a list of row-chunks of varMatrix so that the entire matrix doesn't need to be sent to everyone. It also avoids calls to clusterEvalQ and clusterExport, simplifying the code, as well as making it a bit more efficient.

Trying to get started with doParallel and foreach but no improvement

I am trying to use the doParallel and foreach package but I'm getting reduction in performance using the bootstrapping example in the guide found here CRANpage.
library(doParallel)
library(foreach)
registerDoParallel(3)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
ptime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})[3]
ptime
This example returns 56.87.
When I change the dopar to just do to run it sequentially instead of in parallel, it returns 36.65.
If I do registerDoParallel(6) it gets the parallel time down to 42.11 but is still slower than sequentially. registerDoParallel(8) gets 40.31 still worse than sequential.
If I increase trials to 100,000 then the sequential run takes 417.16 and the parallel run with 3 workers takes 597.31. With 6 workers in parallel it takes 425.85.
My system is
Dell Optiplex 990
Windows 7 Professional 64-bit
16GB RAM
Intel i-7-2600 3.6GHz Quad-core with hyperthreading
Am I doing something wrong here? If I do the most contrived thing I can think of (replacing computational code with Sys.sleep(1)) then I get an actual reduction closely proportionate to the number of workers. I'm left wondering why the example in the guide decreases performance for me while for them it sped things up?

The underlying problem is that doParallel executes attach for every task execution on the workers of the PSOCK cluster in order to add the exported variables to the package search path. This resolves various scoping issues, but can hurt performance significantly, particularly with short duration tasks and large amounts of exported data. This doesn't happen on Linux and Mac OS X with your example, since they will use mclapply, rather than clusterApplyLB, but it will happen on all platforms if you explicitly register a PSOCK cluster.
I believe that I've figured out how to resolve the task scoping problems in a different way that doesn't hurt performance, and I'm working with Revolution Analytics to get the fix into the next release of doParallel and doSNOW, which also has the same problem.
You can work around this problem by using task chunking:
ptime2 <- system.time({
chunks <- getDoParWorkers()
r <- foreach(n=idiv(trials, chunks=chunks), .combine='cbind') %dopar% {
y <- lapply(seq_len(n), function(i) {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
})
do.call('cbind', y)
}
})[3]
This results in only one task per worker, so each worker only executes attach once, rather than trials / 3 times. It also results in fewer but larger socket operations, which can be performed more efficiently on most systems, but in this case, the critical issue is attach.

How do I time out a lapply when a list item fails or takes too long?

For several efforts I'm involved in at the moment, I am running large datasets with numerous parameter combinations through a series of functions. The functions have a wrapper (so I can mclapply) for ease of operation on a cluster. However, I run into two major challenges.
a) My parameter combinations are large (think 20k to 100k). Sometimes particular combinations will fail (e.g. survival is too high and mortality is too low so the model never converges as a hypothetical scenario). It's difficult for me to suss out ahead of time exactly which combinations will fail (life would be easier if I could do that). But for now I have this type of setup:
failsafe <- failwith(NULL, my_wrapper_function)
# This is what I run
# Note that input_variables contains a list of variables in each list item
results <- mclapply(input_variables, failsafe, mc.cores = 72)
# On my local dual core mac, I can't do this so the equivalent would be:
results <- llply(input_variables, failsafe, .progress = 'text')
The skeleton for my wrapper function looks like this:
my_wrapper_function <- function(tlist) {
run <- tryCatch(my_model(tlist$a, tlist$b, tlist$sA, tlist$Fec, m = NULL) , error=function(e) NULL)
...
return(run)
}
Is this the most efficient approach? If for some reason a particular combination of variables crashes the model, I need it to return a NULL and carry on with the rest. However, I still have issues that this fails less than gracefully.
b) Sometimes a certain combination of inputs does not crash the model but takes too long to converge. I set a limit on the computation time on my cluster (say 6 hours) so I don't waste my resources on something that is stuck. How can I include a timeout such that if a function call takes more than x time on a single list item, it should move on? Calculating the time spent is trivial but a function mid simulation can't be interrupted to check the time, right?
Any ideas, solutions or tricks are appreciated!

You may well be able to manage graceful-exits-upon-timout using a combination of tryCatch() and evalWithTimeout() from the R.utils package.
See also this post, which presents similar code and unpacks it in a bit more detail.
require(R.utils)
myFun <- function(x) {Sys.sleep(x); x^2}
## evalWithTimeout() times out evaluation after 3.1 seconds, and then
## tryCatch() handles the resulting error (of class "TimeoutException") with
## grace and aplomb.
myWrapperFunction <- function(i) {
tryCatch(expr = evalWithTimeout(myFun(i), timeout = 3.1),
TimeoutException = function(ex) "TimedOut")
}
sapply(1:5, myWrapperFunction)
# [1] "1" "4" "9" "TimedOut" "TimedOut"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

load balancing for parallel processing - r

Look at the snow package -- specifically the clusterApplyLB function to handle a load-balanced apply function. That will actually handle the distribution of work to the nodes/cores more intelligently than just an even partition.

Consider using Hadoop (aka MapReduce) via RHIPE.

Related

R Simmer - lapply on simulation environments built with a loop resulting in identical simulations

R: how to check how many cores/CPU usage available

R parLapply not parallel

Trying to get started with doParallel and foreach but no improvement

How do I time out a lapply when a list item fails or takes too long?

Categories

Resources