Future solutions - r

I am working with a large data set, which I use to make certain calculations. Since it is a huge data set, my machine, I am working on, is doing the job excessively long, for this reason I decided to use the future package in order to distribute the work between several machines and speed up the calculations.
So, my problem is that through the future (using putty & ssh) I can connect to those machines (in parallel), but the work itself is doing the main one, without any distribution. Maybe you can advice some solution:
How to make it work in all machines;
As well, how to check if the process is working (I mean some function or anything that could help to verify the functionment functionality of those, ofc if it's existing).
My code:
library(future)
workers <- c("000.000.0.000", "111.111.1.111")
plan(remote, envir = parent.frame(), workers= workers, myip = "222.222.2.22")
start <- proc.time()
cl <- makeClusterPSOCK(
c("000.000.0.000", "111.111.1.111"), user = "...",
rshcmd = c("plink", "-ssh", "-pw", "..."),
rshopts = c("-i", "V:\\vbulavina\\privatekey.ppk"),
homogeneous = FALSE))
setwd("V:/vbulavina/r/inversion")
a <- source("fun.r")
f <- future({source("pasos.r")})
l <- future({source("pasos2.R")})
time_elapsed_parallel <- proc.time() - start
time_elapsed_parallel
f and l objects are supposed to be done in parallel, but the master machine is doing all the job, so I'm a bit confused if i can do something concerning it.
PS: I tried plan() with remote, multiprocess, multisession, cluster and nothing.
PS2: my local machine is Windows and try to connect to Kubuntu and Debian (firewall is off in all of those).
Thnx in advance.

Author of future here. First, make sure you can setup the PSOCK cluster, i.e. connect to the two workers over SSH and run Rscript on them. This you do as:
library(future)
workers <- c("000.000.0.000", "111.111.1.111")
cl <- makeClusterPSOCK(workers, user = "...",
rshcmd = c("plink", "-ssh", "-pw", "..."),
rshopts = c("-i", "V:/vbulavina/privatekey.ppk"),
homogeneous = FALSE)
print(cl)
### socket cluster with 2 nodes on hosts '000.000.0.000', '111.111.1.111'
(If the above makeClusterPSOCK() stalls or doesn't work, add argument verbose = TRUE to get more info - feel free to report back here.)
Next, with the PSOCK cluster set up, tell the future system to parallelize over those two workers:
plan(cluster, workers = cl)
Test that futures are actually resolved remotes, e.g.
f <- future(Sys.info()[["nodename"]])
print(value(f))
### [1] "000.000.0.000"
I leave the remaining part, which also needs adjustments, for now - let's make sure to get the workers up and running first.
Continuing, using source() in parallel processing complicates things, especially when the parallelization is done on different machines. For instance, calling source("my_file.R") on another machine requires that the file my_file.R is available on that machine too. Even if it is, it also complicates things when it comes to the automatic identification of variables that need to be exported to the external machine. A safer approach is to incorporate all the code in the main script. Having said all this, you can try to replace:
f <- future({source("pasos.r")})
l <- future({source("pasos2.R")})
with
futureSource <- function(file, envir = parent.frame(), ...) {
expr <- parse(file)
future(expr, substitute = FALSE, envir = envir, ...)
}
f <- futureSource("pasos.r")
l <- futureSource("pasos2.R")
As long as pasos.r and pasos2.R don't call source() internally, this c/should work.
BTW, what version of Windows are you on? Because with an up-to-date Windows 10, you have built-in support for SSH and you no longer need to use PuTTY.
UPDATE 2018-07-31: Continue answer regarding using source() in futures.

Related

'mc.cores' > 1 is not supported on Windows

I am new to programming and I am trying to use parallel processing for R in windows, using an existing code.
Following is the snippet of my code:
if (length(grep("linux", R.version$os)) == 1){
num_cores = detectCores()
impact_list <- mclapply(len_a, impact_func, mc.cores = (num_cores - 1))
}
# else if(length(grep("mingw32", R.version$os)) == 1){
# num_cores = detectCores()
# impact_list <- mclapply(len_a, impact_func, mc.cores = (num_cores - 1))
#
# }
else{
impact_list <- lapply(len_a, impact_func)
}
return(sum(unlist(impact_list, use.names = F)))
This works fine, I am using R on windows so the code enters in 'else' statement and it runs the code using lapply() and not by parallel processing.
I have added the 'else if' statement to make it work for windows. So when I un-comment 'else if' block of code and run it, I am getting an error "'mc.cores' > 1 is not supported on Windows".
Please suggest how can I use parallel processing in windows, so that less time is taken to run the code.
Any help will be appreciated.
(disclaimer: I'm author of the future framework here)
The future.apply package provides parallel versions of R's built-in "apply" functions. It's cross platform, i.e. it works on Linux, macOS, and Windows. The package allows you to often just replace an existing lapply() with a future_lapply() call, e.g.
library(future.apply)
plan(multisession)
your_fcn <- function(len_a) {
impact_list <- future_lapply(len_a, impact_func)
sum(unlist(impact_list, use.names = FALSE))
}
Regarding mclapply() per se: If you use parallel::mclapply() in your code, make sure that there is always an option not to use it. The reason is that it is not guaranteed to work in all environment, that is, it might be unstable and crash R. In R-devel thread 'mclapply returns NULLs on MacOS when running GAM' (https://stat.ethz.ch/pipermail/r-devel/2020-April/079384.html), the author of mclapply() wrote on 2020-04-28:
Do NOT use mcparallel() in packages except as a non-default option that user can set for the reasons Henrik explained. Multicore is intended for HPC applications that need to use many cores for computing-heavy jobs, but it does not play well with RStudio and more importantly you don't know the resource available so only the user can tell you when it's safe to use. Multi-core machines are often shared so using all detected cores is a very bad idea. The user should be able to explicitly enable it, but it should not be enabled by default.

How to shut down an open R cluster connection using parallel

In the question here, the OP mentioned using kill to stop each individual processes, well because I wasn't aware that connections remain open if you you push "stop" while running this in parallel in R Studio on Windows 10, and like a fool I tried to run the same thing 4-5 times, so now I have about 15 open connections on my poor 3 core machines stealing eating up all of my CPU. I can restart my R, but then I have to reclaim all of these unsaved objects which will take a good hour and I'd rather not waste the time. Likewise, the answers in the linked post are great but all of them are about how to prevent the issue in the future not how to actually solve the issue when you have it.
So I'm looking for something like:
# causes problem
lapply(c('doParallel','doSNOW'), library, character.only = TRUE)
n_c <- detectCores()-1
cl<- makeCluster(n_c)
registerDoSNOW(cl)
stop()
stopCluster(cl) #not reached
# so to close off the connection we use something like
a <- showConnections()
cls$description %>% kill
The issue is very frustrating, any help would be appreciated.
Use
autoStopCluster <- function(cl) {
stopifnot(inherits(cl, "cluster"))
env <- new.env()
env$cluster <- cl
attr(cl, "gcMe") <- env
reg.finalizer(env, function(e) {
message("Finalizing cluster ...")
message(capture.output(print(e$cluster)))
try(parallel::stopCluster(e$cluster), silent = FALSE)
message("Finalizing cluster ... done")
})
cl
}
and then set up your cluster as:
cl <- autoStopCluster(makeCluster(n_c))
Old cluster objects no longer reachable will then be automatically stopped when garbage collected. You can trigger the garbage collector by calling gc(). For example, if you call:
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
gc()
and watch your OSes process monitor, you'll see lots of workers being launched, but eventually when the garbage collector runs only the most recent set of cluster workers remain.
EDIT 2018-09-05: Added debug output messages to show when the registered finalizer runs, which happens when the garbage collector runs. Remove those message() lines and use silent = TRUE if you want it to be completely silent.

parallel reads from Athena (AWS) database, via R

I've got a largish dataset on an Athena database on AWS. I'd like to read from it in parallel, and I'm accustomed to the foreach package's approach to forking from within R.
I'm using RJDBC
Here's what I am trying:
out <- foreach(i = 1:length(fipsvec), .combine = rbind, .errorhandling = "remove") %dopar% {
coni <- dbConnect(driver, "jdbc:awsathena://<<location>>/",
s3_staging_dir="my_directory",
user="...",
password="...")
print(paste0("starting ", i))
sqlstring <- paste0("SELECT ",
"My_query_body"
fipsvec[i]
)
row <- fetch(dbSendQuery(coni, sqlstring), -1, block = 999)
print(i)
dbDisconnect(coni)
rm(coni)
gc()
return(row)
}
(Sorry I can't make this reproducible -- I obviously can't hand out the keys to the DB online.)
When I run this, the first c = number of cores steps run fine, but then it hangs and does nothing -- indefinitely as far as I can tell. htop shows no activity on any of the cores. And when I change the for loop to only loop over c entries, the output is what I expect. When I change from parallel to serial (%do% instead of %dopar%), it also works fine.
Does this have something to do with the connection not being closed properly, or somehow being defined redundantly? I've placed the connection within the parallel loop, so each core should have its own connection in its own environment. But I don't know enough about databases to tell whether this is sufficiently distinct.
I'd appreciate answers that help me understand what's going on under the hood here -- it's all voodoo to me at this point.
Are you passing the RJDBC package (and it's dependencies-- methods, DBI, and rJava) into the cluster anywhere?
If not, your the first line of your code should look something like below:
results <- foreach(i = 1:length(fipsvec),
.combine = rbind,
.errorhandling = "remove",
.packages=c('methods','DBI','rJava','RJDBC')) %dopar% {
One thing that I suspect (but don't know) might make things a little hairier is that RJDBC uses a JVM to execute the queries. Not super knowledgeable about how rJava handles JVM initialization, and if each of the threads may be trying to re-use the same JVM simultaneously, or if they have enough information about the external environment to properly initialize one in the first place.
Another troubleshooting step if the above doesn't work might be to move the assignment step for driver into the %dopar% environment.
On another track, how many rows are in your result set? If the result set is in the million+ row range and can be returned with a single query, I actually came across an opportunity for optimization within the RJDBC package and have an open pull request on github ( https://github.com/s-u/RJDBC/pull/50 ) that I haven't heard anything on but have been using myself for a couple months. There's a basic benchmark documented in the pull request, I found the speedup to be substantial on the particular query I was running.
If it seems applicable you can install the branch with:
library(devtools)
devtools::install_github("msummersgill/RJDBC",ref = "harmonize", force = TRUE)

curl memory usage in R for multiple files in parLapply loop

I have a project that's downloading ~20 million PDFs multithreaded on an ec2. I'm most proficient in R and it's a one off so my initial assessment was that the time savings from bash scripting wouldn't be enough to justify the time spent on the learning curve. So I decided just to call curl from within an R script. The instance is a c4.8xlarge, rstudio server over ubuntu with 36 cores and 60 gigs of memory.
With any method I've tried it runs up to the max ram fairly quickly. It runs alright but I'm concerned swapping the memory is slowing it down. curl_download or curl_fetch_disk work much more quickly than the native download.file function (one pdf per every .05 seconds versus .2) but those both run up to max memory extremely quickly and then seem to populate the directory with empty files. With the native function I was dealing with the memory problem by suppressing output with copious usage of try() and invisible(). That doesn't seem to help at all with the curl package.
I have three related questions if anyone could help me with them.
(1) Is my understanding of how memory is utilized correct in that needlessly swapping memory would cause the script to slow down?
(2) curl_fetch_disk is supposed to be writing direct to disk, does anyone have any idea as to why it would be using so much memory?
(3) Is there any good way to do this in R or am I just better off learning some bash scripting?
Current method with curl_download
getfile_sweep.fun <- function(url
,filename){
invisible(
try(
curl_download(url
,destfile=filename
,quiet=T
)
)
)
}
Previous method with native download.file
getfile_sweep.fun <- function(url
,filename){
invisible(
try(
download.file(url
,destfile=filename
,quiet=T
,method="curl"
)
)
)
}
parLapply loop
len <- nrow(url_sweep.df)
gc.vec <- unlist(lapply(0:35, function(x) x + seq(
from=100,to=len,by=1000)))
gc.vec <- gc.vec[order(gc.vec)]
start.time <- Sys.time()
ptm <- proc.time()
cl <- makeCluster(detectCores()-1,type="FORK")
invisible(
parLapply(cl,1:len, function(x){
invisible(
try(
getfile_sweep.fun(
url = url_sweep.df[x,"url"]
,filename = url_sweep.df[x,"filename"]
)
)
)
if(x %in% gc.vec){
gc()
}
}
)
)
stopCluster(cl)
Sweep.time <- proc.time() - ptm
Sample of data -
Sample of url_sweep.df:
https://www.dropbox.com/s/anldby6tcxjwazc/url_sweep_sample.rds?dl=0
Sample of existing.filenames:
https://www.dropbox.com/s/0n0phz4h5925qk6/existing_filenames_sample.rds?dl=0
Notes:
1- I do not have such powerful system available to me, so I cannot reproduce every issue mentioned.
2- All the comments are being summarized here
3- It was stated that machine received an upgrade: EBS to provisioned SSD w/ 6000 IOPs/sec, however the issue persists
Possible issues:
A- if memory swap starts to happen then you are nor purely working with RAM anymore and I think R would have harder and harder time to find available continues memory spaces.
B- work load and the time it takes to finish the workload, compared to the number of cores
c- parallel set up, and fork cluster
Possible solutions and troubleshooting:
B- Limiting memory usage
C- Limiting number of cores
D- If the code runs fine on a smaller machine like personal desktop than issue is with how the parallel usage is setup, or something with fork cluster.
Things to still try:
A- In general running jobs in parallel incurs overhead, now more cores you have, you will see the effects more. when you pass a lot of jobs that take very very little time (think smaller than second) this will results in increase of overhead related to constantly pushing jobs. try to limit the core to 8 just like your desktop and try your code? does the code run fine? if yes than try to increase the workload as you increase the cores available to the program.
Start with lower end of spectrum of number of cores and amount of ram an increase them as you increase the workload and see where the fall happens.
B- I will post a summery about Parallelism in R, this might help you catch something that we have missed
What worked:
Limiting the number of cores has fixed the issue. As mentioned by OP, he has also made other changes to the code, however i do not have access to them.
You can use the async interface instead. Short example below:
cb_done <- function(resp) {
filename <- basename(urltools::path(resp$url))
writeBin(resp$content, filename)
}
pool <- curl::new_pool()
for (u in urls) curl::curl_fetch_multi(u, pool = pool, done = cb_done)
curl::multi_run(pool = pool)

How can I run multiple independent and unrelated functions in parallel without larger code do-over?

I've been searching around the internet, trying to understand parallel processing.
What they all seem to assume is that I have some kind of loop function operating on e.g. every Nth row of a data set divided among N cores and combined afterwards, and I'm pointed towards a lot of parallelized apply() functions.
(Warning, ugly code below)
My situation though is that I have is on form
tempJob <- myFunction(filepath, string.arg1, string.arg2)
where the path is a file location, and the string arguments are various ways of sorting my data.
My current workflow is simply amassing a lot of
tempjob1 <- myFunction(args)
tempjob2 <- myFunction(other args)
...
tempjobN <- myFunction(some other args here)
# Make a list of all temporary outputs in the global environment
temp.list <- lapply(ls(pattern = "temp"), get)
# Stack them all
df <- rbindlist(temp.list)
# Remove all variables from workspace matching "temp"
rm(list=ls(pattern="temp"))
These jobs are entirely independent, and could in principle be run in 8 separate instances of R (although that would be a bother to manage I guess). How can I separate the first 8 jobs out to 8 cores, and whenever a core finishes its job and returns a treated dataset to the global environment it'll simply take whichever job is next in line.
With the future package (I'm the author) you can achieve what you want with a minor modification to your code - use "future" assignments %<-% instead of regular assignments <- for the code you want to run asynchronously.
library("future")
plan(multisession)
tempjob1 %<-% myFunction(args)
tempjob2 %<-% myFunction(other args)
...
tempjobN %<-% myFunction(some other args here)
temp.list <- lapply(ls(pattern = "temp"), get)
EDIT 2022-01-04: plan(multiprocess) -> plan(multisession) since multiprocess is deprecated and will eventually be removed.
Unless you are unfortunate enough to be using Windows, you could maybe try with GNU Parallel like this:
parallel Rscript ::: script1.R script2.R JOB86*.R
and that would keep 8 scripts running at a time, if your CPU has 8 cores. You can change it with -j 4 if you just want 4 at a time. The JOB86 part is just random - I made it up.
You can also add switches for a progress bar, for how to handle errors, for adding parameters and distributing jobs across multiple machines.
If you are on a Mac, you can install GNU Parallel with homebrew:
brew install parallel
I think the easiest way is to use one of the parallelized apply functions. Those will do all the fiddly work of separating out the jobs, taking whichever job is next in line, etc.
Put all your arguments into a list:
args <- list(
list(filePath1, stringArgs11, stringArgs21),
list(filePath2, stringArgs12, stringArgs22),
...
list(filePath8, stringArgs18, stringArgs28)
)
Then do something like
library(parallel)
cl <- makeCluster(detectCores())
df <- parSapply(cl, args, myFunction)
I'm not sure about parSapply, and I can't check as R isn't working on my machine just now. If that doesn't work, use parLapply and then manipulate the result.

Resources