Processes become zombies while r parallel session still working - r

I'm trying to query my DB large number of times and to activate some logic over the query's result set.
I'm using Roracle and dopar in order to do so (BTW-my first try was with RJDBC, but I switched to Roracle because I got Error reading from connection; Now, I no longer get this error, but I have the problem described below).
The problem is that most of the process are dying (become zombies) during the parallel session. I monitor this using top command over my linux system; the log file which shows me the progress of my parallel loop; and monitoring my DB during the session. When I'm starting the program, I see that the workers are loaded and the program progresses in high pace, but then most of them are died, and the program become slow (or not working at all) with no error message.
Here some example code of what I'm trying to do:
library(doParallel)
library(Roracle)
temp <- function(i) {
# because you can't get access to my DB, it's irrelevant to file the following rows(where I put three dots)- But I checked my DB connection and it works fine.
drv <- ...
host <- ...
port <- ...
sid <- ...
connect.string <- paste(...)
conn_oracle <- dbConnect(drv, username=..., password=..., dbname=connect.string)
myData <- dbGetQuery(conn_oracle, sprintf("SELECT '%s%' FROM dual", i))
print(i)
dbDisconnect(conn_oracle)
}
cl <- makeCluster(10, outfile = "par_log.txt")
registerDoParallel(cl)
output <- foreach(i=1:100000, .inorder=T, .verbose=T, .combine='rbind',
.packages=c('Roracle'),
.export=c('temp'))
%dopar% {temp(i)}
stopCluster(cl)
Any help will be appreciated!

Related

Can writing to a file while running foreach in R cause connection errors?

I'm fitting models to about 1000 different time series data parallely. The code seems to be running in some machines but not in others. The error I'm getting (in some cases) is:
Error in { : task 135 failed - “cannot open the connection”
The task # differs from time to time. Here's a sample of my foreach code:
parallel_df <- foreach(sku = SKUList, .packages = loadedNamespaces(), .combine = rbind) %dopar% {
#This prints out the status of the operation in a file
#So that you can check how deep into the SKUList the code is
#See "Status.txt"
writeLines(paste("Training Until: ", dt, "\nFitting for: ", sku, "\nKey", match(sku,SKUList), "of", length(SKUList), "keys"), con = "Status.txt")
# Other things the loop does
}
My question is: could it be the case that the connection error is caused by this exact same line? (writeLines). I'm guessing that could be the issue because that is the only line in the entire loop that uses an external connection (i.e., a text file to which the code writes the status).
If so, why does this pose problems in certain machines and not in others? It works fine on my personal laptop, but not on other work machines.

How to shut down an open R cluster connection using parallel

In the question here, the OP mentioned using kill to stop each individual processes, well because I wasn't aware that connections remain open if you you push "stop" while running this in parallel in R Studio on Windows 10, and like a fool I tried to run the same thing 4-5 times, so now I have about 15 open connections on my poor 3 core machines stealing eating up all of my CPU. I can restart my R, but then I have to reclaim all of these unsaved objects which will take a good hour and I'd rather not waste the time. Likewise, the answers in the linked post are great but all of them are about how to prevent the issue in the future not how to actually solve the issue when you have it.
So I'm looking for something like:
# causes problem
lapply(c('doParallel','doSNOW'), library, character.only = TRUE)
n_c <- detectCores()-1
cl<- makeCluster(n_c)
registerDoSNOW(cl)
stop()
stopCluster(cl) #not reached
# so to close off the connection we use something like
a <- showConnections()
cls$description %>% kill
The issue is very frustrating, any help would be appreciated.
Use
autoStopCluster <- function(cl) {
stopifnot(inherits(cl, "cluster"))
env <- new.env()
env$cluster <- cl
attr(cl, "gcMe") <- env
reg.finalizer(env, function(e) {
message("Finalizing cluster ...")
message(capture.output(print(e$cluster)))
try(parallel::stopCluster(e$cluster), silent = FALSE)
message("Finalizing cluster ... done")
})
cl
}
and then set up your cluster as:
cl <- autoStopCluster(makeCluster(n_c))
Old cluster objects no longer reachable will then be automatically stopped when garbage collected. You can trigger the garbage collector by calling gc(). For example, if you call:
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
gc()
and watch your OSes process monitor, you'll see lots of workers being launched, but eventually when the garbage collector runs only the most recent set of cluster workers remain.
EDIT 2018-09-05: Added debug output messages to show when the registered finalizer runs, which happens when the garbage collector runs. Remove those message() lines and use silent = TRUE if you want it to be completely silent.

Future solutions

I am working with a large data set, which I use to make certain calculations. Since it is a huge data set, my machine, I am working on, is doing the job excessively long, for this reason I decided to use the future package in order to distribute the work between several machines and speed up the calculations.
So, my problem is that through the future (using putty & ssh) I can connect to those machines (in parallel), but the work itself is doing the main one, without any distribution. Maybe you can advice some solution:
How to make it work in all machines;
As well, how to check if the process is working (I mean some function or anything that could help to verify the functionment functionality of those, ofc if it's existing).
My code:
library(future)
workers <- c("000.000.0.000", "111.111.1.111")
plan(remote, envir = parent.frame(), workers= workers, myip = "222.222.2.22")
start <- proc.time()
cl <- makeClusterPSOCK(
c("000.000.0.000", "111.111.1.111"), user = "...",
rshcmd = c("plink", "-ssh", "-pw", "..."),
rshopts = c("-i", "V:\\vbulavina\\privatekey.ppk"),
homogeneous = FALSE))
setwd("V:/vbulavina/r/inversion")
a <- source("fun.r")
f <- future({source("pasos.r")})
l <- future({source("pasos2.R")})
time_elapsed_parallel <- proc.time() - start
time_elapsed_parallel
f and l objects are supposed to be done in parallel, but the master machine is doing all the job, so I'm a bit confused if i can do something concerning it.
PS: I tried plan() with remote, multiprocess, multisession, cluster and nothing.
PS2: my local machine is Windows and try to connect to Kubuntu and Debian (firewall is off in all of those).
Thnx in advance.
Author of future here. First, make sure you can setup the PSOCK cluster, i.e. connect to the two workers over SSH and run Rscript on them. This you do as:
library(future)
workers <- c("000.000.0.000", "111.111.1.111")
cl <- makeClusterPSOCK(workers, user = "...",
rshcmd = c("plink", "-ssh", "-pw", "..."),
rshopts = c("-i", "V:/vbulavina/privatekey.ppk"),
homogeneous = FALSE)
print(cl)
### socket cluster with 2 nodes on hosts '000.000.0.000', '111.111.1.111'
(If the above makeClusterPSOCK() stalls or doesn't work, add argument verbose = TRUE to get more info - feel free to report back here.)
Next, with the PSOCK cluster set up, tell the future system to parallelize over those two workers:
plan(cluster, workers = cl)
Test that futures are actually resolved remotes, e.g.
f <- future(Sys.info()[["nodename"]])
print(value(f))
### [1] "000.000.0.000"
I leave the remaining part, which also needs adjustments, for now - let's make sure to get the workers up and running first.
Continuing, using source() in parallel processing complicates things, especially when the parallelization is done on different machines. For instance, calling source("my_file.R") on another machine requires that the file my_file.R is available on that machine too. Even if it is, it also complicates things when it comes to the automatic identification of variables that need to be exported to the external machine. A safer approach is to incorporate all the code in the main script. Having said all this, you can try to replace:
f <- future({source("pasos.r")})
l <- future({source("pasos2.R")})
with
futureSource <- function(file, envir = parent.frame(), ...) {
expr <- parse(file)
future(expr, substitute = FALSE, envir = envir, ...)
}
f <- futureSource("pasos.r")
l <- futureSource("pasos2.R")
As long as pasos.r and pasos2.R don't call source() internally, this c/should work.
BTW, what version of Windows are you on? Because with an up-to-date Windows 10, you have built-in support for SSH and you no longer need to use PuTTY.
UPDATE 2018-07-31: Continue answer regarding using source() in futures.

parallel reads from Athena (AWS) database, via R

I've got a largish dataset on an Athena database on AWS. I'd like to read from it in parallel, and I'm accustomed to the foreach package's approach to forking from within R.
I'm using RJDBC
Here's what I am trying:
out <- foreach(i = 1:length(fipsvec), .combine = rbind, .errorhandling = "remove") %dopar% {
coni <- dbConnect(driver, "jdbc:awsathena://<<location>>/",
s3_staging_dir="my_directory",
user="...",
password="...")
print(paste0("starting ", i))
sqlstring <- paste0("SELECT ",
"My_query_body"
fipsvec[i]
)
row <- fetch(dbSendQuery(coni, sqlstring), -1, block = 999)
print(i)
dbDisconnect(coni)
rm(coni)
gc()
return(row)
}
(Sorry I can't make this reproducible -- I obviously can't hand out the keys to the DB online.)
When I run this, the first c = number of cores steps run fine, but then it hangs and does nothing -- indefinitely as far as I can tell. htop shows no activity on any of the cores. And when I change the for loop to only loop over c entries, the output is what I expect. When I change from parallel to serial (%do% instead of %dopar%), it also works fine.
Does this have something to do with the connection not being closed properly, or somehow being defined redundantly? I've placed the connection within the parallel loop, so each core should have its own connection in its own environment. But I don't know enough about databases to tell whether this is sufficiently distinct.
I'd appreciate answers that help me understand what's going on under the hood here -- it's all voodoo to me at this point.
Are you passing the RJDBC package (and it's dependencies-- methods, DBI, and rJava) into the cluster anywhere?
If not, your the first line of your code should look something like below:
results <- foreach(i = 1:length(fipsvec),
.combine = rbind,
.errorhandling = "remove",
.packages=c('methods','DBI','rJava','RJDBC')) %dopar% {
One thing that I suspect (but don't know) might make things a little hairier is that RJDBC uses a JVM to execute the queries. Not super knowledgeable about how rJava handles JVM initialization, and if each of the threads may be trying to re-use the same JVM simultaneously, or if they have enough information about the external environment to properly initialize one in the first place.
Another troubleshooting step if the above doesn't work might be to move the assignment step for driver into the %dopar% environment.
On another track, how many rows are in your result set? If the result set is in the million+ row range and can be returned with a single query, I actually came across an opportunity for optimization within the RJDBC package and have an open pull request on github ( https://github.com/s-u/RJDBC/pull/50 ) that I haven't heard anything on but have been using myself for a couple months. There's a basic benchmark documented in the pull request, I found the speedup to be substantial on the particular query I was running.
If it seems applicable you can install the branch with:
library(devtools)
devtools::install_github("msummersgill/RJDBC",ref = "harmonize", force = TRUE)

%dopar% parallel foreach loop fails to exit when called from inside a function (R)

I have written the following code (running in RStudio for Windows) to read a long list of very large text files into memory using a parallel foreach loop:
open.raw.txt <- function() {
files <- choose.files(caption="Select .txt files for import")
cores <- detectCores() - 2
registerDoParallel(cores)
data <- foreach(file.temp = files[1:length(files)], .combine = cbind) %dopar%
as.numeric(read.table(file.temp)[, 4])
stopImplicitCluster()
return(data)
}
Unfortunately, however, the function fails to complete and debugging shows that it gets stuck at the foreach loop stage. Oddly, windows task manager indicated that I am at close to full capacity processor wise (I have 32 cores, and this should use 30 of them) for around 10 seconds, then it drops back to baseline. However the loop never completes, indicating that it is doing the work and then getting stuck.
Even more bizarrely, if I remove the 'function' bit and just run each step one-by-one as follows:
files <- choose.files(caption="Select .txt files for import")
cores <- detectCores() - 2
registerDoParallel(cores)
data <- foreach(file.temp = files[1:length(files)], .combine = cbind) %dopar%
as.numeric(read.table(file.temp)[, 4])
stopImplicitCluster()
Then it all works fine. What is going on?
Update: I ran the function and then left it for a while (around an hour) and finally it completed. I am not quite sure how to interpret this, given that multiple cores are still only used for the first 10 seconds or so. Could the issue be with how the tasks are being shared out? Or maybe memory management? I'm new to parallelism, so not sure how to investigate this.
The problem is that you have multiple process opening and closing the same file. Usually when a file is opened by a process it is locked to other process, so that prevents reading the file in parallel

Resources