parallel r, multiple errors - r

I am running a foreach loop and am not sure how to interpret some strange behaviour. The functions are tested individually, the loop roughly looks like this:
cl<-makeCluster(2, outfile = "")
registerDoParallel(cl)
outputs <- foreach(k = 1:x, .packages = "various packages") %dopar% {
do things #contains a for-loop
return(stuff for output)
}
stopCluster(c1)
I get these errors
Error in unserialize(socklist[[n]]) : error reading from connection
In addition: Warning message:
closing unused connection 6 (<-computer:11531)
and
Error in serialize(data, node$con) : error writing to connection
The red dot in rstudio disappears and the console cursor is active again, suggesting it has quit the program.
Now, I know these are pretty non-specific errors, but the really weird thing is that the cluster still runs after the errors. The processes spawn, they take up the expected amount of CPU and memory for as long as I would expect the job to take, but I get no return data.
I thought maybe the weird behaviour might be useful to pin down the problem.

Related

Can writing to a file while running foreach in R cause connection errors?

I'm fitting models to about 1000 different time series data parallely. The code seems to be running in some machines but not in others. The error I'm getting (in some cases) is:
Error in { : task 135 failed - “cannot open the connection”
The task # differs from time to time. Here's a sample of my foreach code:
parallel_df <- foreach(sku = SKUList, .packages = loadedNamespaces(), .combine = rbind) %dopar% {
#This prints out the status of the operation in a file
#So that you can check how deep into the SKUList the code is
#See "Status.txt"
writeLines(paste("Training Until: ", dt, "\nFitting for: ", sku, "\nKey", match(sku,SKUList), "of", length(SKUList), "keys"), con = "Status.txt")
# Other things the loop does
}
My question is: could it be the case that the connection error is caused by this exact same line? (writeLines). I'm guessing that could be the issue because that is the only line in the entire loop that uses an external connection (i.e., a text file to which the code writes the status).
If so, why does this pose problems in certain machines and not in others? It works fine on my personal laptop, but not on other work machines.

R Parallelisation Error unserialize(socklisk[[n]])

In a nutshell I am trying to parallelise my whole script over dates using Snow and adply but continually get the below error.
Error in unserialize(socklist[[n]]) : error reading from connection
In addition: Warning messages:
1: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
2: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
I have set up the parallelisation process in the following way:
Cores = detectCores(all.tests = FALSE, logical = TRUE)
cl = makeCluster(Cores, type="SOCK")
registerDoSNOW(cl)
clusterExport(cl, c("Var1","Var2","Var3","Var4"), envir = environment())
exposureDaily <- adply(.data = dateSeries,.margins = 1,.fun = MainCalcFunction,
.expand = TRUE, Var1, Var2, Var3,
Var4,.parallel = TRUE)
stopCluster(cl)
Where dateSeries might look something like
> dateSeries
marketDate
1 2016-04-22
2 2016-04-26
MainCalcFunction is a very long script with multiple of my own functions contained within it. As the script is so long reproducing it wouldn't be practical, and a hypothetical small function would defeat the purpose as I have already got this methodology to work with other smaller functions. I can say that within MainCalcFunction I call all my libraries, necessary functions, and a file containing all other variables aside from those exported above so that I don't have to export a long list libraries and other objects.
MainCalcFunction can run successfully in its entirety over 2 dates using adply but not parallelisation, which tells me that it is not a bug in the code that is causing the parallelisation to fail.
Initially I thought (from experience) that the parallelisation over dates was failing because there was another function within the code that utilised parallelisation, however I have subsequently rebuilt the whole code to make sure that there was no such function.
I have poured over the script with a fine tooth comb to see if there was any place where I accidently didn't export something that I needed and I can't find anything.
Some ideas as to what could be causing the code to fail are:
The use of various option valuation functions in fOptions and rquantlib
The use of type sock
I am aware of this question already asked and also this question, and while the first question has helped me, it hasn't yet help solve the problem. (Note: that may be because I haven't used it correctly, having mainly used loginfo("text") to track where the code is. Potentially, there is a way to change that such that I log warning and/or error messages instead?)
Please let me know if there is any other information I can provide to help in solving this. I would be so appreciative if someone could provide some guidance, as the code takes close to 40 minutes to run for a day and I need to run it for close to a year, therefore parallelisation is essential!
EDIT
I have tried to implement the suggestion in the first question included above by utilising the outfile option. Given I am using Windows, I have done this by including the following lines before the exporting of the key objects and running MainCalcFunction :
reportLogName <- paste("logout_parallel.txt", sep="")
addHandler(writeToFile,
file = paste(Save_directory,reportLogName, sep="" ),
level='DEBUG')
with(getLogger(), names(handlers))
loginfo(paste("Starting log file", getwd()))
mc<-detectCores()
cl<-makeCluster(mc, outfile="")
registerDoParallel(cl)
Similarly, at the beginning of MainCalcFunction, after having sourced my libraries and functions I have included the following to print to file:
reportLogName <- paste(testDate,"_logout.txt", sep="")
addHandler(writeToFile,
file = paste(Save_directory,reportLogName, sep="" ),
level='DEBUG')
with(getLogger(), names(handlers))
loginfo(paste("Starting test function ",getwd(), sep = ""))
In the MainCalcFunction function I have then put loginfo("text") statements at key junctures to inform me of where the code is at.
This has resulted in some text files being available after the code fails due to the aforementioned error. However, these text files provide no more information on the cause of the error aside from at what point. This is despite having a tryCatch statement embedded in MainCalcFunction where at the end, on any instance of error I have added the line logerror(e)
I am posting this answer in case it helps anyone else with a similar problem in the future.
Essentially, the error unserialize(socklist[[n]]) doesn't tell you a lot, so to solve it it's a matter of narrowing down the issue.
Firstly, be absolutely sure the code runs over several dates in non-parallel with no errors
Ensure the parallelisation is set up correctly. There are some obvious initial errors that many other questions respond to, e.g., hidden parallelisation inside the code which means parallelisation is occurring twice.
Once you are sure that there is no problem with the code and the parallelisation is set up correctly start narrowing down. The issue is likely (unless something has been missed above) something in the code which isn't a problem when it is run in serial, but becomes a problem when run in parallel. The easiest way to narrow down is by setting outfile = "Log.txt" in which make cluster function you use, e.g., cl<-makeCluster(cores-1, outfile="Log.txt"). Then add as many print("Point in code") comments in your function to narrow down on where the issue is occurring.
In my case, the problem was the line jj = closeAllConnections(). This line works fine in non-parallel but breaks the code when in parallel. I suspect it has something to do with the function closing all connections including socket connections that are required for the parallelisation.
Try running using plain R instead of running in RStudio.

Tracing back rare error

I'm running a code which takes very long to compute. I made my code parallel using foreach()%dopar% and run in on the cluster.
It runs generally fine but sometimes crashes and I get the following error :
Error in { : task 4 failed - "missing value where TRUE/FALSE needed"
Calls: %dopar% -> <Anonymous>
Execution halted
Now it says Execution halted but only for this particular core so the others keep running and at the end it fails to output but doesn't tell me before hand.
I guess it's a problem with an if statement. I tried simulating the code on my computer but it is so rare that I can't simulate it.
The code runs easily 100 hours doing as many as 100 000 loops and only one of them will fail.
My questions are : Can I traceback where the error was? (I run the code on a cluster so I don't have all the nice Rstudio stuff)
Also, is it possible to still output from a foreach() loop even if one of the tasks crashed?
Or perhaps any method people use so that I can make the crash happen on my computer?
I can write the code if needed, please ask if it helps.
The foreach ".errorhandling" argument is intended to help in this situation. If you want foreach to pass errors through, then use .errorhandling="pass". If you want it to filter out errors (which reduces the length of the result), then use .errorhandling="remove". The default value is "stop" which throws an error indicating which task failed.
Unfortunately, most parallel backends don't support tracebacks, but doMPI does. You simply call "startMPIcluster" with verbose=TRUE, and the traceback will be written to the log file of the worker that had the error. Here's an example that generates an error on task 42:
suppressMessages(library(doMPI))
cl <- startMPIcluster(4, verbose=TRUE)
registerDoMPI(cl)
g <- function(i) {
if (i == 42) {
if (NULL) cat('hello, world\n')
}
7
}
f <- function(i) g(i)
r <- foreach(i=1:50, .errorhandling='pass') %dopar% f(i)
print(r)
closeCluster(cl)
mpi.quit()
Since it uses .errorhandling="pass", the script runs to completion, with an error object returned in element 42 of the result list. In addition, one of the log files contains a traceback of the error (along with many other messages):
waiting for a taskchunk...
executing taskchunk 42 containing 1 tasks
error executing task: argument is of length zero
traceback (most recent call first):
> g(i)
> f(i)
> eval(expr, envir, enclos)
> eval(expr, envir)
> executeTask(taskchunk$argslist[[1]])
> executeTaskChunk(cl$workerid, taskchunk, envir, err, cores)
returning error results for taskchunk 42
Unfortunately, doMPI is mostly used on Linux systems, so this isn't helpful to most Mac and Windows users.

Timeout while reading csv file from url in R

I currently have a script in R that loops around 2000 times (for loop), and on each loop it queries data from a database using a url link and the read.csv function to put the data into a variable.
My problem is: when I query low amounts of data (around 10000 rows) it takes around 12 seconds per loop and its fine. But now I need to query around 50000 rows of data per loop and the query time increases quite a lot, to 50 seconds or so per loop. And this is fine for me but sometimes I notice it takes longer for the server to send the data (≈75-90 seconds) and APPARENTLY the connection times out and I get these errors:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") : cannot open: HTTP status was '0 (nil)'
or this one:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") : InternetOpenUrl failed: 'The operation timed
out'
I don't get the same warning every time, it changes between those two.
Now, what I want is to avoid my program to stop when this happens, or to simply prevent this timeout error and tell R to wait more time for the data. I have tried these settings at the start of my script as a possible solution but it keeps happening.
options(timeout=190)
setInternet2(use=NA)
setInternet2(use=FALSE)
setInternet2(use=NA)
Any other suggestions or workarounds? Maybe to skip to the next loop when this happens and store in a variable the loop number of the times this error occurred so it can be queried again in the end but only for those i's in the loop that were skipped due to the connection error? The ideal solution would be, of course, to avoid having this error.
A solution using the RCurl package:
You can change the timeout option using
curlSetOpt(timeout = 200)
or by passing it into the call to getURL
getURL(url_vect[i], timeout = 200)
A solution using base R:
Simply download each file using download.file, and then worry about manipulating those file later.
I see this is an older post, but it still comes up early in the list of Google results, so...
If you are downloading via WinInet (rather than curl, internal, wget, etc.) options, including timeout, are inherited from the system. Thus, you cannot set the timeout in R. You must change the Internet Explorer settings. See Microsoft references for details:
https://support.microsoft.com/en-us/kb/181050
https://support.microsoft.com/en-us/kb/193625
This is partial code that I show you, but you can modify to you're needs :
# connect to website
withRestarts(
tryCatch(
webpage <- getURL(url_vect[i]),
finally = print(" Succes.")
),
abort = function(){},
error = function(e) {
i<-i+1
}
)
In my case the url_vect[i] was one of the url's I copied information. This will increase the time you need to wait for the program to finish sadly.
UPDATED
tryCatch how to example

Results of workers not returned properly - snow - debug

I'm using the snow package in R to execute a function on a SOCK cluster with multiple machines(3) running on Linux OS. I tried to run the code with both parLapply and clusterApply.
In case of any error at the worker level, the results of the worker nodes are not returned properly to master making it very hard to debug. I'm currently logging every heartbeat of the worker nodes independently using futile.logger. It seems as if the results are properly computed. But when I tried to print the result at the master node (After receiving the output from workers) I get an error which says, Error in checkForRemoteErrors(val): 8 nodes produced errors; first error: missing value where TRUE/FALSE needed.
Is there any way to debug the results of the workers more deeply?
The checkForRemoteErrors function is called by parLapply and clusterApply to check for task errors, and it will throw an error if any of the tasks failed. Unfortunately, although it displays the error message, it doesn't provide any information about what worker code caused the error. But if you modify your worker/task function to catch errors, you can retain some extra information that may be helpful in determining where the error occurred.
For example, here's a simple snow program that fails. Note that it uses outfile='' when creating the cluster so that output from the program is displayed, which by itself is a very useful debugging technique:
library(snow)
cl <- makeSOCKcluster(2, outfile='')
problem <- function(i) {
if (NA)
j <- 999
else
j <- i
2 * j
}
r <- parLapply(cl, 1:2, problem)
When you execute this, you see the error message from checkForRemoteErrors and some other messages, but nothing that tells you that the if statement caused the error. To catch errors when calling problem, we define workerfun:
workerfun <- function(i) {
tryCatch({
problem(i)
},
error=function(e) {
print(e)
stop(e)
})
}
Now we execute workerfun with parLapply instead of problem, first exporting problem to the workers:
clusterExport(cl, c('problem'))
r <- parLapply(cl, 1:2, workerfun)
Among the other messages, we now see
<simpleError in if (NA) j <- 999 else j <- i: missing value where TRUE/FALSE needed>
which includes the actual if statement that generated the error. Of course, it doesn't tell you the file name and line number of the expression, but it's often enough to let you solve the problem.
check the range of your observations. how the observation varies. I have noticed that when there are lots of decimal places 4, 5,6 , it throws glm.nb off. To solve this i just round the observations to 2 decimal places.

Resources