Transfer large objects through `socketConnection` in R fails on macOS

Transfer large objects through `socketConnection` in R fails on macOS - r

I am trying to transfer a large object from a session to another using a socketConnection.
With the following I can establish the connection between the two session and see that it works for small messages:
socket <- serverSocket(11927)
r <- callr::r_session$new()
r$call(function() {
assign(
"conection",
socketConnection(port = 11927, open = "wb", blocking = TRUE, server = FALSE),
envir = .GlobalEnv
)
NULL
})
con <- socketAccept(socket, open = "wb", blocking = TRUE)
close(socket)
r$read()
serialize("hello world", con)
r$run(function() {
serialize(paste("hello from there: ", unserialize(conection)), conection)
})
unserialize(con)
Now if I try to serialize a large value, for example:
r$run(function() {
x <- runif(256*256*3)
serialize(x, conection)
TRUE
})
The serialization never finishes. It's worth noting that this works as expected on Linux. I didn't try on Windows.
I think this should work correctly because, using the parallel package, that also uses socket connections to transfer objects everything works as expected and I can transfer large objects pretty quickly. For example:
cl <- parallel::makePSOCKcluster(1)
parallel::clusterEvalQ(cl, {
get_batch <- function() runif(256*256*3)
})
out <- parallel::clusterCall(cl, "get_batch")
Any idea on what could be causing this behavior on macOS?

Related

Force stop query in R

I am using odbc in my R to get data from SQL server.
Recently, I had an issue: For some unknown reason. My query may take hours to get the result from the SQL server. It was fine before. The return data is only 10,000 rows. My data team colleagues haven't figure out the reason. My old code was:
getSqlData = function(server, sqlstr){
con = odbc::dbConnect(odbc(),
Driver = "SQL Server",
Server = server,
Trusted_Connection = "True")
result = odbc::dbGetQuery(con, sqlstr)
dbDisconnect(con)
return(result)
}
At first, I was trying to find a timeout parameter for dbGetQuery(). Unfortunately, there is no such parameter for this function. So I decide to monitor the runtime by myself.
getSqlData = function(server, sqlstr){
con = odbc::dbConnect(odbc(),
Driver = "SQL Server",
Server = server,
Trusted_Connection = "True")
result = tryCatch(
{
a = withTimeout(odbc::dbGetQuery(con, sqlstr), timeout = 600, onTimeout = "error")
return(a)
},
error=function(cond) {
msg <- "The query timed out:"
msg <- paste(msg,sqlstr,sep = " ")
error(logger,msg)
return(NULL)
},finally={
dbDisconnect(con)
}
)
return(result)
}
I force the function to stop if dbGetQuery() didn't finish in 10 mins. However, I get warning message as
In connection_release(conn#ptr) : There is a result object still in use.
The connection will be automatically released when it is closed
My understanding is this means the query is still running and the connection is not closed.
Is there a way to force the connection to be closed and force the query to stop?
The other thing I notice is even I set timeout = 1, it will not raise the error in 1s, it will run for around 1mins and then raise the error. Does anyone know why it behaved like this?
Thank you.

How to catch the error, save it and then remove the error in foreach in R?

I am running a code and it is really important for me to catch the error and save it for later, but not include it in my final result of the foreach. I have used trycatch and even tried coercing an error using stop. Here is a snippet of my code:
##options namely stop , remove or pass
error_handle <- "remove"
cores <- round(detectCores()*percent)
cl<-makeCluster(cores)
registerDoParallel(cl)
predict_load_all <- foreach(i=1:length(id),.export=func,.packages
(.packages()),.errorhandling = error_handle) %dopar% {
possibleError <- tryCatch({
weather_cast <- data.frame(udata$date,j,coeff_i,predict(hour_fits[[j]],
newdata=udata))
},error=function(e)return(paste0("The hour '",j, "'",
" caused the error: '", e, "'")))
if(!exists("weather_cast")){
#possibleError <- data.frame('coeff_item' = coeff_i,'Error' =
possibleError)
possibleError <- data.frame('Error' = possibleError)
write_csv(possibleError,file.path(path_predict,
'Error_weather_cast.csv'),append = T)
stop('error')
}
colnames(weather_cast)<- c("Date","Hour","coeff_item","Predicted_Load")
ifelse(j==1,predict_load <-weather_cast,predict_load <-
rbind(predict_load,weather_cast))
predict_load <- spread(predict_load, Hour, Predicted_Load)
predict_load
}
I am running the foreach to output predict_load_all. possibleError is the error which needs to be saved, which is bound by a trycatch. This should save the object (satisfies exists condition) and then using stop, an error is induced which gets ignored by the remove(.errohandling object) and the loop in the foreach is skipped. This way, I get the error and a list without the errors.
This doesn't seem to be saving an error file.
Any ideas?

In my experience (and I have a lot more "learning experiences" with parallel processing than actual success stories, unfortunately), this is not an easy task. The best solution I found was to add outfile = '' to the makeCluster command. For me, in a non-interactive mode, this caused (some of?) the error messages to get written to the output file instead of being discarded completely.

Execution of future package in R results in an endless waiting time

i have a Problem with the future package. In my Task I try to set up an asynchronous process. I am doing this using Futures. If I run my script for the first time (in a clean RSession) everything is working fine and as expected. Running the same function for the second time, within the same R Session, Ends up in an endless waiting time. The execution stops in the line, where the Futures are started. No Errors are thrown. The Code just runs forever. If I interrupt the Code manually, the browser is called from the line:
Sys.sleep(interval).
Doing this a little bit earlier the call is made from:
Called from: socketSelect(list(con), write = FALSE, timeout = timeout).
I have written a small program, which has basically the same structure as my script and the same Problem occurs. While not obvious in this little example, this structure has some advantage in my original code:
library(future)
library(parallel)
asynchronousfunction <- function(){
Threads.2.start <- availableCores()
cl <- parallel::makePSOCKcluster(Threads.2.start)
plan(cluster, workers = cl)
threads <- lapply(1:Threads.2.start, function(index){
future::cluster({Sys.getpid()},persistent = TRUE, workers = cl[[index]])
})
while(!any(resolved(threads))){
Sys.sleep(0.1)
}
threads <- lapply(1:Threads.2.start, function(index){
future::cluster({Sys.getpid()},persistent = TRUE, workers = cl[[index]])
})
stopCluster(cl = cl)
}
asynchronousfunction() # First call to the function. Everything is working fine.
asynchronousfunction() #Second call to the function. Endless Execution.
I am working on Windows 10 and the R Version is 3.4.2. The package Version is 1.6.2.
I hope you Guys can help me.
Thanks in advance.
Best Regards,
Harvard

Author future here. It looks a like you've tried to overdo it a bit and I am not 100% sure what you're trying to achieve. Things that looks suspicious to me is your use of:
cluster() - call future() instead.
cluster(..., workers = cl[[index]]) - don't specify workers when you set up a future.
Is there a reason why you want to use persistent = TRUE?
resolve(threads) basically does the same as your while() loop.
You are not collecting the values of the futures, i.e. you're not calling value() or values().
For troubleshooting, you can get more details on what's going on under the hood by setting option(future.debug = TRUE).
If I'd rewrite your example as close to what you have now, a working example would look like this:
library("future")
asynchronousfunction <- function() {
n <- availableCores()
cl <- makeClusterPSOCK(n)
plan(cluster, workers = cl)
fs <- lapply(1:n, function(index) {
future({ Sys.getpid() }, persistent = TRUE)
})
## Can be replaced by resolve(fs)
while(!any(resolved(fs))) {
Sys.sleep(0.1)
}
fs <- lapply(1:n, function(index) {
future({ Sys.getpid() }, persistent = TRUE)
})
parallel::stopCluster(cl = cl)
}
Instead of rolling your own lapply() + future(), would it be sufficient for you to use future_lapply()? For example,
asynchronousfunction <- function() {
n <- availableCores()
cl <- makeClusterPSOCK(n)
plan(cluster, workers = cl)
pids <- future_lapply(1:n, function(ii) {
Sys.getpid()
})
str(pids)
pids <- future_lapply(1:n, function(ii) {
Sys.getpid()
})
str(pids)
parallel::stopCluster(cl = cl)
}

jsonlite working in plain code, but not as a part of a function

I stumbled upon an error:
> getBDLsearch("czas")
Error in file(con, "r") : cannot open the connection
...so I started a teardown to find where the problem is in the function. It's very simple, so I'll just paste it:
require(htmltools)
getBDLsearch <- function(query = "", debug = 0, raw = FALSE) {
url <- paste0('https://api.mojepanstwo.pl/bdl/search?q=', htmlEscape(query))
if (raw) {
document <- jsonlite::fromJSON(txt = url,simplifyVector=FALSE)
return(document)
}
else {
document <- jsonlite::fromJSON(txt = url,simplifyDataFrame=TRUE)
return(document)
}
}
( https://github.com/pbiecek/SmarterPoland )
The thing is when I run subsequent lines manually, it works like a charm and the variable "document" gets filled in nicely. I'm curious, why is it so?

Check that connection is valid

I'm using RPostgreSQL and sqldf inside my function like this:
MyFunction <- function(Connection) {
options(sqldf.RPostgreSQL.user = Connection[1],
sqldf.RPostgreSQL.password = Connection[2],
sqldf.RPostgreSQL.dbname = Connection[3],
sqldf.RPostgreSQL.host = Connection[4],
sqldf.RPostgreSQL.port = Connection[5])
# ... some sqldf() stuff
}
How do I test that connection is valid?

You can check that an existing connection is valid using isPostgresqlIdCurrent.
conn <- dbConnect("RPgSQL", your_database_details)
isPostgresqlIdCurrent(conn)
For testing new connections, I don't think that there is a way to know if a connection is valid without trying it. (How would R know that the database exists and is available until it tries to connect?)
For most analysis purposes, just stopping on an error and fixing the login details is the best approach. So just call dbConnect and don't worry about extra check functions.
If you are creating some kind of application where you need to to handle errors gracefully, a simple tryCatch wrapper should do the trick.
conn <- tryCatch(conn <- dbConnection(wherever), error = function(e) do_something)

My current design uses tryCatch:
Connection <- c('usr','secret','db','host','5432')
CheckDatabase <- function(Connection) {
require(sqldf)
require(RPostgreSQL)
options(sqldf.RPostgreSQL.user = Connection[1],
sqldf.RPostgreSQL.password = Connection[2],
sqldf.RPostgreSQL.dbname = Connection[3],
sqldf.RPostgreSQL.host = Connection[4],
sqldf.RPostgreSQL.port = Connection[5])
out <- tryCatch(
{
sqldf("select TRUE;")
},
error=function(cond) {
out <- FALSE
}
)
return(out)
}
if (!CheckDatabase(Connection)) {
stop("Not valid PostgreSQL connection.")
} else {
message("PostgreSQL connection is valid.")
}

One approach is to simply try executing the code, and catching any errors with a nice informative error message. Have a look at the documentation of tryCatch to see the details regarding how this works.
The following blog post provides an introduction to the exception-based style of programming.