database connection intermittently fails when using dopar - r

I am trying to access an SQL Server database from R and need to parallelise the process for higher throughput using doSNOW. When setting up the cluster, I first initialise the connection, but for some of the cores in the cluster, database connection fails without explanation.
cl <- makeCluster(10)
registerDoSNOW(cl)
clusterEvalQ(cl, {
library(RODBC)
dbhandle <- odbcDriverConnect(%connectionstring%)
})
This code prints a list of the connections and whilst some have been successfully initialised, others have failed (returned -1). This happens randomly and different connections fail each time the code is run.
[[1]]
[1] -1
[[2]]
RODBC Connection 1
Details:
case=nochange
DRIVER=SQL Server
SERVER=redacted
UID=
Trusted_Connection=Yes
WSID=redacted
DATABASE=redacted
[[3]]
[1] -1
[[4]]
RODBC Connection 1
Details:
case=nochange
DRIVER=SQL Server
SERVER=redacted
UID=
Trusted_Connection=Yes
WSID=redacted
DATABASE=redacted

As per comments, adding sleep(Sys.getpid()/1000) fixes the problem
clusterEvalQ(cl, {
sleep(Sys.getpid()/1000)
library(RODBC)
dbhandle <- odbcDriverConnect(%connectionstring%)
})

Related

R times out when accessing POSTGRESQL db

I'm using the POSTGRESQL package to access a database and R times out and crashes every time when connecting ("The previous R session was terminated due to an unexpected crash"). Terminal says 'abrt-cli status' timed out.
I've tried accessing the same db through SSH and it works. Our sysadmin thinks the credentials are correct, and says R seems to be working fine for him (although he's doing completely different work that has nothing to do with accessing postgresql databases). I've tried naming the server rather than using 'localhost', just in case, and that doesn't work either. But even if the credentials were wrong, I feel like I'd get an error rather than a crash.
I realize this might have to do with our local configuration, but I'm at the end of my rope. I'd be extremely grateful for any ideas people have.
# Install the package RPostgres
knitr::opts_chunk$set(echo = TRUE)
library(DBI)
library(RPostgreSQL)
library(tidycensus)
# The user needs to input their password
input_password <- rstudioapi::askForPassword("Database password")
db <- [CENSORED, I PROMISE IT'S CORRECT :)]
host_db <- 'localhost'
db_port <- '5432'
db_user <- 'leviem'
db_password <- input_password
drv <- dbDriver("PostgreSQL")
conn <- dbConnect(drv,
dbname = db,
host = host_db,
port = db_port,
user = db_user,
password = db_password)
# List all the tables available
dbListTables(conn)
# Close the connection
dbDisconnect (conn)

Translate Python MySQL ssh port forwarding solution to R (dbplyr)

I'm trying to query a MySQL server through an R/Tidyverse/dbplyr workflow. My MySQL access requires configuring SSH and port forwarding.
I have this code working using python (below), but I'm struggling to get started with the SSH/port forwarding equivalent in R. Any pointers to solutions or equivalent R packages appreciated. thanks.
import pymysql
import paramiko
import pandas as pd
from paramiko import SSHClient
from sshtunnel import SSHTunnelForwarder
from os.path import expanduser
pkeyfilepath = '/.ssh/id_ed25519'
home = expanduser('~')
mypkey = paramiko.Ed25519Key.from_private_key_file(home + pkeyfilepath)
sql_hostname = 'mycompany.com'
sql_username = 'me'
sql_password = '***'
sql_main_database = 'my_db'
sql_port = 3306
ssh_host = 'jumphost.mycompany.com'
ssh_user = 'me'
ssh_port = 22
with SSHTunnelForwarder(
(ssh_host, ssh_port),
ssh_username=ssh_user,
ssh_pkey=mypkey,
remote_bind_address=(sql_hostname, sql_port)) as tunnel:
conn = pymysql.connect(host='127.0.0.1', user=sql_username,
passwd=sql_password, db=sql_main_database,
port=tunnel.local_bind_port)
query = '''SELECT VERSION();'''
data = pd.read_sql_query(query, conn)
print(data)
conn.close()
There are several ways to do ssh port forwarding for R. In no particular order:
I forward it externally to R. All of my work is remote, and for one particular client I need access to various instances of SQL Server, Redis, MongoDB, remote filesystems, and a tunnel-hop to another network only accessible from the ssh bastion host. I tend to do work in more than R, so it's important to me that I generalize this. It is not for everybody or every task.
For this, I used a mismash of autossh and my ssh-agent (in KeePass/KeeAgent).
The ssh package does have a function to Create a Tunnel. The premise is that you have already created a "session" to which you can add a forwarding rule(s). When using ssh::ssh_tunnel, it is blocking, meaning you cannot use it in the same R process and continue to work. Demo:
# R session 1
sess <- ssh::ssh_connect("user#remote")
# insert passphrase
ssh::ssh_tunnel(sess, 21433, "otherremote:1433")
# / Waiting for connection on port 21433...
# R session 2
con <- DBI::dbConnect(..., port=21433)
DBI::dbGetQuery(con, "select 1 as n")
# n
# 1 1
This connection will stay alive so long as con is not closed and the remote end does not close it (e.g., activity timeout).
Note: I cannot get the ssh package to use my ssh-agent, so all passwords must be typed in or otherwise passed in not-ideal ways. There are many ways to not have to type it, such as using the keyring package (secure) or envvars, both of which would pass the password to ssh_connect(..., passwd=<>).
The above, but using callr so that you don't need to explicit sessions active (though you will still have another R session.
bgr <- callr::r_bg(function() {
ssh <- ssh::ssh_connect("r2#remote", passwd=keyring::key_get("r2", "remote"))
ssh::ssh_tunnel(ssh, port=21433, "otherremote:1433")
}, supervise = TRUE)
DBI::dbGetQuery(con, "select 1 as n")
# n
# 1 1
### when your work is done
bgr$kill()
If you do this, I strongly recommend the use of supervise=TRUE, which ensures the background R process is killed when this (primary) R session exits. This will reduce the risk of having phantom unused R sessions hanging around; in addition to just clogging up the process tree, if one of these phantom R processes is actively forwarding a port, that means nothing else can forward that port. This allows you to continue working, but you are not longer in control of the process doing the forwarding ... and subsequent attempts to tunnel will fail.
FYI, I generally prefer using keyring::key_get("r2", "remote") for password management in situations like this: (1) it prevents me from having to set that envvar each time I start R ... which will inadvertently store the plain-string password in ~/.Rhistory, if saved; (2) it prevents me from having to set that envvar in the global environment permanently, which is prone to other stupid mistakes I make; and (3) is much better protected since it is using the native credentials of your base OS. Having said that, you can replace the above use of keyring::key_get(..) with Sys.getenv("mypass") in a pinch, or in a case where the code is running on a headless system where a credentials manager is unavailable.
And if you want this to be a little more resilient to timeout disconnects, you can instead use
bgr <- callr::r_bg(function() {
ssh <- ssh::ssh_connect("r2#remote", passwd=keyring::key_get("r2", "remote"))
while (!inherits(try(ssh::ssh_tunnel(ssh, port=21433, "otherremote:1433"), silent=TRUE), "try-error")) Sys.sleep(1)
}, supervise = TRUE)
which will repeatedly make the tunnel so long as the attempt does not error. You may need to experiment with this to get it "perfect".
callr is really just using processx under the hood to start a background R process and allow you to continue working. If you don't want the "weight" of another R process solely to forward ports, you can use processx to start an explicit call to ssh that does everything you need it to do.
proc <- processx::process$new("ssh", c("-L", "21433:otherremote:1433", "r2#remote", "-N"))
### prompts for password
DBI::dbGetQuery(con, "select 1 as n")
# n
# 1 1
### when done
proc$kill()
# [1] TRUE

R: postgres connection keeps timing out or breaking

I am running a code in R, which connects to the postgresql database. The connection is defined outside the loop, but it times out and keeps breaking. If I put the connection inside the loop, and kill it each time I use it. We reach the limit on the connections.
Additionally, when we run the r code in a loop, the answers/outputs are stored it in a db, it works for first 15 minutes but then the connection breaks saying it cannot connect.
I get the following errors:
RS-DBI driver: (could not connect ------ on dbname "abc": could not connect to server: Connection timed out (0x0000274C/10060)
    Is the server running on host "123.456.567.890" and accepting
    TCP/IP connections on port 5432?
)Error in diagnosticTestsPg(project_path, modelbank, modelproduct, modelwaterfall,  :
  object 'conn' not found
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Here, conn is the connection to the database
Is there a way to fix this or a workaround to have the connection in place until the loop runs?
id <- tryCatch(withCallingHandlers(
id <- f(), error=function(e) {
write.to.log(sys.calls())
},
warning=function(w) {
write.to.log(sys.calls())
invokeRestart("muffleWarning")
}
)
, error = function(e) { print("recovered from error") }
)
Where f() has the db connection details

How to check if stopCluster (R) worked

When I try to remove a cluster from my workspace with stopCluster, it does not seem to work. Below is the code I am using.
> cl <- makeCluster(3)
> cl
socket cluster with 3 nodes on host ‘localhost’
> stopCluster(cl)
> cl
socket cluster with 3 nodes on host ‘localhost’
Note that the command cl still is called a socket cluster with 3 nodes after I have supposedly removed it. Shouldn't I get an error that object cl is not found? How do I know that my cluster has actually been removed? A related question: if I close R, is the cluster terminated and my computer returned to its normal state of being able to use all of its cores?
You shouldn't get an error that cl is not found, until you run rm(cl). Stopping a cluster doesn't remove the object from your environment.
Use showConnections to see that no connections are active:
> require(parallel)
Loading required package: parallel
> cl <- makeCluster(3)
> cl
socket cluster with 3 nodes on host ‘localhost’
> showConnections()
description class mode text isopen can read can write
3 "<-localhost:11129" "sockconn" "a+b" "binary" "opened" "yes" "yes"
4 "<-localhost:11129" "sockconn" "a+b" "binary" "opened" "yes" "yes"
5 "<-localhost:11129" "sockconn" "a+b" "binary" "opened" "yes" "yes"
> stopCluster(cl)
> showConnections()
description class mode text isopen can read can write
>
Whether or not your computer is "returned to its normal state" depends on the type of cluster you create. If it's just a simple socket or fork cluster, then gracefully stopping the parent process should cause all the child processes to terminate. If it's a more complicated cluster, it's possible terminating R will not stop all the jobs it started on the nodes.
Unfortunately, the print.SOCKcluster method doesn't tell you if the cluster object is usable. However, you can find out if it's usable by printing the elements of the cluster object, thus using the print.SOCKnode method. For example:
> library(parallel)
> cl <- makeCluster(3)
> for (node in cl) try(print(node))
node of a socket cluster on host ‘localhost’ with pid 29607
node of a socket cluster on host ‘localhost’ with pid 29615
node of a socket cluster on host ‘localhost’ with pid 29623
> stopCluster(cl)
> for (node in cl) try(print(node))
Error in summary.connection(connection) : invalid connection
Error in summary.connection(connection) : invalid connection
Error in summary.connection(connection) : invalid connection
Note that print.SOCKnode actually sends a message via the socket connection in order to get the process ID of the corresponding worker, as seen in the source code:
> parallel:::print.SOCKnode
function (x, ...)
{
sendCall(x, eval, list(quote(Sys.getpid())))
pid <- recvResult(x)
msg <- gettextf("node of a socket cluster on host %s with pid %d",
sQuote(x[["host"]]), pid)
cat(msg, "\n", sep = "")
invisible(x)
}
<bytecode: 0x2f0efc8>
<environment: namespace:parallel>
Thus, if you've called stopCluster on the cluster object, you'll get errors trying to use the socket connections.

Communicating with a TCP socket (server)

I am trying to send text to a server listening on a TCP port with R, and then read the response text from the server.
Quite trivial, i.e., on the BASH for a server listening to port 12345, that is:
> echo "text" | nc localhost 12345
response
The server keeps running and can be queried any time again after this.
But if I try the same thing from within R with socketConnection, I either never get a response, or it is printed and not captured.
I have tried the following:
con <- socketConnection(port=12345)
con <- socketConnection(port=12345, blocking=TRUE, timeout=2)
writeLines("text", con) # server does not receive a thing
flush(con) # has no effect
readLines(con) # still, nothing happens and gets nothing back
close(con) # server confirms receipt, but I no longer can get the result...
The server only receives the data after closing the connection, so nothing can be read
con <- pipe("nc localhost 12345")
writeLines("text", con)
Now, "result" is printed to STDOUT, so I cannot capture it...
If using a temporary file that contains "text":
res <- readLines(pipe("nc localhost 12345 < tempfile"))
That works, but requires an intermediate, temporary file.
How do I get server communication to work in R so that I can write and then read from the same connection?
I compiled and ran this simple server, leading to
Socket created
bind done
Waiting for incoming connections...
Then in R I created a connection
con <- socketConnection("127.0.0.1", port = 8888)
the server responded
Connection accepted
and back in R...
writeLines("all the world's a stage", con)
x = readLines(con)
x
## [1] "all the world's a stage"
close(con)
to which the server responded
Client disconnected
and then quit, as expected. Not sure how this differs from what you've tried.

Resources