R: postgres connection keeps timing out or breaking - r

I am running a code in R, which connects to the postgresql database. The connection is defined outside the loop, but it times out and keeps breaking. If I put the connection inside the loop, and kill it each time I use it. We reach the limit on the connections.
Additionally, when we run the r code in a loop, the answers/outputs are stored it in a db, it works for first 15 minutes but then the connection breaks saying it cannot connect.
I get the following errors:
RS-DBI driver: (could not connect ------ on dbname "abc": could not connect to server: Connection timed out (0x0000274C/10060)
    Is the server running on host "123.456.567.890" and accepting
    TCP/IP connections on port 5432?
)Error in diagnosticTestsPg(project_path, modelbank, modelproduct, modelwaterfall,  :
  object 'conn' not found
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Here, conn is the connection to the database
Is there a way to fix this or a workaround to have the connection in place until the loop runs?

id <- tryCatch(withCallingHandlers(
id <- f(), error=function(e) {
write.to.log(sys.calls())
},
warning=function(w) {
write.to.log(sys.calls())
invokeRestart("muffleWarning")
}
)
, error = function(e) { print("recovered from error") }
)
Where f() has the db connection details

Related

How to resolve "Error: could not receive data from server" - database connection in r?

I am running into
Error: could not receive data from server: Software caused connection abort (0x00002745/10053)
upon trying to connect to a postgres database using the DBI package in R. Note that i am in a work environment, so subject to a corporate firewall. Can that explain the error or is there something else that could be happening?
Here is the code I'm using
# Connect to trayaway dev
con <- DBI::dbConnect(
RPostgres::Postgres(),
host = host, port = 5432, dbname = "postgres",
user = user, password = password
)
error below:
Error: could not receive data from server: Software caused connection abort (0x00002745/10053)
solution was found- i tried the same code using wiFi and the code works - when hardwired, connection string fails to connect to database - so this is a corporate firewall issue - thank you,

database connection intermittently fails when using dopar

I am trying to access an SQL Server database from R and need to parallelise the process for higher throughput using doSNOW. When setting up the cluster, I first initialise the connection, but for some of the cores in the cluster, database connection fails without explanation.
cl <- makeCluster(10)
registerDoSNOW(cl)
clusterEvalQ(cl, {
library(RODBC)
dbhandle <- odbcDriverConnect(%connectionstring%)
})
This code prints a list of the connections and whilst some have been successfully initialised, others have failed (returned -1). This happens randomly and different connections fail each time the code is run.
[[1]]
[1] -1
[[2]]
RODBC Connection 1
Details:
case=nochange
DRIVER=SQL Server
SERVER=redacted
UID=
Trusted_Connection=Yes
WSID=redacted
DATABASE=redacted
[[3]]
[1] -1
[[4]]
RODBC Connection 1
Details:
case=nochange
DRIVER=SQL Server
SERVER=redacted
UID=
Trusted_Connection=Yes
WSID=redacted
DATABASE=redacted
As per comments, adding sleep(Sys.getpid()/1000) fixes the problem
clusterEvalQ(cl, {
sleep(Sys.getpid()/1000)
library(RODBC)
dbhandle <- odbcDriverConnect(%connectionstring%)
})

How to pull data from postgresql data into dataframe for use with sqldf

I have successfully connected to a postgres DB in RStudio using postgresql and pulled back the necessary data. No problem there.
The issue is that now I have the dataset in RStudio I want to be able to query it as a dataframe using sqldf. This is where the problem lies.
I have already tried the following code
tab1 <- DBI::dbGetQuery(con, "SELECT a.user_id
,a.some_id1
,a.some_id2
,a.some_var1
,a.some_var2
,a.some_var3
,a.some_var4
,a.some_var5
,b.some_var6 FROM sessions a LEFT JOIN session_experiments b on a.some_id1 = b.some_id2
AND a.some_var1 = b.some_var1")
Again, this returns the data I want to see in RStudio.
I then try something like...
tab2 <- sqldf("SELECT COUNT (DISTINCT some_id1) FROM tab1")
...and I see the following error.
Error in postgresqlNewConnection(drv, ...) :
RS-DBI driver: (could not connect postgres#localhost:5432 on dbname "test": could not connect to server: Connection refused
Is the server running on host "localhost" (::1) and accepting
TCP/IP connections on port 5432?
could not connect to server: Connection refused
Is the server running on host "localhost" (127.0.0.1) and accepting
TCP/IP connections on port 5432?
)
Error in !dbPreExists : invalid argument type
Admit postgresql is not a package I have used before so would appreciate some help.
Thanks in advance
Ok, seems the issue lies in the fact that when reverting to sqldf you are required to specify driver and dbnames explicitly as in the following example:
sqldf(query, drv="SQLite", dbname=":memory:")
I had no idea about this but this resolves my issue so I will consider the question answered.
Read more here:
https://www.r-bloggers.com/using-postgresql-in-r-a-quick-how-to/

Spark JDBC connection to SQL Server times out often

I'm running Spark v2.2.1 via sparklyr v0.6.2 and pulling data from SQL Server via jdbc. I seem to be experiencing some network issue because many times (not every time) my executor doing a write to SQL Server fails with error:
Prelogin error: host <my server> port 1433 Error reading prelogin response: Connection timed out (Read failed) ClientConnectionId:...
I am running my sparklyr session with the following configurations:
spark_conf = spark_config()
spark_conf$spark.executor.cores <- 8
spark_conf$`sparklyr.shell.driver-memory` <- "8G"
spark_conf$`sparklyr.shell.executor-memory` <- "12G"
spark_conf$spark.serializer <- "org.apache.spark.serializer.KryoSerializer"
spark_conf$spark.network.timeout <- 400
But interestingly the network timeout I've set above does not seem to apply based on the executor logs:
18/06/11 17:53:44 INFO BlockManager: Found block rdd_9_16 locally
18/06/11 17:53:45 WARN SQLServerConnection: ConnectionID:3 ClientConnectionId: d3568a9f-049f-4772-83d4-ed65b907fc8b Prelogin error: host nciensql14.nciwin.local port 1433 Error reading prelogin response: Connection timed out (Read failed) ClientConnectionId:d3568a9f-049f-4772-83d4-ed65b907fc8b
18/06/11 17:53:45 WARN SQLServerConnection: ConnectionID:2 ClientConnectionId: ecb084e6-99a8-49d1-9215-491324e8d133 Prelogin error: host nciensql14.nciwin.local port 1433 Error reading prelogin response: Connection timed out (Read failed) ClientConnectionId:ecb084e6-99a8-49d1-9215-491324e8d133
18/06/11 17:53:45 ERROR Executor: Exception in task 10.0 in stage 26.0 (TID 77)
Can someone help me understand what a prelogin error is and how to avoid this issue? Here is my write function:
function (df, tbl, db, server = NULL, user, pass, mode = "error",
options = list(), ...)
{
sparklyr::spark_write_jdbc(
df,
tbl,
options = c(
list(url = paste0("jdbc:sqlserver://", server, ".nciwin.local;",
"databaseName=", db, ";",
"user=", user, ";",
"password=", pass, ";"),
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"),
options),
mode = mode, ...)
}
I've just updated my jdbc driver to version 6.0, but I don't think it made a difference. I hope i installed it correctly. I just dropped it into my Spark/jars folder and then added it into Spark/conf/spark-defaults.conf.
EDIT
I am reading in 23M rows in 24 partitions into Spark. My cluster has 4 nodes with 8 cores each and 18G memory. With my current configurations I have 4 executors with 8 cores each and 12G per executor. My function to read in the data looks as such:
function (sc, tbl, db, server = NULL, user, pass, repartition = 0, options = list(), ...)
{
sparklyr::spark_read_jdbc(
sc,
tbl,
options = c(
list(url = paste0("jdbc:sqlserver://", server, ".nciwin.local;"),
user = user,
password = pass,
databaseName = db,
dbtable = tbl,
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"),
options),
repartition = repartition, ...)
}
I set repartition to 24 when running. As such, I'm not seeing the connection with the post suggested.
EDIT 2
I was able to fix my issue by getting rid of repartitioning. Can anyone explain why repartitioning with sparklyr is not effective in this case?
As explained in the other question, as well as some other posts (Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?, Converting mysql table to spark dataset is very slow compared to same from csv file, Partitioning in spark while reading from RDBMS via JDBC, spark reading data from mysql in parallel) and off-site resources (Parallelizing Reads), by default Spark JDBC source reads all data sequentially into a single node.
There are two ways of parallelizing reads:
Range splitting based on a numerical column with lowerBound, upperBound, partitionColumn and numPartitions options required, where partitionColumn is a stable numeric column (pseudocolumns might not be a good choice)
spark_read_jdbc(
...,
options = list(
...
lowerBound = "0", # Adjust to fit your data
upperBound = "5000", # Adjust to fit your data
numPartitions = "42", # Adjust to fit your data and resources
partitionColumn = "some_numeric_column"
)
)
predicates list - not supported in sparklyr at the moment.
Repartitioning (sparklyr::sdf_repartition doesn't resolve the problem because it happens after data has been loaded. Since shuffle (required for repartition) belongs to the most expensive operations in Spark it can easily crash the node.
As a result using:
repartition parameter of spark_read_jdbc:
sdf_repartition
is just a cargo cult practice, and most of the time does more harm than good. If data is small enough to be piped through a single node, then increasing number of partitions will usually decreases performance. Otherwise it will just crash.
That being said - if data is already processed by a single node it raises a question, if it makes sense to use Apache Spark at all. The answer will depend on the rest of your pipeline, but considering only component in question, it likely be negative.

socket connection error in doredis on AWS EC2

I have set up an instance to use as a redis worker. All ports are open. When i issue
library("doRedis")
redisWorker(host="ZZZ-23-20-XXX-XXX.compute-1.amazonaws.com", queue="jobs")
i get the error
Error in socketConnection(host, port, open = "a+b", blocking = TRUE, timeout = timeout) :
cannot open the connection
In addition: Warning message:
In socketConnection(host, port, open = "a+b", blocking = TRUE, timeout = timeout) :
ZZZ-23-20-XXX-XXX.compute-1.amazonaws.com:6379 cannot be opened
Any ideas what could be going on? I have also used the inernal EC2 IP (10.XXX.XXX.ZZZ) still get the same error. The server is up, running and pingable
I am running latest and greatest of R, doRedis,Ubuntu 12.04 all fully updated. This has been discussed before but no solution found. doRedis with strange socket connection error in Ubuntu Linux, R, and RStudio
I have had similar issues although with registerDoRedis() as you cannot set a timeout and I believe the problem is with the timeout value used in the function 'redisConnect'.
In R if you run fix(redisConnect) and you can see the default for timeout is as follows:
redisConnect <- function (host = "localhost", port = 6379, returnRef = FALSE,
timeout = 2147483647L)
It seems this huge timeout value is causing the issue. To check change it on the line it is used from this:
con <- socketConnection(host, port, open = "a+b", blocking = TRUE,
timeout = timeout)
To this:
con <- socketConnection(host, port, open = "a+b", blocking = TRUE,
timeout = 30)
I find that works although as soon as you reload the package the change gets wiped. I just found this today so will submit a bug to the developer. I'm running R 2.15 on OSX by the way.
The function you are using should default to timeout 30, or you can try setting it on the function call to be sure rather than fix()'ing the underlying code.

Resources