How to overcome BiocParallel errors (error in socket connection) - r

I'm trying to run a package in R that requires parallel processing (using xcms). I come across this error, which I am aware of other people experiencing (
Errors in makeCluster(multicore): cannot open the connection,
Error in parallel processing: port cannot be open,
Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b", : cannot open the connection in Stan (from R) (the one most similar to mine that doesn't utilize a cluster/Linux and had no answer 7 years ago)
Error in socketConnection(port = port, server = TRUE, blocking = TRUE, : cannot open the connection
The only way I can get the code to run is through disabling parallel processing through
register(SerialParam())
I am using a Windows computer with 11 available cores. Any advice would be appreciated.

Related

How to connect to shinyapps with a firewall installed?

I am trying to connect to shinyapps via Rstudio using the setAccountInfo function in the rsconnect package:
rsconnect::setAccountInfo(name='MYACCOUNTNAME',
token='TOKEN',
secret='<SECRET>')
But I am getting the following error:
Error in function (type, msg, asError = TRUE) :
Failed to connect to api.shinyapps.io port 443: Timed out
I am in my office PC and one of the more likely problems would be the firewall of the enterprise, so my questions would be:
Is there a way to workaround this problem and connect anyway?
If not, what would be the instruction I would have to give the IT department to be capable of connecting?
The following options should help you see whats happening:
library(rsconnect)
options(rsconnect.http.trace = TRUE, rsconnect.error.trace = TRUE, rsconnect.http.verbose = TRUE)
rsconnect::setAccountInfo(name='MYACCOUNTNAME',
token='TOKEN',
secret='<SECRET>')
By running this you should see what IP addresses rsconnect is trying to use. Try adding this to a whitelist for your firewall.
If this doesn't work it may be a proxy issue. Issue setting up my shinyapps.io + AUTHORIZE ACCOUNT + time out port 443 This should help set up a proxy in rStudio.

Spark JDBC connection to SQL Server times out often

I'm running Spark v2.2.1 via sparklyr v0.6.2 and pulling data from SQL Server via jdbc. I seem to be experiencing some network issue because many times (not every time) my executor doing a write to SQL Server fails with error:
Prelogin error: host <my server> port 1433 Error reading prelogin response: Connection timed out (Read failed) ClientConnectionId:...
I am running my sparklyr session with the following configurations:
spark_conf = spark_config()
spark_conf$spark.executor.cores <- 8
spark_conf$`sparklyr.shell.driver-memory` <- "8G"
spark_conf$`sparklyr.shell.executor-memory` <- "12G"
spark_conf$spark.serializer <- "org.apache.spark.serializer.KryoSerializer"
spark_conf$spark.network.timeout <- 400
But interestingly the network timeout I've set above does not seem to apply based on the executor logs:
18/06/11 17:53:44 INFO BlockManager: Found block rdd_9_16 locally
18/06/11 17:53:45 WARN SQLServerConnection: ConnectionID:3 ClientConnectionId: d3568a9f-049f-4772-83d4-ed65b907fc8b Prelogin error: host nciensql14.nciwin.local port 1433 Error reading prelogin response: Connection timed out (Read failed) ClientConnectionId:d3568a9f-049f-4772-83d4-ed65b907fc8b
18/06/11 17:53:45 WARN SQLServerConnection: ConnectionID:2 ClientConnectionId: ecb084e6-99a8-49d1-9215-491324e8d133 Prelogin error: host nciensql14.nciwin.local port 1433 Error reading prelogin response: Connection timed out (Read failed) ClientConnectionId:ecb084e6-99a8-49d1-9215-491324e8d133
18/06/11 17:53:45 ERROR Executor: Exception in task 10.0 in stage 26.0 (TID 77)
Can someone help me understand what a prelogin error is and how to avoid this issue? Here is my write function:
function (df, tbl, db, server = NULL, user, pass, mode = "error",
options = list(), ...)
{
sparklyr::spark_write_jdbc(
df,
tbl,
options = c(
list(url = paste0("jdbc:sqlserver://", server, ".nciwin.local;",
"databaseName=", db, ";",
"user=", user, ";",
"password=", pass, ";"),
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"),
options),
mode = mode, ...)
}
I've just updated my jdbc driver to version 6.0, but I don't think it made a difference. I hope i installed it correctly. I just dropped it into my Spark/jars folder and then added it into Spark/conf/spark-defaults.conf.
EDIT
I am reading in 23M rows in 24 partitions into Spark. My cluster has 4 nodes with 8 cores each and 18G memory. With my current configurations I have 4 executors with 8 cores each and 12G per executor. My function to read in the data looks as such:
function (sc, tbl, db, server = NULL, user, pass, repartition = 0, options = list(), ...)
{
sparklyr::spark_read_jdbc(
sc,
tbl,
options = c(
list(url = paste0("jdbc:sqlserver://", server, ".nciwin.local;"),
user = user,
password = pass,
databaseName = db,
dbtable = tbl,
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"),
options),
repartition = repartition, ...)
}
I set repartition to 24 when running. As such, I'm not seeing the connection with the post suggested.
EDIT 2
I was able to fix my issue by getting rid of repartitioning. Can anyone explain why repartitioning with sparklyr is not effective in this case?
As explained in the other question, as well as some other posts (Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?, Converting mysql table to spark dataset is very slow compared to same from csv file, Partitioning in spark while reading from RDBMS via JDBC, spark reading data from mysql in parallel) and off-site resources (Parallelizing Reads), by default Spark JDBC source reads all data sequentially into a single node.
There are two ways of parallelizing reads:
Range splitting based on a numerical column with lowerBound, upperBound, partitionColumn and numPartitions options required, where partitionColumn is a stable numeric column (pseudocolumns might not be a good choice)
spark_read_jdbc(
...,
options = list(
...
lowerBound = "0", # Adjust to fit your data
upperBound = "5000", # Adjust to fit your data
numPartitions = "42", # Adjust to fit your data and resources
partitionColumn = "some_numeric_column"
)
)
predicates list - not supported in sparklyr at the moment.
Repartitioning (sparklyr::sdf_repartition doesn't resolve the problem because it happens after data has been loaded. Since shuffle (required for repartition) belongs to the most expensive operations in Spark it can easily crash the node.
As a result using:
repartition parameter of spark_read_jdbc:
sdf_repartition
is just a cargo cult practice, and most of the time does more harm than good. If data is small enough to be piped through a single node, then increasing number of partitions will usually decreases performance. Otherwise it will just crash.
That being said - if data is already processed by a single node it raises a question, if it makes sense to use Apache Spark at all. The answer will depend on the rest of your pipeline, but considering only component in question, it likely be negative.

How to make a client / server connection using Rserver and Windows Server 2008

I am searching for a robust solution to perform extensive computations on a remote server, dedicated to computational tasks. The server is on Windows 2008 R2 and has R x64 3.4.1 installed on it. I've searched for free solutions and am now focusing on the Rserver/RSclient packages solutions.
However, I can't connect any client (using RSclient) to the instanced server.
This is how I'm proceeding at the moment from the server side:
library(Rserve)
run.Rserve(config.file = "Rserv.conf")
using the following Rserv.conf file:
port 6311
remote enable
plaintext enable
control enable
r-control enable
The server is now intanciated using the Rsession (It's a bit ugly, but will change that latter on):
running Rserve in this R session (pid=...), 1 server(s)
Now, i'm trying to connect using a remote computer (Client-side) using:
library(RSclient)
c = RS.connect(host = "...")
The connection then seems to succeed, checking for c:
> c
Rserve QAP1 connection 0x000000000fbe9f50 (socket 764, queue length 0)
The error occurs when i try to eval anything, for example:
> RS.server.eval(c,"0<1")
Error in RS.server.eval(c, "0<1") : command failed with status code 0x4e: no control line present (control commands disabled or server shutdown)
I've read the available guides but still failed in connecting. What is wrong? It seems to be related to control lines but I authorized them in the config file.
for me the problem was solved by initiating the Rserve instance with the command:
R CMD Rserve --RS-port 9000 --RS-enable-remote --RS-enable-control
instead of starting it in the R environment (library(Rserve), run.Rserve(config.file = "Rserv.conf")). You may try this on Windows as well.
Refer https://github.com/s-u/Rserve/wiki/rserve.conf.
port 6311
remote enable -> it should be remote true
plaintext enable
control enable
r-control enable
Likewise refer the link and try with actual values

"permission denied" when trying to open serial connection

I am using the serial package in R to read a serial input to USB (COM3). When i try to open/read the connection:
library(serial)
con <- serialConnection(name="test_con", port="COM3", mode="4, n, 8, 1", ...)
open(con)
read.serialConnection(con)
I get the following error message:
Error: object 'binned_spikes' not found
Sometimes open(con) works but read.serialConnection() never does.
I have tried restarting my computer etc. and have tested the serial connection on Teraterm and the device definitely works and is in the right port. Ive also tried this for all 4 ports and always get the same error message in R.
Thanks if you can help!

socket connection error in doredis on AWS EC2

I have set up an instance to use as a redis worker. All ports are open. When i issue
library("doRedis")
redisWorker(host="ZZZ-23-20-XXX-XXX.compute-1.amazonaws.com", queue="jobs")
i get the error
Error in socketConnection(host, port, open = "a+b", blocking = TRUE, timeout = timeout) :
cannot open the connection
In addition: Warning message:
In socketConnection(host, port, open = "a+b", blocking = TRUE, timeout = timeout) :
ZZZ-23-20-XXX-XXX.compute-1.amazonaws.com:6379 cannot be opened
Any ideas what could be going on? I have also used the inernal EC2 IP (10.XXX.XXX.ZZZ) still get the same error. The server is up, running and pingable
I am running latest and greatest of R, doRedis,Ubuntu 12.04 all fully updated. This has been discussed before but no solution found. doRedis with strange socket connection error in Ubuntu Linux, R, and RStudio
I have had similar issues although with registerDoRedis() as you cannot set a timeout and I believe the problem is with the timeout value used in the function 'redisConnect'.
In R if you run fix(redisConnect) and you can see the default for timeout is as follows:
redisConnect <- function (host = "localhost", port = 6379, returnRef = FALSE,
timeout = 2147483647L)
It seems this huge timeout value is causing the issue. To check change it on the line it is used from this:
con <- socketConnection(host, port, open = "a+b", blocking = TRUE,
timeout = timeout)
To this:
con <- socketConnection(host, port, open = "a+b", blocking = TRUE,
timeout = 30)
I find that works although as soon as you reload the package the change gets wiped. I just found this today so will submit a bug to the developer. I'm running R 2.15 on OSX by the way.
The function you are using should default to timeout 30, or you can try setting it on the function call to be sure rather than fix()'ing the underlying code.

Resources