I'm running Spark v2.2.1 via sparklyr v0.6.2 and pulling data from SQL Server via jdbc. I seem to be experiencing some network issue because many times (not every time) my executor doing a write to SQL Server fails with error:
Prelogin error: host <my server> port 1433 Error reading prelogin response: Connection timed out (Read failed) ClientConnectionId:...
I am running my sparklyr session with the following configurations:
spark_conf = spark_config()
spark_conf$spark.executor.cores <- 8
spark_conf$`sparklyr.shell.driver-memory` <- "8G"
spark_conf$`sparklyr.shell.executor-memory` <- "12G"
spark_conf$spark.serializer <- "org.apache.spark.serializer.KryoSerializer"
spark_conf$spark.network.timeout <- 400
But interestingly the network timeout I've set above does not seem to apply based on the executor logs:
18/06/11 17:53:44 INFO BlockManager: Found block rdd_9_16 locally
18/06/11 17:53:45 WARN SQLServerConnection: ConnectionID:3 ClientConnectionId: d3568a9f-049f-4772-83d4-ed65b907fc8b Prelogin error: host nciensql14.nciwin.local port 1433 Error reading prelogin response: Connection timed out (Read failed) ClientConnectionId:d3568a9f-049f-4772-83d4-ed65b907fc8b
18/06/11 17:53:45 WARN SQLServerConnection: ConnectionID:2 ClientConnectionId: ecb084e6-99a8-49d1-9215-491324e8d133 Prelogin error: host nciensql14.nciwin.local port 1433 Error reading prelogin response: Connection timed out (Read failed) ClientConnectionId:ecb084e6-99a8-49d1-9215-491324e8d133
18/06/11 17:53:45 ERROR Executor: Exception in task 10.0 in stage 26.0 (TID 77)
Can someone help me understand what a prelogin error is and how to avoid this issue? Here is my write function:
function (df, tbl, db, server = NULL, user, pass, mode = "error",
options = list(), ...)
{
sparklyr::spark_write_jdbc(
df,
tbl,
options = c(
list(url = paste0("jdbc:sqlserver://", server, ".nciwin.local;",
"databaseName=", db, ";",
"user=", user, ";",
"password=", pass, ";"),
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"),
options),
mode = mode, ...)
}
I've just updated my jdbc driver to version 6.0, but I don't think it made a difference. I hope i installed it correctly. I just dropped it into my Spark/jars folder and then added it into Spark/conf/spark-defaults.conf.
EDIT
I am reading in 23M rows in 24 partitions into Spark. My cluster has 4 nodes with 8 cores each and 18G memory. With my current configurations I have 4 executors with 8 cores each and 12G per executor. My function to read in the data looks as such:
function (sc, tbl, db, server = NULL, user, pass, repartition = 0, options = list(), ...)
{
sparklyr::spark_read_jdbc(
sc,
tbl,
options = c(
list(url = paste0("jdbc:sqlserver://", server, ".nciwin.local;"),
user = user,
password = pass,
databaseName = db,
dbtable = tbl,
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"),
options),
repartition = repartition, ...)
}
I set repartition to 24 when running. As such, I'm not seeing the connection with the post suggested.
EDIT 2
I was able to fix my issue by getting rid of repartitioning. Can anyone explain why repartitioning with sparklyr is not effective in this case?
As explained in the other question, as well as some other posts (Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?, Converting mysql table to spark dataset is very slow compared to same from csv file, Partitioning in spark while reading from RDBMS via JDBC, spark reading data from mysql in parallel) and off-site resources (Parallelizing Reads), by default Spark JDBC source reads all data sequentially into a single node.
There are two ways of parallelizing reads:
Range splitting based on a numerical column with lowerBound, upperBound, partitionColumn and numPartitions options required, where partitionColumn is a stable numeric column (pseudocolumns might not be a good choice)
spark_read_jdbc(
...,
options = list(
...
lowerBound = "0", # Adjust to fit your data
upperBound = "5000", # Adjust to fit your data
numPartitions = "42", # Adjust to fit your data and resources
partitionColumn = "some_numeric_column"
)
)
predicates list - not supported in sparklyr at the moment.
Repartitioning (sparklyr::sdf_repartition doesn't resolve the problem because it happens after data has been loaded. Since shuffle (required for repartition) belongs to the most expensive operations in Spark it can easily crash the node.
As a result using:
repartition parameter of spark_read_jdbc:
sdf_repartition
is just a cargo cult practice, and most of the time does more harm than good. If data is small enough to be piped through a single node, then increasing number of partitions will usually decreases performance. Otherwise it will just crash.
That being said - if data is already processed by a single node it raises a question, if it makes sense to use Apache Spark at all. The answer will depend on the rest of your pipeline, but considering only component in question, it likely be negative.
Related
I am running SQL-code on a oracle database. Some commands require to run them via sqlplus. Is there a way to avoid my commandline solution but directly running sqlplus via, e.g. dbSendStatement().
Pseudo code to not share any sensible information
# Via dbSendStatement ------------------------------------------------------------------------------
con <- odbc::dbConnect(odbc::odbc(),
Driver = "oracle",
Host = "HOST",
Port = "PORT",
SVC = "SVC",
UID = Sys.getenv("USRDWH"),
PWD = Sys.getenv("PWDDWH"),
ssl = "true",
timeout = 10)
# Error
odbc::dbSendStatement(con, "EXEC SQL CODE")
# actual error message:
#> Error in new_result(connection#ptr, statement, immediate) :
#> nanodbc/nanodbc.cpp:1594: 00000: [RStudio][OracleOCI] (3000) Oracle Caller Interface: ORA-00900: invalid SQL statement
# Via system command -------------------------------------------------------------------------------
cmd <- paste0("sqlplus ",
Sys.getenv("USRDWH"), "/", Sys.getenv("PWDDWH"),
"#", "HOST", ":", "PORT", "/", "SVC", " ",
"#", "EXEC script.sql")
cmd
#> [1] "sqlplus USR/PWD#HOST:PORT/SVC #EXEC script.sql"
# Works
system(cmd,
intern = TRUE)
Code like that always connects directly to the database. sqlplus is a specific client tool; it doesn't have its own API for those kind of interactions. In other words, you always connect to the database; you can't connect to sqlplus as it is not a service.
Your best option would be to convert your SQL in such a way that you can run it natively in your code using a direct database connection (i.e. don't use sqlplus). If your SQL commands cannot be adapted, then you will need to write a shell interaction to manipulate sqlplus as you did with cmd in your example.
That said, this implementation in your example is very insecure, as it will allow anyone with access to your host to see the database username, password, and connection info associated with the process while it is running. There are much more secure ways of scripting this, including the use of an auto-open Oracle Wallet to hold the credentials so you don't have to embed them in your code (which is always a bad idea, too).
Using Oracle Wallet, your cmd call would then look more like this:
sqlplus /#TNS_ALIAS #EXEC script.sql
This is still not perfect, but is a step or two in the right direction.
I'm trying to query a MySQL server through an R/Tidyverse/dbplyr workflow. My MySQL access requires configuring SSH and port forwarding.
I have this code working using python (below), but I'm struggling to get started with the SSH/port forwarding equivalent in R. Any pointers to solutions or equivalent R packages appreciated. thanks.
import pymysql
import paramiko
import pandas as pd
from paramiko import SSHClient
from sshtunnel import SSHTunnelForwarder
from os.path import expanduser
pkeyfilepath = '/.ssh/id_ed25519'
home = expanduser('~')
mypkey = paramiko.Ed25519Key.from_private_key_file(home + pkeyfilepath)
sql_hostname = 'mycompany.com'
sql_username = 'me'
sql_password = '***'
sql_main_database = 'my_db'
sql_port = 3306
ssh_host = 'jumphost.mycompany.com'
ssh_user = 'me'
ssh_port = 22
with SSHTunnelForwarder(
(ssh_host, ssh_port),
ssh_username=ssh_user,
ssh_pkey=mypkey,
remote_bind_address=(sql_hostname, sql_port)) as tunnel:
conn = pymysql.connect(host='127.0.0.1', user=sql_username,
passwd=sql_password, db=sql_main_database,
port=tunnel.local_bind_port)
query = '''SELECT VERSION();'''
data = pd.read_sql_query(query, conn)
print(data)
conn.close()
There are several ways to do ssh port forwarding for R. In no particular order:
I forward it externally to R. All of my work is remote, and for one particular client I need access to various instances of SQL Server, Redis, MongoDB, remote filesystems, and a tunnel-hop to another network only accessible from the ssh bastion host. I tend to do work in more than R, so it's important to me that I generalize this. It is not for everybody or every task.
For this, I used a mismash of autossh and my ssh-agent (in KeePass/KeeAgent).
The ssh package does have a function to Create a Tunnel. The premise is that you have already created a "session" to which you can add a forwarding rule(s). When using ssh::ssh_tunnel, it is blocking, meaning you cannot use it in the same R process and continue to work. Demo:
# R session 1
sess <- ssh::ssh_connect("user#remote")
# insert passphrase
ssh::ssh_tunnel(sess, 21433, "otherremote:1433")
# / Waiting for connection on port 21433...
# R session 2
con <- DBI::dbConnect(..., port=21433)
DBI::dbGetQuery(con, "select 1 as n")
# n
# 1 1
This connection will stay alive so long as con is not closed and the remote end does not close it (e.g., activity timeout).
Note: I cannot get the ssh package to use my ssh-agent, so all passwords must be typed in or otherwise passed in not-ideal ways. There are many ways to not have to type it, such as using the keyring package (secure) or envvars, both of which would pass the password to ssh_connect(..., passwd=<>).
The above, but using callr so that you don't need to explicit sessions active (though you will still have another R session.
bgr <- callr::r_bg(function() {
ssh <- ssh::ssh_connect("r2#remote", passwd=keyring::key_get("r2", "remote"))
ssh::ssh_tunnel(ssh, port=21433, "otherremote:1433")
}, supervise = TRUE)
DBI::dbGetQuery(con, "select 1 as n")
# n
# 1 1
### when your work is done
bgr$kill()
If you do this, I strongly recommend the use of supervise=TRUE, which ensures the background R process is killed when this (primary) R session exits. This will reduce the risk of having phantom unused R sessions hanging around; in addition to just clogging up the process tree, if one of these phantom R processes is actively forwarding a port, that means nothing else can forward that port. This allows you to continue working, but you are not longer in control of the process doing the forwarding ... and subsequent attempts to tunnel will fail.
FYI, I generally prefer using keyring::key_get("r2", "remote") for password management in situations like this: (1) it prevents me from having to set that envvar each time I start R ... which will inadvertently store the plain-string password in ~/.Rhistory, if saved; (2) it prevents me from having to set that envvar in the global environment permanently, which is prone to other stupid mistakes I make; and (3) is much better protected since it is using the native credentials of your base OS. Having said that, you can replace the above use of keyring::key_get(..) with Sys.getenv("mypass") in a pinch, or in a case where the code is running on a headless system where a credentials manager is unavailable.
And if you want this to be a little more resilient to timeout disconnects, you can instead use
bgr <- callr::r_bg(function() {
ssh <- ssh::ssh_connect("r2#remote", passwd=keyring::key_get("r2", "remote"))
while (!inherits(try(ssh::ssh_tunnel(ssh, port=21433, "otherremote:1433"), silent=TRUE), "try-error")) Sys.sleep(1)
}, supervise = TRUE)
which will repeatedly make the tunnel so long as the attempt does not error. You may need to experiment with this to get it "perfect".
callr is really just using processx under the hood to start a background R process and allow you to continue working. If you don't want the "weight" of another R process solely to forward ports, you can use processx to start an explicit call to ssh that does everything you need it to do.
proc <- processx::process$new("ssh", c("-L", "21433:otherremote:1433", "r2#remote", "-N"))
### prompts for password
DBI::dbGetQuery(con, "select 1 as n")
# n
# 1 1
### when done
proc$kill()
# [1] TRUE
I have acces to an oneway export function to a public company database through Elastic Search. I have problems connection to it from R and the elastic package.
I have server name(URL), username and password, but I don't have any port number. They describe it as a rest API. Do I have to use the elastic package or is there an easier way around it. The only information I have to the database is: http://distribution.virk.dk/cvr-permanent/virksomhed/_search?.
Host="Distribution.virk.dk"
index="cvr-permanent"
type="virksomhed"
The above link works with HTTR, but I wish to use elastic for automation purposes, when making a large request of data.
so my connect looks like
host = "distribution.virk.dk"
port = ''
path = ''
schema = "http"
user = "user_name"
pass = "secret"
connect(es_host = host,es_user=user, transport=schema, port=port, es_pwd = pass)
Even though I set port to blank it returns 9200.
If I try to use Search
>Search(index="cvr-permanent", type="virksomhed", q='"cvrNummer":"33647093"', size=10)
Error in curl::curl_fetch_memory(url, handle = handle) :
Failed to connect to distribution.virk.dk port 9200: Timed out
(elastic maintainer here)
You should be able to pass in httr::authenticate() to elastic::Search and other functions from the pkg, e.g,.
x <- Search(config = c(httr::verbose(), authenticate("foo", "bar")))
You should see the Authorization: Basic XXXXXX header in the request headers
does that work?
I am trying to query AWS ElasticSearch Service (AWS ES) through a package in R called elastic. I am getting an error when trying to connect to the server.
Here is an example:
install.packages("elastic")
library(elastic)
aws_endpoint = "<secret>"
# I am certain the endpoint exists and is correct, as it functions with Kibana
aws_port = 80
# I have tried 9200, 9300, and 443 with no success
connect(es_host = aws_endpoint,
es_port = 80,
errors = "complete")
ping()
Search(index = "foobar", size = 1)$hits$hits
Whether pinging the server, or actually trying to search a document, both retrieve this error:
Error: 404 - no such index
ES stack trace:
type: index_not_found_exception
reason: no such index
resource.type: index_or_alias
resource.id: us-east-1.es.amazonaws.com
index: us-east-1.es.amazonaws.com
I have gone into my AWS ES dashboard and made certain I am using indexes that exist. Why this error?
I imagine I am misunderstanding something about transport protocols. elastic interacts with elasticsearch's HTTP API. I thought this was fine.
How do I establish an approriate connection between R and AWS ES?
R version 3.3.0 (2016-05-03); elastic_0.7.8
Solved it.
es_path must be specified as an empty string (""). Otherwise, connect() understands the AWS region (i.e. us-east-1.es.amazonaws.com) as the path. I imagine connect() adds the misunderstood path in the HTTP request, following the format shown here.
connect(es_host = aws_endpoint,
es_port = 80,
errors = "complete",
es_path = ""
)
Just to be clear, the parameters I actually used is shown below, but they should not make a difference. Fixing es_path is the key.
connect(es_host = aws_endpoint,
es_port = 443,
es_path = "",
es_transport_schema = "https")
I am making the move from RSQLite to RMySQL and I am confused by the user and password fields. FWIW, I'm running Windows 7, R 2.12.2, MySQL 5.5 (all 64 bit), and RMySQL 0.7-5.
I installed RMySQL as prescribed in this previous SO question, and as far as I know it works (i.e., I can load the package with library(RMySQL)). But when I try to run the tutorial from the R data import guide, I get a "failed to connect to database..." error. This is the code from the tutorial from the guide:
library(RMySQL) # will load DBI as well
## open a connection to a MySQL database
con <- dbConnect(dbDriver("MySQL"), user = "root", password = "root", dbname = "pookas")
## list the tables in the database
dbListTables(con)
## load a data frame into the database, deleting any existing copy
data(USArrests)
dbWriteTable(con, "arrests", USArrests, overwrite = TRUE)
dbListTables(con)
## get the whole table
dbReadTable(con, "arrests")
## Select from the loaded table
dbGetQuery(con, paste("select row_names, Murder from arrests",
"where Rape > 30 order by Murder"))
dbRemoveTable(con, "arrests")
dbDisconnect(con)
On the second line I get the following error:
> con <- dbConnect(dbDriver("MySQL"), user = "richard", password = "root", dbname = "pookas")
Error in mysqlNewConnection(drv, ...) :
RS-DBI driver: (Failed to connect to database: Error: Access denied for user 'richard'#'localhost' (using password: NO)
)
I have tried with and without user and password and with admin as user. I have also tried using a dbname that I made before with the command line and with one that doesn't exist.
Any tips? Is there a good reference here? Thanks!
That is most likely a setup issue on the server side. Make sure that networked access is enabled.
Also, a local test with the command-line client is not equivalent as that typically uses sockets. The mysql server logs may be helpful.
First try to connect to MySQL server using MySQL Workbench or command line mysql using the same parameter. If it connects then R should also be able to connect.
Typically this issue comes when MySQL server doesn't allow connections from remote machines.
As people have told you, you can try to connect to the host with other application as mysql workbench. How odd! When I have tried in RStudio to connect to my db with your code without indicate the host in the command I haven't been able to connect.
I have needed to indicate the host ( host = 'localhost' ) in the command.