Connecting to AWS Elasticsearch Service using R - Getting 404 Error - r

I am trying to query AWS ElasticSearch Service (AWS ES) through a package in R called elastic. I am getting an error when trying to connect to the server.
Here is an example:
install.packages("elastic")
library(elastic)
aws_endpoint = "<secret>"
# I am certain the endpoint exists and is correct, as it functions with Kibana
aws_port = 80
# I have tried 9200, 9300, and 443 with no success
connect(es_host = aws_endpoint,
es_port = 80,
errors = "complete")
ping()
Search(index = "foobar", size = 1)$hits$hits
Whether pinging the server, or actually trying to search a document, both retrieve this error:
Error: 404 - no such index
ES stack trace:
type: index_not_found_exception
reason: no such index
resource.type: index_or_alias
resource.id: us-east-1.es.amazonaws.com
index: us-east-1.es.amazonaws.com
I have gone into my AWS ES dashboard and made certain I am using indexes that exist. Why this error?
I imagine I am misunderstanding something about transport protocols. elastic interacts with elasticsearch's HTTP API. I thought this was fine.
How do I establish an approriate connection between R and AWS ES?
R version 3.3.0 (2016-05-03); elastic_0.7.8

Solved it.
es_path must be specified as an empty string (""). Otherwise, connect() understands the AWS region (i.e. us-east-1.es.amazonaws.com) as the path. I imagine connect() adds the misunderstood path in the HTTP request, following the format shown here.
connect(es_host = aws_endpoint,
es_port = 80,
errors = "complete",
es_path = ""
)
Just to be clear, the parameters I actually used is shown below, but they should not make a difference. Fixing es_path is the key.
connect(es_host = aws_endpoint,
es_port = 443,
es_path = "",
es_transport_schema = "https")

Related

Problems connecting to SFTP with R and RCurl

I am attempting to connect to an SFTP site to pull data. It used to work, but for some reason, it stopped working a couple of weeks ago. The owners of the SFTP say nothing has changed on their end, and I can pull data easily without error using WinSCP.
protocol <- "sftp"
server <- "sftp.xxxx.net"
userpwd <- "user:password"
file <- "/public/bpus_dailytx.csv"
url <- paste0(protocol, "://", server, file)
data <- getURL(url = url, userpwd=userpwd, verbose = TRUE)
When I run this, I now get the following info:
* Trying xxx.xx.xx.xxx...
* Connected to sftp.xxxx.net (xxx.xx.xx.xxx) port 22 (#0)
* SSH MD5 fingerprint: 34rh3ie93hhr39hhdik3
* SSH authentication methods available: publickey,keyboard-interactive
* Using SSH public key file '(nil)'
* Using SSH private key file ''
* SSH public key authentication failed: Unable to extract public key from private key file: Unable to open private key file
* No identity would match
* Authentication failure
* Closing connection 0
Error in function (type, msg, asError = TRUE) : Authentication failure
It connects OK but then the authentication fails. Any ideas what could be going on here? Again, this code used to work but something has changed. What are some other ways I can attempt to pull the data other than this?
Edit: WinSCP screenshots:
It is difficult to say why your code stopped working because we do not have enough information about configurations (both on your machine and on the server) when it was working.
Because your keyfile was not in the proper format for RCurl, my leading hypothesis is that although the server folks said nothing changed on their end, I think they removed the password authentication option. That is because your code attempts password authentication only. If password authentication were still available, the one line in your output would look something like this:
SSH authentication methods available: publickey,password,keyboard-interactive
It is, as you noted, now:
SSH authentication methods available: publickey,keyboard-interactive
Therefore, the solution here was to convert your keyfile from PuTTY to OpenSSH format using PuTTYgen and then use the following RCurl code pointing to your new keyfile:
protocol <- "sftp"
server <- "sftp.xxxx.net"
file <- "/public/bpus_dailytx.csv"
url <- paste0(protocol, "://", server, file)
keypasswd <- "your_keypasswd"
ssh.private.keyfile = "your_path_to_keyfile"
username <- "your_username"
data <- getURL(url = url, keypasswd = keypasswd, ssh.private.keyfile = ssh.private.keyfile, username = username, verbose = TRUE)
And I will add a special thanks to #Tensibai for the assistance. It would have taken me way longer to arrive at this solution without their insight with the keyfile format.

Connect To Amazon DocumentDB with R Mongolite

I Have my own AWS DocumentDB and I'm trying to connect to it in R using Mongolite Package
I tried to do this with mongolite ssl_options
with this code:
mong <- mongo(collection = "test", db = "test"
,url ='*******************.docdb.amazonaws.com:27017'
,verbose = TRUE
,options = ssl_options(ca= 'rds-combined-ca-bundle.pem',weak_cert_validation = T)
)
But I get this Error :
> Error: No suitable servers found (`serverSelectionTryOnce` set):
> [socket timeout calling ismaster on
> '***********************-central-1.docdb.amazonaws.com:27017']
so i need someone how can solve this problem.
You can connect to Amazon DocumentDB using TLS and the Mongolite package (https://jeroen.github.io/mongolite/index.html) using the following example connection string:
j <- mongo(url = "mongodb://<yourUsername>:<yourPassword>#docdb-2019-02-21-02-57-28.cluster-ccuszbx3pn5e.us-east-1.docdb.amazonaws.com:27017/?ssl=true", options = ssl_options(weak_cert_validation = T, key = "rds-combined-ca-bundle.pem"))
The error you are seeing typically occurs when 1/the URL for the host (Amazon DocumentDB cluster) in the connection string is incorrect or does not match that of the cluster you are trying to connect to or 2/your client machine that you are issuing the connection from is in a different region or VPC than your Amazon DocumentDB cluster.
For additional troubleshooting: https://docs.aws.amazon.com/documentdb/latest/developerguide/troubleshooting.html

Spark JDBC connection to SQL Server times out often

I'm running Spark v2.2.1 via sparklyr v0.6.2 and pulling data from SQL Server via jdbc. I seem to be experiencing some network issue because many times (not every time) my executor doing a write to SQL Server fails with error:
Prelogin error: host <my server> port 1433 Error reading prelogin response: Connection timed out (Read failed) ClientConnectionId:...
I am running my sparklyr session with the following configurations:
spark_conf = spark_config()
spark_conf$spark.executor.cores <- 8
spark_conf$`sparklyr.shell.driver-memory` <- "8G"
spark_conf$`sparklyr.shell.executor-memory` <- "12G"
spark_conf$spark.serializer <- "org.apache.spark.serializer.KryoSerializer"
spark_conf$spark.network.timeout <- 400
But interestingly the network timeout I've set above does not seem to apply based on the executor logs:
18/06/11 17:53:44 INFO BlockManager: Found block rdd_9_16 locally
18/06/11 17:53:45 WARN SQLServerConnection: ConnectionID:3 ClientConnectionId: d3568a9f-049f-4772-83d4-ed65b907fc8b Prelogin error: host nciensql14.nciwin.local port 1433 Error reading prelogin response: Connection timed out (Read failed) ClientConnectionId:d3568a9f-049f-4772-83d4-ed65b907fc8b
18/06/11 17:53:45 WARN SQLServerConnection: ConnectionID:2 ClientConnectionId: ecb084e6-99a8-49d1-9215-491324e8d133 Prelogin error: host nciensql14.nciwin.local port 1433 Error reading prelogin response: Connection timed out (Read failed) ClientConnectionId:ecb084e6-99a8-49d1-9215-491324e8d133
18/06/11 17:53:45 ERROR Executor: Exception in task 10.0 in stage 26.0 (TID 77)
Can someone help me understand what a prelogin error is and how to avoid this issue? Here is my write function:
function (df, tbl, db, server = NULL, user, pass, mode = "error",
options = list(), ...)
{
sparklyr::spark_write_jdbc(
df,
tbl,
options = c(
list(url = paste0("jdbc:sqlserver://", server, ".nciwin.local;",
"databaseName=", db, ";",
"user=", user, ";",
"password=", pass, ";"),
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"),
options),
mode = mode, ...)
}
I've just updated my jdbc driver to version 6.0, but I don't think it made a difference. I hope i installed it correctly. I just dropped it into my Spark/jars folder and then added it into Spark/conf/spark-defaults.conf.
EDIT
I am reading in 23M rows in 24 partitions into Spark. My cluster has 4 nodes with 8 cores each and 18G memory. With my current configurations I have 4 executors with 8 cores each and 12G per executor. My function to read in the data looks as such:
function (sc, tbl, db, server = NULL, user, pass, repartition = 0, options = list(), ...)
{
sparklyr::spark_read_jdbc(
sc,
tbl,
options = c(
list(url = paste0("jdbc:sqlserver://", server, ".nciwin.local;"),
user = user,
password = pass,
databaseName = db,
dbtable = tbl,
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"),
options),
repartition = repartition, ...)
}
I set repartition to 24 when running. As such, I'm not seeing the connection with the post suggested.
EDIT 2
I was able to fix my issue by getting rid of repartitioning. Can anyone explain why repartitioning with sparklyr is not effective in this case?
As explained in the other question, as well as some other posts (Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?, Converting mysql table to spark dataset is very slow compared to same from csv file, Partitioning in spark while reading from RDBMS via JDBC, spark reading data from mysql in parallel) and off-site resources (Parallelizing Reads), by default Spark JDBC source reads all data sequentially into a single node.
There are two ways of parallelizing reads:
Range splitting based on a numerical column with lowerBound, upperBound, partitionColumn and numPartitions options required, where partitionColumn is a stable numeric column (pseudocolumns might not be a good choice)
spark_read_jdbc(
...,
options = list(
...
lowerBound = "0", # Adjust to fit your data
upperBound = "5000", # Adjust to fit your data
numPartitions = "42", # Adjust to fit your data and resources
partitionColumn = "some_numeric_column"
)
)
predicates list - not supported in sparklyr at the moment.
Repartitioning (sparklyr::sdf_repartition doesn't resolve the problem because it happens after data has been loaded. Since shuffle (required for repartition) belongs to the most expensive operations in Spark it can easily crash the node.
As a result using:
repartition parameter of spark_read_jdbc:
sdf_repartition
is just a cargo cult practice, and most of the time does more harm than good. If data is small enough to be piped through a single node, then increasing number of partitions will usually decreases performance. Otherwise it will just crash.
That being said - if data is already processed by a single node it raises a question, if it makes sense to use Apache Spark at all. The answer will depend on the rest of your pipeline, but considering only component in question, it likely be negative.

Connection issue R elastic package for Elastic search - one way entry

I have acces to an oneway export function to a public company database through Elastic Search. I have problems connection to it from R and the elastic package.
I have server name(URL), username and password, but I don't have any port number. They describe it as a rest API. Do I have to use the elastic package or is there an easier way around it. The only information I have to the database is: http://distribution.virk.dk/cvr-permanent/virksomhed/_search?.
Host="Distribution.virk.dk"
index="cvr-permanent"
type="virksomhed"
The above link works with HTTR, but I wish to use elastic for automation purposes, when making a large request of data.
so my connect looks like
host = "distribution.virk.dk"
port = ''
path = ''
schema = "http"
user = "user_name"
pass = "secret"
connect(es_host = host,es_user=user, transport=schema, port=port, es_pwd = pass)
Even though I set port to blank it returns 9200.
If I try to use Search
>Search(index="cvr-permanent", type="virksomhed", q='"cvrNummer":"33647093"', size=10)
Error in curl::curl_fetch_memory(url, handle = handle) :
Failed to connect to distribution.virk.dk port 9200: Timed out
(elastic maintainer here)
You should be able to pass in httr::authenticate() to elastic::Search and other functions from the pkg, e.g,.
x <- Search(config = c(httr::verbose(), authenticate("foo", "bar")))
You should see the Authorization: Basic XXXXXX header in the request headers
does that work?

websocket connection does not work

I seem to be in struggle with the websockets in R. I wanted to download the streaming data from the BitCoin exchange MtGox directly to R, but R cannot establish the connection.
The websocket specs are defined as:
Host: websocket.mtgox.com or socketio.mtgox.com
Port: 80 or 443 ( ssl )
Namespace: /mtgox (Including beginning slash)
url for more details: https://en.bitcoin.it/wiki/MtGox/API/Streaming
and my code is:
require(websockets)
con = websocket("https://socketio.mtgox.com/mtgox",port=443)
and I always end up with an error:
> con = websocket("https://socketio.mtgox.com/mtgox",port=443)
Error in websocket("https://socketio.mtgox.com/mtgox", port = 443) :
Connection error
Does anyone have an idea what is wrong?
Many thanks.
I've looked at the source code and manual here - https://github.com/rstudio/R-Websockets
The R Websocket library is out of date and not compliant with the WebSocket protocol as it stands.
So you'd need to fix the library or find an alternate one. Fixing the library isn't that hard depending on your ability. I managed to do it here -
https://github.com/zeenogee/R-Websockets
My one is (lazily) hard-coded to MtGox - use at own risk! You'd need to remove the current WebSocket library and install this one. Don't forget Your code is only doing the basic connection. There are a couple more steps to see actual data -
set_callback("receive", function(DATA,WS,HEADER) cat(rawToChar(DATA)), con)
service (con)

Resources