I'm trying to query a MySQL server through an R/Tidyverse/dbplyr workflow. My MySQL access requires configuring SSH and port forwarding.
I have this code working using python (below), but I'm struggling to get started with the SSH/port forwarding equivalent in R. Any pointers to solutions or equivalent R packages appreciated. thanks.
import pymysql
import paramiko
import pandas as pd
from paramiko import SSHClient
from sshtunnel import SSHTunnelForwarder
from os.path import expanduser
pkeyfilepath = '/.ssh/id_ed25519'
home = expanduser('~')
mypkey = paramiko.Ed25519Key.from_private_key_file(home + pkeyfilepath)
sql_hostname = 'mycompany.com'
sql_username = 'me'
sql_password = '***'
sql_main_database = 'my_db'
sql_port = 3306
ssh_host = 'jumphost.mycompany.com'
ssh_user = 'me'
ssh_port = 22
with SSHTunnelForwarder(
(ssh_host, ssh_port),
ssh_username=ssh_user,
ssh_pkey=mypkey,
remote_bind_address=(sql_hostname, sql_port)) as tunnel:
conn = pymysql.connect(host='127.0.0.1', user=sql_username,
passwd=sql_password, db=sql_main_database,
port=tunnel.local_bind_port)
query = '''SELECT VERSION();'''
data = pd.read_sql_query(query, conn)
print(data)
conn.close()
There are several ways to do ssh port forwarding for R. In no particular order:
I forward it externally to R. All of my work is remote, and for one particular client I need access to various instances of SQL Server, Redis, MongoDB, remote filesystems, and a tunnel-hop to another network only accessible from the ssh bastion host. I tend to do work in more than R, so it's important to me that I generalize this. It is not for everybody or every task.
For this, I used a mismash of autossh and my ssh-agent (in KeePass/KeeAgent).
The ssh package does have a function to Create a Tunnel. The premise is that you have already created a "session" to which you can add a forwarding rule(s). When using ssh::ssh_tunnel, it is blocking, meaning you cannot use it in the same R process and continue to work. Demo:
# R session 1
sess <- ssh::ssh_connect("user#remote")
# insert passphrase
ssh::ssh_tunnel(sess, 21433, "otherremote:1433")
# / Waiting for connection on port 21433...
# R session 2
con <- DBI::dbConnect(..., port=21433)
DBI::dbGetQuery(con, "select 1 as n")
# n
# 1 1
This connection will stay alive so long as con is not closed and the remote end does not close it (e.g., activity timeout).
Note: I cannot get the ssh package to use my ssh-agent, so all passwords must be typed in or otherwise passed in not-ideal ways. There are many ways to not have to type it, such as using the keyring package (secure) or envvars, both of which would pass the password to ssh_connect(..., passwd=<>).
The above, but using callr so that you don't need to explicit sessions active (though you will still have another R session.
bgr <- callr::r_bg(function() {
ssh <- ssh::ssh_connect("r2#remote", passwd=keyring::key_get("r2", "remote"))
ssh::ssh_tunnel(ssh, port=21433, "otherremote:1433")
}, supervise = TRUE)
DBI::dbGetQuery(con, "select 1 as n")
# n
# 1 1
### when your work is done
bgr$kill()
If you do this, I strongly recommend the use of supervise=TRUE, which ensures the background R process is killed when this (primary) R session exits. This will reduce the risk of having phantom unused R sessions hanging around; in addition to just clogging up the process tree, if one of these phantom R processes is actively forwarding a port, that means nothing else can forward that port. This allows you to continue working, but you are not longer in control of the process doing the forwarding ... and subsequent attempts to tunnel will fail.
FYI, I generally prefer using keyring::key_get("r2", "remote") for password management in situations like this: (1) it prevents me from having to set that envvar each time I start R ... which will inadvertently store the plain-string password in ~/.Rhistory, if saved; (2) it prevents me from having to set that envvar in the global environment permanently, which is prone to other stupid mistakes I make; and (3) is much better protected since it is using the native credentials of your base OS. Having said that, you can replace the above use of keyring::key_get(..) with Sys.getenv("mypass") in a pinch, or in a case where the code is running on a headless system where a credentials manager is unavailable.
And if you want this to be a little more resilient to timeout disconnects, you can instead use
bgr <- callr::r_bg(function() {
ssh <- ssh::ssh_connect("r2#remote", passwd=keyring::key_get("r2", "remote"))
while (!inherits(try(ssh::ssh_tunnel(ssh, port=21433, "otherremote:1433"), silent=TRUE), "try-error")) Sys.sleep(1)
}, supervise = TRUE)
which will repeatedly make the tunnel so long as the attempt does not error. You may need to experiment with this to get it "perfect".
callr is really just using processx under the hood to start a background R process and allow you to continue working. If you don't want the "weight" of another R process solely to forward ports, you can use processx to start an explicit call to ssh that does everything you need it to do.
proc <- processx::process$new("ssh", c("-L", "21433:otherremote:1433", "r2#remote", "-N"))
### prompts for password
DBI::dbGetQuery(con, "select 1 as n")
# n
# 1 1
### when done
proc$kill()
# [1] TRUE
Related
I am running SQL-code on a oracle database. Some commands require to run them via sqlplus. Is there a way to avoid my commandline solution but directly running sqlplus via, e.g. dbSendStatement().
Pseudo code to not share any sensible information
# Via dbSendStatement ------------------------------------------------------------------------------
con <- odbc::dbConnect(odbc::odbc(),
Driver = "oracle",
Host = "HOST",
Port = "PORT",
SVC = "SVC",
UID = Sys.getenv("USRDWH"),
PWD = Sys.getenv("PWDDWH"),
ssl = "true",
timeout = 10)
# Error
odbc::dbSendStatement(con, "EXEC SQL CODE")
# actual error message:
#> Error in new_result(connection#ptr, statement, immediate) :
#> nanodbc/nanodbc.cpp:1594: 00000: [RStudio][OracleOCI] (3000) Oracle Caller Interface: ORA-00900: invalid SQL statement
# Via system command -------------------------------------------------------------------------------
cmd <- paste0("sqlplus ",
Sys.getenv("USRDWH"), "/", Sys.getenv("PWDDWH"),
"#", "HOST", ":", "PORT", "/", "SVC", " ",
"#", "EXEC script.sql")
cmd
#> [1] "sqlplus USR/PWD#HOST:PORT/SVC #EXEC script.sql"
# Works
system(cmd,
intern = TRUE)
Code like that always connects directly to the database. sqlplus is a specific client tool; it doesn't have its own API for those kind of interactions. In other words, you always connect to the database; you can't connect to sqlplus as it is not a service.
Your best option would be to convert your SQL in such a way that you can run it natively in your code using a direct database connection (i.e. don't use sqlplus). If your SQL commands cannot be adapted, then you will need to write a shell interaction to manipulate sqlplus as you did with cmd in your example.
That said, this implementation in your example is very insecure, as it will allow anyone with access to your host to see the database username, password, and connection info associated with the process while it is running. There are much more secure ways of scripting this, including the use of an auto-open Oracle Wallet to hold the credentials so you don't have to embed them in your code (which is always a bad idea, too).
Using Oracle Wallet, your cmd call would then look more like this:
sqlplus /#TNS_ALIAS #EXEC script.sql
This is still not perfect, but is a step or two in the right direction.
How do I ssh ipv6 address in R ssh package
library(ssh)
# works
session <- ssh_connect("user#10.1.1.0")
# gives error
session <- ssh_connect("user#24:022f:0313:112:0::2")
Error in parse_host(host, default_port = 22) :
host string contains multiple ':' characters
Since you can't install source packages, one super hacky way to do this is to call the C function ssh_connect() calls directly:
.Call(
ssh:::C_start_session, "2405:0200:0313:112:41::42", 22, "user", NULL, ssh:::askpass, FALSE
)
That C interface is highly unlikely to change so it should be a pretty safe hack until the package eventually supports IPv6.
For those that stumble on this before the rOpenSci folks make any changes, the fork : https://github.com/hrbrmstr/ssh : also adds support for using a local SSH config file. Which means you can add a Host entry for IPv6 addresses (along with any other config options) and they'll be looked up.
i.e. if one has:
Host awickedcoolhost
User boringusername
Hostname ::1
IdentityFile ~/.ssh/id_rsa
Port 22222
in ~/.ssh/config, one can (with the fork) do:
ssh_connect("awickedcoolhost", config="~/.ssh/config")
and all the overrides in that entry should work.
While I currently have a workaround method, I feel like there has got to be a better way to do this, something like SSHTunnelForwarder in Python.
(found here) https://sshtunnel.readthedocs.io/en/latest/
My current workaround:
1) Write a /.ssh/config file that specifies local port forwarding options.
2) In a terminal window, execute ssh -N location_of_db
3) In R, run the following:
library(RPostgres)
drv <- RPostgres::Postgres()
dbName <- "my_database"
host <- "127.0.0.1"
port <- '5439'
user <- "username"
db <- dbConnect(drv, dbname=dbName, host=host, port=port, user=user,
password=readLines("/Users/me/keys/db_password.txt"))
Is there an all R way to do this?
Note: In lieu of 1) and 2) you can just connect and set up local port forwarding with -L option on the ssh command (now in the comments), but this requires that you locate your ssh-agent for the db and provide other security info, as set up by the sys admin.
I am searching for a robust solution to perform extensive computations on a remote server, dedicated to computational tasks. The server is on Windows 2008 R2 and has R x64 3.4.1 installed on it. I've searched for free solutions and am now focusing on the Rserver/RSclient packages solutions.
However, I can't connect any client (using RSclient) to the instanced server.
This is how I'm proceeding at the moment from the server side:
library(Rserve)
run.Rserve(config.file = "Rserv.conf")
using the following Rserv.conf file:
port 6311
remote enable
plaintext enable
control enable
r-control enable
The server is now intanciated using the Rsession (It's a bit ugly, but will change that latter on):
running Rserve in this R session (pid=...), 1 server(s)
Now, i'm trying to connect using a remote computer (Client-side) using:
library(RSclient)
c = RS.connect(host = "...")
The connection then seems to succeed, checking for c:
> c
Rserve QAP1 connection 0x000000000fbe9f50 (socket 764, queue length 0)
The error occurs when i try to eval anything, for example:
> RS.server.eval(c,"0<1")
Error in RS.server.eval(c, "0<1") : command failed with status code 0x4e: no control line present (control commands disabled or server shutdown)
I've read the available guides but still failed in connecting. What is wrong? It seems to be related to control lines but I authorized them in the config file.
for me the problem was solved by initiating the Rserve instance with the command:
R CMD Rserve --RS-port 9000 --RS-enable-remote --RS-enable-control
instead of starting it in the R environment (library(Rserve), run.Rserve(config.file = "Rserv.conf")). You may try this on Windows as well.
Refer https://github.com/s-u/Rserve/wiki/rserve.conf.
port 6311
remote enable -> it should be remote true
plaintext enable
control enable
r-control enable
Likewise refer the link and try with actual values
I’ve successfully used snowfall to setup a cluster on a single server with 16 processors.
require(snowfall)
if (sfIsRunning() == TRUE) sfStop()
number.of.cpus <- 15
sfInit(parallel = TRUE, cpus = number.of.cpus)
stopifnot( sfCpus() == number.of.cpus )
stopifnot( sfParallel() == TRUE )
# Print the hostname for each cluster member
sayhello <- function()
{
info <- Sys.info()[c("nodename", "machine")]
paste("Hello from", info[1], "with CPU type", info[2])
}
names <- sfClusterCall(sayhello)
print(unlist(names))
Now, I am looking for complete instructions on how to move to a distributed model. I have 4 different Windows machines with a total of 16 cores that I would like to use for a 16 node cluster. So far, I understand that I could manually setup a SOCK connection or leverage MPI. While it appears possible, I haven’t found clear and complete directions as to how.
The SOCK route appears to depend on code in a snowlib script. I can generate a stub from the master side with the following code:
winOptions <-
list(host="172.01.01.03",
rscript="C:/Program Files/R/R-2.7.1/bin/Rscript.exe",
snowlib="C:/Rlibs")
cl <- makeCluster(c(rep(list(winOptions), 2)), type = "SOCK", manual = T)
It yields the following:
Manually start worker on 172.01.01.03 with
"C:/Program Files/R/R-2.7.1/bin/Rscript.exe"
C:/Rlibs/snow/RSOCKnode.R
MASTER=Worker02 PORT=11204 OUT=/dev/null SNOWLIB=C:/Rlibs
It feels like a reasonable start. I found code for RSOCKnode.R on GitHub under the snow package:
local({
master <- "localhost"
port <- ""
snowlib <- Sys.getenv("R_SNOW_LIB")
outfile <- Sys.getenv("R_SNOW_OUTFILE") ##**** defaults to ""; document
args <- commandArgs()
pos <- match("--args", args)
args <- args[-(1 : pos)]
for (a in args) {
pos <- regexpr("=", a)
name <- substr(a, 1, pos - 1)
value <- substr(a,pos + 1, nchar(a))
switch(name,
MASTER = master <- value,
PORT = port <- value,
SNOWLIB = snowlib <- value,
OUT = outfile <- value)
}
if (! (snowlib %in% .libPaths()))
.libPaths(c(snowlib, .libPaths()))
library(methods) ## because Rscript as of R 2.7.0 doesn't load methods
library(snow)
if (port == "") port <- getClusterOption("port")
sinkWorkerOutput(outfile)
cat("starting worker for", paste(master, port, sep = ":"), "\n")
slaveLoop(makeSOCKmaster(master, port))
})
It’s not clear how to actually start a SOCK listener on the workers, unless it is buried in snow::recvData.
Looking into the MPI route, as far as I can tell, Microsoft MPI version 7 is a starting point. However, I could not find a Windows alternative for sfCluster. I was able to start the MPI service, but it does not appear to listen on port 22 and no amount of bashing against it with snowfall::makeCluster has yielded a result. I’ve disabled the firewall and tried testing with makeCluster and directly connecting to the worker from the master with PuTTY.
Is there a comprehensive, step-by-step guide to setting up a snowfall cluster on Windows workers that I’ve missed? I am fond of snowfall::sfClusterApplyLB and would like to continue using that, but if there is an easier solution, I’d be willing to change course. Looking into Rmpi and parallel, I found alternative solutions for the master side of the work, but still little to no specific detail on how to setup workers running Windows.
Due to the nature of the work environment, neither moving to AWS, nor Linux is an option.
Related questions without definitive answers for Windows worker nodes:
How to set up cluster slave nodes (on Windows)
Parallel R on a Windows cluster
Create a cluster of co-workers' Windows 7 PCs for parallel processing in R?
There were several options for HPC infrastructure considered: MPICH, Open MPI, and MS MPI. Initially tried to use MPICH2 but gave up as the latest stable release 1.4.1 for Windows dated back by 2013 and no support since those times. Open MPI is not supported by Windows. Then only the MS MPI option is left.
Unfortunately snowfall does not support MS MPI so I decided to go with pbdMPI package, which supports MS MPI by default. pbdMPI implements the SPMD paradigm in contrast withRmpi, which uses manager/worker parallelism.
MS MPI installation, configuration, and execution
Install MS MPI v.10.1.2 on all machines in the to-be Windows HPC cluster.
Create a directory accessible to all nodes, where R-scripts / resources will reside, for example, \HeadMachine\SharedDir.
Check if MS MPI Launch Service (MsMpiLaunchSvc) running on all nodes.
Check, that MS MPI has the rights to run R application on all the nodes on behalf of the same user, i.e. SharedUser. The user name and the password must be the same for all machines.
Check, that R should be launched on behalf of the SharedUser user.
Finally, execute mpiexec with the following options mentioned in Steps 7-10:
mpiexec.exe -n %1 -machinefile "C:\MachineFileDir\hosts.txt" -pwd
SharedUserPassword –wdir "\HeadMachine\SharedDir" Rscript hello.R
where
-wdir is a network path to the directory with shared resources.
–pwd is a password by SharedUser user, for example, SharedUserPassword.
–machinefile is a path to hosts.txt text file, for example С:\MachineFileDir\hosts.txt. hosts.txt file must be readable from the head node at the specified path and it contains a list of IP addresses of the nodes on which the R script is to be run.
As a result of Step 7 MPI will log in as SharedUser with the password SharedUserPassword and execute copies of the R processes on each computer listed in the hosts.txt file.
Details
hello.R:
library(pbdMPI, quiet = TRUE)
init()
cat("Hello World from
process",comm.rank(),"of",comm.size(),"!\n")
finalize()
hosts.txt
The hosts.txt - MPI Machines File - is a text file, the lines of which contain the network names of the computers on which R scripts will be launched. In each line, after the computer name is separated by a space (for MS MPI), the number of MPI processes to be launched. Usually, it equals the number of processors in each node.
Sample of hosts.txt with three nodes having 2 processors each:
192.168.0.1 2
192.168.0.2 2
192.168.0.3 2