Attempting to download files from SFTP using R - r

I'm trying to implement R in the workplace and save a bit of time from all the data churning we do.
A lot of files we receive are sent to us via SFTP as they contain sensitive information.
I've looked around on StackOverflow & Google but nothing seems to work for me. I tried using the RCurl Library from an example I found online but it doesn't allow me to include the port(22) as part of the login details.
library(RCurl)
protocol <- "sftp"
server <- "hostname"
userpwd <- "user:password"
tsfrFilename <- "Reports/Excelfile.xlsx"
ouptFilename <- "~/Test.xlsx"
url <- paste0(protocol, "://", server, tsfrFilename)
data <- getURL(url = url, userpwd=userpwd)
I end up getting the error code
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
embedded nul in string:
Any help would be greatly appreciated as this will save us loads of time!
Thanks,
Shan

Looks like a similar situation here: Using R to download SAS file from ftp-server
I'm no expert in r but there it looks like getBinaryUrl() worked instead of getURL() in the example given.
Hope that helps
M

Note that there are two packages, RCurl and rcurl. For RCurl, I used successfully keyfiles to connect via sftp:
opts <- list(
ssh.public.keyfile = pubkey, # file name
ssh.private.keyfile = privatekey, # filename
keypasswd <- keypasswd # optional password
)
RCurl::getURL(url=uri, .opts = opts, curl = RCurl::getCurlHandle())
For this to work, you need two create the keyfiles e.g. via putty or similar.

I too was having problems specifying the port options when using the getURI() and getURL() functions.
In order to specify the port, you simply add the port as port = #### instead of port(####). For example:
data <- getURI(url = url,
userpwd = userpwd,
port = 22)
Now, like #MarkThomas pointed out, whenever you get an encodoing error, try getBinaryURL() instead of getURI(). In most cases, this will allow you to download SAS files as well as .csv files econded in UTF-8 or LATIN1!!

Related

Using RCurl with ftp server with embedded nulls

I've been working on this issue for a few days now and even after contacting the site administrators, I've had no luck in solving it.
I would like to automate the download of a specific file from an ftp server without using any software besides R.
userpwd = "MyUserName:MyPassword"
url <- "ftp://arthurhou.pps.eosdis.nasa.gov/gpmdata/2014/04/01/imerg/3B-HHR.MS.MRG.3IMERG.20140401-S150000-E152959.0900.V03D.HDF5"
dat <- try(getURL(url, userpwd = userpwd,verbose=TRUE,ftp.use.epsv = FALSE))
When I run this, I get the error:
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
embedded nul in string: '‰HDF\r\n\032\n\0\0\0\0\0\b\b\0\004\0\020\0\0\0\0\0\0\0\0\0\0\0\0\0ÿÿÿÿÿÿÿÿÚá'\0\0\0\0\0ÿÿÿÿÿÿÿÿ\0\0\0\0\0\0\0\0`\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0OHDR\002,fÉ¿TbÉ¿TfÉ¿TbÉ¿Tà\002"\0\0\0\0\0\003\001\0\0\0\0\0\0\0ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\n\002\0\001\0\0\0\0\006\027\0\0\0\0\001\004\0\0\0\0\0\0\0\0\004Grid[\001\0\0\0\0\0\0\025\034\0\004\0\0\0\003\002\0ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\020\020\0\0\0\0\036`&\0\0\0\0\0{\003\0\0\0\0\0\0\0U\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\024î*aOHDR\002,fÉ¿TbÉ¿TfÉ¿TbÉ¿Tà\002"\0\0\0\0\0\003\v\0\0\0\0\0\0\0Ã\025\0\0\0\0\0\0U\026\0\0\0\0\0\0{\026\0\0\0\0\0\0\n\002\0\001\0\0\0\0\025\034\0\004\0\0\0\003\001\0ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\020\020\0\0\0\0™c&\0\0\0\0\0\034\001\0\0\0\0\0\0\0r\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
I've tried removing the nulls from the initial link, i.e. url <- "ftp://arthurhou.pps.eosdis.nasa.gov%2Fgpmdata%2F2014%2F04%2F01%2Fimerg%2F3B-HHR.MS.MRG.3IMERG.20140401-S213000-E215959.1290.V03D.HDF5" yet this returns the same error as before.
If anyone would like to try this for themselves, you can register an email at: http://pmm.nasa.gov/data-access/downloads/gpm, and then use the email as the username and password.
This worked for me:
library(httr)
url <- "ftp://arthurhou.pps.eosdis.nasa.gov/gpmdata/2014/04/01/imerg/3B-HHR.MS.MRG.3IMERG.20140401-S150000-E152959.0900.V03D.HDF5"
output_file <- "3B-HHR.MS.MRG.3IMERG.20140401-S150000-E152959.0900.V03D.HDF5"
my_email <- "someone#example.com"
GET(url, authenticate(my_email, my_email),
write_disk(output_file))

Reading an online xlsx file into R

I am trying to download spreadsheets from AQR data library into R directly.
I have this link: http://www.aqr.com/~/media/files/data-sets/value-and-momentum-everywhere-portfolios-monthly.xlsx which prompts a download. However, when trying the following code:
> url1<-"http://www.aqr.com/~/media/files/data-sets/value-and-momentum-everywhere-portfolios-monthly.xlsx"
> download.file(url1,destfile="example.xlsx")
I get this error
trying URL 'http://www.aqr.com/~/media/files/data-sets/value-and-momentum-everywhere-portfolios-monthly.xlsx'
Error in download.file(url1, destfile = "example.xlsx") : cannot open URL 'http://www.aqr.com/~/media/files/data-sets/value-and-momentum-everywhere-portfolios-monthly.xlsx'
https://www.aqr.com/library/data-sets/value-and-momentum-everywhere-portfolios-monthly is the page from which I am trying to download data(under full set data link).
Could you provide some guidance?
It looks like that link redirects to https, which download.file does not support by default. If you have wget or curl installed you can use
download.file("https://www.aqr.com/~/media/files/data-sets/value-and-momentum-everywhere-portfolios-monthly.xlsx",
"example.xlsx",
method = "wget")
or
download.file("https://www.aqr.com/~/media/files/data-sets/value-and-momentum-everywhere-portfolios-monthly.xlsx",
"example.xlsx",
method = "curl")
These and other options are discussed at Download a file from HTTPS using download.file()
I'm not quite sure what is causing the problem for you, but the following worked for me:
library(XLConnect)
##
con <- "http://www.aqr.com/~/media/files/data-sets/value-and-momentum-everywhere-portfolios-monthly.xlsx"
download.file(con,"xlsxFile.xlsx",mode="wb")
##
newWB <- loadWorkbook(
file="xlsxFile.xlsx",
create=F)
##
R> getSheets(newWB)
[1] "VME Portfolios" "Definitions" "Data Sources" "Disclosures"
and here's a screenshot of the downloaded file:

Connect to the Twitter Streaming API using R

I just started playing around with the Twitter Streaming API and using the command line, redirect the raw JSON reponses to a file using the command below:
curl https://stream.twitter.com/1/statuses/sample.json -u USER:PASSWORD -o "somefile.txt"
Is it possible to stay completely within R and leverage RCurl to do the same thing? Instead of just saving the output to a file, I would like to parse each response that is returned. I have parsed twitter search results in the past, but I would like to do this as each response is received. Essentially, apply a function to each JSON response.
Thanks in advance.
EDIT: Here is the code that I have tried in R (I am on Windows, unfortunately). I need to include the reference to the .pem file to avoid the error. However, the code just "runs" and I can not seem to see what is returned. I have tried print, cat, etc.
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
getURL("https://stream.twitter.com/1/statuses/sample.json",
userpwd="USER:PWD",
cainfo = "cacert.pem")
I was able to figure out the basics, hopefully this helps.
#==============================================================================
# Streaming twitter using RCURL
#==============================================================================
library(RCurl)
library(rjson)
# set the directory
setwd("C:\\")
#### redirects output to a file
WRITE_TO_FILE <- function(x) {
if (nchar(x) >0 ) {
write.table(x, file="Twitter Stream Capture.txt", append=T, row.names=F, col.names=F)
}
}
### windows users will need to get this certificate to authenticate
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
### write the raw JSON data from the Twitter Firehouse to a text file
getURL("https://stream.twitter.com/1/statuses/sample.json",
userpwd=USER:PASSWORD,
cainfo = "cacert.pem",
write=WRITE_TO_FILE)
Try the twitter api package for R.
install.packages('twitteR')
library(twitteR)
I think this is what you need.

How to open Excel 2007 File from Password Protected Sharepoint 2007 site in R using RODBC or RCurl?

I am interested in opening an Excel 2007 file in R 2.11.1 using RODBC. The Excel file resides in the shared documents page of a MOSS2007 website. I currently download the .xlsx file to my hard drive and then import to R using the following code:
library(RODBC)
con<-odbcConnectExcel2007("C:/file location/file.xlsx")
data<-sqlFetch(con, "worksheet name")
close(con)
When I type in the web url for the document into the odbcConnectExcel2007 connection, an error message pops up with:
ODBC Excel Driver Login Failed: Invalid internet Address.
followed by the following message in my R console:
ERROR: Could not SQLDriverConnect
Any insights you can provide would be greatly appreciated.
Thanks!
**UPDATE**
The site I am attempting to download from is password protected. I tried another method using the method 'getUrl' in the package RCurl:
x = getURL("http://website.com/file.xlsx", userpwd = "uname:pw")
The error that I receive is:
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
embedded nul in string: 'PK\003\004\024\0\006\0\b\0\0\0!\0dA»ï\001\0\0O\n\0\0\023\0Ò\001[Content_Types].xml ¢Î\001( \0\002\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
I have no idea what this means. Any help would be appreciated. Thanks!
Two solutions worked for me.
If you do not need to automate the script that pulls the data, you can map a network drive pointing to the sharepoint folder from which you want to extract the Excel document.
If you need to automate a script to pull the Excel file every couple of minutes, I recommend sending your authentication credentials in a request that automatically saves the file to a local drive. From there you can read it into R for further data wrangling.
library("httr")
library("openxlsx")
user <- <USERNAME>
password <- <PASSWORD>
url <- "https://sharepoint.company/file_to_obtain.xlsx"
httr::GET(url,
authenticate(user, password, type="ntlm"),
write_disk("C:/tempfile.xlsx", overwrite = TRUE))
df <- openxlsx::read.xlsx("C:/tempfile.xlsx")
You can obtain the correct URL to the file by clicking on the sharepoint location and removing "?Web=1" after the file ending (xlsx, xlsb, xls,...). USERNAME and PASSWORD are usually windows credentials. It helps storing them in a key manager (such as:
library("keyring")
keyring::key_set_with_value(service = "Windows", username = "Key", password = <PASSWORD>)
and then authenticating via
authenticate(user, kreyring::key_get("Windows", "Key"), type="ntlm")
in some instances it may be sufficient to pass
authenticate(":", ":", type="ntlm")
if only your Windows credentials are required and the code is running from your machine.

RGoogleDocs (or RCurl) giving SSL certificate problem

I was using one of my favorite R packages today to read data from a google spreadsheet. It would not work. This problem is occurring on all my machines (I use windows) and it appears to be a new problem. I am using Version: 0.4-1 of RGoogleDocs
library(RGoogleDocs)
ps <-readline(prompt="get the password in ")
sheets.con = getGoogleDocsConnection(getGoogleAuth("fxxxh#gmail.com", ps, service ="wise"))
ts2=getWorksheets("OnCall",sheets.con)
And this is what I get after running the last line.
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
SSL certificate problem, verify that the CA cert is OK. Details:
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
I did some reading and came across some interesting, but not useful to me at least, information.
When I try to interact with a URL via https, I get an error of the form
Curl: SSL certificate problem, verify that the CA cert is OK
I got the very big picture message but did not know how to implement the solution in my script. I dropped the following line before getWorksheets.
x = getURLContent("https://www.google.com", ssl.verifypeer = FALSE)
That did not work so I tried
ts2=getWorksheets("OnCall",sheets.con,ssl.verifypeer = FALSE)
That also did not work.
Interestingly enough, the following line works
getDocs(sheets.con,folders = FALSE)
What do you suggest I try to get it working again? Thanks.
I no longer have this problem. I do not quite remember the timeline of exactly when I overcame the problem and cannot remember who helped me get here but here is a typical session which works.
library(RGoogleDocs)
if(exists("ps")) print("got password, keep going") else ps <-readline(prompt="get the password in ") #conditional password asking
options(RCurlOptions = list(capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
sheets.con = getGoogleDocsConnection(getGoogleAuth("fjh#gmail.com", ps, service ="wise"))
#WARNING: this would prevent curl from detecting a 'man in the middle' attack
ts2=getWorksheets("name of workbook here",sheets.con)
names(ts2)
sheet.1 <-sheetAsMatrix(ts2$"Sheet 1",header=TRUE, as.data.frame=TRUE, trim=TRUE) #Get one sheet
other <-sheetAsMatrix(ts2$"whatever name of tab",header=TRUE, as.data.frame=TRUE, trim=TRUE) #Get other sheet
Does it help you?
Maybe you don't have the certificate bundle installed. I installed those on OS X. You can also find them on the curl site

Resources