Using RCurl with ftp server with embedded nulls - r

I've been working on this issue for a few days now and even after contacting the site administrators, I've had no luck in solving it.
I would like to automate the download of a specific file from an ftp server without using any software besides R.
userpwd = "MyUserName:MyPassword"
url <- "ftp://arthurhou.pps.eosdis.nasa.gov/gpmdata/2014/04/01/imerg/3B-HHR.MS.MRG.3IMERG.20140401-S150000-E152959.0900.V03D.HDF5"
dat <- try(getURL(url, userpwd = userpwd,verbose=TRUE,ftp.use.epsv = FALSE))
When I run this, I get the error:
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
embedded nul in string: '‰HDF\r\n\032\n\0\0\0\0\0\b\b\0\004\0\020\0\0\0\0\0\0\0\0\0\0\0\0\0ÿÿÿÿÿÿÿÿÚá'\0\0\0\0\0ÿÿÿÿÿÿÿÿ\0\0\0\0\0\0\0\0`\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0OHDR\002,fÉ¿TbÉ¿TfÉ¿TbÉ¿Tà\002"\0\0\0\0\0\003\001\0\0\0\0\0\0\0ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\n\002\0\001\0\0\0\0\006\027\0\0\0\0\001\004\0\0\0\0\0\0\0\0\004Grid[\001\0\0\0\0\0\0\025\034\0\004\0\0\0\003\002\0ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\020\020\0\0\0\0\036`&\0\0\0\0\0{\003\0\0\0\0\0\0\0U\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\024î*aOHDR\002,fÉ¿TbÉ¿TfÉ¿TbÉ¿Tà\002"\0\0\0\0\0\003\v\0\0\0\0\0\0\0Ã\025\0\0\0\0\0\0U\026\0\0\0\0\0\0{\026\0\0\0\0\0\0\n\002\0\001\0\0\0\0\025\034\0\004\0\0\0\003\001\0ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\020\020\0\0\0\0™c&\0\0\0\0\0\034\001\0\0\0\0\0\0\0r\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
I've tried removing the nulls from the initial link, i.e. url <- "ftp://arthurhou.pps.eosdis.nasa.gov%2Fgpmdata%2F2014%2F04%2F01%2Fimerg%2F3B-HHR.MS.MRG.3IMERG.20140401-S213000-E215959.1290.V03D.HDF5" yet this returns the same error as before.
If anyone would like to try this for themselves, you can register an email at: http://pmm.nasa.gov/data-access/downloads/gpm, and then use the email as the username and password.

This worked for me:
library(httr)
url <- "ftp://arthurhou.pps.eosdis.nasa.gov/gpmdata/2014/04/01/imerg/3B-HHR.MS.MRG.3IMERG.20140401-S150000-E152959.0900.V03D.HDF5"
output_file <- "3B-HHR.MS.MRG.3IMERG.20140401-S150000-E152959.0900.V03D.HDF5"
my_email <- "someone#example.com"
GET(url, authenticate(my_email, my_email),
write_disk(output_file))

Related

R httr GET request - connection time-out

I am trying to download programmatically files like this from an ftp.
The home page provides openly username ("fire") and password ("burnt") and I can download the files no problem from browser.
When I try to do the same in R using httr::GET()
library("httr")
GET(url = "ftp://fuoco.geog.umd.edu/gfed4/monthly/GFED4.0_MQ_200301_BA.hdf",
authenticate(user = "fire", password = "burnt"),
write_disk(file.path(tempdir(), "GFED4.0_MQ_200301_BA.hdf"),
overwrite = TRUE))
I get the following error
Error in curl::curl_fetch_disk(url, x$path, handle = handle) :
Timeout was reached: Connection time-out
I would greatly appreciate any idea to fix this problem, many thanks!
The problem seems to be that FTP isn't supported by library(httr):
Please see this, or more recent this.
I'd give library(RCurl) a go instead:
library(RCurl)
url <- "ftp://fuoco.geog.umd.edu/gfed4/monthly/GFED4.0_MQ_200301_BA.hdf"
content <- getBinaryURL(url, userpwd = "fire:burnt", ftp.use.epsv = FALSE)
writeBin(content, con = basename(url))

Attempting to download files from SFTP using R

I'm trying to implement R in the workplace and save a bit of time from all the data churning we do.
A lot of files we receive are sent to us via SFTP as they contain sensitive information.
I've looked around on StackOverflow & Google but nothing seems to work for me. I tried using the RCurl Library from an example I found online but it doesn't allow me to include the port(22) as part of the login details.
library(RCurl)
protocol <- "sftp"
server <- "hostname"
userpwd <- "user:password"
tsfrFilename <- "Reports/Excelfile.xlsx"
ouptFilename <- "~/Test.xlsx"
url <- paste0(protocol, "://", server, tsfrFilename)
data <- getURL(url = url, userpwd=userpwd)
I end up getting the error code
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
embedded nul in string:
Any help would be greatly appreciated as this will save us loads of time!
Thanks,
Shan
Looks like a similar situation here: Using R to download SAS file from ftp-server
I'm no expert in r but there it looks like getBinaryUrl() worked instead of getURL() in the example given.
Hope that helps
M
Note that there are two packages, RCurl and rcurl. For RCurl, I used successfully keyfiles to connect via sftp:
opts <- list(
ssh.public.keyfile = pubkey, # file name
ssh.private.keyfile = privatekey, # filename
keypasswd <- keypasswd # optional password
)
RCurl::getURL(url=uri, .opts = opts, curl = RCurl::getCurlHandle())
For this to work, you need two create the keyfiles e.g. via putty or similar.
I too was having problems specifying the port options when using the getURI() and getURL() functions.
In order to specify the port, you simply add the port as port = #### instead of port(####). For example:
data <- getURI(url = url,
userpwd = userpwd,
port = 22)
Now, like #MarkThomas pointed out, whenever you get an encodoing error, try getBinaryURL() instead of getURI(). In most cases, this will allow you to download SAS files as well as .csv files econded in UTF-8 or LATIN1!!

Downloading zip files from FTP server in R: Connection refused, port 80

I'm trying to download some zip files from a larger directory on a FTP server. Currently I have code to load the directory and search for zip files then download all files with a .zip extension.
url <- "ftp://ftp.zakupki.gov.ru/fcs_regions/Adygeja_Resp/protocols/"
userpw <- "free:free"
protocol <- getURL(url, userpwd=userpw, ftp.use.epsv=TRUE, dirlistonly=TRUE)
filenames <- protocol <- strsplit(protocol, "\r*\n")[[1]]
write.table(filenames, "names.txt", sep="\t")
zips <- sapply(filenames,function(x) substr(x,nchar(x)-2,nchar(x)))== "zip"
downloads <- filenames[zips]
con <- getCurlHandle(ftp.use.epsv = TRUE, userpwd=userpw)
mapply(function(x,y) writeBin(getBinaryURL(x, curl = con, dirlistonly = FALSE), y), x = downloads, y = paste("C://temp//",downloads, sep = ""))
Last night I ran the code and was able to download the files with no problems, however when I tried running it again today I received the following error:
Error in function (type, msg, asError = TRUE) :
Failed to connect to protocol_Adygeja_Resp_2014030100_2014040100_20140710102838_001.xml.zip port 80: Connection refused
I've tried turning off the internet2 setting in R, as well as changing the ftp.use.espv setting. I'm quite certain the code I have listed above ran fine the first time however and none of the setting changes I've tried have helped.
Thanks
Your code worked for me, but you might want to give it a try with the more modern curl package:
library(curl)
# Get dir listing ---------------------------------------------------------
list_h <- new_handle()
handle_setopt(list_h, userpwd=userpw, ftp_use_epsv=TRUE, dirlistonly=TRUE)
con <- curl(url, "r", handle=list_h)
protocol <- readLines(con)
close(con)
# Save off a list of the filenames ----------------------------------------
writeLines(protocol, con="names.txt")
# Filter out only .zip files ----------------------------------------------
just_zips <- grep("\\.zip$", protocol, value=TRUE)
# Download the files ------------------------------------------------------
dl_h <- new_handle()
handle_setopt(dl_h, userpwd=userpw, ftp_use_epsv=TRUE)
for (i in seq_along(just_zips)) {
curl_fetch_disk(url=sprintf("%s%s", url, just_zips[i]),
path=sprintf("/tmp/%s", just_zips[i]),
handle=dl_h)
}
You'll need to change /tmp but this worked fine on my Mac. I don't have a handy-enough Windows system to try it on there.

How to open Excel 2007 File from Password Protected Sharepoint 2007 site in R using RODBC or RCurl?

I am interested in opening an Excel 2007 file in R 2.11.1 using RODBC. The Excel file resides in the shared documents page of a MOSS2007 website. I currently download the .xlsx file to my hard drive and then import to R using the following code:
library(RODBC)
con<-odbcConnectExcel2007("C:/file location/file.xlsx")
data<-sqlFetch(con, "worksheet name")
close(con)
When I type in the web url for the document into the odbcConnectExcel2007 connection, an error message pops up with:
ODBC Excel Driver Login Failed: Invalid internet Address.
followed by the following message in my R console:
ERROR: Could not SQLDriverConnect
Any insights you can provide would be greatly appreciated.
Thanks!
**UPDATE**
The site I am attempting to download from is password protected. I tried another method using the method 'getUrl' in the package RCurl:
x = getURL("http://website.com/file.xlsx", userpwd = "uname:pw")
The error that I receive is:
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
embedded nul in string: 'PK\003\004\024\0\006\0\b\0\0\0!\0dA»ï\001\0\0O\n\0\0\023\0Ò\001[Content_Types].xml ¢Î\001( \0\002\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
I have no idea what this means. Any help would be appreciated. Thanks!
Two solutions worked for me.
If you do not need to automate the script that pulls the data, you can map a network drive pointing to the sharepoint folder from which you want to extract the Excel document.
If you need to automate a script to pull the Excel file every couple of minutes, I recommend sending your authentication credentials in a request that automatically saves the file to a local drive. From there you can read it into R for further data wrangling.
library("httr")
library("openxlsx")
user <- <USERNAME>
password <- <PASSWORD>
url <- "https://sharepoint.company/file_to_obtain.xlsx"
httr::GET(url,
authenticate(user, password, type="ntlm"),
write_disk("C:/tempfile.xlsx", overwrite = TRUE))
df <- openxlsx::read.xlsx("C:/tempfile.xlsx")
You can obtain the correct URL to the file by clicking on the sharepoint location and removing "?Web=1" after the file ending (xlsx, xlsb, xls,...). USERNAME and PASSWORD are usually windows credentials. It helps storing them in a key manager (such as:
library("keyring")
keyring::key_set_with_value(service = "Windows", username = "Key", password = <PASSWORD>)
and then authenticating via
authenticate(user, kreyring::key_get("Windows", "Key"), type="ntlm")
in some instances it may be sufficient to pass
authenticate(":", ":", type="ntlm")
if only your Windows credentials are required and the code is running from your machine.

RGoogleDocs (or RCurl) giving SSL certificate problem

I was using one of my favorite R packages today to read data from a google spreadsheet. It would not work. This problem is occurring on all my machines (I use windows) and it appears to be a new problem. I am using Version: 0.4-1 of RGoogleDocs
library(RGoogleDocs)
ps <-readline(prompt="get the password in ")
sheets.con = getGoogleDocsConnection(getGoogleAuth("fxxxh#gmail.com", ps, service ="wise"))
ts2=getWorksheets("OnCall",sheets.con)
And this is what I get after running the last line.
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
SSL certificate problem, verify that the CA cert is OK. Details:
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
I did some reading and came across some interesting, but not useful to me at least, information.
When I try to interact with a URL via https, I get an error of the form
Curl: SSL certificate problem, verify that the CA cert is OK
I got the very big picture message but did not know how to implement the solution in my script. I dropped the following line before getWorksheets.
x = getURLContent("https://www.google.com", ssl.verifypeer = FALSE)
That did not work so I tried
ts2=getWorksheets("OnCall",sheets.con,ssl.verifypeer = FALSE)
That also did not work.
Interestingly enough, the following line works
getDocs(sheets.con,folders = FALSE)
What do you suggest I try to get it working again? Thanks.
I no longer have this problem. I do not quite remember the timeline of exactly when I overcame the problem and cannot remember who helped me get here but here is a typical session which works.
library(RGoogleDocs)
if(exists("ps")) print("got password, keep going") else ps <-readline(prompt="get the password in ") #conditional password asking
options(RCurlOptions = list(capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
sheets.con = getGoogleDocsConnection(getGoogleAuth("fjh#gmail.com", ps, service ="wise"))
#WARNING: this would prevent curl from detecting a 'man in the middle' attack
ts2=getWorksheets("name of workbook here",sheets.con)
names(ts2)
sheet.1 <-sheetAsMatrix(ts2$"Sheet 1",header=TRUE, as.data.frame=TRUE, trim=TRUE) #Get one sheet
other <-sheetAsMatrix(ts2$"whatever name of tab",header=TRUE, as.data.frame=TRUE, trim=TRUE) #Get other sheet
Does it help you?
Maybe you don't have the certificate bundle installed. I installed those on OS X. You can also find them on the curl site

Resources