RCurl - How to retrieve data from sftp? - r

I tried to retrieve data from sftp with the below code:
library(RCurl)
protocol <- "sftp"
server <- "xxxx#sftp.xxxx.com"
userpwd <- "xxx:yyy"
tsfrFilename <- "cccccc.tsv"
ouptFilename <- "out.csv"
opts = list(
#ssh.public.keyfile = "true", # file name
ssh.private.keyfile = "xxxxx.ppk",
keypasswd = "userpwd"
)
# Run #
## Download Data
url <- paste0(protocol, "://", server, tsfrFilename)
data <- getURL(url = url, .opts = opts, userpwd=userpwd)
and i received an error message:
Error in function (type, msg, asError = TRUE) : Authentication failure
What am I doing wrong?
Thanks

With a private key you do not need a password with you username. So your getURL statement will be:
data <- getURL(url = url, .opts = opts, username="username")

I had exactly the same problem and have just spent an hour trying different things out. What worked for me was changing the format of the private key to OpenSSH.
To do this, I used the key generator package puttygen. Go to the menu item "Conversions" to import the original private key and export to the OpenSSH format. I exported the converted key to the same folder that my original key was in with a new filename. I kept the *.ppk extension
Then I used the following commands:
opts <- list(
ssh.private.keyfile = "<path to my new OpenSSH Key>.ppk"
)
data <- getURL(url = URL, .opts = opts, username = username, verbose = TRUE)
This seemed to work fine.

Related

Getting data from ftp file directly into environment or as file r

Trying to get file from ftp server with HTTR and RCurl any method doesn't work.
Real case. User and password credentials are real.
First HTTR
library(httr)
GET(url = "ftp://77.72.135.237/2993309138_Tigres.xls", authenticate("xxxxxxx", "xxxxxx"),
write_disk("~/Downloads/2993309138_Tigres.xls", overwrite = T))
#> Error: $ operator is invalid for atomic vectors
Second RCurl
library(RCurl)
my_data <- getURL(url = "ftp://77.72.135.237/2993309138_Tigres.xls", userpwd = "xxxxxx")
#> Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding): embedded nul in string: 'ÐÏ\021ࡱ\032á'
Is it server side bug or mine? :)
It seems to have something to do with the encoding. Try like this (you'll have to fill in the authentication) :
library(httr)
content(
GET(url = "ftp://77.72.135.237/2993309138_Tigres.xls",
authenticate("...", "..."),
write_disk("~/Downloads/2993309138_Tigres.xls", overwrite = T)
),
type = "text/csv",
as = "text",
encoding = "WINDOWS-1251"
)

XLSX data upload with RestRserve

I would like to work with RestRServe to have a .xlsx file uploaded for processing. I have tried the below using a .csv with success, but some slight modifications for .xlsx with get_file was not fruitful.
ps <- r_bg(function(){
library(RestRserve)
library(readr)
library(xlsx)
app = Application$new(content_type = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
app$add_post(
path = "/echo",
FUN = function(request, response) {
cnt <- request$get_file("xls")
dt <- xlsx::read.xlsx(cnt, sheetIndex = 1, header = TRUE)
response$set_body("some function")
}
)
backend = BackendRserve$new()
backend$start(app, http_port = 65080)
})
What have you tried? According to the documentation request$get_file() method returns a raw vector - a binary representation of the file. I'm not aware of R packages/functions which allow to read xls/xlsx file directly from the raw vector (probably such functions exist, I just don't know).
Here you can write body to a file and then read it normal way then:
library(RestRserve)
library(readxl)
app = Application$new()
app$add_post(
path = "/xls",
FUN = function(request, response) {
fl = tempfile(fileext = '.xlsx')
xls = request$get_file("xls")
# need to drop attributes as writeBin()
# can't write object with attributes
attributes(xls) = NULL
writeBin(xls, fl)
xls = readxl::read_excel(fl, sheet = 1)
response$set_body("done")
}
)
backend = BackendRserve$new()
backend$start(app, http_port = 65080)
Also mind that content_type argument is for response encoding, not for request decoding.

Cannot open an HTTPS connection to GitHub in R

I am trying to fetch some data in R from a github repo, but the connection fails.
remote.file = function(URL) {
temporaryFile <- tempfile()
download.file(URL,destfile=temporaryFile, method="curl")
return( temporaryFile )
}
URL = "https://raw.githubusercontent.com/sidooms/MovieTweetings/master/latest/ratings.dat"
Ratings = read.table( remote.file(URL), sep = ":", header=FALSE ),c(1,3,5,7)]
I get the following error:

R - RCurl scrape data from a password-protected site

I'm trying to scrape some table data from a password-protected website (I have a valid username/password) using R and have yet to succeed.
For an example, here's the website to log in to my dentist: http://www.deltadentalins.com/uc/index.html
I have tried the following:
library(httr)
download <- "https://www.deltadentalins.com/indService/faces/Home.jspx?_afrLoop=73359272573000&_afrWindowMode=0&_adf.ctrl-state=12pikd0f19_4"
terms <- "http://www.deltadentalins.com/uc/index.html"
values <- list(username = "username", password = "password", TARGET = "", SMAUTHREASON = "", POSTPRESERVATIONDATA = "",
bundle = "all", dups = "yes")
POST(terms, body = values)
GET(download, query = values)
I have also tried:
your.username <- 'username'
your.password <- 'password'
require(SAScii)
require(RCurl)
require(XML)
agent="Firefox/23.0"
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt' ,
useragent = agent,
followlocation = TRUE ,
autoreferer = TRUE ,
curl = curl
)
# list parameters to pass to the website (pulled from the source html)
params <-
list(
'lt' = "",
'_eventID' = "",
'TARGET' = "",
'SMAUTHREASON' = "",
'POSTPRESERVATIONDATA' = "",
'SMAGENTNAME' = agent,
'username' = your.username,
'password' = your.password
)
#logs into the form
html = postForm('https://www.deltadentalins.com/siteminderagent/forms/login.fcc', .params = params, curl = curl)
# logs into the form
html
I can't get either to work. Are there any experts out there that can help?
Updated 3/5/16 to work with package Relenium
#### FRONT MATTER ####
library(devtools)
library(RSelenium)
library(XML)
library(plyr)
######################
## This block will open the Firefox browser, which is linked to R
RSelenium::checkForServer()
remDr <- remoteDriver()
startServer()
remDr$open()
url="yoururl"
remDr$navigate(url)
This first section loads the required packages, sets the login URL, and then opens it in a Firefox instance. I type in my username & password, and then I'm in and can start scraping.
infoTable <- readHTMLTable(firefox$getPageSource(), header = TRUE)
infoTable
Table1 <- infoTable[[1]]
Apps <- Table1[,1] # Application Numbers
For this example, the first page contained two tables. The first is the one I'm interested and has a table of application numbers and names. I pull out the first column (application numbers).
Links2 <- paste("https://yourURL?ApplicantID=", Apps2, sep="")
The data I want are stored in invidiual applications, so this bit created the links that I want to loop through.
### Grabs contact info table from each page
LL <- lapply(1:length(Links2),
function(i) {
url=sprintf(Links2[i])
firefox$get(url)
firefox$getPageSource()
infoTable <- readHTMLTable(firefox$getPageSource(), header = TRUE)
if("First Name" %in% colnames(infoTable[[2]]) == TRUE) infoTable2 <- cbind(infoTable[[1]][1,], infoTable[[2]][1,])
else infoTable2 <- cbind(infoTable[[1]][1,], infoTable[[3]][1,])
print(infoTable2)
}
)
results <- do.call(rbind.fill, LL)
results
write.csv(results, "C:/pathway/results2.csv")
This final section loops through the link for each application, then grabs the table with their contact information (which is either table 2 OR table 3, so R has to check first). Thanks again to Chinmay Patil for the tip on relenium!

how to download a large binary file with RCurl *after* server authentication

i originally asked this question about performing this task with the httr package, but i don't think it's possible using httr. so i've re-written my code to use RCurl instead -- but i'm still tripping up on something probably related to the writefunction.. but i really don't understand why.
you should be able to reproduce my work by using the 32-bit version of R, so you hit memory limits if you read anything into RAM. i need a solution that downloads directly to the hard disk.
to start, this code to works -- the zipped file is appropriately saved to the disk.
library(RCurl)
filename <- tempfile()
f <- CFILE(filename, "wb")
url <- "http://www2.census.gov/acs2011_5yr/pums/csv_pus.zip"
curlPerform(url = url, writedata = f#ref)
close(f)
# 2.1 GB file successfully written to disk
now here's some RCurl code that does not work. as stated in the previous question, reproducing this exactly will require creating an extract on ipums.
your.email <- "email#address.com"
your.password <- "password"
extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz"
library(RCurl)
values <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt',
followlocation = TRUE,
autoreferer = TRUE,
ssl.verifypeer = FALSE,
curl = curl
)
params <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl)
dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl)
and now that i'm logged in, try the same commands as above, but with the curl object to keep the cookies.
filename <- tempfile()
f <- CFILE(filename, mode = "wb")
this line breaks--
curlPerform(url = extract.path, writedata = f#ref, curl = curl)
close(f)
# the error is:
Error in curlPerform(url = extract.path, writedata = f#ref, curl = curl) :
embedded nul in string: [[binary jibberish here]]
the answer to my previous post referred me to this c-level writefunction answer, but i'm clueless about how to re-create that curl_writer C program (on windows?)..
dyn.load("curl_writer.so")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
curlPerform(URL=url, writefunction=writer)
..or why it's even necessary, given that the five lines of code at the top of this question work without anything crazy like getNativeSymbolInfo. i just don't understand why passing in that extra curl object that stores the authentication/cookies and tells it not to verify SSL would cause code that otherwise works.. to break?
From this link create a file named curl_writer.c and save it to C:\<folder where you save your R files>
#include <stdio.h>
/**
* Original code just sent some message to stderr
*/
size_t writer(void *buffer, size_t size, size_t nmemb, void *stream) {
fwrite(buffer,size,nmemb,(FILE *)stream);
return size * nmemb;
}
Open a command window, go to the folder where you saved curl_writer.c and run the R compiler
c:> cd "C:\<folder where you save your R files>"
c:> R CMD SHLIB -o curl_writer.dll curl_writer.c
Open R and run your script
C:> R
your.email <- "email#address.com"
your.password <- "password"
extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz"
library(RCurl)
values <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt',
followlocation = TRUE,
autoreferer = TRUE,
ssl.verifypeer = FALSE,
curl = curl
)
params <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl)
dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl)
# Load the DLL you created
# "writer" is the name of the function
# "curl_writer" is the name of the dll
dyn.load("curl_writer.dll")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
# Note that "URL" parameter is upper case, in your code it is lowercase
# I'm not sure if that has something to do
# "writer" is the symbol defined above
f <- CFILE(filename <- tempfile(), "wb")
curlPerform(URL=url, writedata=f#ref, writefunction=writer, curl=curl)
close(f)
this is now possible with the httr package. thanks hadley!
https://github.com/hadley/httr/issues/44

Resources