R downloading file from S3 - r

I need to download a file from an S3 bucket hosted by my company, so far I am able to retrieve data using the paws package:
library("aws.s3")
library("paws")
paws::s3(config = list(endpoint = "myendpoint"))
mycsv_raw <- s3$get_object(Bucket = "mybucket", key="myfile.csv")
mycsv <- rawToChar(mycsv_raw$Body)
write.csv(mycsv)
However, this is not good because I need to manually convert the raw file to a csv - and that might be more difficult for other types of files. Is there not a way to just download the file locally directly as a csv ?
When I try using aws.s3 I get an error in curl::curl_fetch_memory(url, handle = handle) : could not resolve host xxxx do you have any idea how to make that work? I am of course in a locked down corporate environment... But I am using the same endpoint in both cases so why does it work with one ane not the other?
Sys.setenv(
AWS_S3_ENDPOINT = "https://xxxx"
)
test <- get_object(object = "myfile.csv", bucket = "mybucket",
file = "mydownloadedfile.csv")
Error in curl::curl_fetch_memory(url, handle = handle) : could not resolve host xxxx

Related

googleCloudStorageR, gcs_save() works but gcs_load() does not

We have already authenticated and now have the following code snippet
googleCloudStorageR::gcs_save(
iris,
file = 'bucket-folder/iris.rda',
bucket = 'our-gcs-bucket'
)
googleCloudStorageR::gcs_load(
file = 'bucket-folder/iris.rda',
bucket = 'our-gcs-bucket'
)
Here, gcs_save() works fine to save the RDA in GCS, but gcs_load() does not work. We receive the error:
Error in curl::curl_fetch_disk(url, x$path, handle = handle): Failed to open file C:\Users\myname\path-to-file\iris.rda.
Request failed [ERROR]. Retrying in 1 seconds...
Error in curl::curl_fetch_disk(url, x$path, handle = handle): Failed to open file C:\Users\myname\path-to-file\iris.rda.
Request failed [ERROR]. Retrying in 1.3 seconds...
Error in curl::curl_fetch_disk(url, x$path, handle = handle) :
Failed to open file C:\Users\myname\path-to-file\iris.rda.
Error: Request failed before finding status code: Failed to open file C:\Users\myname\path-to-file\iris.rda.```
I am confused as to why gcs_load() appears to be attempting to open the file locally, rather than grabbing it from our GCS bucket. Are we using the gcs_load() function wrong here? How can we retrieve our saved RDA from GCS?
Edit: Also perhaps helpful and worth noting, but iris.rda saved in GCS is listed as being saved with Type text/plain per the GCS UI. Shouldn't the type here be something like RDA or RData, rather than text/plain?
Sorry missed this at the time. Its due to you saving the object with a folder name that GCS just regards as a / in the filename, then when you download it again it tries to put it in that folder structure but the folder does not exist.
To fix you could create the folder locally with dir.create():
googleCloudStorageR::gcs_save(
iris,
file = 'bucket-folder/iris.rda',
bucket = 'our-gcs-bucket'
)
dir.create("bucket-folder")
googleCloudStorageR::gcs_load(
file = 'bucket-folder/iris.rda',
bucket = 'our-gcs-bucket'
)
✓ Saved bucket-folder/iris.rda to bucket-folder/iris.rda ( 1.1 Kb )
[1] TRUE
Or you could specify the file location to be in an existing folder location (e.g. without the folder name) with the argument saveToDisk:
googleCloudStorageR::gcs_load(
file = 'bucket-folder/iris.rda',
saveToDisk = "iris.rda",
bucket = 'our-gcs-bucket'
)
✓ Saved bucket-folder/iris.rda to iris.rda ( 1.1 Kb )
[1] TRUE

Read a .RDS file from remote host using Shiny

I'm hosting my first shiny app from www.shinyapps.io. My r script uses a glm I created locally that I have stored as a .RDS file.
How can I read this file into my application directly using a free file host such as dropbox or google drive? (or another better alternative?)
test<-readRDS(gzcon(url("https://www.dropbox.com/s/p3bk57sqvlra1ze/strModel.RDS?dl=0")))
however, I get the error:
Error in readRDS(gzcon(url("https://www.dropbox.com/s/p3bk57sqvlra1ze/strModel.RDS?dl=0"))) :
unknown input format
I assume this is because the URL doesn't lead directly to the file but rather a landing page for dropbox?
That being said, I can't seem to find any free file hosting sites that have that functionality.
As always, I'm sure the solution is very obvious, any help is appreciated.
I figured it out. Hosted the file in a GitHub repository. From there I was able to copy the link to the raw file and placed that link in the readRDS(gzcon(url())) wrappers.
Remotely reading using readRDS() can be disappointing. You might want to try this wrapper that saves the data set to a temporary location before reading it locally:
readRDS_remote <- function(file, quiet = TRUE) {
if (grepl("^http", file, ignore.case = TRUE)) {
# temp location
file_local <- file.path(tempdir(), basename(file))
# download the data set
download.file(file, file_local, quiet = quiet, mode = "wb")
file <- file_local
}
readRDS(file)
}

Attempting to download files from SFTP using R

I'm trying to implement R in the workplace and save a bit of time from all the data churning we do.
A lot of files we receive are sent to us via SFTP as they contain sensitive information.
I've looked around on StackOverflow & Google but nothing seems to work for me. I tried using the RCurl Library from an example I found online but it doesn't allow me to include the port(22) as part of the login details.
library(RCurl)
protocol <- "sftp"
server <- "hostname"
userpwd <- "user:password"
tsfrFilename <- "Reports/Excelfile.xlsx"
ouptFilename <- "~/Test.xlsx"
url <- paste0(protocol, "://", server, tsfrFilename)
data <- getURL(url = url, userpwd=userpwd)
I end up getting the error code
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
embedded nul in string:
Any help would be greatly appreciated as this will save us loads of time!
Thanks,
Shan
Looks like a similar situation here: Using R to download SAS file from ftp-server
I'm no expert in r but there it looks like getBinaryUrl() worked instead of getURL() in the example given.
Hope that helps
M
Note that there are two packages, RCurl and rcurl. For RCurl, I used successfully keyfiles to connect via sftp:
opts <- list(
ssh.public.keyfile = pubkey, # file name
ssh.private.keyfile = privatekey, # filename
keypasswd <- keypasswd # optional password
)
RCurl::getURL(url=uri, .opts = opts, curl = RCurl::getCurlHandle())
For this to work, you need two create the keyfiles e.g. via putty or similar.
I too was having problems specifying the port options when using the getURI() and getURL() functions.
In order to specify the port, you simply add the port as port = #### instead of port(####). For example:
data <- getURI(url = url,
userpwd = userpwd,
port = 22)
Now, like #MarkThomas pointed out, whenever you get an encodoing error, try getBinaryURL() instead of getURI(). In most cases, this will allow you to download SAS files as well as .csv files econded in UTF-8 or LATIN1!!

Importing a csv.gz file from an FTP server with Port and Directory Credentials into R

I want to import datasets in R that come from an FTP server. I am using FileZilla to manually see the files. Currently my data is in a xxxx.csv.gz file in the FTP server and a new file gets added once a day.
My issues are that I have tried using the following link as guidance and it doesn't seem to work well in my case:
Using R to download newest files from ftp-server
When I attempt the following code an error message comes up:
library(RCurl)
url <- "ftp://yourServer"
userpwd <- "yourUser:yourPass"
filenames <- getURL(url, userpwd = userpwd,
ftp.use.epsv = FALSE,dirlistonly = TRUE)
Error:
Error in function (type, msg, asError = TRUE) :
Failed to connect to ftp.xxxx.com port 21: Timed out
The reason why this happened is because under the credentials: it states that I should use Port: 22 for secure port
How do I modify my getURL function so that I can access Port: 22?
Also there is a directory after making this call that I need to get to in order to access the files.
For example purposes: let's say the directory is:
Directory: /xxx/xxxxx/xxxxxx
(I've also tried attaching this to the original URL callout and the same error message comes up)
Basically I want to get access to this directory, upload individual csv.gz files into R and then automatically call the following day's data.
The file names are:
XXXXXX_20160205.csv.gz
(The file names are just dates and each file will correspond to the previous day)
I guess the first step is to just make a connection to the files and download them and later down the road, automatically call the previous day's csv.gz file.
Any help would be great, thanks!

How to open Excel 2007 File from Password Protected Sharepoint 2007 site in R using RODBC or RCurl?

I am interested in opening an Excel 2007 file in R 2.11.1 using RODBC. The Excel file resides in the shared documents page of a MOSS2007 website. I currently download the .xlsx file to my hard drive and then import to R using the following code:
library(RODBC)
con<-odbcConnectExcel2007("C:/file location/file.xlsx")
data<-sqlFetch(con, "worksheet name")
close(con)
When I type in the web url for the document into the odbcConnectExcel2007 connection, an error message pops up with:
ODBC Excel Driver Login Failed: Invalid internet Address.
followed by the following message in my R console:
ERROR: Could not SQLDriverConnect
Any insights you can provide would be greatly appreciated.
Thanks!
**UPDATE**
The site I am attempting to download from is password protected. I tried another method using the method 'getUrl' in the package RCurl:
x = getURL("http://website.com/file.xlsx", userpwd = "uname:pw")
The error that I receive is:
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
embedded nul in string: 'PK\003\004\024\0\006\0\b\0\0\0!\0dA»ï\001\0\0O\n\0\0\023\0Ò\001[Content_Types].xml ¢Î\001( \0\002\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
I have no idea what this means. Any help would be appreciated. Thanks!
Two solutions worked for me.
If you do not need to automate the script that pulls the data, you can map a network drive pointing to the sharepoint folder from which you want to extract the Excel document.
If you need to automate a script to pull the Excel file every couple of minutes, I recommend sending your authentication credentials in a request that automatically saves the file to a local drive. From there you can read it into R for further data wrangling.
library("httr")
library("openxlsx")
user <- <USERNAME>
password <- <PASSWORD>
url <- "https://sharepoint.company/file_to_obtain.xlsx"
httr::GET(url,
authenticate(user, password, type="ntlm"),
write_disk("C:/tempfile.xlsx", overwrite = TRUE))
df <- openxlsx::read.xlsx("C:/tempfile.xlsx")
You can obtain the correct URL to the file by clicking on the sharepoint location and removing "?Web=1" after the file ending (xlsx, xlsb, xls,...). USERNAME and PASSWORD are usually windows credentials. It helps storing them in a key manager (such as:
library("keyring")
keyring::key_set_with_value(service = "Windows", username = "Key", password = <PASSWORD>)
and then authenticating via
authenticate(user, kreyring::key_get("Windows", "Key"), type="ntlm")
in some instances it may be sufficient to pass
authenticate(":", ":", type="ntlm")
if only your Windows credentials are required and the code is running from your machine.

Resources