Using R to download SAS file from ftp-server - r

I am attempting to download some files onto my local from an ftp-server. I have had success using the following method to move .txt and .csv files from the server but not the .sas7bdat files that I need.
protocol <- "sftp"
server <- "ServerName"
userpwd <- "User:Pass"
tsfrFilename <- "/filepath/file.sas7bdat"
ouptFilename <- "out.sas7bdat"
# Run #
## Download Data
url <- paste0(protocol, "://", server, tsfrFilename)
data <- getURL(url = url, userpwd=userpwd)
## Create File
fconn <- file(ouptFilename)
writeLines(data, fconn)
close(fconn)
When I run the getURL command, however, I am met with the following error:
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
embedded nul in string:
Does anyone know of any alternative way which I can download a sas7bdat file from an ftp-server to my local, or if there is a way to alter my code below to successfully download the file. Thanks!

As #MrFlick suggested, I solved this problem using getBinaryURL instead of getURL(). Also, I had to use the function write() instead of writeLines(). The result is as follows:
protocol <- "sftp"
server <- "ServerName"
userpwd <- "User:Pass"
tsfrFilename <- "/filepath/file.sas7bdat"
ouptFilename <- "out.sas7bdat"
# Run #
## Download Data
url <- paste0(protocol, "://", server, tsfrFilename)
data <- getBinaryURL(url = url, userpwd=userpwd)
## Create File
fconn <- file(ouptFilename)
write(data, fconn)
close(fconn)
Alternatively, to transform the read data into R data frame, one can use the library haven, as follows
library(haven)
df_data= read_sas(data)

Related

load corrupted xls file into r without manually changing file type

I am struggling to download an excel file and then loading it to R:
utils::download.file(
url = 'https://servicos.ibama.gov.br/ctf/publico/areasembargadas/downloadListaAreasEmbargadas.php',
destfile = 'C:/users/arthu/Desktop/fines.rar',
mode = "wb"
)
After unzipping and trying to load it into R:
utils::unzip(
zipfile = './fines.rar',
exdir = './ibama_data'
)
dados <- readxl::read_xls(
"./ibama_data/rel_areas_embargadas_0-65000_2020-12-10_080019.xls"),
skip = 6,
col_type = c(rep("guess", 13), "date", "guess", "date")
)
I get libxls error: Unable to open file.
If I try to rename the file as .xlsx as follows, I get an evaluation error when reading it with readxl::read_excel, saying unable to open file
file <- file.rename(
from = "./Desktop/ibama_data/rel_areas_embargadas_0-65000_2020-12-10_080019.xls",
to = "./Desktop/ibama_data/test.xlsx"
)
However, if I manually open such a file, excel throws me a warning saying that the file's extension does not match its type. After saving it as .xlsx, I can finally load it using read_excel
How can I solve this, given that I want to write a package with a function that downloads such data from the web and then loads it into R?
Edit
The .xls file you are trying to read isn't an Excel document, it's an HTML table.
You could read it using XML package :
library(XML)
doc <- htmlParse('rel_areas_embargadas_0-65000_2021-01-13_080018.xls')
tableNode <- getNodeSet(doc, '//table')
data <- XML::readHTMLTable(tableNode[[1]])
#Store header
header <- data[1:5,]
#Store colnames
colnames <- data[6,]
#Remove header
data <- data[-1:-6,]
#Set colnames
colnames(data)<-colnames
head(data)

How to use read.csv2.sql to read zip file without unzipping it?

I am trying to read a zip file without unzipping it in my directory while utilizing read.csv2.sql for specific row filtering.
Zip file can be downloaded here :
I have tried setting up a file connection to read.csv2.sql, but it seems that it does not take in file connection as an parameter for "file".
I already installed sqldf package in my machine.
This is my following R code for the issue described:
### Name the download file
zipFile <- "Dataset.zip"
### Download it
download.file("https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip",zipFile,mode="wb")
## Set up zip file directory
zip_dir <- paste0(workingDirectory,"/Dataset.zip")
### Establish link to "household_power_consumption.txt" inside zip file
data_file <- unz(zip_dir,"household_power_consumption.txt")
### Read file into loaded_df
loaded_df <- read.csv2.sql(data_file , sql="SELECT * FROM file WHERE Date='01/02/2007' OR Date='02/02/2007'",header=TRUE)
### Error Msg
### -Error in file(file) : invalid 'description' argument
This does not use read.csv2.sql but as there are only ~ 2 million records in the file it should be possible to just download it, read it in using read.csv2 and then subset it in R.
# download file creating zipfile
u <-"https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip"
zipfile <- sub(".*%2F", "", u)
download.file(u, zipfile)
# extract fname from zipfile, read it into DF0 and subset it to DF
fname <- sub(".zip", ".txt", zipfile)
DF0 <- read.csv2(unz(zipfile, fname))
DF0$Date <- as.Date(DF0$Date, format = "%d/%m/%Y")
DF <- subset(DF0, Date == '2007-02-01' | Date == '2007-02-02')
# can optionally free up memory used by DF0
# rm(DF0)

R import csv files from FTP Server to data-frame

I am trying to import csv-files from a ftp server to R.
It would be best to import files into a dataframe.
I want to import only specific files from ftp server, not all of the files.
My issues began by trying to import only one file:
url <- "ftp:servername.de/"
download.file(url, "testdata.csv")
I got this error message:
try URL 'ftp://servername/'
Fehler in download.file(url, "testdata") :
can not open 'ftp://servername.de/'
Additional Warning
In download.file(url, "tesdata.csv") :
URL 'ftp://servername/': status was 'Couldn't connect to server'
Another way I tried was:
url <- "ftp://servername.de/"
userpwd <- "a:n"
filenames <- getURL(url, userpwd = userpwd
,ftp.use.epsv = FALSE, dirlistonly = TRUE
)
Here I do not understand how to import the files into an R-Object.
Additionally, it would be great to get a clue on how to handle this process with zipped data instead of csv-data (format: .gz)
Use the curl library to extract the directory listing
library(curl)
url = "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/11/PXD000299/"
h = new_handle(dirlistonly=TRUE)
con = curl(url, "r", h)
tbl = read.table(con, stringsAsFactors=TRUE, fill=TRUE)
close(con)
head(tbl)
V1
1 12-0210_Druart_Uterus_J0N-Co_1a_ORBI856.raw.mzML
2 12-0210_Druart_Uterus_J0N-Co_2a_ORBI857.raw.mzML
3 12-0210_Druart_Uterus_J0N-Co_3a_ORBI858.raw.mzML
4 12-0210_Druart_Uterus_J10N-Co_1a_ORBI859.raw.mzML
5 12-0210_Druart_Uterus_J10N-Co_2a_ORBI860.raw.mzML
6 12-0210_Druart_Uterus_J10N-Co_3a_ORBI861.raw.mzML
Paste the relevant ones on to the url and use
urls <- paste0(url, tbl[1:5,1])
fls = basename(urls)
curl_fetch_disk(urls[1], fls[1])
Reference:
Downloading files from ftp with R

Cannot access file on S3 using R

access_key<-"**************"
secret_key<-"****************"
bucket<- "temp"
filename<-"test.csv"
Sys.setenv("AWS_ACCESS_KEY_ID" = access_key,
"AWS_SECRET_ACCESS_KEY" = secret_key )
buckets<-(bucketlist())
getbucket(bucket)
usercsvobj <-get_object(bucket = "","s3://part112017rscriptanddata/test.csv")
csvcharobj <- rawToChar(usercsvobj)
con <- textConnection(csvcharobj)
data <- read.csv(con)
I am a able to see the contents of the bucket, but fail to read the csv as a data frame.
[1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error>
<Code>PermanentRedirect</Code><Message>The bucket you are attempting to
access must be addressed using the specified endpoint. Please send all
future requests to this endpoint.</Message><Bucket>test.csv</Bucket>
<Endpoint>test.csv.s3.amazonaws.com</Endpoint>
<RequestId>76E9C6B03AC12D8D</RequestId>
<HostId>9Cnfif4T23sJVHJyNkx8xKgWa6/+
Uo0IvCAZ9RkWqneMiC1IMqVXCvYabTqmjbDl0Ol9tj1MMhw=</HostId></Error>"
I am using the cran versioin of the aws.S3 package .
I was able to read from an S3 bucket both in local r and via r stuio server using:
data <-read.csv(textConnection(getURL("https://s3-eu-west-1.amazonaws.com/'yourbucket'/'yourFileName")),sep = ",", header = TRUE)

Using R to download newest files from ftp-server

I have a a number of files named
FileA2014-03-05-10-24-12
FileB2014-03-06-10-25-12
Where the part "2014-03-05-10-24-12" means "Year/Day/Month/Hours/Minutes/Seconds/". These files reside on a ftp-server. I would like to use R to connect to the ftp-server and download whatever file is newest based on date.
I have started trying to list the content, using RCurl and dirlistonly. Next step will be to try to parse and find the newest file. Not quite there yet...
library(RCurl)
getURL("ftpserver/",verbose=TRUE,dirlistonly = TRUE)
This should work
library(RCurl)
url <- "ftp://yourServer"
userpwd <- "yourUser:yourPass"
filenames <- getURL(url, userpwd = userpwd,
ftp.use.epsv = FALSE,dirlistonly = TRUE)
-
times<-lapply(strsplit(filenames,"[-.]"),function(x){
time<-paste(c(substr(x[1], nchar(x[1])-3, nchar(x[1])),x[2:6]),
collapse="-")
time<-as.POSIXct(time, "%Y-%m-%d-%H-%M-%S", tz="GMT")
})
ind <- which.max(times)
dat <- try(getURL(paste(url,filenames[ind],sep=""), userpwd = userpwd))
So datis now containing the newest file
To make it reproduceable: all others can use this instead of the upper part use
filenames<-c("FileA2014-03-05-10-24-12.csv","FileB2014-03-06-10-25-12.csv")

Resources