Using R to download newest files from ftp-server - r

I have a a number of files named
FileA2014-03-05-10-24-12
FileB2014-03-06-10-25-12
Where the part "2014-03-05-10-24-12" means "Year/Day/Month/Hours/Minutes/Seconds/". These files reside on a ftp-server. I would like to use R to connect to the ftp-server and download whatever file is newest based on date.
I have started trying to list the content, using RCurl and dirlistonly. Next step will be to try to parse and find the newest file. Not quite there yet...
library(RCurl)
getURL("ftpserver/",verbose=TRUE,dirlistonly = TRUE)

This should work
library(RCurl)
url <- "ftp://yourServer"
userpwd <- "yourUser:yourPass"
filenames <- getURL(url, userpwd = userpwd,
ftp.use.epsv = FALSE,dirlistonly = TRUE)
-
times<-lapply(strsplit(filenames,"[-.]"),function(x){
time<-paste(c(substr(x[1], nchar(x[1])-3, nchar(x[1])),x[2:6]),
collapse="-")
time<-as.POSIXct(time, "%Y-%m-%d-%H-%M-%S", tz="GMT")
})
ind <- which.max(times)
dat <- try(getURL(paste(url,filenames[ind],sep=""), userpwd = userpwd))
So datis now containing the newest file
To make it reproduceable: all others can use this instead of the upper part use
filenames<-c("FileA2014-03-05-10-24-12.csv","FileB2014-03-06-10-25-12.csv")

Related

R:Check existence of "today's" file and if it doesn't exist, download it

I'm trying to check to see if a user has an up-to-date file version of London's Covid prevalence data in their working directory and if not, download it from here:
fileURL <-"https://api.coronavirus.data.gov.uk/v1/data?filters=areaType=region;areaName=London&structure=%7B%22areaType%22:%22areaType%22,%22areaName%22:%22areaName%22,%22areaCode%22:%22areaCode%22,%22date%22:%22date%22,%22newCasesBySpecimenDateRollingSum%22:%22newCasesBySpecimenDateRollingSum%22,%22newCasesBySpecimenDateRollingRate%22:%22newCasesBySpecimenDateRollingRate%22%7D&format=csv"
Pasting that URL into the browser downloads the csv file. using download.file(URL, "data_.csv") creates junk. Why?
So far I have:
library(data.table)
#Look for COVID file starting with "data_"
destfile <- list.files(pattern = "data_")
#Find file date
fileDate<-file.info(destfile)$ctime %>%as.Date()
if(!file.exists(destfile) | fileDate != today()){
res <- tryCatch(download.file(url = fileURL,
destfile = paste0("data_",today(),".csv"),
method = "auto"),
error=function(e) 1)
if(res!=1) COVIDdata<-data.table::fread(destfile) #This doesn't read the file
}
The function always downloads a file regardless of the date on it but it saves it an unreadable format. I've resorted to downloading the file every time as follows.
COVIDdata <- data.table::fread(fileURL)
The junk file that gets downloaded is this:
I think this an issue with encoding the result of download.file, one way could be to use fread to get the data then write it with fwrite:
#Look for COVID file starting with "data_"
destfile <- list.files(pattern = "data_")
#Find file date
fileDate <- file.info(destfile)$ctime %>% as.Date()
#>[1] "2020-11-06" "2020-11-06" "2020-11-06"
if(!length(destfile) | max(fileDate) != today()){
COVIDdata <- fread(fileURL)
fwrite(COVIDdata, file = paste0("data_",today(),".csv"))
}

How can I create a data frame in R from a zip file with multiple levels located in an URL?

I have been trying to work this out but I have not been able to do it...
I want to create a data frame with four columns: country-number-year-(content of the .txt file)
There is a .zip file in the following URL:
https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT
The file contains a folder with 49 folders in it, and each of them contain 150 .txt files give or take.
I first tried to download the zip file with get_dataset but did not work
if (!require("dataverse")) devtools::install_github("iqss/dataverse-client-r")
library("dataverse")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
get_dataset("=doi:10.7910/DVN/0TJX8Y/PZUURT", key = "", server = "dataverse.harvard.edu")
"Error in get_dataset("=doi:10.7910/DVN/0TJX8Y/PZUURT", key = "", server = "dataverse.harvard.edu") :
Not Found (HTTP 404)."
Then I tried
temp <- tempfile()
download.file("https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT",temp)
UNGDC <-unzip(temp, "UNGDC+1970-2018.zip")
It worked to some point... I downloaded the .zip file and then I created UNGDC but nothing happened, because it only has the following information:
UNGDC
A connection with
description "/var/folders/nl/ss_qsy090l78_tyycy03x0yh0000gn/T//RtmpTc3lvX/fileab730f392b3:UNGDC+1970-2018.zip"
class "unz"
mode "r"
text "text"
opened "closed"
can read "yes"
can write "yes"
Here I don't know what to do... I have not found relevant information to proceed... Can someone please give me some hints? or any web to learn how to do it?
Thanks for your attention and help!!!
How about this? I used the zip package to unzip, but possibly the base unzip might work as well.
library(zip)
dir.create(temp <- tempfile())
url<-'https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT'
download.file(url, paste0(temp, '/PZUURT.zip'), mode = 'wb', exdir = temp)
unzip(paste0(temp, '/PZUURT.zip'), exdir = temp)
Note in particular I had to set the mode = 'wb' as I'm on a Windows machine.
I then saw that the unzipped archive had a _MACOSX folder and a Converted sessions folder. Assuming I don't need the MACOSX stuff, I did the following to get just the files I'm interested in:
root_folder <- paste0(temp,'/Converted sessions/')
filelist <- list.files(path = root_folder, pattern = '*.txt', recursive = TRUE)
filenames <- basename(filelist)
'filelist' contains the full paths to each text file, while 'filenames' has just each file name, which I'll then break up to get the country, the number and the year:
df <- data.frame(t(sapply(strsplit(filenames, '_'),
function(x) c(x[1], x[2], substr(x[3], 1, 4)))))
colnames(df) <- c('Country', 'Number', 'Year')
Finally, I can read the text from each of the files and stick it into the dataframe as a new Text field:
df$Text <- sapply(paste0(root_folder, filelist), function(x) readChar(x, file.info(x)$size))

Using R to download SAS file from ftp-server

I am attempting to download some files onto my local from an ftp-server. I have had success using the following method to move .txt and .csv files from the server but not the .sas7bdat files that I need.
protocol <- "sftp"
server <- "ServerName"
userpwd <- "User:Pass"
tsfrFilename <- "/filepath/file.sas7bdat"
ouptFilename <- "out.sas7bdat"
# Run #
## Download Data
url <- paste0(protocol, "://", server, tsfrFilename)
data <- getURL(url = url, userpwd=userpwd)
## Create File
fconn <- file(ouptFilename)
writeLines(data, fconn)
close(fconn)
When I run the getURL command, however, I am met with the following error:
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
embedded nul in string:
Does anyone know of any alternative way which I can download a sas7bdat file from an ftp-server to my local, or if there is a way to alter my code below to successfully download the file. Thanks!
As #MrFlick suggested, I solved this problem using getBinaryURL instead of getURL(). Also, I had to use the function write() instead of writeLines(). The result is as follows:
protocol <- "sftp"
server <- "ServerName"
userpwd <- "User:Pass"
tsfrFilename <- "/filepath/file.sas7bdat"
ouptFilename <- "out.sas7bdat"
# Run #
## Download Data
url <- paste0(protocol, "://", server, tsfrFilename)
data <- getBinaryURL(url = url, userpwd=userpwd)
## Create File
fconn <- file(ouptFilename)
write(data, fconn)
close(fconn)
Alternatively, to transform the read data into R data frame, one can use the library haven, as follows
library(haven)
df_data= read_sas(data)

Download zipped files from ftp via RCurl

I had problems to download zipped files from a ftp server. But now I have solved the problem and because I haven't found any solution to my problem here, I'm sharing my approach.
First I tried it with
download.file()
But there was the problem that my password was ending with an "#". That's why the solution with submittign user and password within the URL wasn't working. The double # was apparently confusing R.
url <- ftp://user:password##url
You'll find the solution below.
Maybe someone has some improvements.
Maybe for someone it's usefull,
Florian
Here is my solution:
library(RCurl)
url<- "ftp://adress/"
filenames <- getURL(url, userpwd="USER:PASSWORD", ftp.use.epsv = FALSE, dirlistonly = TRUE) #reading filenames from ftp-server
destnames <- filenames <- strsplit(filenames, "\r*\n")[[1]] # destfiles = origin file names
con <- getCurlHandle( ftp.use.epsv = FALSE, userpwd="USER:PASSWORD")
mapply(function(x,y) writeBin(getBinaryURL(x, curl = con, dirlistonly = FALSE), y), x = filenames, y = paste("C:\\temp\\",destnames, sep = "")) #writing all zipped files in one directory
Hopefully for anybody it's usefull!
Regards,
Florian
If you have no particular reason to stay with Rcurl, you can use this bash-based method:
URL <- "ftp.server.ca"
USR <- "aUserName"
MDP <- "myPassword"
OUT <- "output.file"
cmd <- paste("wget -m --ftp-user=",USR," --ftp-password=",MDP, " ftp://", URL," -O ", OUT, sep="")
system(cmd)

Is it possible to install pandoc on windows using an R command?

I would like to download and install pandoc on a windows 7 machine, by running a command in R. Is that possible?
(I know I can do this manually, but when I'd show this to students - the more steps I can organize within an R code chunk - the better)
What about simply downloading the most recent version of the installer and starting that from R:
a) Identify the most recent version of Pandoc and grab the URL with the help of the XML package:
library(XML)
page <- readLines('http://code.google.com/p/pandoc/downloads/list', warn = FALSE)
pagetree <- htmlTreeParse(page, error=function(...){}, useInternalNodes = TRUE, encoding='UTF-8')
url <- xpathSApply(pagetree, '//tr[2]//td[1]//a ', xmlAttrs)[1]
url <- paste('http', url, sep = ':')
b) Or apply some regexp magic thanks to #G.Grothendieck instead (no need for the XML package this way):
page <- readLines('http://code.google.com/p/pandoc/downloads/list', warn = FALSE)
pat <- "//pandoc.googlecode.com/files/pandoc-[0-9.]+-setup.exe"
line <- grep(pat, page, value = TRUE); m <- regexpr(pat, line)
url <- paste('http', regmatches(line, m), sep = ':')
c) Or simply check the most recent version manually if you'd feel like that:
url <- 'http://pandoc.googlecode.com/files/pandoc-1.10.1-setup.exe'
Download the file as binary:
t <- tempfile(fileext = '.exe')
download.file(url, t, mode = 'wb')
And simply run it from R:
system(t)
Remove the needless file after installation:
unlink(t)
PS: sorry, only tested on Windows XP

Resources