I have this user defined function that uses the rvest package to get downloadable files from a web page.
GetFluDataFiles <- function(URL = "https://www1.health.gov.au/internet/main/publishing.nsf/Content/ohp-pub-datasets.htm",
REMOVE_URL_STRING = "ohp-pub-datasets.htm/",
DEBUG = TRUE){
if(DEBUG) message("GetFluDataFiles: Function initialized \n")
FUNCTION_OUTPUT <- list()
FUNCTION_OUTPUT[["URL"]] <- URL
page <- rvest::read_html(URL)
if(DEBUG) message("GetFluDataFiles: Get all downloadable files on webpage \n")
all_downloadable_files <- page %>%
rvest::html_nodes("a") %>%
rvest::html_attr("href") %>%
str_subset("\\.xlsx")
# all_downloadable_files
FUNCTION_OUTPUT[["ALL_DOWNLOADABLE_FILES"]] <- all_downloadable_files
if(DEBUG) message("GetFluDataFiles: Get all downloadable files on webpage which contain flu data \n")
influenza_file <- all_downloadable_files[tolower(all_downloadable_files) %like% c("influenza")]
# influenza_file
FUNCTION_OUTPUT[["FLU_FILE"]] <- influenza_file
file_path = file.path(URL, influenza_file)
# file_path
FUNCTION_OUTPUT[["FLU_FILE_PATH"]] <- file_path
if(DEBUG) message("GetFluDataFiles: Collect final path \n")
if(!is.null(REMOVE_URL_STRING)){
full_final_path <- gsub(REMOVE_URL_STRING, "", file_path)
} else {
full_final_path <- file_path
}
FUNCTION_OUTPUT[["FULL_FINAL_PATH"]] <- full_final_path
if(!is.na(full_final_path) | !is.null(full_final_path)){
if(DEBUG) message("GetFluDataFiles: Function run completed \n")
return(FUNCTION_OUTPUT)
} else {
stop("GetFluDataFiles: Folders not created \n")
}
}
I've used this function to extract the data that I want
Everything seems to work... I am able to download the file.
> output <- GetFluDataFiles()
GetFluDataFiles: Function initialized
GetFluDataFiles: Get all downloadable files on webpage
GetFluDataFiles: Get all downloadable files on webpage which contain flu data
GetFluDataFiles: Collect final path
GetFluDataFiles: Function run completed
> output$FULL_FINAL_PATH
[1] "https://www1.health.gov.au/internet/main/publishing.nsf/Content/C4DDC0B448F04792CA258728001EC5D0/$File/x.Influenza-laboratory-confirmed-Public-datset-2008-2019.xlsx"
> download.file(output$FULL_FINAL_PATH, destfile = "myfile.xlsx")
trying URL 'https://www1.health.gov.au/internet/main/publishing.nsf/Content/C4DDC0B448F04792CA258728001EC5D0/$File/x.Influenza-laboratory-confirmed-Public-datset-2008-2019.xlsx'
Content type 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' length 27134133 bytes (25.9 MB)
downloaded 25.9 MB
And the file exists.
> file.exists("myfile.xlsx")
[1] TRUE
But when I go to import the xlsx file, this error pops up.
> library("readxl")
> my_data <- read_excel("myfile.xlsx", sheet = 1, skip = 1)
Error: Evaluation error: error -103 with zipfile in unzGetCurrentFileInfo
What is this error? How can I resolve it?
Set download option to curl
download.file(output$FULL_FINAL_PATH, destfile = "myfile.xlsx", method = 'curl')
my_data <- read_excel("myfile.xlsx", sheet = 1, skip = 1)
Related
I am trying to pull data from a website: https://transtats.bts.gov/PREZIP/
I am interested in downloading the datasets named Origin_and_Destination_Survey_DB1BMarket_1993_1.zip to Origin_and_Destination_Survey_DB1BMarket_2021_3.zip
For this I am trying to automate and put the url in a loop
# dates of all files
year_quarter_comb <- crossing(year = 1993:2021, quarter = 1:4) %>%
mutate(year_quarter_comb = str_c(year, "_", quarter)) %>%
pull(year_quarter_comb)
# download all files
for(year_quarter in year_quarter_comb){
get_BTS_data(str_glue("https://transtats.bts.gov/PREZIP/Origin_and_Destination_Survey_DB1BMarket_", year_quarter, ".zip"))
}
What I was wondering is how I can exclude 2021 quarter 4 since the data for this is not available yet. Also is there a better way to automate the task? I was thinking of matching by "DB1BMarket" but R is actually case-sensitive. The names for certain dates change to "DB1BMARKET"
I can use this year_quarter_comb[-c(116)] to remove 2021_4 from the output:
EDIT: I was actually trying to download the files into a specific folder with these set of codes:
path_to_local <- "whatever location" # this is the folder where the raw data is stored.
# download data from BTS
get_BTS_data <- function(BTS_url) {
# INPUT: URL for the zip file with the data
# OUTPUT: NULL (this just downloads the data)
# store the download in the path_to_local folder
# down_file <- str_glue(path_to_local, "QCEW_Hawaii_", BLS_url %>% str_sub(34) %>% str_replace_all("/", "_"))
down_file <- str_glue(path_to_local, fs::path_file(BTS_url))
# download data to folder
QCEW_files <- BTS_url %>%
# download file
curl::curl_download(down_file)
}
EDIT2:
I edited the codes a little from the answer below and it runs:
url <- "http://transtats.bts.gov/PREZIP"
content <- read_html(url)
file_paths <- content %>%
html_nodes("a") %>%
html_attr("href")
origin_destination_paths <-
file_paths[grepl("DB1BM", file_paths)]
base_url <- "https://transtats.bts.gov"
origin_destination_urls <-
paste0(base_url, origin_destination_paths)
h <- new_handle()
handle_setopt(h, ssl_verifyhost = 0, ssl_verifypeer=0)
lapply(origin_destination_urls, function(x) {
tmp_file <- tempfile()
curl_download(x, tmp_file, handle = h)
unzip(tmp_file, overwrite = F, exdir = "airfare data")
})
It takes a while to download these datasets as the files are quite large. It downloaded files until 2007_2 but then I got an error with the curl connection dropping out.
Instead of trying to generate the URL you could scrape the file paths from the website. This avoids generating any non-existing files.
Below is a short script that downloads all of the zip files you are looking for and unzips them into your working directory.
The hardest part for me here was, that the server seems to have a misconfigured SSL certificate. I was able to find help here on SO for turning of SSL certificate verification for read_html() and curl_download(). These solutions are integrated in the script below.
library(tidyverse)
library(rvest)
library(curl)
url <- "http://transtats.bts.gov/PREZIP"
content <-
httr::GET(url, config = httr::config(ssl_verifypeer = FALSE)) |>
read_html()
file_paths <-
content |>
html_nodes("a") |>
html_attr("href")
origin_destination_paths <-
file_paths[grepl("DB1BM", file_paths)]
base_url <- "https://transtats.bts.gov"
origin_destination_urls <-
paste0(base_url, origin_destination_paths)
h <- new_handle()
handle_setopt(h, ssl_verifyhost = 0, ssl_verifypeer=0)
lapply(origin_destination_urls, function(x) {
tmp_file <- tempfile()
curl_download(x, tmp_file, handle = h)
unzip(tmp_file)
})
I have been trying to run the following codes below with the url: https://files.hawaii.gov/dbedt/economic/data_reports/DLIR/LFR_LAUS_LF.xls.
The url downloads an xls file. However using the read_xls(temp, sheet = 1) command in R is giving me an R. Any solutions? The file is supposed to be stored in the temp folder.
Scrape Employment Data from DBEDT
rm(list = ls(all = TRUE))
#loads necessary packages to run codes
library(dplyr)
library(tidyverse)
library(readxl)
library(stringr)
library(taRifx)
library(lubridate)
#set url link for xls
# url <- "https://www.hirenethawaii.com/admin/gsipub/htmlarea/uploads/LFR_LAUS_LF.xls"
url <- "https://files.hawaii.gov/dbedt/economic/data_reports/DLIR/LFR_LAUS_LF.xls"
#create a temp file to load the xls and download the xls
temp <- tempfile(fileext = "xls")
download.file(url = url, destfile = temp)
#create a function to select only the columns we want
#grab date (col 2), labor force (col 3), employed persons (col 4), and unemployment rate (col 6) columns from state data
keep_cols <- function(data) {
data <- data %>%
select(2,3,4,6) %>%
rename(date = 1, lf = 2, empl = 3, ur = 4)
return(data)
}
#importing the xls sheets into our r environment
state_xls <- read_xls(temp, sheet = 1) %>% keep_cols() %>% mutate(geo = "HI")
The error is as follows:
Error:
filepath: C:\Users\Jon Doe\AppData\Local\Temp\Rtmp0iFXmt\file8eb8142165d0xls
libxls error: Unable to open file
I am using a Windows laptop and the code was originally written in Macbookbbb
It seems that there is a problem with Windows with the download.file(). So I added mode="wb" and it seemed to have worked. So basically this part of the code needed the change:
download.file(url = url, destfile = temp, mode = "wb")
download.file(mainurl,filename,method="curl") method try with “wget”, “curl”, "auto"
url <- "https://exporter.nih.gov/ExPORTER_Catalog.aspx"
linksu <- getURL(url)
links <- getHTMLLinks(linksu)
filenames <- links[str_detect(links, ".zip")]
filenames_list <- as.list(filenames)
downloadzip <- function (mainurl,filename) {
filedetails <- str_c(mainurl,filename)
print(filename)
require(downloader)
download.file(mainurl,filename,method="curl")
}
# save the files into the current Dir.
l_ply(filenames,downloadzip,mainurl = "https://exporter.nih.gov/")
#downloaded csv file this is works
download.file("https://exporter.nih.gov/CSVs/final/RePORTER_PRJ_C_FY2019_053.zip",destfile ="test.zip" )
download success full all files but all are corrupted
downloadzip <- function (mainurl,filename) {
filedetails <- str_c(mainurl,filename)
print(filedetails)
require(downloader)
download.file(filedetails,filename,method="curl")
}
** this working for me **
I have downloaded one photo of each deputy. In total, I have 513 photos (but I hosted a file with 271 photos). Each photo was named with the ID of the deputy. I want to change the name of photo to the deputy's name. This means that "66179.jpg" file would be named "norma-ayub.jpg".
I have a column with the IDs ("uri") and their names ("name_lower"). I tried to run the code with "destfile" of download.file(), but it receives only a string. I couldn't find out how to work with file.rename().
And rename_r_to_R changes only the file extension.
I am a beginner in working with R.
CSV file:
https://gist.github.com/gabrielacaesar/3648cd61a02a3e407bf29b7410b92cec
Photos:
https://github.com/gabrielacaesar/studyingR/blob/master/chamber-of-deputies-17jan2019-files.zip
(It's not necessary to download the ZIP file; When running the code below, you also get the photos, but it takes some time to download them)
deputados <- fread("dep-legislatura56-14jan2019.csv")
i <- 1
while(i <= 514) {
tryCatch({
url <- deputados$uri[i]
api_content <- rawToChar(GET(url)$content)
pessoa_info <- jsonlite::fromJSON(api_content)
pessoa_foto <- pessoa_info$dados$ultimoStatus$urlFoto
download.file(pessoa_foto, basename(pessoa_foto), mode = "wb")
Sys.sleep(0.5)
}, error = function(e) return(NULL)
)
i <- i + 1
}
I downloaded the files you provided and directly read them into R or unzipped them into a new folder respectivly:
df <- data.table::fread(
"https://gist.githubusercontent.com/gabrielacaesar/3648cd61a02a3e407bf29b7410b92cec/raw/1d682d8fcdefce40ff95dbe57b05fa83a9c5e723/chamber-of-deputies-17jan2019",
sep = ",",
header = TRUE)
download.file("https://github.com/gabrielacaesar/studyingR/raw/master/chamber-of-deputies-17jan2019-files.zip",
destfile = "temp.zip")
dir.create("photos")
unzip("temp.zip", exdir = "photos")
Then I use list.files to get the file names of all photos, match them with the dataset and rename the photos. This runs very fast and the last bit will report if renaming the file was succesful.
photos <- list.files(
path = "photos",
recursive = TRUE,
full.names = TRUE
)
for (p in photos) {
id <- basename(p)
id <- gsub(".jpg$", "", id)
name <- df$name_lower[match(id, basename(df$uri))]
fname <- paste0(dirname(p), "/", name, ".jpg")
file.rename(p, fname)
# optional
cat(
"renaming",
basename(p),
"to",
name,
"succesful:",
ifelse(success, "Yes", "No"),
"\n"
)
}
I am working with AIMS model developed by APEC Climate center. The model downloads data from ftp server and then calls the LoadCmip5DataFromAdss function from datasource.R to load data into the model.
#do.call("LoadCmip5DataFromAdss", parameters)
On github I found the source code for LoadCmip5DataFromAdss which gives the path of an ftp server to download data
LoadCmip5DataFromAdss <- function(dbdir, NtlCode) {
fname <- paste("cmip5_daily_", NtlCode, ".zip", sep="")
if(nchar(NtlCode)==4 && substr(NtlCode,1,2)=="US"){
adss <- "ftp://cis.apcc21.org/CMIP5DB/US/"
}else{
adss <- "ftp://cis.apcc21.org/CMIP5DB/"
}
I want to get the data from a local directory instead of downloading because that takes a lot of time. How do I do that?
Where do I find the file containing LoadCmip5DataFromAdss on my PC, because in the setup only datasource.R is given.
All that function does is copy the zip file (cmip5_daily_ + whatever you specified for NtlCode + .zip) to the directory you specified for dbdir after it downloads it then unzips it and removes the ZIP file. Here's the whole function from rSQM:
LoadCmip5DataFromAdss <- function(dbdir, NtlCode) {
fname <- paste("cmip5_daily_", NtlCode, ".zip", sep="")
if(nchar(NtlCode)==4 && substr(NtlCode,1,2)=="US"){
adss <- "ftp://cis.apcc21.org/CMIP5DB/US/"
}else{
adss <- "ftp://cis.apcc21.org/CMIP5DB/"
}
srcfname <- paste(adss, fname, sep="")
dstfname <- paste(dbdir, "/", fname, sep = "")
download.file(srcfname, dstfname, mode = "wb")
unzip(dstfname, exdir = dbdir)
unlink(dstfname, force = T)
cat("CMIP5 scenario data at",NtlCode,"is successfully loaded.\n")
}
You can just do something like:
unzip(YOUR_LOCAL_NtlCode_ZIP_FILE, exdir = WHERE_YOUR_dbdir_IS)
vs use that function.