How to remove a date written as a string in R? - r

I am trying to pull data from a website: https://transtats.bts.gov/PREZIP/
I am interested in downloading the datasets named Origin_and_Destination_Survey_DB1BMarket_1993_1.zip to Origin_and_Destination_Survey_DB1BMarket_2021_3.zip
For this I am trying to automate and put the url in a loop
# dates of all files
year_quarter_comb <- crossing(year = 1993:2021, quarter = 1:4) %>%
mutate(year_quarter_comb = str_c(year, "_", quarter)) %>%
pull(year_quarter_comb)
# download all files
for(year_quarter in year_quarter_comb){
get_BTS_data(str_glue("https://transtats.bts.gov/PREZIP/Origin_and_Destination_Survey_DB1BMarket_", year_quarter, ".zip"))
}
What I was wondering is how I can exclude 2021 quarter 4 since the data for this is not available yet. Also is there a better way to automate the task? I was thinking of matching by "DB1BMarket" but R is actually case-sensitive. The names for certain dates change to "DB1BMARKET"
I can use this year_quarter_comb[-c(116)] to remove 2021_4 from the output:
EDIT: I was actually trying to download the files into a specific folder with these set of codes:
path_to_local <- "whatever location" # this is the folder where the raw data is stored.
# download data from BTS
get_BTS_data <- function(BTS_url) {
# INPUT: URL for the zip file with the data
# OUTPUT: NULL (this just downloads the data)
# store the download in the path_to_local folder
# down_file <- str_glue(path_to_local, "QCEW_Hawaii_", BLS_url %>% str_sub(34) %>% str_replace_all("/", "_"))
down_file <- str_glue(path_to_local, fs::path_file(BTS_url))
# download data to folder
QCEW_files <- BTS_url %>%
# download file
curl::curl_download(down_file)
}
EDIT2:
I edited the codes a little from the answer below and it runs:
url <- "http://transtats.bts.gov/PREZIP"
content <- read_html(url)
file_paths <- content %>%
html_nodes("a") %>%
html_attr("href")
origin_destination_paths <-
file_paths[grepl("DB1BM", file_paths)]
base_url <- "https://transtats.bts.gov"
origin_destination_urls <-
paste0(base_url, origin_destination_paths)
h <- new_handle()
handle_setopt(h, ssl_verifyhost = 0, ssl_verifypeer=0)
lapply(origin_destination_urls, function(x) {
tmp_file <- tempfile()
curl_download(x, tmp_file, handle = h)
unzip(tmp_file, overwrite = F, exdir = "airfare data")
})
It takes a while to download these datasets as the files are quite large. It downloaded files until 2007_2 but then I got an error with the curl connection dropping out.

Instead of trying to generate the URL you could scrape the file paths from the website. This avoids generating any non-existing files.
Below is a short script that downloads all of the zip files you are looking for and unzips them into your working directory.
The hardest part for me here was, that the server seems to have a misconfigured SSL certificate. I was able to find help here on SO for turning of SSL certificate verification for read_html() and curl_download(). These solutions are integrated in the script below.
library(tidyverse)
library(rvest)
library(curl)
url <- "http://transtats.bts.gov/PREZIP"
content <-
httr::GET(url, config = httr::config(ssl_verifypeer = FALSE)) |>
read_html()
file_paths <-
content |>
html_nodes("a") |>
html_attr("href")
origin_destination_paths <-
file_paths[grepl("DB1BM", file_paths)]
base_url <- "https://transtats.bts.gov"
origin_destination_urls <-
paste0(base_url, origin_destination_paths)
h <- new_handle()
handle_setopt(h, ssl_verifyhost = 0, ssl_verifypeer=0)
lapply(origin_destination_urls, function(x) {
tmp_file <- tempfile()
curl_download(x, tmp_file, handle = h)
unzip(tmp_file)
})

Related

How can i make a code in R that opens a .csv file outside the Rstudio

The code scrapes from the two websites, turns them into a data frame and then into a csv file and that file is located at C:/Users/JoséLuiz/Desktop/news. What i want is to make a code that opens those csv files so they can pop up at my screen and "tell me" that there is some new refreshed data over there so i don't have to close and open the files every single time. I was trying to make a windows forms with .net framework but it got really complicated.
library(rvest)
library(xml2)
library(WriteXLS)
setwd("C:/Users/JoséLuiz")
setwd("C:/Users/JoséLuiz/Desktop/news")
while(TRUE){
###Broadcast
time <- Sys.time()
hora <- format(time, '%H')
minuto <- format(time, '%M')
segundo <- format(time, '%S')
url <- 'http://broadcast.com.br/'
html <- read_html(url)
headlines <- html %>%
html_nodes('.materia :nth-child(1) a')%>%
html_text()
write.table(headlines, file = "Headlines.csv", row.names = F, sep = ',')
#Trading ecconomics
url <- 'https://www.investing.com/news/economic-indicators'
endereco <- read_html(url)
manchete <- endereco %>%
html_nodes('.title')%>%
html_text()
details <- endereco %>%
html_nodes('p')%>%
html_text()
time <- endereco %>%
html_nodes('.date')%>%
html_text()
manchete <- data.frame(manchete)
write.table(manchete,file = "Manchetes_Trading_Ecconomics.csv", row.names = F, sep = ',')
setwd("C:/Users/JoséLuiz/Desktop/news")
Sys.sleep(300)
}
You can open the csv, check the dimensions/number of rows and compare it to the file you are saving to see if there is new r refreshed data. Something like the following should do the trick.
# Add this at the beginning (outside your while loop
tmp = rio::import("Manchetes_Trading_Ecconomics.csv")
# Add this after the line manchete <- data.frame(manchete)
if(dim(df)[1] != dim(df2)[1]){
print("New data added") } else{
print("No new data added") }

Getting an error message when using rvest for webscraping?

I want to scrape all the shapefiles from the following website: https://www.sciencebase.gov/catalog/items?q=&filter0=browseCategory%3DData&community=California+Condor&filter1=browseType%3DMap+Service&filter2=browseType%3DOGC+WMS+Layer&filter3=browseType%3DDownloadable&filter4=facets.facetName%3DShapefile&&filter5=browseType%3DShapefile
I used the following script:
installed.packages('rvest')
library(rvest)
library(tidyverse)
## CONSTANT ----
URL <- "https://www.sciencebase.gov/catalog/item/54471eb5e4b0f888a81b82ca"
dir_out <- "~/condor"
## MAIN ----
# Get the webpage content
webpage <- read_html(URL)
# Extract the information of interest from the website
data <- html_nodes(webpage, ".sb-file-get sb-download-link")
# Grab the base URLs to download all the referenced data
url_base <- html_attr(data,"href")
# Filter the zip files
shapefile_base <- grep("*.zip",url_base, value=TRUE)
# Fix the double `//`
shapefile_fixed <- gsub("//", "/", shapefile_base)
# Add the URL prefix
shapefile_full <- paste0("https://www.sciencebase.gov/",shapefile_fixed)
# Create the output directory
dir.create(dir_out, showWarnings = FALSE)
# Create a list of filenames
filenames_full <- file.path(dir_out,basename(shapefile_full))
# Download the files
lapply(shapefile_full, FUN=function(x) download.file(x, file.path(dir_out,basename(x))))
# Unzip the files
unzip(filenames_full, overwrite = TRUE)
However for the download file section, I get the following error
> lapply(shapefile_full, FUN=function(x) download.file(x, file.path(dir_out,basename(x))))
trying URL 'https://www.sciencebase.gov/'
Content type 'text/html' length unknown
downloaded 111 bytes
[[1]]
[1] 0
> # Unzip the files
> unzip(filenames_full, overwrite = TRUE)
Warning message:
In unzip(filenames_full, overwrite = TRUE) :
error 1 in extracting from zip file
This works for me and produces 6 zip files in my working directory.
library(rvest)
URL <- "https://www.sciencebase.gov/catalog/item/54471eb5e4b0f888a81b82ca"
webpage <- read_html(URL)
# Extract the information of interest from the website
data <- html_nodes(webpage, "span.sb-file-get")
# Grab the base URLs to download all the referenced data
url_base <- html_attr(data,"data-url")
# Filter the zip files
shapefile_full <- paste0("https://www.sciencebase.gov",url_base)
Map(download.file, shapefile_full, sprintf('file%d.zip', seq_along(shapefile_full)))

Object not found in R language

I am newbie at using R and here's my attempt to play a round a code to scrape quotes from multiple pages
# Load Libraries
library(rvest) # To Scrape
library(tidyverse) # To Manipulate Data
# Scrape Multiple Pages
for (i in 1:4){
site_to_scrape <- read_html(paste0("http://quotes.toscrape.com/page/",i))
temp <- site_to_scrape html_nodes(".text") html_text()
content <- append(content, temp)
}
#Export Results To CSV File
write.csv(content, file = "content.csv", row.names = FALSE)
I have encountered an error Object not found as for content variable. How can I overcome this error and set the object so as to be reusable in the append line?
Growing vector in a loop is very inefficient if you are scraping many pages. Instead what you should do is initialise a list with specific length which you know beforehand.
library(rvest)
n <- 4
content = vector('list', n)
# Scrape Multiple Pages
for (i in 1:n){
site_to_scrape <- read_html(paste0("http://quotes.toscrape.com/page/",i))
content[[i]] <- site_to_scrape %>%
html_nodes(".text") %>%
html_text()
}
write.csv(unlist(content), file = "content.csv", row.names = FALSE)
Another option without initialising is to use sapply/lapply :
all_urls <- paste0("http://quotes.toscrape.com/page/",1:4)
content <- unlist(lapply(all_urls, function(x)
x %>% read_html %>% html_nodes(".text") %>% html_text()))
I have searched and found the way to assign empty object before the loop content = c()
# Load Libraries
library(rvest) # To Scrape
library(tidyverse) # To Manipulate Data
content = c()
# Scrape Multiple Pages
for (i in 1:4){
site_to_scrape <- read_html(paste0("http://quotes.toscrape.com/page/",i))
temp <- site_to_scrape %>%
html_nodes(".text") %>%
html_text()
content <- append(content, temp)
}
#Export Results To CSV File
write.csv(content, file = "content.csv", row.names = FALSE)

Downloading xls files with a loop through url's gives me corrupted files

I am downloading xls files from this page with a loop through url's with R (based on this first step):
getURLFilename <- function(url){
require(stringi)
hdr <-paste(curlGetHeaders(url),collapse = '')
fname <- as.vector(stri_match(hdr,regex = '(?<=filename=\\").*(?=\\")'))
fname
}
for(i in 8:56) {
i1 <- sprintf('%02d', i)
url <- paste0("https://journals.openedition.org/acrh/29", i1, "?file=1")
file <- paste0("myExcel_", i, ".xls")
if (!file.exists(file)) download.file(url, file)
}
The files are downloaded but corrupted.
Here is a bit different approach using rvest to scrape the URL's to download and filename to save only the XLS files and not the PDF's.
library(rvest)
url <- "https://journals.openedition.org/acrh/2906"
#Scrape the nodes which we are interested in
target_nodes <- url %>%
read_html() %>%
html_nodes(xpath = '//*[#id="annexes"]') %>%
html_nodes("a")
#Get the indices which ends with xls
inds <- target_nodes %>% html_text() %>% grep("xls$", .)
#Get the corresponding URL for the xls files and paste it with prefix
target_urls <- target_nodes %>%
html_attr("href") %>% .[inds] %>%
paste0("https://journals.openedition.org/acrh/", .)
#Get the target name to save file
target_name <- target_nodes %>%
html_text() %>%
grep("xls$", ., value = TRUE) %>%
sub("\\s+", ".", .) %>%
paste0("/folder_path/to/storefiles/", .)
#Download the files and store them at target_name location
mapply(download.file, target_urls, target_name)
I manually verified 3-4 files on my system and I am able to open them and the data match as well when I manually download them from the url.
You should use mode="wb" in download.file to write the file in binary mode.
library(readxl)
for (i in 8:55) {
i1 <- sprintf('%02d', i)
url <- paste0("https://journals.openedition.org/acrh/29", i1, "?file=1")
if (is.na(format_from_signature(url))) {
file <- paste0("myPdf_", i, ".pdf")
} else {
file <- paste0("myExcel_", i, ".xls")
}
if (!file.exists(file)) download.file(url, file, mode="wb")
}
Now the downloaded Excel files are not corrupted.

Exporting scraped data to one CSV

I managed to do a scraper for gathering Election info in R(rvest), but now I am struggling with how I can save the data not in separate CSV files, but in the one CSV file.
Here is my working code where I can scrap pages 11,12,13 separately.
library(rvest)
library(xml2)
do.call(rbind, lapply(11:13,
function(n) {
url <- paste0("http://www.cvk.gov.ua/pls/vnd2014/WP040?PT001F01=910&pf7331=", n)
mi <- read_html(url)%>% html_table(fill = TRUE)
mi[[8]]
file <- paste0("election2014_", n, ".csv")
if (!file.exists(file)) write.csv(mi[[8]], file)
Sys.sleep(5)
}))
I tried to do this in the end, but it is not working as I expected
write.csv(rbind(mi[[8]],url), file="election2014.csv")
try this one :
library(rvest)
library(tidyverse)
scr<-function(n){
url<-paste0("http://www.cvk.gov.ua/pls/vnd2014/WP040?PT001F01=910&pf7331=",n)
df=read_html(url)%>%
html_table(fill = TRUE)%>%
.[[8]]%>%
data.frame()
colnames(df)<-df[1,]
df<-df[-1,]
}
res<-11:13%>%
map_df(.,scr)
write.csv2(res,"odin_tyr.csv")
I wasn't able to get your code to work, but you could try creating an empty data frame before running you code, and then do this before writing a csv file with the complete data:
df = rbind(df,mi[[8]])
you could also consider turning your csv files into one using the purrr package:
files = list.files("folder_name",pattern="*.csv",full.names = T)
df = files %>%
map(read_csv) %>%
reduce(rbind)

Resources