Exporting scraped data to one CSV - r

I managed to do a scraper for gathering Election info in R(rvest), but now I am struggling with how I can save the data not in separate CSV files, but in the one CSV file.
Here is my working code where I can scrap pages 11,12,13 separately.
library(rvest)
library(xml2)
do.call(rbind, lapply(11:13,
function(n) {
url <- paste0("http://www.cvk.gov.ua/pls/vnd2014/WP040?PT001F01=910&pf7331=", n)
mi <- read_html(url)%>% html_table(fill = TRUE)
mi[[8]]
file <- paste0("election2014_", n, ".csv")
if (!file.exists(file)) write.csv(mi[[8]], file)
Sys.sleep(5)
}))
I tried to do this in the end, but it is not working as I expected
write.csv(rbind(mi[[8]],url), file="election2014.csv")

try this one :
library(rvest)
library(tidyverse)
scr<-function(n){
url<-paste0("http://www.cvk.gov.ua/pls/vnd2014/WP040?PT001F01=910&pf7331=",n)
df=read_html(url)%>%
html_table(fill = TRUE)%>%
.[[8]]%>%
data.frame()
colnames(df)<-df[1,]
df<-df[-1,]
}
res<-11:13%>%
map_df(.,scr)
write.csv2(res,"odin_tyr.csv")

I wasn't able to get your code to work, but you could try creating an empty data frame before running you code, and then do this before writing a csv file with the complete data:
df = rbind(df,mi[[8]])
you could also consider turning your csv files into one using the purrr package:
files = list.files("folder_name",pattern="*.csv",full.names = T)
df = files %>%
map(read_csv) %>%
reduce(rbind)

Related

how to automate the process of opening a file and performing some action in R?

I am very new to R and trying to setup some automation. I have some 10-20 json files in a folder, I want to run the R script for each json file so that I can extract data from each json file and keep appending the extracted data in a one dataframe.
In below code df is the dataframe that will store data.
In my case I was able to extract data from one json file and stored that data in df. How do I do this for all the json file and store the extracted data by appending df?
json_file <- "path_to_file/file.json"
json_data <- fromJSON(json_file)
df <- data.frame(str_split(json_data$data$summary$bullet, pattern = " - ")) %>%
row_to_names(row_number = 1)
My output should be a dataframe that contains all the extracted data from each file in a sequence.
I would really appreciate any help.
Something like the following might do what the question asks for. Untested, since there are no data.
The JSON processing package is just a guess, there are alternatives on CRAN. Change the call to library() at will.
library(jsonlite)
read_and_process_json <- function(x, path) {
json_file <- file.path(path, x)
json_data <- fromJSON(json_file)
json_bullet <- stringr::str_split(json_data$data$summary$bullet, pattern = " - ")
data.frame(json_bullet) |>
janitor::row_to_names(row_number = 1)
}
base_path <- "path_to_file"
json_files <- list.files(path = base_path, pattern = "\\.json$")
df_list <- lapply(json_files, read_and_process_json, path = base_path)
df_all <- do.call(rbind, df_list)

How to remove a date written as a string in R?

I am trying to pull data from a website: https://transtats.bts.gov/PREZIP/
I am interested in downloading the datasets named Origin_and_Destination_Survey_DB1BMarket_1993_1.zip to Origin_and_Destination_Survey_DB1BMarket_2021_3.zip
For this I am trying to automate and put the url in a loop
# dates of all files
year_quarter_comb <- crossing(year = 1993:2021, quarter = 1:4) %>%
mutate(year_quarter_comb = str_c(year, "_", quarter)) %>%
pull(year_quarter_comb)
# download all files
for(year_quarter in year_quarter_comb){
get_BTS_data(str_glue("https://transtats.bts.gov/PREZIP/Origin_and_Destination_Survey_DB1BMarket_", year_quarter, ".zip"))
}
What I was wondering is how I can exclude 2021 quarter 4 since the data for this is not available yet. Also is there a better way to automate the task? I was thinking of matching by "DB1BMarket" but R is actually case-sensitive. The names for certain dates change to "DB1BMARKET"
I can use this year_quarter_comb[-c(116)] to remove 2021_4 from the output:
EDIT: I was actually trying to download the files into a specific folder with these set of codes:
path_to_local <- "whatever location" # this is the folder where the raw data is stored.
# download data from BTS
get_BTS_data <- function(BTS_url) {
# INPUT: URL for the zip file with the data
# OUTPUT: NULL (this just downloads the data)
# store the download in the path_to_local folder
# down_file <- str_glue(path_to_local, "QCEW_Hawaii_", BLS_url %>% str_sub(34) %>% str_replace_all("/", "_"))
down_file <- str_glue(path_to_local, fs::path_file(BTS_url))
# download data to folder
QCEW_files <- BTS_url %>%
# download file
curl::curl_download(down_file)
}
EDIT2:
I edited the codes a little from the answer below and it runs:
url <- "http://transtats.bts.gov/PREZIP"
content <- read_html(url)
file_paths <- content %>%
html_nodes("a") %>%
html_attr("href")
origin_destination_paths <-
file_paths[grepl("DB1BM", file_paths)]
base_url <- "https://transtats.bts.gov"
origin_destination_urls <-
paste0(base_url, origin_destination_paths)
h <- new_handle()
handle_setopt(h, ssl_verifyhost = 0, ssl_verifypeer=0)
lapply(origin_destination_urls, function(x) {
tmp_file <- tempfile()
curl_download(x, tmp_file, handle = h)
unzip(tmp_file, overwrite = F, exdir = "airfare data")
})
It takes a while to download these datasets as the files are quite large. It downloaded files until 2007_2 but then I got an error with the curl connection dropping out.
Instead of trying to generate the URL you could scrape the file paths from the website. This avoids generating any non-existing files.
Below is a short script that downloads all of the zip files you are looking for and unzips them into your working directory.
The hardest part for me here was, that the server seems to have a misconfigured SSL certificate. I was able to find help here on SO for turning of SSL certificate verification for read_html() and curl_download(). These solutions are integrated in the script below.
library(tidyverse)
library(rvest)
library(curl)
url <- "http://transtats.bts.gov/PREZIP"
content <-
httr::GET(url, config = httr::config(ssl_verifypeer = FALSE)) |>
read_html()
file_paths <-
content |>
html_nodes("a") |>
html_attr("href")
origin_destination_paths <-
file_paths[grepl("DB1BM", file_paths)]
base_url <- "https://transtats.bts.gov"
origin_destination_urls <-
paste0(base_url, origin_destination_paths)
h <- new_handle()
handle_setopt(h, ssl_verifyhost = 0, ssl_verifypeer=0)
lapply(origin_destination_urls, function(x) {
tmp_file <- tempfile()
curl_download(x, tmp_file, handle = h)
unzip(tmp_file)
})

Merging multiple html saved as xls in a table in R

I have a folder with multiple html files with .xls extension.
data sample
I need to combine them in a single table:
I have started with reading files in a folder:
library(rvest)
library(tibble)
file_list <- list.files(pattern = '*.xls')
html_df <- lapply(file_list,function(x)read_html(x))
I do not know how to proceed from here to pull tables from each file and combine together
This should work if all the files have the same format as the sample you've uploaded:
library(rvest)
file_list <- list.files(pattern = '*.xls')
data <-
purrr::map_dfr( # use map_dfr() to combine data frames
file_list,
function(x) {
read_html(x) %>%
html_node("table") %>% # read the first 'table' node (which is the only one in the sample)
html_table(fill = T) %>% # fill it because the table is not neat yet
setNames(.[1, ]) %>% # use the first row to set column names
.[-c(1, nrow(.)), ] # chop the first row which is the repeated column names and the last row which is the total row
}
)

Merging multiple csv files with name sequence

I am trying to merge 700+ csv files in R. I was able to merge them successfully using the code:
library(dplyr)
library(readr)
df <- list.files(full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows
Now my problem is the file names are saved as flux.0, flux.1, flux.2......flux.733. The R binds the files in order of flux.0, flux.1, flux.10, flux.100, flux.101...and so on. Since the sequence of the file is important for me, can you suggest to incorporate this in the above code?
Many Thanks for the help!
My pipeline for things like that is to get the list of all the files (as you did), turn it into a tbl/data.frame and than using mapto read the files and unnest() them. That why I can keep the path/file name for each file I loaded.
require(tidyverse)
df <- list.files(path = "path",
full.names = TRUE,
recursive = TRUE,
pattern = "*.csv") %>%
tbl_df() %>%
mutate(data = map(value, read.csv)) %>%
arrange(value) %>%
unnest(data)
Here you have another answer using your own approach. I've just added a function that reads the csv and adds a new column called 'file' with the name of the file without the extension.
library(dplyr)
library(readr)
df <- list.files(full.names = TRUE) %>%
lapply(function(x) {a <- read_csv(x);
mutate(a, file = tools::file_path_sans_ext(basename(x)))}) %>%
bind_rows

How to scrape tables from raster files (like .jpeg, .jpg, .png, .gif) and save in excel format?

I have been trying to extract a table from .jpg format to excel format. I'm aware how to do it if it's a .pdf or html file. Please find the script below. I would be grateful if someone could help me figure this out.
Thanks,
library(httr)
library(magick)
library(tidyverse)
url_template <- "https://www.environment.co.za/wp-content/uploads/2016/05/worst-air-pollution-in-south-africa-table-graph-statistics-1024x864.jpg"
pb <- progress_estimated(n=length(url_template))
sprintf(url_template) %>%
map(~{
pb$tick()$print()
GET(url = .x,
add_headers(
accept = "image/webp,image/apng,image/*,*/*;q=0.8",
referer = "https://www.environment.co.za/pollution/worst-air-pollution-south-africa.html/attachment/worst-air-pollution-in-south-africa-table-graph-statistics",
authority = "environment.co.za"))
}) -> store_list_pages
map(store_list_pages, content) %>%
map(image_read) %>%
reduce(image_join) %>%
image_write("SApollution.pdf", format = "pdf")
library(tabulizer)
library(tabulizerjars)
library(XML)
wbk<-loadWorkbook("~/crap_exercise/img2pdf/randomdata.xlsx", create=TRUE)
# Extract the table from the document
out <- extract_tables("SApollution.pdf") #check if which="the table number" is there
#Combine these into a single data matrix containing all of the data
final <- do.call(rbind, out[-length(out)])
# table headers get extracted as rows with bad formatting. Dump them.
final <- as.data.frame(final[1:nrow(final), ])
# Column names
headers <- c('#', 'Uraban area', 'Province', 'PM2.5 (mg/m3)')
# Apply custom column names
names(final) <- headers
createSheet(wbk, "pollution")
writeWorksheet(wbk,poptable,sheet='pollution', header=T)
saveWorkbook(wbk)

Resources