R - Get html-addresses from data frame to rvest - r

I am new to R, and I have come upon a problem I can't solve. I would like to scrape Swedish election data at electoral district level. They are structured as can be found here http://www.val.se/val/val2014/slutresultat/K/valdistrikt/25/82/0134/personroster.html
I get the data I want by using this code:
library(rvest)
district.data <- read_html("http://www.val.se/val/val2014/slutresultat/K/kommun/25/82/0134/personroster.html")
prost <- district.data %>%
html_nodes("table") %>%
.[[2]] %>%
html_table()
But that is just for one district out of 6,227 districts. The districts are identified by the html address. In the website mentioned above it is identified by "25/82/0134". I can find the identities of all districts here http://www.val.se/val/val2014/statistik/2014_riksdagsval_per_valdistrikt.skv
And I read this semi-colon separated file into R by using this code:
valres <- read_csv2("http://www.val.se/val/val2014/statistik/2014_riksdagsval_per_valdistrikt.skv" )
(as a side note, how can I change the encoding so that the Swedish letters (e.g. å, ä, ö) are imported correctly? I manage to do that with read.csv and specifying encoding='utf-8' but not with read_csv)
In this data frame, the columns LAN, KOM and VALDIST give the identities of the districts (note that VALDIST sometimes just have 2 characters). Hence the addresses have the following structure http://www.val.se/val/val2014/slutresultat/K/kommun/LAN/KOM/VALDIST/personroster.html
So, I would like to use the combination in each row to get the identity of district, scrape the information into R, add a column with the district identity (i.e. LAN, KOM and VALDIST combined into one string), and do so over all 6,227 districts and append the information from each of those districts into one single data frame. I assume I need to use some kind of loop or some of those apply functions, to iterate over the data frame, but I have not figured out how.
UPDATE:
After the help I received (thank you!) in the answer below, the code now is as follows. My remaining problem is that I want to add the district identity (i.e. paste0(LAN, KOM, VALDIST)) for each website that is scraped to a column in the final data frame. Can someone help me with this final step?
# Read the indentities of the districts (w Swedish letters)
districts_url <- "http://www.val.se/val/val2014/statistik/2014_riksdagsval_per_valdistrikt.skv"
valres <- read_csv2(districts_url, locale=locale("sv",encoding="ISO-8859-1", asciify=FALSE))
# Add a variabel to separate the two types of electoral districts
valres$typ <- "valdistrikt"
valres$typ [nchar(small_valres$VALDIST) == 2] <- "onsdagsdistrikt"
# Create a vector w all the web addresses to the district data
base_url <- "http://www.val.se/val/val2014/slutresultat/K/%s/%s/%s/%s/personroster.html"
urls <- with(small_valres, sprintf(base_url, typ, LAN, KOM, VALDIST))
# Scrape the data
pb <- progress_estimated(length(urls))
map_df(urls, function(x) {
pb$tick()$print()
# Maybe add Sys.sleep(1)
read_html(x) %>%
html_nodes("table") %>%
.[[2]] %>%
html_table()
}) -> df
Any help would be greatly appreciated!
All the best,
Richard

You can use sprintf() to do positional substitution and then use purrr::map_df() to iterate over a vector of URLs and generate a data frame:
library(rvest)
library(readr)
library(purrr)
library(dplyr)
districts_url <- "http://www.val.se/val/val2014/statistik/2014_riksdagsval_per_valdistrikt.skv"
valres <- read_csv2(districts_url, locale=locale("sv",encoding="UTF-8", asciify=FALSE))
base_url <- "http://www.val.se/val/val2014/slutresultat/K/valdistrikt/%s/%s/%s/personroster.html"
urls <- with(valres, sprintf(base_url, LAN, KOM, VALDIST))
pb <- progress_estimated(length(urls))
map_df(urls, function(x) {
pb$tick()$print()
read_html(x) %>%
html_nodes("table") %>%
.[[2]] %>%
html_table()
}) -> df
HOWEVER, you should add a randomized delay to avoid being blocked as a bot and should look at wrapping read_html() with purrr::safely() since not all those LAN/KOM/VALDIST combinations are valid URLs (at least in my testing).
That code also provides a progress bar since it's going to take a while (prbly an hour on a moderately decent connection).

Related

Scraping Dynamic JSON Data in R

On pgatour.com/stats I am trying to scrape multiple stats over multiple tournaments over multiple years. Unfortunately, I am struggling to scrape data for past years or tournament ID’s. In the past, PGA’s website looked like:
https://www.pgatour.com/stats/stat.STAT_ID.y.YEAR_ID.eoff.TOURNAMENT_ID.html
STAT_ID, YEAR_ID, and TOURNAMENT_ID would all change as you updated the particular stat, year, and tournament id to correspond with their unique id’s. Because of this, I was able to use a function that sifted through all combinations of stat_id, year_id, and tournament_id to scrape the website.
Now the website URL’s don’t change except for the particular stat_id being searched. If I change the tournament or year through dropdowns, the stats will load, but the url remains unchanged. This prevents targeting different tournaments or years.
https://www.pgatour.com/stats/detail/02675 - 02675 being an example stat_id
#Dave2e has been very helpful in showing me that pga uses java and how to access some of the JSON data. I combined his teachings along with my past code to scrape all stats for the most recent tournament. However, I can’t figure out how to get the stats for past years or tournaments. In the JSON str I see that there are id’s for $tournamentId and $year, but I’m uncertain of how to use this info to search for past tournaments and years.
How can I access the tournament and year id's to scrape past data on pgatour.com. Should I be trying to access this data with rselenium opposed to a program like rvest?
Code
library(tidyverse)
library(rvest)
library(dplyr)
df23 <- expand.grid(
stat_id = c("02568","02675", "101")
) %>%
mutate(
links = paste0(
"https://www.pgatour.com/stats/detail/",
stat_id
)
) %>%
as_tibble()
get_info <- function(link, stat_id) {
data <- link %>%
read_html() %>%
html_elements(xpath = ".//script[#id='__NEXT_DATA__']") %>%
html_text() %>%
jsonlite::fromJSON()
answer <- data$props$pageProps$statDetails$rows %>%
#NA's in player name stops data from being collected
drop_na(playerName)
# get lists of dataframes into single dataframe, then merge back with original dataframe
answer2 <- answer$stats
answer2 <- bind_rows(answer2, .id = "column_label") %>%
select(-color) %>%
pivot_wider(
values_from = statValue,
names_from = statName)
#All stats combined and unnested
stats2 <- dplyr::bind_cols(answer, answer2)
}
test_stats <- df23 %>%
mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))
test_stats <- test_stats %>%
unnest(everything())
Simplified code courtesy of #Dave2e
#read page
library(rvest)
page <- read_html("https://www.pgatour.com/stats/detail/02675")
#find the script with the correct id tage, strip the html code
datascript <- page %>% html_elements(xpath = ".//script[#id='__NEXT_DATA__']") %>% html_text()
#convert from JSON
output <- jsonlite::fromJSON(datascript)
#explore the output
str(output)
#get the main table
answer <-output$props$pageProps$statDetails$rows
If you take a look at the developer tools (F12 key in your browser) and observe the Network tab when you click on a different year you can see a background request is being made to retrieve that year's data:
It returns a JSON dataset similar to the one in your original post:
To scrape this you need to replicate this GraphQL POST request in your R program. Note that it sends a JSON document with query details which includes tournament codes and the year.
Finally to ensure that your graphql succeeds make sure that you match headers you see in this inspector in your R program. In particular the headers Origin, Referer and the X- prefixed ones:
(you can probably hardcode these)

Scraping with Rvest in R Studio: Returns df 0 rows by 32 columns

I am trying to scrape some sports data from this website (https://en.khl.ru/stat/players/1097/skaters/) using rvest. There are no pages to filter through, but there is a 'Show All' icon to show all the data on the page.
I have been trying to use a css selector to extract the table. Unfortunately, no rows are produced but the column names of the table are present.
I suspect the problem lies in the website's interactive features with the table.
Yes, this page is dynamically generated, thus troublesome for rvest to handle. But the key to scrape this page is to realize the data is stored as JSON in a script element on the page.
The code below reads the page and extracts the script nodes. Reviewed the script node to find the correct one. Then some trial and error extracted the JSON data. Cleaned up the player and team name columns for the final answer.
library(rvest)
library(dplyr)
library(stringr)
url <- "https://en.khl.ru/stat/players/1097/skaters/"
page <- read_html(url)
#the data for the page is stored in a script element
scripts <-page %>% html_elements("script")
#get column names
headers <- page %>% html_elements("thead th") %>% html_text()
#examined the nodes and manually determined the 31st node was it
tail(scripts, 18)
data <- scripts[31] %>% html_text()
#examined the data string and notice the start of the JSON was '[ ['
#end of the JSON was ']]'
jsonstring <- str_extract(data, "\\[ \\[.+\\]\\]")
#convert the JSON into data frame
answer <- jsonlite::fromJSON(jsonstring) %>% as.data.frame
#rename column titles
names(answer) <- headers
#function to clean up html code in columns
cleanhtml <- function(text) {
out<-text %>% read_html() %>% html_text()
}
#remove the html information in columns 1 &3
answer <- answer[ , -32] %>% rowwise() %>%
mutate(Player = cleanhtml(Player), Team=cleanhtml(Team))
answer

Scraping data from finviz with R - Structure for

I am new using R and this is my first question. I apologize if it has been solved before but I haven't found a solution.
By using below code, that I found here, I can get data from and specific subsector from Finviz screener:
library (rvest)
url <- read_html("https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry")
tables <- html_nodes(url,"table")
screener <- tables %>% html_nodes("table") %>% .[11] %>%
html_table(fill=TRUE) %>% data.frame()
head(screener)
It was a bit difficult to find the table number bud I did. My question refers to lists with more than 20, like the one I am using at the example. They use &r=1, &r=21, &r=41, &r=61 at the end of each url.
How could I create in this case the structure?
i=0
for(z in ...){
Many thanks in advance for your help.
Update script based on new table number and link:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry"
TableList<-c("1","21","41","61") # table list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&r=",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&r=",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[17] %>%
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # getting all data in form of list
Here is one approach using stringr and lapply:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry" # base url
TableList<-c("1","21","41","61") # table number list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[11] %>% # check
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # list of dataframes
However please check for .[11] number as it will be changed for these URLs(URL with &1, &21, etc). It is working fine for the base URL. Data is not present for the URL with &1, &21, etc at 11th index. Please change accordingly.

Web-scraping paginated website with difficult node

I'm scraping the ASN database (http://aviation-safety.net/database/). I've written code to paginate through each of the years (1919-2019) and scrape all relevant nodes except fatalities (represented as "fat."). Selector Gadget tells me the fatalities node is called "'#contentcolumnfull :nth-child(5)'". For some reason ".list:nth-child(5)" doesn't work.
When I scrape #contentcolumnfull :nth-child(5), the first element is blank, represented as "".
How can I write a function to delete the first empty element for every year/page that's scraped? It's simple to delete the first element when I scrape a single page on its own:
fat <- html_nodes(webpage, '#contentcolumnfull :nth-child(5)')
fat <- html_text(fat)
fat <- fat[-1]
but I'm finding it difficult to write into a function.
I also have a second question regarding date-time and formatting. My days data are represented as day-month-year. Several element days and months are missing (ex: ??-??-1985, JAN-??-2004). Ideally, I'd like to transform the dates into a lubridate object, but I can't with missing data or if I only keep the years.
At this point, I've used gsub() and regex to clean the data (delete "??" and floating dashes), so I have a mixed bag of data formats. However, this makes it difficult to visualize the data. Thoughts on best practice?
# Load libraries
library(tidyverse)
library(rvest)
library(xml2)
library(httr)
years <- seq(1919, 2019, by=1)
pages <- c("http://aviation-safety.net/database/dblist.php?Year=") %>%
paste0(years)
# Leaving out the category, location, operator, etc. nodes for sake of brevity
read_date <- function(url){
az <- read_html(url)
date <- az %>%
html_nodes(".list:nth-child(1)") %>%
html_text() %>%
as_tibble()
}
read_type <- function(url){
az <- read_html(url)
type <- az %>%
html_nodes(".list:nth-child(2)") %>%
html_text() %>%
as_tibble()
}
date <- bind_rows(lapply(pages, read_date))
type <- bind_rows(lapply(pages, read_type))
# Writing to dataframe
aviation_df <- cbind(type, date)
aviation_df <- data.frame(aviation_df)
# Excluding data cleaning
It is bad practice to ping the same page more than once in order to extract the requested information. You should read the page, extract all of the desired information and then move to the next page.
In this case the individual nodes are all store in one master table. rvest's html_table() function is handy to convert a html table into a data frame.
library(rvest)
library(dplyr)
years <- seq(2010, 2015, by=1)
pages <- c("http://aviation-safety.net/database/dblist.php?Year=") %>%
paste0(years)
# Leaving out the category, location, operator, etc. nodes for sake of brevity
read_table <- function(url){
#add delay so that one is not attacking the host server (be polite)
Sys.sleep(0.5)
#read page
page <- read_html(url)
#extract the table out (the data frame is stored in the first element of the list)
answer<-(page %>% html_nodes("table") %>% html_table())[[1]]
#convert the falatities column to character to make a standardize column type
answer$fat. <-as.character(answer$fat.)
answer
}
# Writing to dataframe
aviation_df <- bind_rows(lapply(pages, read_table))
The are a few extra columns which will need clean-up

Scraping a table made of tables in R

I have a table from a website I am trying to download, and it appears to be made of a bunch of tables. Right now I am using rvest to bring the table in as text, but it is bringing in a bunch of other tables I am not interested in and then I am coercing the data into a better format, but it's not a repeatable process. Here is my code:
library(rvest)
library(tidyr)
#Auto Download Data
#reads the url of the race
race_url <- read_html("http://racing-reference.info/race/2016_Folds_of_Honor_QuikTrip_500/W")
#reads in the tables, in this code there are too many
race_results <- race_url %>%
html_nodes(".col") %>%
html_text()
race_results <- data.table(race_results) #turns from a factor to a DT
f <- nrow(race_results) #counts the number of rows in the data
#eliminates all rows after 496 (11*45 + 1) since there are never more than 43 racers
race_results <- race_results[-c(496:f)]
#puts the data into a format with 1:11 numbering for unstacking
testDT <- data.frame(X = race_results$race_results, ind = rep(1:11, nrow(race_results)/11))
testDT <- unstack(testDT, X~ind) #unstacking data into 11 columns
colnames(testDT) <- testDT[1, ] #changing the top column into the header
I commented everything so you would know what I am trying to do. If you go to the URL, there is a top table with driver results, which is what I am trying to scrape, but it is pulling the bottom ones too, as I can't seem to get a different html_nodes to work other than with ".col". I also tried html_table() in place of the html_text() but it didn't work. I suppose this can be done either by identifying the table in the css (I can't figure this out) or by using a different type of call or the XML library (which I also can't figure out). Any help or direction is appreciated.
UPDATE:
From the comments below, the correct code to pull this data is as follows:
library(rvest)
library(tidyr)
#Auto Download Data
race_url <- read_html("http://racing-reference.info/race/2016_Folds_of_Honor_QuikTrip_500/W") #reads the url of the race
race_results <- race_url %>% html_nodes("table") #returns a DF with all of the tables on the page
race_results <- race_results[7] %>% html_table()
race_results <- data.frame(race_results) #turns from a factor to a DT

Resources