How to scrape a table created using datawrapper using rvest? - r

I am trying to scrape Table 1 from the following website using rvest:
https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/
Following is the code i have written:
link <- "https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/"
page <- read_html(link)
page %>% html_nodes("iframe") %>% html_attr("src") %>% .[11] %>% read_html() %>%
html_nodes("table.medium datawrapper-g2oKP-6idse1 svelte-1vspmnh resortable")
But, i get {xml_nodeset (0)} as the result. I am struggling to figure out the correct tag to select in html_nodes() from the datawrapper page to extract Table 1.
I will be really grateful if someone can point out the mistake i am making, or suggest a solution to scrape this table.
Many thanks.

The data is present in the iframe but needs a little manipulation. It is easier, for me at least, to construct the csv download url from the iframe page then request that csv
library(rvest)
library(magrittr)
library(vroom)
library(stringr)
page <- read_html('https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/')
iframe <- page %>% html_element('iframe[title^="Table 1"]') %>% html_attr('src')
id <- read_html(iframe) %>% html_element('meta') %>% html_attr('content') %>% str_match('/(\\d+)/') %>% .[, 2]
csv_url <- paste(iframe,id, 'dataset.csv', sep = '/' )
data <- vroom(csv_url, show_col_types = FALSE)

Related

Rvest Pulls Empty Tables

The site I use to scrape data has changed and I'm having issues pulling the data into table format. I used two different types of codes below trying to get the tables, but it is returning blanks instead of tables.
I'm a novice in regards to scraping and would appreciate the expertise of the group. Should I look for other solutions in rvest, or try to learn a program like rSelenium?
https://www.pgatour.com/stats/detail/02675
Scrape for Multiple Links
library("dplyr")
library("purr")
library("rvest")
df23 <- expand.grid(
stat_id = c("02568","02674", "02567", "02564", "101")
) %>%
mutate(
links = paste0(
'https://www.pgatour.com/stats/detail/',
stat_id
)
) %>%
as_tibble()
#replaced tournament_id with stat_id
get_info <- function(link, stat_id){
data <- link %>%
read_html() %>%
html_table() %>%
.[[2]]
}
test_main_stats <- df23 %>%
mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))
test_main_stats <- test_main_stats %>%
unnest(everything())
Alternative Code
url <- read_html("https://www.pgatour.com/stats/detail/02568")
test1 <- url %>%
html_nodes(".css-8atqhb") %>%
html_table
This page uses javascript to create the table, so rvest will not directly work. But if one examines the page's source code, all of the data is stored in JSON format in a "<script>" node.
This code finds that node and converts from JSON to a list. The variable is the main table but there is a wealth of other information contained in the JSON data struture.
#read page
library(rvest)
page <- read_html("https://www.pgatour.com/stats/detail/02675")
#find the script with the correct id tage, strip the html code
datascript <- page %>% html_elements(xpath = ".//script[#id='__NEXT_DATA__']") %>% html_text()
#convert from JSON
output <- jsonlite::fromJSON(datascript)
#explore the output
str(output)
#get the main table
answer <-output$props$pageProps$statDetails$rows

character (0) after scraping webpage in read_html

I'm trying to scrape "1,335,000" from the screenshot below (the number is at the bottom of the screenshot). I wrote the following code in R.
t2<-read_html("https://fortune.com/company/amazon-com/fortune500/")
employee_number <- t2 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//*[contains(#class, 'info__value--2AHH7')]") %>%
rvest::html_text()
However, when I call "employee_number", it gives me "character(0)". Can anyone help me figure out why?
As Dave2e pointed the page uses javascript, thus can't make use of rvest.
url = "https://fortune.com/company/amazon-com/fortune500/"
#launch browser
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes(xpath = '//*[#id="content"]/div[5]/div[1]/div[1]/div[12]/div[2]') %>%
html_text()
[1] "1,335,000"
Data is loaded dynamically from a script tag. No need for expense of a browser. You could either extract the entire JavaScript object within the script, pass to jsonlite to handle as JSON, then extract what you want, or, if just after the employee count, regex that out from the response text.
library(rvest)
library(stringr)
library(magrittr)
library(jsonlite)
page <- read_html('https://fortune.com/company/amazon-com/fortune500/')
data <- page %>% html_element('#preload') %>% html_text() %>%
stringr::str_match(. , "PRELOADED_STATE__ = (.*);") %>% .[, 2] %>% jsonlite::parse_json()
print(data$components$page$`/company/amazon-com/fortune500/`[[6]]$children[[4]]$children[[3]]$config$employees)
#shorter version
print(page %>%html_text() %>% stringr::str_match('"employees":"(\\d+)?"') %>% .[,2] %>% as.integer() %>% format(big.mark=","))

Data Scraping a Table with Multiple pages

I am currently trying to retrieve a table from the CDC website however (https://www.cdc.gov/obesity/data/prevalence-maps.html#states) the table in question has multiple pages that must be scrolled through and I am having difficulty retrieving it and putting it into RStudio. I have tried to utilize the possibly() function from purrr but no luck. Any help is appreciated.
library(rvest)
library(dplyr)
library(purrr)
link <- "https://www.cdc.gov/obesity/data/prevalence-maps.html"
xpaths <- paste0('//*[#id="DataTables_Table_0', 1:9, '"]/table[2]')
scrape_table <- function(link, xpath){
link %>%
read_html() %>%
html_nodes(xpath = xpath) %>%
html_table() %>%
flatten_df %>%
setNames(c("State", "Prevalence", "95 CI"))
}
scrape_table_possibly <- possibly(scrape_table, otherwise = NULL)
scraped_tables <- map(xpaths, ~ scrape_table_possibly(link = link, xpath = .x))
The page source doesn't contain the data but gets external data by JS so you can't scrape it by rvest anyway. The table you want is from this file: https://www.cdc.gov/obesity/data/maps/2019-overall.csv
Edit: I scrolled down and saw other tables:
https://www.cdc.gov/obesity/data/maps/2019-white.csv
https://www.cdc.gov/obesity/data/maps/2019-hispanic.csv
https://www.cdc.gov/obesity/data/maps/2019-black.csv

Scraping Lineup Data From Football Reference Using R

I seem to always have a problem scraping reference sites using either Python or R. Whenever I use my normal xpath approach (Python) or Rvest approach in R, the table I want never seems to be picked up by the scraper.
library(rvest)
url = 'https://www.pro-football-reference.com/years/2016/games.htm'
webpage = read_html(url)
table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)
for(x in boxscore_links{
keep = substr(x, 10, 36)
url2 = paste('https://www.pro-football-reference.com', keep, sep = "")
webpage2 = read_html(url2)
home_team = webpage2 %>% html_nodes(xpath='//*[#id="all_home_starters"]') %>% html_text()
away_team = webpage2 %>% html_nodes(xpath='//*[#id="all_vis_starters"]') %>% html_text()
home_starters = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_text()
home_starters2 = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_table()
#code that will bind lineup tables with some master table -- code to be written later
}
I'm trying to scrape the starting lineup tables. The first bit of code pulls the urls for all boxscores in 2016, and the for loop goes to each boxscore page with the hopes of extracting the tables led by "Insert Team Here" Starters.
Here's one link for example: 'https://www.pro-football-reference.com/boxscores/201609110rav.htm'
When I run the code above, the home_starters and home_starters2 objects contain zero elements (when ideally it should contain the table or elements of the table I'm trying to bring in).
I appreciate the help!
I've spent the last three hours trying to figure this out. This is how it shoudl be done. This is given my example but I'm sure you could apply it to yours.
"https://www.pro-football-reference.com/years/2017/" %>% read_html() %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#returns') %>% # select desired node
html_table()

R - Web Scrape of job board

I am trying to get a list of Companies and jobs in a table from indeed.com's job board.
I am using the rvest package using a URL Base of http://www.indeed.com/jobs?q=proprietary+trader&
install.packages("gtools")
install.packages('rvest")
library(rvest)
library(gtools)
mydata = read.csv("setup.csv", header=TRUE)
url_base <- "http://www.indeed.com/jobs?q=proprietary+trader&"
names <- mydata$Page
results<-data.frame()
for (name in names){
url <-paste0(url_base,name)
title.results <- url %>%
html() %>%
html_nodes(".jobtitle") %>%
html_text()
company.results <- url %>%
html() %>%
html_nodes(".company") %>%
html_text()
results <- smartbind(company.results, title.results)
results3<-data.frame(company=company.results, title=title.results)
}
new <- results(Company=company, Title=title)
and then looping a contatenation. For some reason it is not grabbing all of the jobs and mixing the companies and jobs.
It might be because you make two separate requests to the page. You should change the middle part of your code to:
page <- url %>%
html()
title.results <- page %>%
html_nodes(".jobtitle") %>%
html_text()
company.results <- page %>%
html_nodes(".company") %>%
html_text()
When I do that, it seems to give me 10 jobs and companies which match. Can you give an example otherwise of a query URL that doesn't work?

Resources