extracting table with htmltab R

extracting table with htmltab R - r

I'm attempting to scrape the second table from
https://fbref.com/en/comps/9/passing/Premier-League-Stats
I have used
URLPL <- "https://fbref.com/en/comps/9/passing/Premier-League-Stats"
Tab <- htmltab(doc = URLPL, which = 2)
which returns
"Error: Couldn't find the table. Try passing (a different) information
to the which argument"
and also
URLPL <- "https://fbref.com/en/comps/9/passing/Premier-League-Stats"
Tab <- htmltab(doc = URLPL, which = "//table[2]")
which returns
"Error in Node[1] : subscript out of bounds"
There is 2 tables on the webpage. If anyone can point me on the right path here.
Thanks.
Edit: I've now realised that there's only 1 table on the webpage and what I thought was a table, is not. Now I'm even more confused as where to go with this.

Answering my own question here. For anyone who may have the same problem.
Anything other than the top table on any of the sports-references websites. (Hockey/Basketball/Baseball) are counted as comments.
PremLeague = "https://fbref.com/en/comps/12/stats/La-Liga-Stats"
Prem = PremLeague %>%
read_html %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse='') %>%
read_html() %>%
html_node("#stats_standard") %>%
html_table()
This worked for me.

Related

How to scrape a table created using datawrapper using rvest?

I am trying to scrape Table 1 from the following website using rvest:
https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/
Following is the code i have written:
link <- "https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/"
page <- read_html(link)
page %>% html_nodes("iframe") %>% html_attr("src") %>% .[11] %>% read_html() %>%
html_nodes("table.medium datawrapper-g2oKP-6idse1 svelte-1vspmnh resortable")
But, i get {xml_nodeset (0)} as the result. I am struggling to figure out the correct tag to select in html_nodes() from the datawrapper page to extract Table 1.
I will be really grateful if someone can point out the mistake i am making, or suggest a solution to scrape this table.
Many thanks.

The data is present in the iframe but needs a little manipulation. It is easier, for me at least, to construct the csv download url from the iframe page then request that csv
library(rvest)
library(magrittr)
library(vroom)
library(stringr)
page <- read_html('https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/')
iframe <- page %>% html_element('iframe[title^="Table 1"]') %>% html_attr('src')
id <- read_html(iframe) %>% html_element('meta') %>% html_attr('content') %>% str_match('/(\\d+)/') %>% .[, 2]
csv_url <- paste(iframe,id, 'dataset.csv', sep = '/' )
data <- vroom(csv_url, show_col_types = FALSE)

Data scraping from coinmarketcap

Hello i'm trying to scrape the market table at the end of this page "https://coinmarketcap.com/currencies/bitcoin/markets/"
This is what I tried
crpyto_url <- read_html("https://coinmarketcap.com/currencies/bitcoin/markets/")
Exchanges <- crpyto_url %>%
html_node(xpath = '//*[#id="__next"]/div/div[2]/div/div[3]/div[2]/div[2]/div/table') %>%
html_text() %>%
jsonlite::fromJSON()
This is the error
Error in if (is.character(txt) && length(txt) == 1 && nchar(txt, type = "bytes") < : missing value where TRUE/FALSE needed
I don't think that the error is relevant, I think that the real problem is that I don't know how to find the xpath related with the table.
If someone manage to find the xpath, can you please explain what was the process to found it. Or link some resources.
Thanks
I.

This can be done with the coingecko API.
url <- "https://api.coingecko.com/api/v3/coins/bitcoin/tickers"
Exchanges <- GET(url)
araw_data <- fromJSON(content(Exchanges, as = "text",encoding = "UTF-8"))
araw_data$tickers$market %>% select(name) %>% pull

Data Scraping a Table with Multiple pages

I am currently trying to retrieve a table from the CDC website however (https://www.cdc.gov/obesity/data/prevalence-maps.html#states) the table in question has multiple pages that must be scrolled through and I am having difficulty retrieving it and putting it into RStudio. I have tried to utilize the possibly() function from purrr but no luck. Any help is appreciated.
library(rvest)
library(dplyr)
library(purrr)
link <- "https://www.cdc.gov/obesity/data/prevalence-maps.html"
xpaths <- paste0('//*[#id="DataTables_Table_0', 1:9, '"]/table[2]')
scrape_table <- function(link, xpath){
link %>%
read_html() %>%
html_nodes(xpath = xpath) %>%
html_table() %>%
flatten_df %>%
setNames(c("State", "Prevalence", "95 CI"))
}
scrape_table_possibly <- possibly(scrape_table, otherwise = NULL)
scraped_tables <- map(xpaths, ~ scrape_table_possibly(link = link, xpath = .x))

The page source doesn't contain the data but gets external data by JS so you can't scrape it by rvest anyway. The table you want is from this file: https://www.cdc.gov/obesity/data/maps/2019-overall.csv
Edit: I scrolled down and saw other tables:
https://www.cdc.gov/obesity/data/maps/2019-white.csv
https://www.cdc.gov/obesity/data/maps/2019-hispanic.csv
https://www.cdc.gov/obesity/data/maps/2019-black.csv

Extracting data from class = "section wrapper" using Rvest

I'm sure a similar question has been answered previously, but I would love to understand why Rvest can't extract data from class = "section wrapper." I'm using R Studio and in short:
anasj_103 = read_html("https://www.hockey-reference.com/boxscores/201810030SJS.html")
ana_table = anasj_103 %>%
html_node(xpath = '//*[#id="ANA_skaters"]') %>%
html_table()
adv_ana = anasj_103 %>%
html_node(xpath = '//*[#id="ANA_adv"]') %>%
html_table()
Error that comes back: Error in UseMethod("html_table") :
no applicable method for 'html_table' applied to an object of class "xml_missing"
The ana_table works fine when using the Xpath but the adv_ana gives an error or returns nothing when using a similar code.I run into this issue with all of the data that is in a div section followed by that class. Since I can't even return basic text in the section wrapper, I'm convinced this is the issue.
Any thoughts or workarounds?

Thanks to QHarr for the assistance.
The above question was solved by using:
table = anasjs_103 %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse = '') %>%
read_html() %>%
html_node('table#ANA_adv') %>%
html_table()

Scraping Lineup Data From Football Reference Using R

I seem to always have a problem scraping reference sites using either Python or R. Whenever I use my normal xpath approach (Python) or Rvest approach in R, the table I want never seems to be picked up by the scraper.
library(rvest)
url = 'https://www.pro-football-reference.com/years/2016/games.htm'
webpage = read_html(url)
table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)
for(x in boxscore_links{
keep = substr(x, 10, 36)
url2 = paste('https://www.pro-football-reference.com', keep, sep = "")
webpage2 = read_html(url2)
home_team = webpage2 %>% html_nodes(xpath='//*[#id="all_home_starters"]') %>% html_text()
away_team = webpage2 %>% html_nodes(xpath='//*[#id="all_vis_starters"]') %>% html_text()
home_starters = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_text()
home_starters2 = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_table()
#code that will bind lineup tables with some master table -- code to be written later
}
I'm trying to scrape the starting lineup tables. The first bit of code pulls the urls for all boxscores in 2016, and the for loop goes to each boxscore page with the hopes of extracting the tables led by "Insert Team Here" Starters.
Here's one link for example: 'https://www.pro-football-reference.com/boxscores/201609110rav.htm'
When I run the code above, the home_starters and home_starters2 objects contain zero elements (when ideally it should contain the table or elements of the table I'm trying to bring in).
I appreciate the help!

I've spent the last three hours trying to figure this out. This is how it shoudl be done. This is given my example but I'm sure you could apply it to yours.
"https://www.pro-football-reference.com/years/2017/" %>% read_html() %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#returns') %>% # select desired node
html_table()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

extracting table with htmltab R - r

Related

How to scrape a table created using datawrapper using rvest?

Data scraping from coinmarketcap

Data Scraping a Table with Multiple pages

Extracting data from class = "section wrapper" using Rvest

Scraping Lineup Data From Football Reference Using R

Categories

Resources