Trying to extract the links of r packages using rvest - r

I have been trying to use this question and this tutorial to get the table and links for the list of available rpackages in cran
Getting the html table
I got that right doing this:
library(rvest)
page <- read_html("http://cran.r-project.org/web/packages/available_packages_by_name.html") %>% html_node("table") %>% html_table(fill = TRUE, header = FALSE)
trying to get the links
When I try to get the links is where I get in trouble, I tried using the selector gadget for the first column of the table (Packages links) and I got the node td a, so I tried this:
test2 <- read_html("http://cran.r-project.org/web/packages/available_packages_by_name.html") %>% html_node("td a") %>% html_attr("href")
But I only get the first link, then I thought I could get all the href from the tables and tried the following:
test3 <- read_html("http://cran.r-project.org/web/packages/available_packages_by_name.html") %>% html_node("table") %>% html_attr("href")
but got nothing, what am I doing wrong?

Essentially, an "s" is missing: html_nodes() is used instead of html_node:
x <-
read_html(paste0(
"http://cran.r-project.org/web/",
"packages/available_packages_by_name.html"))
html_nodes(x, "td a") %>%
sapply(html_attr, "href")

Related

Error when using a function: Error in df1[[1]] : subscript out of bounds

I'm trying to scrape the gun laws from https://www.statefirearmlaws.org/. However, I keep getting the following error:
Error in df1[[1]] : subscript out of bounds
I used selector gadget to copy the nodes for the table.
What can I do to fix it?
library(rvest)
library(tidyverse)
years <- lapply(c(2006:2018), function(x) {
link <- paste0('https://www.statefirearmlaws.org/national-data/', x)
df1 <- link %>% read_html() %>%
html_nodes('.js-view-dom-id-cc833ef0290cd127457401b760770f1411daa41fc70df5f12d07744fab0a173c > div > div') %>%
html_text(trim = TRUE)
df <- df1[[1]]
return(df)
}
)
df1 <- link %>% read_html() %>%
html_nodes('.js-view-dom-id-cc833ef0290cd127457401b760770f1411daa41fc70df5f12d07744fab0a173c > div > div')
this part results in {xml_nodeset (0)} which later produces empty list().
Are you selecting the correct thing you want to scrape in html_nodes? Maybe SelectorGadget can be helpful to choose what you need
So html_text expects a node as it's input and html_table outputs a list of tibbles so html_text fails to parse this.

How to scrape a table created using datawrapper using rvest?

I am trying to scrape Table 1 from the following website using rvest:
https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/
Following is the code i have written:
link <- "https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/"
page <- read_html(link)
page %>% html_nodes("iframe") %>% html_attr("src") %>% .[11] %>% read_html() %>%
html_nodes("table.medium datawrapper-g2oKP-6idse1 svelte-1vspmnh resortable")
But, i get {xml_nodeset (0)} as the result. I am struggling to figure out the correct tag to select in html_nodes() from the datawrapper page to extract Table 1.
I will be really grateful if someone can point out the mistake i am making, or suggest a solution to scrape this table.
Many thanks.
The data is present in the iframe but needs a little manipulation. It is easier, for me at least, to construct the csv download url from the iframe page then request that csv
library(rvest)
library(magrittr)
library(vroom)
library(stringr)
page <- read_html('https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/')
iframe <- page %>% html_element('iframe[title^="Table 1"]') %>% html_attr('src')
id <- read_html(iframe) %>% html_element('meta') %>% html_attr('content') %>% str_match('/(\\d+)/') %>% .[, 2]
csv_url <- paste(iframe,id, 'dataset.csv', sep = '/' )
data <- vroom(csv_url, show_col_types = FALSE)

Data Scraping a Table with Multiple pages

I am currently trying to retrieve a table from the CDC website however (https://www.cdc.gov/obesity/data/prevalence-maps.html#states) the table in question has multiple pages that must be scrolled through and I am having difficulty retrieving it and putting it into RStudio. I have tried to utilize the possibly() function from purrr but no luck. Any help is appreciated.
library(rvest)
library(dplyr)
library(purrr)
link <- "https://www.cdc.gov/obesity/data/prevalence-maps.html"
xpaths <- paste0('//*[#id="DataTables_Table_0', 1:9, '"]/table[2]')
scrape_table <- function(link, xpath){
link %>%
read_html() %>%
html_nodes(xpath = xpath) %>%
html_table() %>%
flatten_df %>%
setNames(c("State", "Prevalence", "95 CI"))
}
scrape_table_possibly <- possibly(scrape_table, otherwise = NULL)
scraped_tables <- map(xpaths, ~ scrape_table_possibly(link = link, xpath = .x))
The page source doesn't contain the data but gets external data by JS so you can't scrape it by rvest anyway. The table you want is from this file: https://www.cdc.gov/obesity/data/maps/2019-overall.csv
Edit: I scrolled down and saw other tables:
https://www.cdc.gov/obesity/data/maps/2019-white.csv
https://www.cdc.gov/obesity/data/maps/2019-hispanic.csv
https://www.cdc.gov/obesity/data/maps/2019-black.csv

Rvest html_nodes span div and Xpath

I am trying to scrape a website by reading XPath code.
When I go in the developer section, I see those lines:
<span class="js-bestRate-show" data-crid="11232895" data-id="928723" data-abc="0602524361510" data-referecenceta="44205406" data-catalog="1">
I would like to scrape all values for data-abc.
Let's say each element on the site is a movie, so I would like to scrape all data-abc elements for each movie of the page.
I would like to do so using Rvest package with R.
Below are two different attempts that did not work...
website %>% html_nodes("js-bestRate-show") %>% html_text()
website %>%
html_nodes(xpath = "js-bestRate-show") %>%
html_nodes(xpath = "//div") %>%
html_nodes(xpath = "//span") %>%
html_nodes(xpath = "//data-abc")
Anyone knows how html_nodes and Rvest work?
The node is span with class js-bestRate-show. Everything else is an attribute. So you want something like:
library(rvest)
h <- '<span class="js-bestRate-show" data-crid="11232895" data-id="928723" data-abc="0602524361510" data-referecenceta="44205406" data-catalog="1">'
h %>%
read_html() %>%
html_nodes("span.js-bestRate-show") %>%
html_attr("data-abc")

Scraping Lineup Data From Football Reference Using R

I seem to always have a problem scraping reference sites using either Python or R. Whenever I use my normal xpath approach (Python) or Rvest approach in R, the table I want never seems to be picked up by the scraper.
library(rvest)
url = 'https://www.pro-football-reference.com/years/2016/games.htm'
webpage = read_html(url)
table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)
for(x in boxscore_links{
keep = substr(x, 10, 36)
url2 = paste('https://www.pro-football-reference.com', keep, sep = "")
webpage2 = read_html(url2)
home_team = webpage2 %>% html_nodes(xpath='//*[#id="all_home_starters"]') %>% html_text()
away_team = webpage2 %>% html_nodes(xpath='//*[#id="all_vis_starters"]') %>% html_text()
home_starters = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_text()
home_starters2 = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_table()
#code that will bind lineup tables with some master table -- code to be written later
}
I'm trying to scrape the starting lineup tables. The first bit of code pulls the urls for all boxscores in 2016, and the for loop goes to each boxscore page with the hopes of extracting the tables led by "Insert Team Here" Starters.
Here's one link for example: 'https://www.pro-football-reference.com/boxscores/201609110rav.htm'
When I run the code above, the home_starters and home_starters2 objects contain zero elements (when ideally it should contain the table or elements of the table I'm trying to bring in).
I appreciate the help!
I've spent the last three hours trying to figure this out. This is how it shoudl be done. This is given my example but I'm sure you could apply it to yours.
"https://www.pro-football-reference.com/years/2017/" %>% read_html() %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#returns') %>% # select desired node
html_table()

Resources