Extracting data from class = "section wrapper" using Rvest - r

I'm sure a similar question has been answered previously, but I would love to understand why Rvest can't extract data from class = "section wrapper." I'm using R Studio and in short:
anasj_103 = read_html("https://www.hockey-reference.com/boxscores/201810030SJS.html")
ana_table = anasj_103 %>%
html_node(xpath = '//*[#id="ANA_skaters"]') %>%
html_table()
adv_ana = anasj_103 %>%
html_node(xpath = '//*[#id="ANA_adv"]') %>%
html_table()
Error that comes back: Error in UseMethod("html_table") :
no applicable method for 'html_table' applied to an object of class "xml_missing"
The ana_table works fine when using the Xpath but the adv_ana gives an error or returns nothing when using a similar code.I run into this issue with all of the data that is in a div section followed by that class. Since I can't even return basic text in the section wrapper, I'm convinced this is the issue.
Any thoughts or workarounds?

Thanks to QHarr for the assistance.
The above question was solved by using:
table = anasjs_103 %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse = '') %>%
read_html() %>%
html_node('table#ANA_adv') %>%
html_table()

Related

Web page not found in web scraping, how can I find it in R?

I've been working with R for about a year and love it. I've gotten into text mining recently and have had some difficulty. I'm trying to create a data frame with information from a website. I've been scraping the data and have been able to create two variables successfully. In attempting to create the third variable its not working. When I view the table that I've made, the content for that variable says "Sorry webpage cannot be found." But, I know its there! Any thoughts? Thanks everyone!
link = "https://www.fmprc.gov.cn/mfa_eng/wjdt_665385/zyjh_665391/"
page = read_html(link)
title = page %>% html_nodes(".newsLst_mod a") %>% html_text()
slinks = page %>% html_nodes(".newsLst_mod a") %>%
html_attr("href") %>% paste("https://www.fmprc.gov.cn", ., sep = "")
date = page %>% html_nodes(".newsLst_mod span") %>% html_text()
Somewhere here is where I run into trouble... I get 'p' when using Selector Gadget and put that in the html_ nodes function...however, this doesn't seem to work and I'm coming up empty. If I adjust the scraping a little on the page, it might have nothing on the table when I view it.
get_s = function(slinks) {
speeches_link = read_html(slinks)
speech_words = speeches_link %>% html_nodes("p") %>%
html_text() %>% paste(collapse = ",")
return(speech_words)
}
What the table looks like
words = sapply(slinks, FUN = get_s)
speeches = data.frame(title, date, words, stringsAsFactors = FALSE)
The link that you need to paste in each URL is https://www.fmprc.gov.cn/mfa_eng/wjdt_665385/zyjh_665391.
Try the following -
library(rvest)
slinks = page %>% html_nodes(".newsLst_mod a") %>%
html_attr("href") %>% trimws(whitespace = '\\.') %>%
paste0("https://www.fmprc.gov.cn/mfa_eng/wjdt_665385/zyjh_665391", .)
get_s = function(slinks) {
speeches_link = read_html(slinks)
speech_words = speeches_link %>% html_nodes("p") %>%
html_text() %>% paste(collapse = ",")
return(speech_words)
}
words = sapply(slinks, FUN = get_s)
words

html_attr "href" does not extract link

I want to download the file that is in the tab "Dossier" with the text "Modul 4" here:
https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier
First I want to get the link.
My code for that is the following:
"https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier" %>%
read_html %>%
html_nodes(".gba-download__text") %>%
.[[4]] %>%
html_attr("href")
(I know the piece .[[4]] is not really good, this is not my full code.)
This leads to NA and I don't understand why.
Similar questions couldn't help here.
Allan already left a concise answer. But let me leave another way. If you check the page source, you can see that the target is in .gba-download-list. (There are actually two of them.) So get that part and walk down to href part. Once you get urls, you can use grep() to identify a link containing Modul4. I used unique() in the end to remove a dupe.
read_html("https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier") %>%
html_nodes(".gba-download-list") %>%
html_nodes("a") %>%
html_attr("href") %>%
grep(pattern = "Modul4", value = TRUE) %>%
unique()
[1] "/downloads/92-975-67/2011-12-05_Modul4A_Apixaban.pdf"
It's easier to get to a specific node if you use xpath :
library(rvest)
"https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier" %>%
read_html %>%
html_nodes(xpath = "//span[contains(text(),'Modul 4')]/..") %>%
.[[1]] %>%
html_attr("href")
#> [1] "/downloads/92-975-67/2011-12-05_Modul4A_Apixaban.pdf"
I have another solution now and want to share it:
"https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier" %>%
read_html %>%
html_nodes("a.download-helper") %>%
html_attr("href") %>%
.[str_detect(., "Modul4")] %>%
unique
It is faster to use a css selector with contains operator to target the href by substring. In addition, only a single node match needs to be returned
library(rvest)
url <- "https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier"
link <- read_html(url) %>%
html_node("[href*='Modul4']") %>%
html_attr("href") %>% url_absolute(url)

extracting table with htmltab R

I'm attempting to scrape the second table from
https://fbref.com/en/comps/9/passing/Premier-League-Stats
I have used
URLPL <- "https://fbref.com/en/comps/9/passing/Premier-League-Stats"
Tab <- htmltab(doc = URLPL, which = 2)
which returns
"Error: Couldn't find the table. Try passing (a different) information
to the which argument"
and also
URLPL <- "https://fbref.com/en/comps/9/passing/Premier-League-Stats"
Tab <- htmltab(doc = URLPL, which = "//table[2]")
which returns
"Error in Node[1] : subscript out of bounds"
There is 2 tables on the webpage. If anyone can point me on the right path here.
Thanks.
Edit: I've now realised that there's only 1 table on the webpage and what I thought was a table, is not. Now I'm even more confused as where to go with this.
Answering my own question here. For anyone who may have the same problem.
Anything other than the top table on any of the sports-references websites. (Hockey/Basketball/Baseball) are counted as comments.
PremLeague = "https://fbref.com/en/comps/12/stats/La-Liga-Stats"
Prem = PremLeague %>%
read_html %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse='') %>%
read_html() %>%
html_node("#stats_standard") %>%
html_table()
This worked for me.

Trying to extract the links of r packages using rvest

I have been trying to use this question and this tutorial to get the table and links for the list of available rpackages in cran
Getting the html table
I got that right doing this:
library(rvest)
page <- read_html("http://cran.r-project.org/web/packages/available_packages_by_name.html") %>% html_node("table") %>% html_table(fill = TRUE, header = FALSE)
trying to get the links
When I try to get the links is where I get in trouble, I tried using the selector gadget for the first column of the table (Packages links) and I got the node td a, so I tried this:
test2 <- read_html("http://cran.r-project.org/web/packages/available_packages_by_name.html") %>% html_node("td a") %>% html_attr("href")
But I only get the first link, then I thought I could get all the href from the tables and tried the following:
test3 <- read_html("http://cran.r-project.org/web/packages/available_packages_by_name.html") %>% html_node("table") %>% html_attr("href")
but got nothing, what am I doing wrong?
Essentially, an "s" is missing: html_nodes() is used instead of html_node:
x <-
read_html(paste0(
"http://cran.r-project.org/web/",
"packages/available_packages_by_name.html"))
html_nodes(x, "td a") %>%
sapply(html_attr, "href")

Scraping Lineup Data From Football Reference Using R

I seem to always have a problem scraping reference sites using either Python or R. Whenever I use my normal xpath approach (Python) or Rvest approach in R, the table I want never seems to be picked up by the scraper.
library(rvest)
url = 'https://www.pro-football-reference.com/years/2016/games.htm'
webpage = read_html(url)
table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)
for(x in boxscore_links{
keep = substr(x, 10, 36)
url2 = paste('https://www.pro-football-reference.com', keep, sep = "")
webpage2 = read_html(url2)
home_team = webpage2 %>% html_nodes(xpath='//*[#id="all_home_starters"]') %>% html_text()
away_team = webpage2 %>% html_nodes(xpath='//*[#id="all_vis_starters"]') %>% html_text()
home_starters = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_text()
home_starters2 = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_table()
#code that will bind lineup tables with some master table -- code to be written later
}
I'm trying to scrape the starting lineup tables. The first bit of code pulls the urls for all boxscores in 2016, and the for loop goes to each boxscore page with the hopes of extracting the tables led by "Insert Team Here" Starters.
Here's one link for example: 'https://www.pro-football-reference.com/boxscores/201609110rav.htm'
When I run the code above, the home_starters and home_starters2 objects contain zero elements (when ideally it should contain the table or elements of the table I'm trying to bring in).
I appreciate the help!
I've spent the last three hours trying to figure this out. This is how it shoudl be done. This is given my example but I'm sure you could apply it to yours.
"https://www.pro-football-reference.com/years/2017/" %>% read_html() %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#returns') %>% # select desired node
html_table()

Resources