encoding issues with rvest, "Â " appearing instead of a space in character strings - r

I am trying to scrape a table from baseball-reference.com using rvest. my code is:
url="http://www.baseball-reference.com/leagues/NL/2016-standard-batting.shtml"
css=""#players_standard_batting.sortable.stats_table"
read_html(url) %>% html_node(css) %>% html_table()->nlbatting.raw
At this point the table is a bit garbled, there is an 'Â' wherever there should be a space. I have tried
nlbatting.raw %>% mutate(Name=repair_encoding(Name))->nlbatting.raw
which makes everything look ok, but then I get really odd behavior. For instance:
nlbatting.raw$Name[86]=="Yoenis Cespedes"
FALSE
and:
gsub(" ","_",nlbatting.raw$Name[86])
"Yoenis Cespedes"
I have tried different encoding parameters in read_html() but nothing changes. I tried leaving the encoding alone and just gsubbing out the 'Â' but have the same problem. Any help would be great, thanks in advance!
ps. Long time lurker first time poster, sorry if I've missed something obvious

Edited to fix html nodes (from ".class" to ".stats_table"). It worked fine for me. Try this again:
library(rvest)
url <- "http://www.baseball-reference.com/leagues/NL/2016-standard-batting.shtml"
data <- read_html(url) %>% html_nodes(".stats_table") %>% html_table()
head(data[[1]])
head(data[[2]])

Related

Web scraping tables on college basketball stats

I am new to webscraping and working on a test project in which I am trying to scrape every table of data on the following website for this particular team. There should be 15 tables but when I run my code, it only seems to pull the first 6 of the 15. How do I go about getting the rest of the tables?
Here is the code:
library(tidyverse)
library(rvest)
library(stringr)
library(lubridate)
library(magrittr)
iowa_stats<- read_html("https://www.sports-reference.com/cbb/schools/iowa/2021.html")
iowa_stats %>% html_table()
Edit: So I decided to dig a little bit deeper into the problem and see if I could get any more insights. So I decided to start with the first table that doesn't appear when you call the html_table command which is the 'Totals' Table. I did the following to follow the path of the html all the way down to the table to see if I could figure out what's wrong. To do so, I used the following code.
iowa_stats %>% html_nodes("body") %>% html_nodes("div#wrap") %>% html_nodes("div#all_totals.table_wrapper")
This is as far as I can get prior to getting an error. At the next step, there should be the following: div#div_totals.table_container.is_setup in which the table is stored but if I were to add that to the above code, it doesn't exist. When I type the following, it doesn't exist as well.
iowa_stats %>% html_nodes("body") %>% html_nodes("div#wrap") %>% html_nodes("div#all_totals.table_wrapper") %>% html_nodes("div")
Does someone who is better with html/css have any idea why this is the case?
It looks like this webpage is storing some of the tables as comments. To solve this read and save the web page. Remove the comment tags and then process normally.
library(rvest)
library(dplyr)
iowa_stats<- read_html("https://www.sports-reference.com/cbb/schools/iowa/2021.html")
#Only save and work with the body
body<-html_node(iowa_stats,"body")
write_xml(body, "temp.xml")
#Find and remove comments
lines<-readLines("temp.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "temp2.xml")
#Read the file back in and process normally
body<-read_html("temp2.xml")
html_nodes(body, "table") %>% html_table()

How to log scrape paths rvest used?

Background: Using rvest I'd like to scrape all details of all art pieces for the painter Paulo Uccello on wikiart.org. The endgame will look something like this:
> names(uccello_dt)
[1] title year style genre media imgSRC infoSRC
Problem: When a scraping attempt doesn't go as planned, I get back character(0). This isn't helpful for me in understanding exactly what path the scrape took to get character(0). I'd like to have my scrape attempts output what path it specifically took so that I can better troubleshoot my failures.
What I've tried:
I use Firefox, so after each failed attempt I go back to the web inspector tool to make sure that I am using the correct css selector / element tag. I've been keeping the rvest documentation by my side to better understand its functions. It's been a trial and error that's taking much longer than I think should. Here's a cleaned up source of 1 of many failures:
library(tidyverse)
library(data.table)
library(rvest)
sample_url <-
read_html(
"https://www.wikiart.org/en/paolo-uccello/all-works#!#filterName:all-paintings-chronologically,resultType:detailed"
)
imgSrc <-
sample_url %>%
html_nodes(".wiki-detailed-item-container") %>% html_nodes(".masonry-detailed-artwork-item") %>% html_nodes("aside") %>% html_nodes(".wiki-layout-artist-image-wrapper") %>% html_nodes("img") %>%
html_attr("src") %>%
as.character()
title <-
sample_url %>%
html_nodes(".masonry-detailed-artwork-title") %>%
html_text() %>%
as.character()
Thank you in advance.

Google Searching in R

This question is not so much about how to google search in R (discussed many times before) as much as why it does not always work.
I found this code from another posted question here
That I recall working perfectly. It would produce all the links in the search.
But now it does not work. For some reason the node is not there anymore when I pull the data into R. But when I actually inspect the html code on Chrome it's there when I am browsing the code. It show's the h3 node in the display inspector but not when it's being downloded.
library(rvest)
ht <- read_html('https://www.google.co.in/search?q=guitar+repair+workshop')
links <- ht %>% html_nodes(xpath='//h3/a') %>% html_attr('href')
gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
I get the following return:
character(0)
The google page display of links depends on your location/preferences. So maybe this is what is causing the issue?
It appears that the format switched very recently, maybe today, and that the //h3 is no longer used. This produces what is intended with one final extraneous result
library(rvest)
ht <- read_html('https://www.google.co.in/search?q=guitar+repair+workshop')
links <- ht %>% html_nodes(xpath='//a') %>% html_attr('href')
gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))

Xpath found with Elements but not readable/scrapeable via rvest

I am trying to scrape off the number amounts listed in a set of donation websites. So in this example, I would like to get
$3, $10, $25, $100, $250, $1500, $2800
The xpath indicates that one of them should be
/html/body/div[1]/div[3]/div[2]/div/div[1]/div/div/
form/div/div[1]/div/div/ul/li[2]/label
and the css selector
li.btn--wrapper:nth-child(2) > label:nth-child(1)
Up to the following, I see something in the xml_nodeset:
library(rvest)
url <- "https://secure.actblue.com/donate/pete-buttigieg-announcement-day"
read_html(url) %>% html_nodes(
xpath = '//*[#id="cf-app-target"]/div[3]/div[2]/div/div[1]/div/div'
)
Then I see add the second part of the xpath and it shows up blank. Same with
X %>% html_nodes("li")
which gives a bunch of things, but all the StyledButton__StyledAnchorButton-a7s38j-0 kEcVlT turn blank.
I have worked with rvest for a fair bit now, but this one's baffling. And I am not quite sure how RSelenium will help here, although I have knowledge on how to use it for screenshots and clicks. If it helps, the website also refuses to be captured in the wayback machine---there's only the background and nothing else.
I have even tried just taking a screenshot with RSelenium and attempting ocr with tessaract and magick, but while other pages worked this particular example spectacularly fails, because the text is in white and in a rather nonstandard font. Yes, I've tried image_negate and image_resize to see if it helped, but it only showed that relying on OCR is rather a bad idea, as it depends on screenshot size.
Any advice on how to best extract what I want in this situation? Thanks.
You can use regex to extract numbers from script tag. You get a comma separated character vector
library(rvest)
library(stringr)
con <- url('https://secure.actblue.com/donate/pete-buttigieg-announcement-day?refcode=website', "rb")
page = read_html(con)
res <- page %>%
html_nodes(xpath=".//script[contains(., 'preloadedState')]")%>%
html_text() %>% as.character %>%
str_match_all(.,'(?<="amounts":\\[)(\\d+,?)+')
print(res[[1]][,1])
Try it here

Applying rvest pipes to a dataframe

I've got a dataframe called base_table with a lot of 311 data and URLs that point to a broader description of each call.
I'm trying to create a new variable called case_desc with a series of rvest functions each URL.
base_table$case_desc <-
read_html(base_table$case_url) %>%
html_nodes("rc_descrlong") %>%
html_text()
But this doesn't work for I suppose obvious reasons that I can't muster right now. I've tried playing around with functions, but can't seem to nail the right format.
Any help would be awesome! Thank you!
It doesn't work because read_html doesn't work with a vector of URLs. It will throw an error if you give it a vector...
> read_html(c("http://www.google.com", "http://www.yahoo.com"))
Error: expecting a single value
You probably have to use an apply function...
library("rvest")
base_table$case_desc <- sapply(base_table$case_url, function(x)
read_html(x) %>%
html_nodes("rc_descrlong") %>%
html_text())

Resources