webscraping a table with R rvest - r

As an example to teach myself rvest, I attempted to scrape a website to grab data that's already written in a table format. The only problem is that I can't get an output of the underlying table data.
The only thing I really need is the player column.
library(tidyverse)
library(rvest)
base <- "https://www.milb.com/stats/"
base2 <- "?page="
base3 <- "&playerPool=ALL"
html <- read_html(paste0(base,"pacific-coast/","2017",base2,"2",base3))
html2 <- html %>% html_element("#stats-app-root")
html3 <- html2 %>% html_text("#stats-body-table player")
https://www.milb.com/stats/pacific-coast/2017?page=2&playerPool=ALL (easy way to see actual example url)
"HTML 2" appears to work, but I'm a little stuck about what to do from there. A couple of different attempts just hit a wall.
once this works, I'll replace text with numbers and do a few for loops (which seems pretty simple).

If you "inspect" the page in chrome, you see it's making a call to download a json file. Just do that yourself...
library(jsonlite)
data <- fromJSON("https://bdfed.stitch.mlbinfra.com/bdfed/stats/player?stitch_env=prod&season=2017&sportId=11&stats=season&group=hitting&gameType=R&offset=25&sortStat=onBasePlusSlugging&order=desc&playerPool=ALL&leagueIds=112")
df <- data$stats
head(df)
year playerId playerName type rank playerFullName
1 2017 643256 Adam Cimber player 26 Adam Cimber
2 2017 458547 Vladimir Frias player 27 Vladimir Frias
3 2017 643265 Garrett Cooper player 28 Garrett Cooper
4 2017 542979 Keon Broxton player 29 Keon Broxton
5 2017 600301 Taylor Motter player 30 Taylor Motter
6 2017 624414 Christian Arroyo player 31 Christian Arroyo
...

Related

Is there a way to put a wildcard character in a web address when using rvest?

I am new to web scrapping and using R and rvest to try and pull some info for a friend. This project might be a bit much for my first, but hopefully someone can help or tell me if it is possible.
I am trying to pull info from https://www.veteranownedbusiness.com/mo like business name, address, phone number, and description. I started by pulling all the names of the business' and was going to loop through each page to pull the information by business. The problem I ran into is that the business url's have numbers assigned to them :
www.veteranownedbusiness.com/business/**32216**/accel-polymers-llc
Is there a way to tell R to ignore this number or accept any entry in its spot so that I could loop through the business names?
Here is the code I have so far to get and clean the business titles if it helps:
library(rvest)
library(tibble)
library(stringr)
vet_name_list <- "https://www.veteranownedbusiness.com/mo"
HTML <- read_html(vet_name_list)
biz_names_html <- html_nodes(HTML, '.top_level_listing a')
biz_names <- html_text(biz_names_html)
biz_names <- biz_names[biz_names != ""]
biz_names_lower <- tolower(biz_names)
biz_names_sym <- gsub("[][!#$&%()*,.:;<=>#^_`|~.{}]", "", biz_names_lower)
biz_names_dub <- str_squish(biz_names_sym)
biz_name_clean <- chartr(" ", "-", biz_names_dub)
No, I'm afraid you can't use wildcards to get a valid url. What you can do is to scrape all the correct urls from the page, number and all.
To do this, we find all the correct nodes (I'm using xpath here rather than css selectors since it gives a bit more flexibility). You then get the href attribute from each node.
This can produce a data frame of business names and url. Here's a fully reproducible example:
library(rvest)
library(tibble)
vet_name_list <- "https://www.veteranownedbusiness.com/mo"
biz <- read_html(vet_name_list) %>%
html_nodes(xpath = "//tr[#class='top_level_listing']/td/a[#href]")
tibble(business = html_text(biz),
url = paste0(biz %>% html_attr("href")))
#> # A tibble: 550 x 2
#> business url
#> <chr> <chr>
#> 1 Accel Polymers, LLC /business/32216/ac~
#> 2 Beacon Car & Pet Wash /business/35987/be~
#> 3 Compass Quest Joplin Regional Veteran Services /business/21943/co~
#> 4 Financial Assistance for Military Experiencing Divorce /business/20797/fi~
#> 5 Focus Marines Foundation /business/29376/fo~
#> 6 Make It Virtual Assistant /business/32204/ma~
#> 7 Malachi Coaching & Training Ministries Int'l /business/24060/ma~
#> 8 Mike Jackson - Author /business/29536/mi~
#> 9 The Mission Continues /business/14492/th~
#> 10 U.S. Small Business Conference & EXPO /business/8266/us-~
#> # ... with 540 more rows
Created on 2022-08-05 by the reprex package (v2.0.1)

Why can't I read clickable links for webscraping with rvest?

I am trying to webscrape this website.
The content I need is available after clicking on each title. I can get the content I want if I do this for example (I am using SelectorGadget):
library("rvest")
url_boe ="https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021"
sample_text = html_text(html_nodes(read_html(url_boe), "#output .page-section"))
However, I would need to get each text for each clickable link in the website. So I usually do:
url_boe = "https://www.bankofengland.co.uk/news/speeches"
html_attr(html_nodes(read_html(url_boe), "#SearchResults .exclude-navigation"), name = "href")
I get an empty object though. I tried different variants of the code but with the same result.
How can I read those links and then apply the code in the first part to all the links?
Can anyone help me?
Thanks!
As #KonradRudolph has noted before, the links are inserted dynamically into the webpage. Therefore, I have produced a code using RSelenium and rvest to tackle this issue:
library(rvest)
library(RSelenium)
# URL
url = "https://www.bankofengland.co.uk/news/speeches"
# Base URL
base_url = "https://www.bankofengland.co.uk"
# Instantiate a Selenium server
rD <- rsDriver(browser=c("chrome"), chromever="91.0.4472.19")
# Assign the client to an object
rem_dr <- rD[["client"]]
# Navigate to the URL
rem_dr$navigate(url)
# Get page HTML
page <- read_html(rem_dr$getPageSource()[[1]])
# Extract links and concatenate them with the base_url
links <- page %>%
html_nodes(".release-speech") %>%
html_attr('href') %>%
paste0(base_url, .)
# Get links names
links_names <- page %>%
html_nodes('#SearchResults .exclude-navigation') %>%
html_text()
# Keep only even results to deduplicate
links_names <- links_names[c(FALSE, TRUE)]
# Create a data.frame with the results
df <- data.frame(links_names, links)
# Close the client and the server
rem_dr$close()
rD$server$stop()
The resulting data.frame looks like this:
> head(df)
links_names
1 Stablecoins: What’s old is new again - speech by Christina Segal-Knowles
2 Tackling climate for real: progress and next steps - speech by Andrew Bailey
3 Tackling climate for real: the role of central banks - speech by Andrew Bailey
4 What are government bond yields telling us about the economic outlook? - speech by Gertjan Vlieghe
5 Responsible openness in the Insurance Sector - speech by Anna Sweeney
6 Cyber Risk: 2015 to 2027 and the Penrose steps - speech by Lyndon Nelson
links
1 https://www.bankofengland.co.uk/speech/2021/june/christina-segal-knowles-speech-at-the-westminster-eforum-poicy-conference
2 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-bis-bank-of-france-imf-ngfs-green-swan-conference
3 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021
4 https://www.bankofengland.co.uk/speech/2021/may/gertjan-vlieghe-speech-hosted-by-the-department-of-economics-and-the-ipr
5 https://www.bankofengland.co.uk/speech/2021/may/anna-sweeney-association-of-british-insurers-prudential-regulation
6 https://www.bankofengland.co.uk/speech/2021/may/lyndon-nelson-the-8th-operational-resilience-and-cyber-security-summit

Could you please help me with web scraping using Rvest?

I am currently trying to webscrape the following website: https://chicago.suntimes.com/crime/archives
I have been relying on the CSS Selector Gadget to find the x-path and to do web scraping. However, I am unable to use the gadget in this website and I would have to use the Inspect Source to find what I need. I have been trying to find the relevant css and xpath by scrolling down each source, but I was not able to do it due to my limited capabilities.
Could you please help me find the xpath or css for
Title
Author
Date
I am so sorry if this is a dry laundry list of everything... but I am really stuck. I will really appreciate if you could give me some help!
Thank you very much.
For each element that you want to extract if you find the relevant tag with it's respective class using selector gadget you'll be able to get what you want.
library(rvest)
url <- 'https://chicago.suntimes.com/crime/archives'
webpage <- url %>% read_html()
title <- webpage %>% html_nodes('h2.c-entry-box--compact__title') %>% html_text()
author <- webpage %>% html_nodes('span.c-byline__author-name') %>% html_text()
date <- webpage %>% html_nodes('time.c-byline__item')%>% html_text() %>% trimws()
result <- data.frame(title, author, date)
result
result
# title author date
#1 Belmont Cragin man charged with carjacking in Little Village: police Sun-Times Wire February 17
#2 Gas station robbed, man carjacked in Horner Park Jermaine Nolen February 17
#3 8 shot, 2 fatally, Tuesday in Chicago Sun-Times Wire February 17
#4 Businesses robbed at gunpoint on the Northwest Side: police Sun-Times Wire February 17
#5 Man charged with carjacking in Aurora Sun-Times Wire February 16
#6 Woman fatally stabbed in Park Manor apartment Sun-Times Wire February 16
#7 Woman critically hurt by gunfire in Woodlawn David Struett February 16
#8 Teen boy, 17, charged with attempted carjacking in Back of the Yards Sun-Times Wire February 16
#...
#...

Need some tips on how R can handle this situation

I am using the csv version of the Lahman 2018 database found here: http://www.seanlahman.com/baseball-archive/statistics/.
In R, I would like to identify how many extra-base hits all Mets rookies have hit in their rookie seasons by game 95. I want to find out which Met rookie hit the most extra-base hits by game 95.
I have been experimenting with dplyr functions including select, filter, and summarize.
The main thing I am uncertain about is how to get only each Mets players' doubles, triples, and homers for the first 95 games of his first season.
This code shows more of what I have done then how I think my problem can be solved -- for that I am seeking tips.
library(dplyr)
df %>% filter(teamID=='NYN') %>%
select(c(playerID, yearID, G, 2B, 3B, HR)) %>%
group_by(playerID, yearID) %>%
summarise(xbh = sum(2B) + sum(3B)+ sum(HR)) %>%
arrange(desc(xbh))
Here is how I would like the output to appear:
Player Season 2B 3B HR XBH
x 1975 10 2 8 20
y 1980 5 5 5 15
z 2000 9 0 4 13
and so on.
I would like the XBH to be in descending order.

Text Mining with R

I need help in text mining using R
Title Date Content
Boy May 13 2015 "She is pretty", Tom said. Tom is handsome.
Animal June 14 2015 The penguin is cute, lion added.
Human March 09 2015 Mr Koh predicted that every human is smart...
Monster Jan 22 2015 Ms May, a student, said that John has $10.80. May loves you.
I would just want to get the opinions from what the people had said.
And also, I would like to seek help in getting the percentage (Eg. 9.8%), because when i split the sentences based on fullstop ("."), i would get "His result improved by 0." instead of "His result improved by 0.8%".
Below is the output that I would like to obtain:
Title Date Content
Boy May 13 2015 she is pretty
Animal June 14 2015 the penguin is cute
Human March 09 2015 every human is smart
Monster Jan 22 2015 john has $10.80
Below is the code that I tried, but didn't get desired output:
list <- c("said", "added", "predicted")
pattern <- paste (list, collapse = "|")
dataframe <- stack(setNames(lapply(strsplit(dataframe, '(?<=[.])', perl=TRUE), grep, pattern = pattern, value = TRUE), dataframe$Title))[2:1]
You're close, but your regular expression for splitting is wrong. This gave the correct arrangement for the data, modulo your request to extract opinions more exactly:
txt <- '
Title Date Content
Boy May 13 2015 "She is pretty", Tom said. Tom is handsome.
Animal June 14 2015 The penguin is cute, lion added.
Human March 09 2015 Mr Koh predicted that every human is smart...
Monster Jan 22 2015 Ms May, a student, said that John has $10.80. May loves you.
'
txt <- gsub(" {2,}(?=\\S)", "|", txt, perl = TRUE)
dataframe <- read.table(sep = "|", text = txt, header = TRUE)
list <- c("said", "added", "predicted")
pattern <- paste (list, collapse = "|")
content <- strsplit(dataframe$Content, '\\.(?= )', perl=TRUE)
opinions <- lapply(content, grep, pattern = pattern, value = TRUE)
names(opinions) <- dataframe$Title
result <- stack(opinions)
In your sample data, all full stops followed by spaces are sentence-ending, so that's what the regular expression \.(?= ) matches. However that will break up sentences like "I was born in the U.S.A. but I live in Canada", so you might have to do additional pre-processing and checking.
Then, assuming the Titles are unique identifiers, you can just merge to add the dates back in:
result <- merge(dataframe[c("Title", "Date")], result, by = "Title")
As mentioned in the comments, the NLP task itself has more to do with text parsing than R programming. You can probably get some mileage out of searching for a pattern like
<optional adjectives> <noun> <verb> <optional adverbs> <adjective> <optional and/or> <optional adjective> ...
which would match your sample data, but I'm far from an expert here. You'd also need a dictionary with lexical categories. A Google search for "extract opinion text" yielded a lot of helpful results on the first page, including this site run by Bing Liu. From what I can tell, Professor Liu literally wrote the book on sentiment analysis.

Resources