Web Scraping Across multiple pages R - r

I have been working on some R code. The purpose is to collect the average word length and other stats about the words in a section of a website with 50 pages. Collecting the stats is no problem and it's a easy part. However, getting my code to collect the stats over 50 pages is the hard part, it only ever seems to output information from the first page. See the code below and ignore the poor indentation.
install.packages(c('tidytext', 'tidyverse'))
library(tidyverse)
library(tidytext)
library(rvest)
library(stringr)
websitePage <- read_html('http://books.toscrape.com/catalogue/page-1.html')
textSort <- websitePage %>%
html_nodes('.product_pod a') %>%
html_text()
for (page_result in seq(from = 1, to = 50, by = 1)) {
link = paste0('http://books.toscrape.com/catalogue/page-',page_result,'.html')
page = read_html(link)
# Creates a tibble
textSort.tbl <- tibble(text = textSort)
textSort.tidy <- textSort.tbl %>%
funnest_tokens(word, text)
}
# Finds the average word length
textSort.tidy %>%
map(nchar) %>%
map(mean)
# Finds the most common words
textSort.tidy %>%
count(word, sort = TRUE)
# Removes the stop words and then finds most common words
textSort.tidy %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
# Counts the number of times the word "Girl" is in the text
textSort.tidy %>%
count(word) %>%
filter(word == "Girl")

You can use lapply/map to extract the tetx from multiple links.
library(rvest)
link <- paste0('http://books.toscrape.com/catalogue/page-',1:50,'.html')
result <- lapply(link, function(x) x %>%
read_html %>%
html_nodes('.product_pod a') %>%
html_text)
You can continue using lapply if you want to apply other functions to text.

Related

Trouble mapping a function to a list of scraped links using rvest

I am trying to apply a function that extracts a table from a list of scraped links. I am at the final stage where I am applying the get_injury_data function to the links - I have been having issues with successfully executing this. I get the following error:
Error in matrix(unlist(values), ncol = width, byrow = TRUE) :
'data' must be of a vector type, was 'NULL'
I wonder if anyone can help me spot where I am going wrong. The code is as follows:
library(tidyverse)
library(rvest)
# create a function to grab the team links
get_team_links <- function(url){
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>% # remove rows with # string
paste0('https://www.transfermarkt.com', .) %>% # pat the website link to the url strings
unique() %>% # keep only unique links
as_tibble() %>% # turn strings into a tibble datatset
rename("links" = "value") %>% # rename the value column
filter(!grepl('profil', links)) %>% # remove link of players included
filter(!grepl('spielplan', links)) %>% # remove link of additional team pages included
mutate(links = gsub("startseite", "kader", links)) # change link to go to the detailed page
}
# create a function to grab the player links
get_player_links <- function(url){
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>% # remove rows with # string
paste0('https://www.transfermarkt.com', .) %>% # pat the website link to the url strings
unique() %>% # keep only unique links
as_tibble() %>% # turn strings into a tibble datatset
rename("links" = "value") %>% # rename the value column
filter(grepl('profil', links)) %>% # remove link of players included
mutate(links = gsub("profil", "verletzungen", links)) # change link to go to the injury page
}
# create a function to get the injury dataset
get_injury_data <- function(url){
url %>%
read_html() %>%
html_nodes('#yw1') %>%
html_table()
}
# get team links and save it as team_links
team_links <- get_team_links('https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1')
# get player links and by mapping the function on to the player_injury_links dataset
# and then unnest the list of lists as a long list
player_injury_links <- team_links %>%
mutate(links = map(team_links$links, get_player_links)) %>%
unnest(links)
# using the player_injury_links list create a dataset by web scrapping the play injury pages
player_injury_data <- map(player_injury_links$links, get_injury_data)
Solution
So the issue that I was having was that some of the links that I was scraping did not have any data.
To overcome this issue used, I used the possibly function from purrr package. This helped me create a new, error-free function.
The line code that was giving me trouble is as follows:
player_injury_data <- player_injury_links %>%
purrr::map(., purrr::possibly(get_injury_data, otherwise = NULL, quiet = TRUE))

Rvest scraping google news with different number of rows

I am using Rvest to scrape google news.
However, I encounter missing values in element "Time" from time to time on different keywords. Since the values are missing, it will end up having "different number of rows error" for the data frame of scraping result.
Is there anyway to fill-in NA for these missing values?
Below is the example of the code I am using.
html_dat <- read_html(paste0("https://news.google.com/search?q=",Search,"&hl=en-US&gl=US&ceid=US%3Aen"))
dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
news_dat <- data.frame(
Title = html_dat %>%
html_nodes('.DY5T1d') %>%
html_text(),
Link = dat$Link,
Description = html_dat %>%
html_nodes('.Rai5ob') %>%
html_text(),
Time = html_dat %>%
html_nodes('.WW6dff') %>%
html_text()
)
Without knowing the exact page you were looking at I tried the first Google news page.
In the Rvest page, html_node (without the s) will always return a value even it is NA. Therefore in order to keep the vectors the same length, one needed to find the common parent node for all of the desired data nodes. Then parse the desired information from each one of those nodes.
Assuming the Title node is most complete, go up 1 level with xml_parent() and attempt to retrieving the same number of description nodes, this didn't work. Then tried 2 levels up using xml_parent() %>% xml_parent(), this seems to work.
library(rvest)
url <-"https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en"
html_dat <- read_html(url)
Title = html_dat %>% html_nodes('.DY5T1d') %>% html_text()
# Link = dat$Link
Link = html_dat %>% html_nodes('.VDXfz') %>% html_attr('href')
Link <- gsub("./articles/", "https://news.google.com/articles/",Link)
#Find the common parent node
#(this was trial and error) Tried the parent then the grandparent
Titlenodes <- html_dat %>% html_nodes('.DY5T1d') %>% xml_parent() %>% xml_parent()
Description = Titlenodes %>% html_node('.Rai5ob') %>% html_text()
Time = Titlenodes %>% html_node('.WW6dff') %>% html_text()
answer <- data.frame(Title, Time, Description, Link)

Creating a new column in RVest based on a cascading variable

I'm writing a data scraper in using rvest which looks like this:
library(tidyverse)
library(rvest)
library(magrittr)
library(dplyr)
library(tidyr)
library(data.table)
library(zoo)
targets_url <- paste0("https://247sports.com/college/ohio-state/Season/2021-Football/Targets/")
targets <- map_df(targets_url, ~.x %>% read_html %>%
html_nodes(".ri-page__star-and-score .score , .position , .meta , .ri-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
matrix(ncol = 4, byrow = T) %>%
as.data.frame)
df_structure <- apply(targets,2,as.character)
df_targets <- as.data.frame(df_structure)
You'll notice that it creates a dataframe with four variables and 53 rows.
But now go to the URL itself. You'll notice that the 53 rows correspond to certain subcategorizations: Top Target, High Choice, and Interested. Here's a picture showing an example:
What I'm trying to do is create a fifth column, which contains the subcategory. So for example, the three individuals who fall under "Top Target" will be assigned another column which lists them as "Top Target". Then the next 20 rows will have that fifth column reading as "High Choice" and so on. The reason I'm here is because I have no clue how to do that. What makes it even harder is that not every page will have the same numbers, here's an example of that. You'll see that while the picture from above only lists Top Target (3), this page now has Top Target (24). It varies for each page.
Would it be possible to alter my original script that would:
A) Creates that fifth column with the subcategory I mentioned above
B) Knows when it's suppose to switch to the next subcategory
C) Is agnostic to whatever the total number of people in each subcategory
EDITED script partially based on #Dave2e answer:
library(rvest)
library(dplyr)
library(stringr)
teams <- c("ohio-state","penn-state","michigan","michigan-state")
targets_url <- paste0("https://247sports.com/college/", teams, "/Season/2021-Football/Targets/")
# read the web page once! then extract the information requested
targets <- map_df(targets_url, ~.x %>% read_html %>%
html_nodes(".ri-page__star-and-score .score , .position , .meta , .ri-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
matrix(ncol = 4, byrow = T) %>%
as.data.frame)
#find the headings and the players
list <- page %>% html_nodes("li.ri-page__list-item")
headers <- which(html_attr(list, "class") == "ri-page__list-item list-header")
#find the category
category <- list[headers] %>% html_node("b.name") %>% html_text()
#extract repeats from header
nrepeats<-as.integer(str_extract(category, "[0-9]+"))
categories <- rep(category, nrepeats)[1:nrow(targets)]
#create combined dataframe
answer <- cbind(categories, targets)
The headings are located at a "li" node with the class ="ri-page__list-item list-header". It is also convenient to note the heading contains the number of players underneath that heading.
This script, finds the heading nodes, extracts the number of players and then creates the vector of repeating headings to merge to the targets dataframe.
library(rvest)
library(dplyr)
library(stringr)
targets_url <- paste0("https://247sports.com/college/ohio-state/Season/2021-Football/Targets/")
# read the web page once! then extract the information requested
page <- read_html(targets_url)
targets <- page %>%
html_nodes(".ri-page__star-and-score .score , .position , .meta , .ri-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
matrix(ncol = 4, byrow = T) %>%
as.data.frame
#find the headings and the players
list <- page %>% html_nodes("li.ri-page__list-item")
headers <- which(html_attr(list, "class") == "ri-page__list-item list-header")
#find the category
category <- list[headers] %>% html_node("b.name") %>% html_text()
#extract repeats from header
nrepeats<-as.integer(str_extract(category, "[0-9]+"))
categories <- rep(category, nrepeats)[1:nrow(targets)]
#create combined dataframe
answer <- cbind(categories, targets)
Update - finding the hidden data
The webpage dynamically hides some information if the list is too long. The cope can now handle that information. The code below finds the hidden JSON data (contained in a 'script' node and parses that data. It does return a list of players but not all of the same information.
#another option
#find the hidden JSON data
jsons <- page %>% html_nodes(xpath = '//*[#type ="application/ld+json"]')
allplayers <- jsonlite::fromJSON( html_text(jsons[2]))
#Similar list, provide URL to each players webpage
answer2 <- cbind(rep(category, nrepeats), allplayers$athlete)

Loop over list in R, conduct analysis specific to element in list, save results in element dataframe?

I am trying to replicate an analysis using tidytext in R, except using a loop. The specific example comes from Julia Silge and David Robinson's Text Mining with R, a Tidy Approach. The context for it can be found here: https://www.tidytextmining.com/sentiment.html#sentiment-analysis-with-inner-join.
In the text, they give an example of how to do sentiment analysis using the NRC lexicon, which has eight different sentiments, including joy, anger, and anticipation. I'm not doing an analysis for a specific book like the example, so I commented out that line, and it still works:
nrc_list <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
wordcount_joy <- wordcount %>%
# filter(book == "Emma") %>%
inner_join(nrc_list) %>%
count(word, sort = TRUE)
As I said before, this works. I now want to modify it to loop over all eight emotions, and save the results in a dataframe labeled with the emotion. How I tried to modify it:
emotion <- c('anger', 'disgust', 'joy', 'surprise', 'anticip', 'fear', 'sadness', 'trust')
for (i in emotion) {
nrc_list <- get_sentiments("nrc") %>%
filter(sentiment == "i")
wcount[[i]] <- wordcount %>%
inner_join(nrc_list) %>%
count(word, sort = TRUE)
}
I get an "Error: object 'wcount' not found" message when I do this. I have googled this and it seems like the answers to this question is to use wcount[[i]] but clearly something is off when I tried adapting it. Do you have any suggestions?
Code below will do the trick. Note that you are refering to wordcount in your loop and the example uses tidybooks. Code follows the steps as in the link to tidytextmining you are refering to.
library(janeaustenr)
library(dplyr)
library(stringr)
library(tidytext)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
emotion <- c('anger', 'disgust', 'joy', 'surprise', 'anticip', 'fear', 'sadness', 'trust')
# initialize list with the length of the emotion vector
wcount <- vector("list", length(emotion))
# name the list entries
names(wcount) <- emotion
# run loop
for (i in emotion) {
nrc_list <- get_sentiments("nrc") %>%
filter(sentiment == i)
wcount[[i]] <- tidy_books %>%
inner_join(nrc_list) %>%
count(word, sort = TRUE)
}

Scraping Lineup Data From Football Reference Using R

I seem to always have a problem scraping reference sites using either Python or R. Whenever I use my normal xpath approach (Python) or Rvest approach in R, the table I want never seems to be picked up by the scraper.
library(rvest)
url = 'https://www.pro-football-reference.com/years/2016/games.htm'
webpage = read_html(url)
table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)
for(x in boxscore_links{
keep = substr(x, 10, 36)
url2 = paste('https://www.pro-football-reference.com', keep, sep = "")
webpage2 = read_html(url2)
home_team = webpage2 %>% html_nodes(xpath='//*[#id="all_home_starters"]') %>% html_text()
away_team = webpage2 %>% html_nodes(xpath='//*[#id="all_vis_starters"]') %>% html_text()
home_starters = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_text()
home_starters2 = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_table()
#code that will bind lineup tables with some master table -- code to be written later
}
I'm trying to scrape the starting lineup tables. The first bit of code pulls the urls for all boxscores in 2016, and the for loop goes to each boxscore page with the hopes of extracting the tables led by "Insert Team Here" Starters.
Here's one link for example: 'https://www.pro-football-reference.com/boxscores/201609110rav.htm'
When I run the code above, the home_starters and home_starters2 objects contain zero elements (when ideally it should contain the table or elements of the table I'm trying to bring in).
I appreciate the help!
I've spent the last three hours trying to figure this out. This is how it shoudl be done. This is given my example but I'm sure you could apply it to yours.
"https://www.pro-football-reference.com/years/2017/" %>% read_html() %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#returns') %>% # select desired node
html_table()

Resources