RVEST package seems to collect data in random order - r

I have the following question.
I am trying to harvest data from the Booking website (for me only, in order to learn the functionality of the rvest package). Everything's good and fine, the package seems to collect what I want and to put everything in the table (dataframe).
Here's my code:
library(rvest)
library(lubridate)
library(tidyverse)
page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
paste0(1:60) %>%
paste0(c("?ie=UTF8&pageNumber=")) %>%
paste0(1:60) %>%
paste0(c("&pageSize=10&sortBy=recent"))
so in this chunk I collect the data from the first 60 pages after first manually feeding the Booking search engine with the country of my choise (Spain), the dates I am interested in (just some arbitrary interval) and the number of people (I used defaults here).
Then, I add this code to select the properties I want:
read_hotel <- function(url){ # collecting hotel names
ho <- read_html(url)
headline <- ho %>%
html_nodes("span.sr-hotel__name") %>% # the node I want to read
html_text() %>%
as_tibble()
}
hotels <- map_dfr(page_booking, read_hotel)
read_pr <- function(url){ # collecting price tags
pr <- read_html(url)
full_pr <- pr %>%
html_nodes("div.bui-price-display__value") %>% #the node I want to read
html_text() %>%
as_tibble()
}
fullprice <- map_dfr(page_booking, read_pr)
... and eventually save the whole data in the dataframe:
dfr <- tibble(hotels = hotels,
price_fact = fullprice)
I collect more parameters but this doesn't matter. The final dataframe of 1500 rows and two columns is then created. But the problem is the data within the second column does not correspond to the data in the first one. Which is really strange and renders my dataframe to be useless.
I don't really understand how the package works in the background and why does it behaves that way. I also paid attention the first rows in the first column of the dataframe (hotel name) do not correspond to the first hotels I see on the website. So it seems to be a different search/sort/filter criteria the rvest package uses.
Could you please explain me the processes take place during the rvest node hoping?
I would really appreciate at least some explanation, just to better understand the tool we work with.

You shouldn't scrape hotels' name and price separately like that. What you should do is get all nodes of items (hotels), then scrape the name and price relatively of each hotel. With this method, you can't mess up the order.
library(rvest)
library(purrr)
page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
paste0(1:60) %>%
paste0(c("?ie=UTF8&pageNumber=")) %>%
paste0(1:60) %>%
paste0(c("&pageSize=10&sortBy=recent"))
hotels <-
map_dfr(
page_booking,
function(url) {
pg <- read_html(url)
items <- pg %>%
html_nodes(".sr_item")
map_dfr(
items,
function(item) {
data.frame(
hotel = item %>% html_node(xpath = "./descendant::*[contains(#class,'sr-hotel__name')]") %>% html_text(trim = T),
price = item %>% html_node(xpath = "./descendant::*[contains(#class,'bui-price-display__value')]") %>% html_text(trim = T)
)
}
)
}
)
(The dots start the XPath syntaxes present the current node which is the hotel item.)
Update:
Update the code that I think faster but still does the job:
hotels <-
map_dfr(
page_booking,
function(url) {
pg <- read_html(url)
items <- pg %>%
html_nodes(".sr_item")
data.frame(
hotel = items %>% html_node(xpath = "./descendant::*[contains(#class,'sr-hotel__name')]") %>% html_text(trim = T),
price = items %>% html_node(xpath = "./descendant::*[contains(#class,'bui-price-display__value')]") %>% html_text(trim = T)
)
}
)

Related

Rvest Pulls Empty Tables

The site I use to scrape data has changed and I'm having issues pulling the data into table format. I used two different types of codes below trying to get the tables, but it is returning blanks instead of tables.
I'm a novice in regards to scraping and would appreciate the expertise of the group. Should I look for other solutions in rvest, or try to learn a program like rSelenium?
https://www.pgatour.com/stats/detail/02675
Scrape for Multiple Links
library("dplyr")
library("purr")
library("rvest")
df23 <- expand.grid(
stat_id = c("02568","02674", "02567", "02564", "101")
) %>%
mutate(
links = paste0(
'https://www.pgatour.com/stats/detail/',
stat_id
)
) %>%
as_tibble()
#replaced tournament_id with stat_id
get_info <- function(link, stat_id){
data <- link %>%
read_html() %>%
html_table() %>%
.[[2]]
}
test_main_stats <- df23 %>%
mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))
test_main_stats <- test_main_stats %>%
unnest(everything())
Alternative Code
url <- read_html("https://www.pgatour.com/stats/detail/02568")
test1 <- url %>%
html_nodes(".css-8atqhb") %>%
html_table
This page uses javascript to create the table, so rvest will not directly work. But if one examines the page's source code, all of the data is stored in JSON format in a "<script>" node.
This code finds that node and converts from JSON to a list. The variable is the main table but there is a wealth of other information contained in the JSON data struture.
#read page
library(rvest)
page <- read_html("https://www.pgatour.com/stats/detail/02675")
#find the script with the correct id tage, strip the html code
datascript <- page %>% html_elements(xpath = ".//script[#id='__NEXT_DATA__']") %>% html_text()
#convert from JSON
output <- jsonlite::fromJSON(datascript)
#explore the output
str(output)
#get the main table
answer <-output$props$pageProps$statDetails$rows

Rvest scraping google news with different number of rows

I am using Rvest to scrape google news.
However, I encounter missing values in element "Time" from time to time on different keywords. Since the values are missing, it will end up having "different number of rows error" for the data frame of scraping result.
Is there anyway to fill-in NA for these missing values?
Below is the example of the code I am using.
html_dat <- read_html(paste0("https://news.google.com/search?q=",Search,"&hl=en-US&gl=US&ceid=US%3Aen"))
dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
news_dat <- data.frame(
Title = html_dat %>%
html_nodes('.DY5T1d') %>%
html_text(),
Link = dat$Link,
Description = html_dat %>%
html_nodes('.Rai5ob') %>%
html_text(),
Time = html_dat %>%
html_nodes('.WW6dff') %>%
html_text()
)
Without knowing the exact page you were looking at I tried the first Google news page.
In the Rvest page, html_node (without the s) will always return a value even it is NA. Therefore in order to keep the vectors the same length, one needed to find the common parent node for all of the desired data nodes. Then parse the desired information from each one of those nodes.
Assuming the Title node is most complete, go up 1 level with xml_parent() and attempt to retrieving the same number of description nodes, this didn't work. Then tried 2 levels up using xml_parent() %>% xml_parent(), this seems to work.
library(rvest)
url <-"https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en"
html_dat <- read_html(url)
Title = html_dat %>% html_nodes('.DY5T1d') %>% html_text()
# Link = dat$Link
Link = html_dat %>% html_nodes('.VDXfz') %>% html_attr('href')
Link <- gsub("./articles/", "https://news.google.com/articles/",Link)
#Find the common parent node
#(this was trial and error) Tried the parent then the grandparent
Titlenodes <- html_dat %>% html_nodes('.DY5T1d') %>% xml_parent() %>% xml_parent()
Description = Titlenodes %>% html_node('.Rai5ob') %>% html_text()
Time = Titlenodes %>% html_node('.WW6dff') %>% html_text()
answer <- data.frame(Title, Time, Description, Link)

Creating a new column in RVest based on a cascading variable

I'm writing a data scraper in using rvest which looks like this:
library(tidyverse)
library(rvest)
library(magrittr)
library(dplyr)
library(tidyr)
library(data.table)
library(zoo)
targets_url <- paste0("https://247sports.com/college/ohio-state/Season/2021-Football/Targets/")
targets <- map_df(targets_url, ~.x %>% read_html %>%
html_nodes(".ri-page__star-and-score .score , .position , .meta , .ri-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
matrix(ncol = 4, byrow = T) %>%
as.data.frame)
df_structure <- apply(targets,2,as.character)
df_targets <- as.data.frame(df_structure)
You'll notice that it creates a dataframe with four variables and 53 rows.
But now go to the URL itself. You'll notice that the 53 rows correspond to certain subcategorizations: Top Target, High Choice, and Interested. Here's a picture showing an example:
What I'm trying to do is create a fifth column, which contains the subcategory. So for example, the three individuals who fall under "Top Target" will be assigned another column which lists them as "Top Target". Then the next 20 rows will have that fifth column reading as "High Choice" and so on. The reason I'm here is because I have no clue how to do that. What makes it even harder is that not every page will have the same numbers, here's an example of that. You'll see that while the picture from above only lists Top Target (3), this page now has Top Target (24). It varies for each page.
Would it be possible to alter my original script that would:
A) Creates that fifth column with the subcategory I mentioned above
B) Knows when it's suppose to switch to the next subcategory
C) Is agnostic to whatever the total number of people in each subcategory
EDITED script partially based on #Dave2e answer:
library(rvest)
library(dplyr)
library(stringr)
teams <- c("ohio-state","penn-state","michigan","michigan-state")
targets_url <- paste0("https://247sports.com/college/", teams, "/Season/2021-Football/Targets/")
# read the web page once! then extract the information requested
targets <- map_df(targets_url, ~.x %>% read_html %>%
html_nodes(".ri-page__star-and-score .score , .position , .meta , .ri-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
matrix(ncol = 4, byrow = T) %>%
as.data.frame)
#find the headings and the players
list <- page %>% html_nodes("li.ri-page__list-item")
headers <- which(html_attr(list, "class") == "ri-page__list-item list-header")
#find the category
category <- list[headers] %>% html_node("b.name") %>% html_text()
#extract repeats from header
nrepeats<-as.integer(str_extract(category, "[0-9]+"))
categories <- rep(category, nrepeats)[1:nrow(targets)]
#create combined dataframe
answer <- cbind(categories, targets)
The headings are located at a "li" node with the class ="ri-page__list-item list-header". It is also convenient to note the heading contains the number of players underneath that heading.
This script, finds the heading nodes, extracts the number of players and then creates the vector of repeating headings to merge to the targets dataframe.
library(rvest)
library(dplyr)
library(stringr)
targets_url <- paste0("https://247sports.com/college/ohio-state/Season/2021-Football/Targets/")
# read the web page once! then extract the information requested
page <- read_html(targets_url)
targets <- page %>%
html_nodes(".ri-page__star-and-score .score , .position , .meta , .ri-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
matrix(ncol = 4, byrow = T) %>%
as.data.frame
#find the headings and the players
list <- page %>% html_nodes("li.ri-page__list-item")
headers <- which(html_attr(list, "class") == "ri-page__list-item list-header")
#find the category
category <- list[headers] %>% html_node("b.name") %>% html_text()
#extract repeats from header
nrepeats<-as.integer(str_extract(category, "[0-9]+"))
categories <- rep(category, nrepeats)[1:nrow(targets)]
#create combined dataframe
answer <- cbind(categories, targets)
Update - finding the hidden data
The webpage dynamically hides some information if the list is too long. The cope can now handle that information. The code below finds the hidden JSON data (contained in a 'script' node and parses that data. It does return a list of players but not all of the same information.
#another option
#find the hidden JSON data
jsons <- page %>% html_nodes(xpath = '//*[#type ="application/ld+json"]')
allplayers <- jsonlite::fromJSON( html_text(jsons[2]))
#Similar list, provide URL to each players webpage
answer2 <- cbind(rep(category, nrepeats), allplayers$athlete)

What would be the best practice to merge additional variables to data based on specific row information when web scraping in R using 'rvest'?

I'm currently web scraping the IMDB website to extract movie data.
I would like to know how you would solve this problem.
library(tidyverse)
library(data.table)
library(rvest)
library(janitor)
#top rated movies website
url <- 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'
# extract the title of the movies using rvest
titles <- url %>%
read_html() %>%
html_nodes(' .titleColumn a') %>%
html_text() %>%
as.data.table() %>%
setnames(. ,old = colnames(.), new='title')
# extract links to each of the titles, this will be the reference
links <- url %>%
read_html() %>%
html_nodes('.titleColumn a') %>%
html_attr('href') %>%
as.data.table() %>%
setnames(. ,old = colnames(.), new='links')
# creating a DT with the data
movies <- cbind(titles,links)
I will have movies DT with title and links as columns.
Now, I will like to extract additional data of each movie using the links
I will continue using the first row as an example.
#the first link in movies
link <- 'https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=NJ52X0MM1V9FKSPBT46G&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1'
# selector for budget data (this will not change)
select <- '.txt-block:nth-child(15) , .txt-block:nth-child(14) , #titleDetails .txt-block:nth-child(13) , #titleDetails .txt-block:nth-child(12)'
# get budget data
budget <- link %>%
read_html() %>%
html_nodes(select) %>%
html_text() %>%
gsub('\\n','',.) %>%
str_split(.,'\\:')%>%
as.data.table() %>%
janitor::row_to_names(row_number = 1) %>%
setnames(.,old=colnames(.),new= tolower(gsub(' ','_' , str_trim(colnames(.)))))
budget[,(colnames(budget))] <- lapply(budget,function(x) str_extract_all(x, "(\\$) *([0-9,]+)"))
Now I have a 1x4 table with budget information
I would like to pull data for each link of movies and merge it into DT to have a final DT with 6 columns; 'title', 'link' + four budget variables. I was trying to create a function that includes the code to get the budget data using each row's link as a parameter and the using 'lapply', I don't think this is the correct approach.
I would like to see if you have a solution to this in an efficient way.
Thanks so much for your help.
I think this would solve your problem:
# selector for budget data (this will not change)
select <- '.txt-block:nth-child(15) , .txt-block:nth-child(14) , #titleDetails .txt-block:nth-child(13) , #titleDetails .txt-block:nth-child(12)'
# get budget data
## As function
get_budget = function(link,select){
budget <- link %>%
read_html() %>%
html_nodes(select) %>%
html_text() %>%
gsub('\\n','',.) %>%
str_split(.,'\\:')%>%
as.data.table() %>%
janitor::row_to_names(row_number = 1) %>%
setnames(.,old=colnames(.),new= tolower(gsub(' ','_' , str_trim(colnames(.)))))
budget[,(colnames(budget))] <- lapply(budget,function(x) str_extract_all(x, "(\\$) *([0-9,]+)"))
return(budget)
}
#As your code is slow I'll subset movies to have 10 rows:
movies = movies[1:10,]
tmp =
lapply(movies[, links], function(x)
get_budget(link = paste0("https://www.imdb.com/",x),select=select )) %>%
rbindlist(., fill = T)
movies = cbind(movies, tmp)
And your result would seem like this: movies_result
Finally, I think this little advice would make your code loke cooler:
setnames doesn't need . from magrittr; it automatically understands your kind of code.
When possible avoid using setnames(. ,old = colnames(.), new='links'). In your case is just necessary setnames('links') since you are renaming all your variables.
setnames(dt,old = oldnames, new=newnames) is only necessary when oldnames is not equal to names(dt).
Since DT is another R popular library, completely unrelated with data.table I think is better to refer to a data.table as what is a data.table.

Scraping Lineup Data From Football Reference Using R

I seem to always have a problem scraping reference sites using either Python or R. Whenever I use my normal xpath approach (Python) or Rvest approach in R, the table I want never seems to be picked up by the scraper.
library(rvest)
url = 'https://www.pro-football-reference.com/years/2016/games.htm'
webpage = read_html(url)
table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)
for(x in boxscore_links{
keep = substr(x, 10, 36)
url2 = paste('https://www.pro-football-reference.com', keep, sep = "")
webpage2 = read_html(url2)
home_team = webpage2 %>% html_nodes(xpath='//*[#id="all_home_starters"]') %>% html_text()
away_team = webpage2 %>% html_nodes(xpath='//*[#id="all_vis_starters"]') %>% html_text()
home_starters = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_text()
home_starters2 = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_table()
#code that will bind lineup tables with some master table -- code to be written later
}
I'm trying to scrape the starting lineup tables. The first bit of code pulls the urls for all boxscores in 2016, and the for loop goes to each boxscore page with the hopes of extracting the tables led by "Insert Team Here" Starters.
Here's one link for example: 'https://www.pro-football-reference.com/boxscores/201609110rav.htm'
When I run the code above, the home_starters and home_starters2 objects contain zero elements (when ideally it should contain the table or elements of the table I'm trying to bring in).
I appreciate the help!
I've spent the last three hours trying to figure this out. This is how it shoudl be done. This is given my example but I'm sure you could apply it to yours.
"https://www.pro-football-reference.com/years/2017/" %>% read_html() %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#returns') %>% # select desired node
html_table()

Resources