Creating a new column in RVest based on a cascading variable - r

I'm writing a data scraper in using rvest which looks like this:
library(tidyverse)
library(rvest)
library(magrittr)
library(dplyr)
library(tidyr)
library(data.table)
library(zoo)
targets_url <- paste0("https://247sports.com/college/ohio-state/Season/2021-Football/Targets/")
targets <- map_df(targets_url, ~.x %>% read_html %>%
html_nodes(".ri-page__star-and-score .score , .position , .meta , .ri-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
matrix(ncol = 4, byrow = T) %>%
as.data.frame)
df_structure <- apply(targets,2,as.character)
df_targets <- as.data.frame(df_structure)
You'll notice that it creates a dataframe with four variables and 53 rows.
But now go to the URL itself. You'll notice that the 53 rows correspond to certain subcategorizations: Top Target, High Choice, and Interested. Here's a picture showing an example:
What I'm trying to do is create a fifth column, which contains the subcategory. So for example, the three individuals who fall under "Top Target" will be assigned another column which lists them as "Top Target". Then the next 20 rows will have that fifth column reading as "High Choice" and so on. The reason I'm here is because I have no clue how to do that. What makes it even harder is that not every page will have the same numbers, here's an example of that. You'll see that while the picture from above only lists Top Target (3), this page now has Top Target (24). It varies for each page.
Would it be possible to alter my original script that would:
A) Creates that fifth column with the subcategory I mentioned above
B) Knows when it's suppose to switch to the next subcategory
C) Is agnostic to whatever the total number of people in each subcategory
EDITED script partially based on #Dave2e answer:
library(rvest)
library(dplyr)
library(stringr)
teams <- c("ohio-state","penn-state","michigan","michigan-state")
targets_url <- paste0("https://247sports.com/college/", teams, "/Season/2021-Football/Targets/")
# read the web page once! then extract the information requested
targets <- map_df(targets_url, ~.x %>% read_html %>%
html_nodes(".ri-page__star-and-score .score , .position , .meta , .ri-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
matrix(ncol = 4, byrow = T) %>%
as.data.frame)
#find the headings and the players
list <- page %>% html_nodes("li.ri-page__list-item")
headers <- which(html_attr(list, "class") == "ri-page__list-item list-header")
#find the category
category <- list[headers] %>% html_node("b.name") %>% html_text()
#extract repeats from header
nrepeats<-as.integer(str_extract(category, "[0-9]+"))
categories <- rep(category, nrepeats)[1:nrow(targets)]
#create combined dataframe
answer <- cbind(categories, targets)

The headings are located at a "li" node with the class ="ri-page__list-item list-header". It is also convenient to note the heading contains the number of players underneath that heading.
This script, finds the heading nodes, extracts the number of players and then creates the vector of repeating headings to merge to the targets dataframe.
library(rvest)
library(dplyr)
library(stringr)
targets_url <- paste0("https://247sports.com/college/ohio-state/Season/2021-Football/Targets/")
# read the web page once! then extract the information requested
page <- read_html(targets_url)
targets <- page %>%
html_nodes(".ri-page__star-and-score .score , .position , .meta , .ri-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
matrix(ncol = 4, byrow = T) %>%
as.data.frame
#find the headings and the players
list <- page %>% html_nodes("li.ri-page__list-item")
headers <- which(html_attr(list, "class") == "ri-page__list-item list-header")
#find the category
category <- list[headers] %>% html_node("b.name") %>% html_text()
#extract repeats from header
nrepeats<-as.integer(str_extract(category, "[0-9]+"))
categories <- rep(category, nrepeats)[1:nrow(targets)]
#create combined dataframe
answer <- cbind(categories, targets)
Update - finding the hidden data
The webpage dynamically hides some information if the list is too long. The cope can now handle that information. The code below finds the hidden JSON data (contained in a 'script' node and parses that data. It does return a list of players but not all of the same information.
#another option
#find the hidden JSON data
jsons <- page %>% html_nodes(xpath = '//*[#type ="application/ld+json"]')
allplayers <- jsonlite::fromJSON( html_text(jsons[2]))
#Similar list, provide URL to each players webpage
answer2 <- cbind(rep(category, nrepeats), allplayers$athlete)

Related

Web Scraping Across multiple pages R

I have been working on some R code. The purpose is to collect the average word length and other stats about the words in a section of a website with 50 pages. Collecting the stats is no problem and it's a easy part. However, getting my code to collect the stats over 50 pages is the hard part, it only ever seems to output information from the first page. See the code below and ignore the poor indentation.
install.packages(c('tidytext', 'tidyverse'))
library(tidyverse)
library(tidytext)
library(rvest)
library(stringr)
websitePage <- read_html('http://books.toscrape.com/catalogue/page-1.html')
textSort <- websitePage %>%
html_nodes('.product_pod a') %>%
html_text()
for (page_result in seq(from = 1, to = 50, by = 1)) {
link = paste0('http://books.toscrape.com/catalogue/page-',page_result,'.html')
page = read_html(link)
# Creates a tibble
textSort.tbl <- tibble(text = textSort)
textSort.tidy <- textSort.tbl %>%
funnest_tokens(word, text)
}
# Finds the average word length
textSort.tidy %>%
map(nchar) %>%
map(mean)
# Finds the most common words
textSort.tidy %>%
count(word, sort = TRUE)
# Removes the stop words and then finds most common words
textSort.tidy %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
# Counts the number of times the word "Girl" is in the text
textSort.tidy %>%
count(word) %>%
filter(word == "Girl")
You can use lapply/map to extract the tetx from multiple links.
library(rvest)
link <- paste0('http://books.toscrape.com/catalogue/page-',1:50,'.html')
result <- lapply(link, function(x) x %>%
read_html %>%
html_nodes('.product_pod a') %>%
html_text)
You can continue using lapply if you want to apply other functions to text.

What would be the best practice to merge additional variables to data based on specific row information when web scraping in R using 'rvest'?

I'm currently web scraping the IMDB website to extract movie data.
I would like to know how you would solve this problem.
library(tidyverse)
library(data.table)
library(rvest)
library(janitor)
#top rated movies website
url <- 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'
# extract the title of the movies using rvest
titles <- url %>%
read_html() %>%
html_nodes(' .titleColumn a') %>%
html_text() %>%
as.data.table() %>%
setnames(. ,old = colnames(.), new='title')
# extract links to each of the titles, this will be the reference
links <- url %>%
read_html() %>%
html_nodes('.titleColumn a') %>%
html_attr('href') %>%
as.data.table() %>%
setnames(. ,old = colnames(.), new='links')
# creating a DT with the data
movies <- cbind(titles,links)
I will have movies DT with title and links as columns.
Now, I will like to extract additional data of each movie using the links
I will continue using the first row as an example.
#the first link in movies
link <- 'https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=NJ52X0MM1V9FKSPBT46G&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1'
# selector for budget data (this will not change)
select <- '.txt-block:nth-child(15) , .txt-block:nth-child(14) , #titleDetails .txt-block:nth-child(13) , #titleDetails .txt-block:nth-child(12)'
# get budget data
budget <- link %>%
read_html() %>%
html_nodes(select) %>%
html_text() %>%
gsub('\\n','',.) %>%
str_split(.,'\\:')%>%
as.data.table() %>%
janitor::row_to_names(row_number = 1) %>%
setnames(.,old=colnames(.),new= tolower(gsub(' ','_' , str_trim(colnames(.)))))
budget[,(colnames(budget))] <- lapply(budget,function(x) str_extract_all(x, "(\\$) *([0-9,]+)"))
Now I have a 1x4 table with budget information
I would like to pull data for each link of movies and merge it into DT to have a final DT with 6 columns; 'title', 'link' + four budget variables. I was trying to create a function that includes the code to get the budget data using each row's link as a parameter and the using 'lapply', I don't think this is the correct approach.
I would like to see if you have a solution to this in an efficient way.
Thanks so much for your help.
I think this would solve your problem:
# selector for budget data (this will not change)
select <- '.txt-block:nth-child(15) , .txt-block:nth-child(14) , #titleDetails .txt-block:nth-child(13) , #titleDetails .txt-block:nth-child(12)'
# get budget data
## As function
get_budget = function(link,select){
budget <- link %>%
read_html() %>%
html_nodes(select) %>%
html_text() %>%
gsub('\\n','',.) %>%
str_split(.,'\\:')%>%
as.data.table() %>%
janitor::row_to_names(row_number = 1) %>%
setnames(.,old=colnames(.),new= tolower(gsub(' ','_' , str_trim(colnames(.)))))
budget[,(colnames(budget))] <- lapply(budget,function(x) str_extract_all(x, "(\\$) *([0-9,]+)"))
return(budget)
}
#As your code is slow I'll subset movies to have 10 rows:
movies = movies[1:10,]
tmp =
lapply(movies[, links], function(x)
get_budget(link = paste0("https://www.imdb.com/",x),select=select )) %>%
rbindlist(., fill = T)
movies = cbind(movies, tmp)
And your result would seem like this: movies_result
Finally, I think this little advice would make your code loke cooler:
setnames doesn't need . from magrittr; it automatically understands your kind of code.
When possible avoid using setnames(. ,old = colnames(.), new='links'). In your case is just necessary setnames('links') since you are renaming all your variables.
setnames(dt,old = oldnames, new=newnames) is only necessary when oldnames is not equal to names(dt).
Since DT is another R popular library, completely unrelated with data.table I think is better to refer to a data.table as what is a data.table.

RVEST package seems to collect data in random order

I have the following question.
I am trying to harvest data from the Booking website (for me only, in order to learn the functionality of the rvest package). Everything's good and fine, the package seems to collect what I want and to put everything in the table (dataframe).
Here's my code:
library(rvest)
library(lubridate)
library(tidyverse)
page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
paste0(1:60) %>%
paste0(c("?ie=UTF8&pageNumber=")) %>%
paste0(1:60) %>%
paste0(c("&pageSize=10&sortBy=recent"))
so in this chunk I collect the data from the first 60 pages after first manually feeding the Booking search engine with the country of my choise (Spain), the dates I am interested in (just some arbitrary interval) and the number of people (I used defaults here).
Then, I add this code to select the properties I want:
read_hotel <- function(url){ # collecting hotel names
ho <- read_html(url)
headline <- ho %>%
html_nodes("span.sr-hotel__name") %>% # the node I want to read
html_text() %>%
as_tibble()
}
hotels <- map_dfr(page_booking, read_hotel)
read_pr <- function(url){ # collecting price tags
pr <- read_html(url)
full_pr <- pr %>%
html_nodes("div.bui-price-display__value") %>% #the node I want to read
html_text() %>%
as_tibble()
}
fullprice <- map_dfr(page_booking, read_pr)
... and eventually save the whole data in the dataframe:
dfr <- tibble(hotels = hotels,
price_fact = fullprice)
I collect more parameters but this doesn't matter. The final dataframe of 1500 rows and two columns is then created. But the problem is the data within the second column does not correspond to the data in the first one. Which is really strange and renders my dataframe to be useless.
I don't really understand how the package works in the background and why does it behaves that way. I also paid attention the first rows in the first column of the dataframe (hotel name) do not correspond to the first hotels I see on the website. So it seems to be a different search/sort/filter criteria the rvest package uses.
Could you please explain me the processes take place during the rvest node hoping?
I would really appreciate at least some explanation, just to better understand the tool we work with.
You shouldn't scrape hotels' name and price separately like that. What you should do is get all nodes of items (hotels), then scrape the name and price relatively of each hotel. With this method, you can't mess up the order.
library(rvest)
library(purrr)
page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
paste0(1:60) %>%
paste0(c("?ie=UTF8&pageNumber=")) %>%
paste0(1:60) %>%
paste0(c("&pageSize=10&sortBy=recent"))
hotels <-
map_dfr(
page_booking,
function(url) {
pg <- read_html(url)
items <- pg %>%
html_nodes(".sr_item")
map_dfr(
items,
function(item) {
data.frame(
hotel = item %>% html_node(xpath = "./descendant::*[contains(#class,'sr-hotel__name')]") %>% html_text(trim = T),
price = item %>% html_node(xpath = "./descendant::*[contains(#class,'bui-price-display__value')]") %>% html_text(trim = T)
)
}
)
}
)
(The dots start the XPath syntaxes present the current node which is the hotel item.)
Update:
Update the code that I think faster but still does the job:
hotels <-
map_dfr(
page_booking,
function(url) {
pg <- read_html(url)
items <- pg %>%
html_nodes(".sr_item")
data.frame(
hotel = items %>% html_node(xpath = "./descendant::*[contains(#class,'sr-hotel__name')]") %>% html_text(trim = T),
price = items %>% html_node(xpath = "./descendant::*[contains(#class,'bui-price-display__value')]") %>% html_text(trim = T)
)
}
)

How do I add a loop when using R to scrape data?

I'm trying to create a database of crime data by zip code based on Trulia.com's data. I have the code below but so far it only produces 1 line of data. In the code below, Zipcodes is just a list of US zip codes. Can anyone tell me what I need to add to make this run through my entire list "i" ?
Here is a link to one of the Trulia pages for reference: https://www.trulia.com/real_estate/20004-Washington/crime/
UPDATE:
Here are zip codes for download: https://www.dropbox.com/s/uxukqpu0v88d7tf/Zip%20Code%20Database%20wo%20Boston.xlsx?dl=0
I also changed the code a bit this time after realizing the crime stats appear in different orders depending on the zip code. Is it possible to have the loop produce 4 lines per zipcode? This currently works but only produces the last zip code in the dataset. I can't figure out how to make sure each zip code's data is recorded on separate lines, so it doesn't overwrite and only leave one line of the last zip code.
Please help!!
library(rvest)
data=data.frame(Zipcodes)
for(i in data$Zip.Code)
{
site <- paste("https://www.trulia.com/real_estate/",i,"-Boston/crime/", sep="")
site <- html(site)
crime<- data.frame(zip =i,
type =site %>% html_nodes(".brs") %>% html_text() ,
stringsAsFactors=FALSE)
}
View(crime)
If that code doesn't work, try this:
data=data.frame(Zillow_Data_for_R_Test)
for(i in data$Zip.Code)
site <- paste("https://www.trulia.com/real_estate/",i,"-Boston/crime/", sep="")
site <- read_html(site)
crime<- data.frame(zip =i,
theft =site %>% html_nodes(".crime-text-0") %>% html_text() ,
assault =site %>% html_nodes(".crime-text-1") %>% html_text() ,
arrest =site %>% html_nodes(".crime-text-2") %>% html_text() ,
vandalism =site %>% html_nodes(".crime-text-3") %>% html_text() ,
robbery =site %>% html_nodes(".crime-text-4") %>% html_text() ,
type =site %>% html_nodes(".clearfix") %>% html_text() ,
stringsAsFactors=FALSE)
View(crime)
The comment of #r2evans already provides an answer. Since the #ShanCham asked how to actually implement this I wanted to guide with the following code, which is just more verbose than the comment and could therefore not be posted as additional comment.
library(rvest)
#only two exemplary zipcodes, could be more, of course
zipcodes <- c("02110", "02125")
crime <- lapply(zipcodes, function(z) {
site <- read_html(paste0("https://www.trulia.com/real_estate/",z,"-Boston/crime/"))
#for illustrative purposes:
#introduced as.numeric to numeric columns
#exluded some of your other columns and shortenend the current text in type
data.frame(zip = z,
theft = site %>% html_nodes(".crime-text-0") %>% html_text() %>% as.numeric(),
assault = site %>% html_nodes(".crime-text-1") %>% html_text() %>% as.numeric() ,
type = site %>% html_nodes(".clearfix") %>% html_text() %>% paste(collapse = " ") %>% substr(1, 50) ,
stringsAsFactors=FALSE)
})
class(crime)
#list
#Output are lists that can be bound together to one data.frame
crime <- do.call(rbind, crime)
#crime is a data.frame, hence, classes/types are kept
class(crime$type)
# [1] "character"
class(crime$assault)
# [1] "numeric"

How to scrape the data when there's missing values in selector nodes

Hi I am trying scrape the data from ebay in R, I used the code mentioned below but I encountered with a problem wherein there were missing values for a particular selector elements, to get round it I used a for loop as shown(inspecting each listing and giving the number for which there was data missing) since the data scraped was less it was possible to inspect but how to do it when there's large amounts of data to be scraped.
Thanks in advance
library(rvest)
url<-"https://www.ebay.in/sch/i.html_from=R40&_sacat=0&LH_ItemCondition=4&_ipg=100&_nkw=samsung+j7"
web<- read_html(url)
subdescp<- html_nodes(web, ".lvsubtitle+ .lvsubtitle")
subdescp1<-html_text(subdescp)
head(subdescp1)
library(stringr)
subdescp1<- str_replace_all(subdescp1, "[\t\n\r]" , "")
head(subdescp1)
for (i in c(5,6,10,19,33,34,35)){
a<-subdescp1[1:(i-1)]
b<-subdescp1[i:length(subdescp1)]
subdescp1<-append(a,list("NA"))
subdescp1<-append(subdescp1,b)
}
Z<-as.character(subdescp1)
Z
webpage <- read_html(url)
Descp_data_html <- html_nodes(webpage,'.vip')
Descp_data <- html_text(Descp_data_html)
head(Descp_data)
price_data_html <- html_nodes(web,'.prc .bold')
price_data <- html_text(price_data_html)
head(price_data)
library(stringr)
price_data<-str_replace_all(price_data, "[\t\n]" , "")
price_data<-gsub("Rs. ","",price_data)
price_data<-gsub(",","",price_data)
price_data<- as.numeric(price_data)
price_data
Desc_data_html <- html_nodes(webpage,'.lvtitle+ .lvsubtitle')
Desc_data <- html_text(Desc_data_html, trim = TRUE)
head(Desc_data)
j7_f2<-data.frame(Title = Descp_data, Description= Desc_data, Sub_Description= Z, Pirce = price_data)
For instance you can use something like this.
data <- read_html("url.xml")
var <- data %>% html_nodes("//node") %>% xml_text()
# observations that donĀ“t have certain nodes - fill them with NA
var_pair <- data %>% html_nodes("node_var_pair")
var_missing_clean = sapply(var_pair, function(x) {
tryCatch(xml_text(html_nodes(x, "./var_missing")),
error=function(err) NA)
})
df = data.frame(var, var_pair, var_missing)
Here there are three types of nodes that you may consider. var gathers the nodes that do not have missing data. var_pair includes the nodes that you want to pair with the nodes that contain missing observation and var_missing refers to the nodes with missing information. You can create variables and aggregate them in a data data frame (df)
The process here is simple and in two steps -- First extract all nodes at the block level (not each element and don't convert to text). This is a list of length equal to the number of blocks. Second from this extracted list extract each element as text and clean it. Since this is being done from a list, NA's where applicable are automatically coerced in the right places. See an example from the same ebay India site:
library(rvest)
library(stringr)
# specify the url
url <-"https://www.ebay.in/sch/Mobile-Phones"
# read the page
web <- read_html(url)
# define the supernode that has the entire block of information
super_node <- '.li'
# read as vector of all blocks of supernode (imp: use html_nodes function)
super_node_read <- html_nodes(web, super_node)
# define each node element that you want
node_model_details <- '.lvtitle'
node_description_1 <- '.lvtitle+ .lvsubtitle'
node_description_2 <- '.lvsubtitle+ .lvsubtitle'
node_model_price <- '.prc .bold'
node_shipping_info <- '.bfsp'
# extract the output for each as cleaned text (imp: use html_node function)
model_details <- html_node(super_node_read, node_model_details) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
description_1 <- html_node(super_node_read, node_description_1) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
description_2 <- html_node(super_node_read, node_description_2) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
model_price <- html_node(super_node_read, node_model_price) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
shipping_info <- html_node(super_node_read, node_shipping_info) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
# create the data.frame
mobile_phone_data <- data.frame(
model_details,
description_1,
description_2,
model_price,
shipping_info
)

Resources