How to extract max page number from html using rvest - r

I am having issues extracting the max/last page number at the bottom of the page. When I have tried the following code, I don't understand what I am doing wrong. My goal is to extract the only the number (currently 96), and if that isn't possible then at least extracting the href that contains the last page number in it (and I could just get the number from that).
#Example Page
page <- read_html("https://www.yachtworld.com/boats-for-sale/condition-used/type-sail/sort-price:asc/?currency=CAD&year=1990-2018&length=38-45&price=0-200000")
page %>% html_nodes(".nav-next") %>% html_attr("href") #my attempt to extract the number 96
page %>% html_nodes(".search-page-nav a") %>% html_attr("href") #my attempt to extract the href
{xml_nodeset (0)} #this is what is returned in both cases
SelectorGadget highlighting the desired node.
The chunk that I would like from inspect source.

The pagination requires Javascript to run on the page in order to be present. This doesn't happen with rvest. Instead, one way is to calculate the number of pages based on text that is present, that has the number of results per page and the total result count.
library(magrittr)
library(rvest)
library(stringr)
page <- read_html('https://www.yachtworld.com/boats-for-sale/condition-used/type-sail/sort-price:asc/?currency=CAD&year=1990-2018&length=38-45&price=0-200000&page=1')
result_text <- page %>% html_node('.page-selector-text') %>% html_text()
results_per_page <- result_text %>% stringr::str_match('- (\\d+)') %>% .[2] %>% as.integer()
num_results <- result_text %>% stringr::str_match('of ([0-9,]+)') %>% .[2] %>% gsub(',','', .) %>% as.integer()
num_pages <- ceiling(num_results/results_per_page)
print(num_pages)

Related

Efficiency in extracting data from webscraping in R

This is no doubt very simple so apologies but I am new to webscraping and am trying to extract multiple datapoints in one call using rvest. Let's take for example the following code (NB I have not used the actual website which I have replaced in this code snippet with xxxxxx.com):
univsalaries <- lapply(paste0('https://xxxxxx.com/job/p', 1:20,'/key=%F9%80%76&final=1&jump=1&PGTID=0d3408-0000-24gf-ac2b-810&ClickID=2'),
function(url_base){
url_base %>% read_html() %>%
html_nodes('.salary') %>%
html_text()
})
Let's say there is another html node I want to scrape (.company). Obviously I can make a separate call and fetch that data, but I want to understand the syntax of how I could extract the information in the same call.
I tried to put it in the following structure, but the code sent me to the debugger
.... function(url_base){
url_base %>% read_html() %>%
Salary <- univsalaries %>%
html_nodes('.salary') %>% html_text()
Company <- univsalaries %>%
html_nodes('.company') %>% html_text()
dt<-tibble(Salary,Company)
})
Read the webpage once and then you can extract multiple values from the same page.
library(purrr)
library(rvest)
univsalaries <- map(paste0('https://xxxxxx.com/job/p', 1:20,'/key=%F9%80%76&final=1&jump=1&PGTID=0d3408-0000-24gf-ac2b-810&ClickID=2'),
function(url_base){
webpage <- url_base %>% read_html()
data.frame(Salary = webpage %>% html_nodes('.salary') %>% html_text(),
Company = webpage %>% html_nodes('.company') %>% html_text())
})
This would give you a list of dataframes (one for every link), if you need one combined dataframe then use map_df instead of map.

How can I extract a specific table from a website that has multipe tables in R?

I am trying to extract a table from https://www.basketball-reference.com/leagues/NBA_2018.html. The table I want is the (Team per game stats). This webpage has multiple tables, and when I try to extract the tables from it, it gives the first two tables from the page.
How can I get the table I want using R? I mentioned below the code I used
library(rvest)
url <- "https://www.basketball-reference.com/leagues/NBA_2018.html"
# read the link
html <-read_html(url)
tables <- html %>% html_table(fill =TRUE)
View(tables)
It is commented out. You can grab the comments with xpath then grab the table you want
library(rvest)
page <- read_html('https://www.basketball-reference.com/leagues/NBA_2018.html')
df <- page %>% html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse = '') %>%
read_html() %>%
html_node('#team-stats-per_game') %>%
html_table()

Rvest scraping multiple data in one function

I know how to loop when a page is paginated, but I wish to scrape multiple information/html_nodes in one loop function, but I am not sure if you can set it up. So far I have tried the following. It's basically a jobsearch website, where I want company name, company description and number of open positions.
I use sprintf to get page 1-14.
urlingtek <- sprintf("https://www.jobindex.dk/virksomhedsoversigt/kanal/ingenioer?page=%d", 1:14)
I have made a loop, which works to scrape one data source.
company <- function(virksomhed){
company %>% read_html() %>%
html_nodes('.jix_company_name_link a') %>%
html_text()
}
virk <- lapply(urlingtek, virksomhed)
But I wish to scrape all the utilities down at once if possible.
I have so far tried using
jobvirksom <- function(alt){
alt %>%
read_html() %>%
html_nodes('.jix_company_name_link a') %>%
html_text()
html_nodes('.jix_companyindex_overview_ad_content') %>%
html_text()
html_nodes('.jix_active a') %>%
html_text()
}
So far without any luck. Would be a lot better if I could scrape it all at once, press lapply and turn into one list.
Here is the start of a solution. In this case with only 14 webpages to parse through it is sometimes easier to just use a loop. With this number of pages the time between a for loop and lapply is insignificant.
I notice the web pages are not consistently formatted so this solution will need additional work when the data is missing or inconsistent. This will work for the first 2 pages and fail on the third where the overview is missing.
library(rvest)
urlingtek <- sprintf("https://www.jobindex.dk/virksomhedsoversigt/kanal/ingenioer?page=%d", 1:14)
#define empty data frame to store all data
alllistings<-data.frame()
for (i in urlingtek){
print(i)
#read the page just once
page<-read_html(i)
#parse company name
company<-page%>%html_nodes('.jix_company_name_link a') %>% html_text()
#remove blank company names
company<-trimws(company)
company<-company[nchar(company)>1]
#parse company overview
overv<-page %>% html_nodes('.jix_companyindex_overview_ad_content') %>%
html_text()
#parse active information
active<-page %>% html_nodes('.jix_active a') %>% html_text()
#create temporary dataframe to store data from this loop
tempdf<-data.frame(company, overv, active)
#combine temp with all data
alllistings<-rbind(alllistings, tempdf)
}

using rvest to scrape data from a long table with pagination

I am new to data scraping and trying to use rvest to scrape all the salary data from the long table on this website:
https://www.fedsdatacenter.com/federal-pay-rates/
as expected, the following code gives me the variable names of the data:
url <- "https://www.fedsdatacenter.com/federal-pay-rates/"
names <- url %>%
read_html() %>%
html_node('thead') %>%
html_text()
However, why this code gives me no data?
url <- "https://www.fedsdatacenter.com/federal-pay-rates/"
dat <- url %>%
read_html() %>%
html_node('tbody') %>%
html_text()
I followed an example in this article: http://bradleyboehmke.github.io/2015/12/scraping-html-tables.html
url <- "https://www.fedsdatacenter.com/federal-pay-rates/"
sal <- url %>%
read_html() %>%
html_node('#table-example') %>%
html_table(fill=TRUE)
Again, it produces only column names with no data.
Also, how should I read through all the tens of thousands of pages to get all the data from the table? I suspect that I need to use the information in "#table-example_wrapper > div:nth-child(2) > div", but don't know how. Could anyone help?

Extracting multiple pieces of text from multiple web pages

The first part of this code (up to "pages") successfully retrieves the pages from which I want to scrape. I'm then struggling to find a way to extract pieces of article text, with the associated dates, as a data frame.
I get:
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "c('xml_document', 'xml_node')"
Any guidance on elegance, clarity and efficiency also welcome as this is personal learning.
library(rvest)
library(tidyverse)
library(plyr)
library(stringr)
llply(1:2, function(i) {
read_html(str_c("http://www.thetimes.co.uk/search?p=", i, "&q=tech")) %>%
html_nodes(".Headline--regular a") %>%
html_attr("href") %>%
url_absolute("http://www.thetimes.co.uk")
}) -> links
pages <- links %>% unlist() %>% map(read_html)
map_df(pages, function(x) {
text = read_html(x) %>%
html_nodes(".Article-content p") %>%
html_text() %>%
str_extract(".+skills.+")
date = read_html(x) %>%
html_nodes(".Dateline") %>%
html_text()
}) -> article_df
Nice, you were nearly there! There are two mistakes here:
The variable pages already contains the parsed html code. Therefore, applying read_html again on a single page (i.e. inside map_df) doesn't work. This is the error message you get.
The function inside map_df isn't correct. As there is no explicit return the last calculated value is returned, that is date. The variable text is completely forgotten. You have to pack these two variables inside a data frame.
The following contains the fixed code.
article_df <- map_df(pages, function(x) {
data_frame(
text = x %>%
html_nodes(".Article-content p") %>%
html_text() %>%
str_extract(".+skills.+"),
date = x %>%
html_nodes(".Dateline") %>%
html_text()
)
})
Also a few comments on the code itself:
I think it is better to use <- instead of ->. This way one can more easily find where the variable is assigned and if one uses 'speaking variable names' it is much easier to understand the code.
I'd prefer using the package purrr instead of plyr. purrr is part of the tidyverse package. So, instead of the function llply you could simply use map. There is a nice article on purrr vs plyr.
links <- map(1:2, function(i) {
read_html(str_c("http://www.thetimes.co.uk/search?p=", i, "&q=tech")) %>%
html_nodes(".Headline--regular a") %>%
html_attr("href") %>%
url_absolute("http://www.thetimes.co.uk")
})

Resources