Extracting multiple pieces of text from multiple web pages - r

The first part of this code (up to "pages") successfully retrieves the pages from which I want to scrape. I'm then struggling to find a way to extract pieces of article text, with the associated dates, as a data frame.
I get:
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "c('xml_document', 'xml_node')"
Any guidance on elegance, clarity and efficiency also welcome as this is personal learning.
library(rvest)
library(tidyverse)
library(plyr)
library(stringr)
llply(1:2, function(i) {
read_html(str_c("http://www.thetimes.co.uk/search?p=", i, "&q=tech")) %>%
html_nodes(".Headline--regular a") %>%
html_attr("href") %>%
url_absolute("http://www.thetimes.co.uk")
}) -> links
pages <- links %>% unlist() %>% map(read_html)
map_df(pages, function(x) {
text = read_html(x) %>%
html_nodes(".Article-content p") %>%
html_text() %>%
str_extract(".+skills.+")
date = read_html(x) %>%
html_nodes(".Dateline") %>%
html_text()
}) -> article_df

Nice, you were nearly there! There are two mistakes here:
The variable pages already contains the parsed html code. Therefore, applying read_html again on a single page (i.e. inside map_df) doesn't work. This is the error message you get.
The function inside map_df isn't correct. As there is no explicit return the last calculated value is returned, that is date. The variable text is completely forgotten. You have to pack these two variables inside a data frame.
The following contains the fixed code.
article_df <- map_df(pages, function(x) {
data_frame(
text = x %>%
html_nodes(".Article-content p") %>%
html_text() %>%
str_extract(".+skills.+"),
date = x %>%
html_nodes(".Dateline") %>%
html_text()
)
})
Also a few comments on the code itself:
I think it is better to use <- instead of ->. This way one can more easily find where the variable is assigned and if one uses 'speaking variable names' it is much easier to understand the code.
I'd prefer using the package purrr instead of plyr. purrr is part of the tidyverse package. So, instead of the function llply you could simply use map. There is a nice article on purrr vs plyr.
links <- map(1:2, function(i) {
read_html(str_c("http://www.thetimes.co.uk/search?p=", i, "&q=tech")) %>%
html_nodes(".Headline--regular a") %>%
html_attr("href") %>%
url_absolute("http://www.thetimes.co.uk")
})

Related

Efficiency in extracting data from webscraping in R

This is no doubt very simple so apologies but I am new to webscraping and am trying to extract multiple datapoints in one call using rvest. Let's take for example the following code (NB I have not used the actual website which I have replaced in this code snippet with xxxxxx.com):
univsalaries <- lapply(paste0('https://xxxxxx.com/job/p', 1:20,'/key=%F9%80%76&final=1&jump=1&PGTID=0d3408-0000-24gf-ac2b-810&ClickID=2'),
function(url_base){
url_base %>% read_html() %>%
html_nodes('.salary') %>%
html_text()
})
Let's say there is another html node I want to scrape (.company). Obviously I can make a separate call and fetch that data, but I want to understand the syntax of how I could extract the information in the same call.
I tried to put it in the following structure, but the code sent me to the debugger
.... function(url_base){
url_base %>% read_html() %>%
Salary <- univsalaries %>%
html_nodes('.salary') %>% html_text()
Company <- univsalaries %>%
html_nodes('.company') %>% html_text()
dt<-tibble(Salary,Company)
})
Read the webpage once and then you can extract multiple values from the same page.
library(purrr)
library(rvest)
univsalaries <- map(paste0('https://xxxxxx.com/job/p', 1:20,'/key=%F9%80%76&final=1&jump=1&PGTID=0d3408-0000-24gf-ac2b-810&ClickID=2'),
function(url_base){
webpage <- url_base %>% read_html()
data.frame(Salary = webpage %>% html_nodes('.salary') %>% html_text(),
Company = webpage %>% html_nodes('.company') %>% html_text())
})
This would give you a list of dataframes (one for every link), if you need one combined dataframe then use map_df instead of map.

How to extract max page number from html using rvest

I am having issues extracting the max/last page number at the bottom of the page. When I have tried the following code, I don't understand what I am doing wrong. My goal is to extract the only the number (currently 96), and if that isn't possible then at least extracting the href that contains the last page number in it (and I could just get the number from that).
#Example Page
page <- read_html("https://www.yachtworld.com/boats-for-sale/condition-used/type-sail/sort-price:asc/?currency=CAD&year=1990-2018&length=38-45&price=0-200000")
page %>% html_nodes(".nav-next") %>% html_attr("href") #my attempt to extract the number 96
page %>% html_nodes(".search-page-nav a") %>% html_attr("href") #my attempt to extract the href
{xml_nodeset (0)} #this is what is returned in both cases
SelectorGadget highlighting the desired node.
The chunk that I would like from inspect source.
The pagination requires Javascript to run on the page in order to be present. This doesn't happen with rvest. Instead, one way is to calculate the number of pages based on text that is present, that has the number of results per page and the total result count.
library(magrittr)
library(rvest)
library(stringr)
page <- read_html('https://www.yachtworld.com/boats-for-sale/condition-used/type-sail/sort-price:asc/?currency=CAD&year=1990-2018&length=38-45&price=0-200000&page=1')
result_text <- page %>% html_node('.page-selector-text') %>% html_text()
results_per_page <- result_text %>% stringr::str_match('- (\\d+)') %>% .[2] %>% as.integer()
num_results <- result_text %>% stringr::str_match('of ([0-9,]+)') %>% .[2] %>% gsub(',','', .) %>% as.integer()
num_pages <- ceiling(num_results/results_per_page)
print(num_pages)

Mutating a new column on a datafame inside a List / dplyr / mutate / list / Rstudio

sorry if this question is already solved, I have search without success to solve this doubt.
I scraped 10 seasons of the NBA and store the datasets inside a list but the main problem is that I don't have a column with the year of the season inside the datasets making difficult to identify from which season is the dataset coming.
So what im looking forward to do is to mutate a new column based on a vector of seasons and recognize the year of the season.
This is what I have tried:
library(tidyverse)
library(rvest)
library(xml2)
season_scrape <- c(2010:2019)
url <- paste0("https://www.basketball-reference.com/leagues/NBA_", season_scrape, "_totals.html")
scrape_function <- function(url){
season_stats <- url %>%
read_html() %>%
html_nodes("table") %>%
.[[1]] %>%
html_table() %>%
mutate(season_year = season_scrape)
}
season_data <- lapply(url, scrape_function)
What would you recommend? mutate inside the scrape_function or after getting the dataset inside the list.
Thanks in advance.
You can handle this in multiple ways. One way is to pass an additional year parameter in the function and apply the function using Map instead of lapply.
library(dplyr)
library(rvest)
scrape_function <- function(url, year){
url %>%
read_html() %>%
html_nodes("table") %>%
.[[1]] %>%
html_table() %>%
mutate(season_year = year)
}
season_data <- Map(scrape_function, url, season_scrape)
If you need to bind the data together into one dataframe, you can also use map2_df from purrr.
season_data <- purrr::map2_df(url, season_scrape, scrape_function)

R: Converting characters to numbers in an R data.frame

A question regarding this data extraction I did. I would like to create a bar chart with the data but unfortunately I am unable to convert the characters extracted to numbers inside R. If I edit the file in a text editor, there's no porblem at all but I'd like to do the whole process in R. Here it is the code:
install.packages("rvest")
library(rvest)
url <- "https://en.wikipedia.org/wiki/Corporate_tax"
corporatetax <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="mw-content-text"]/div/table[5]') %>%
html_table()
str(corporatetax)
As a result in corporatetax there is a data.frame with 3 variables all of them characters. My question, which I've not been abe to resolve, is how should I proceed to convert the second and the third column to numbers to create a bar chart? I've tried with sapply() and dplyr() but did not find a correct way to do that.
Thanks!
You might try to clean up the table like this
library(rvest)
library(stringr)
library(dplyr)
url <- "https://en.wikipedia.org/wiki/Corporate_tax"
corporatetax <- url %>%
read_html() %>%
# your xpath defines the single table, so you can use html_node() instead of html_nodes()
html_node(xpath='//*[#id="mw-content-text"]/div/table[5]') %>%
html_table() %>% as_tibble() %>%
setNames(c("country", "corporate_tax", "combined_tax"))
corporatetax %>%
mutate(corporate_tax=as.numeric(str_replace(corporate_tax, "%", ""))/100,
combined_tax=as.numeric(str_replace(combined_tax, "%", ""))/100
)

Rvest scraping multiple data in one function

I know how to loop when a page is paginated, but I wish to scrape multiple information/html_nodes in one loop function, but I am not sure if you can set it up. So far I have tried the following. It's basically a jobsearch website, where I want company name, company description and number of open positions.
I use sprintf to get page 1-14.
urlingtek <- sprintf("https://www.jobindex.dk/virksomhedsoversigt/kanal/ingenioer?page=%d", 1:14)
I have made a loop, which works to scrape one data source.
company <- function(virksomhed){
company %>% read_html() %>%
html_nodes('.jix_company_name_link a') %>%
html_text()
}
virk <- lapply(urlingtek, virksomhed)
But I wish to scrape all the utilities down at once if possible.
I have so far tried using
jobvirksom <- function(alt){
alt %>%
read_html() %>%
html_nodes('.jix_company_name_link a') %>%
html_text()
html_nodes('.jix_companyindex_overview_ad_content') %>%
html_text()
html_nodes('.jix_active a') %>%
html_text()
}
So far without any luck. Would be a lot better if I could scrape it all at once, press lapply and turn into one list.
Here is the start of a solution. In this case with only 14 webpages to parse through it is sometimes easier to just use a loop. With this number of pages the time between a for loop and lapply is insignificant.
I notice the web pages are not consistently formatted so this solution will need additional work when the data is missing or inconsistent. This will work for the first 2 pages and fail on the third where the overview is missing.
library(rvest)
urlingtek <- sprintf("https://www.jobindex.dk/virksomhedsoversigt/kanal/ingenioer?page=%d", 1:14)
#define empty data frame to store all data
alllistings<-data.frame()
for (i in urlingtek){
print(i)
#read the page just once
page<-read_html(i)
#parse company name
company<-page%>%html_nodes('.jix_company_name_link a') %>% html_text()
#remove blank company names
company<-trimws(company)
company<-company[nchar(company)>1]
#parse company overview
overv<-page %>% html_nodes('.jix_companyindex_overview_ad_content') %>%
html_text()
#parse active information
active<-page %>% html_nodes('.jix_active a') %>% html_text()
#create temporary dataframe to store data from this loop
tempdf<-data.frame(company, overv, active)
#combine temp with all data
alllistings<-rbind(alllistings, tempdf)
}

Resources