Efficiency in extracting data from webscraping in R - r

This is no doubt very simple so apologies but I am new to webscraping and am trying to extract multiple datapoints in one call using rvest. Let's take for example the following code (NB I have not used the actual website which I have replaced in this code snippet with xxxxxx.com):
univsalaries <- lapply(paste0('https://xxxxxx.com/job/p', 1:20,'/key=%F9%80%76&final=1&jump=1&PGTID=0d3408-0000-24gf-ac2b-810&ClickID=2'),
function(url_base){
url_base %>% read_html() %>%
html_nodes('.salary') %>%
html_text()
})
Let's say there is another html node I want to scrape (.company). Obviously I can make a separate call and fetch that data, but I want to understand the syntax of how I could extract the information in the same call.
I tried to put it in the following structure, but the code sent me to the debugger
.... function(url_base){
url_base %>% read_html() %>%
Salary <- univsalaries %>%
html_nodes('.salary') %>% html_text()
Company <- univsalaries %>%
html_nodes('.company') %>% html_text()
dt<-tibble(Salary,Company)
})

Read the webpage once and then you can extract multiple values from the same page.
library(purrr)
library(rvest)
univsalaries <- map(paste0('https://xxxxxx.com/job/p', 1:20,'/key=%F9%80%76&final=1&jump=1&PGTID=0d3408-0000-24gf-ac2b-810&ClickID=2'),
function(url_base){
webpage <- url_base %>% read_html()
data.frame(Salary = webpage %>% html_nodes('.salary') %>% html_text(),
Company = webpage %>% html_nodes('.company') %>% html_text())
})
This would give you a list of dataframes (one for every link), if you need one combined dataframe then use map_df instead of map.

Related

Mutating a new column on a datafame inside a List / dplyr / mutate / list / Rstudio

sorry if this question is already solved, I have search without success to solve this doubt.
I scraped 10 seasons of the NBA and store the datasets inside a list but the main problem is that I don't have a column with the year of the season inside the datasets making difficult to identify from which season is the dataset coming.
So what im looking forward to do is to mutate a new column based on a vector of seasons and recognize the year of the season.
This is what I have tried:
library(tidyverse)
library(rvest)
library(xml2)
season_scrape <- c(2010:2019)
url <- paste0("https://www.basketball-reference.com/leagues/NBA_", season_scrape, "_totals.html")
scrape_function <- function(url){
season_stats <- url %>%
read_html() %>%
html_nodes("table") %>%
.[[1]] %>%
html_table() %>%
mutate(season_year = season_scrape)
}
season_data <- lapply(url, scrape_function)
What would you recommend? mutate inside the scrape_function or after getting the dataset inside the list.
Thanks in advance.
You can handle this in multiple ways. One way is to pass an additional year parameter in the function and apply the function using Map instead of lapply.
library(dplyr)
library(rvest)
scrape_function <- function(url, year){
url %>%
read_html() %>%
html_nodes("table") %>%
.[[1]] %>%
html_table() %>%
mutate(season_year = year)
}
season_data <- Map(scrape_function, url, season_scrape)
If you need to bind the data together into one dataframe, you can also use map2_df from purrr.
season_data <- purrr::map2_df(url, season_scrape, scrape_function)

How can I extract a specific table from a website that has multipe tables in R?

I am trying to extract a table from https://www.basketball-reference.com/leagues/NBA_2018.html. The table I want is the (Team per game stats). This webpage has multiple tables, and when I try to extract the tables from it, it gives the first two tables from the page.
How can I get the table I want using R? I mentioned below the code I used
library(rvest)
url <- "https://www.basketball-reference.com/leagues/NBA_2018.html"
# read the link
html <-read_html(url)
tables <- html %>% html_table(fill =TRUE)
View(tables)
It is commented out. You can grab the comments with xpath then grab the table you want
library(rvest)
page <- read_html('https://www.basketball-reference.com/leagues/NBA_2018.html')
df <- page %>% html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse = '') %>%
read_html() %>%
html_node('#team-stats-per_game') %>%
html_table()

Web scraping in R with rvest and XML2 extract table

I wish to extract the table with the ranks and returns from a sample URL
https://www.valueresearchonline.com/funds/fundSelector/returns.asp?cat=10&exc=susp%2Cclose&rettab=st
So far tried rvest
#Reading the HTML code from the website
webpage <- read_html(urlString)
#Using CSS selectors to scrap the section
tables <- webpage %>% html_node("tr") %>% html_text()
tables <- html_node(".fundtool_cat") %>% html_text()
I need a dataframe/table with name of the scheme along with ranks and returns for all periods mentioned
library(rvest)
urlString <- "https://www.valueresearchonline.com/funds/fundSelector/returns.asp?cat=10&exc=susp%2Cclose&rettab=st"
urlString %>%
read_html() %>%
html_nodes(xpath='//*[#id="fundCatData"]/table[1]') %>%
html_table(fill=T)

Rvest scraping multiple data in one function

I know how to loop when a page is paginated, but I wish to scrape multiple information/html_nodes in one loop function, but I am not sure if you can set it up. So far I have tried the following. It's basically a jobsearch website, where I want company name, company description and number of open positions.
I use sprintf to get page 1-14.
urlingtek <- sprintf("https://www.jobindex.dk/virksomhedsoversigt/kanal/ingenioer?page=%d", 1:14)
I have made a loop, which works to scrape one data source.
company <- function(virksomhed){
company %>% read_html() %>%
html_nodes('.jix_company_name_link a') %>%
html_text()
}
virk <- lapply(urlingtek, virksomhed)
But I wish to scrape all the utilities down at once if possible.
I have so far tried using
jobvirksom <- function(alt){
alt %>%
read_html() %>%
html_nodes('.jix_company_name_link a') %>%
html_text()
html_nodes('.jix_companyindex_overview_ad_content') %>%
html_text()
html_nodes('.jix_active a') %>%
html_text()
}
So far without any luck. Would be a lot better if I could scrape it all at once, press lapply and turn into one list.
Here is the start of a solution. In this case with only 14 webpages to parse through it is sometimes easier to just use a loop. With this number of pages the time between a for loop and lapply is insignificant.
I notice the web pages are not consistently formatted so this solution will need additional work when the data is missing or inconsistent. This will work for the first 2 pages and fail on the third where the overview is missing.
library(rvest)
urlingtek <- sprintf("https://www.jobindex.dk/virksomhedsoversigt/kanal/ingenioer?page=%d", 1:14)
#define empty data frame to store all data
alllistings<-data.frame()
for (i in urlingtek){
print(i)
#read the page just once
page<-read_html(i)
#parse company name
company<-page%>%html_nodes('.jix_company_name_link a') %>% html_text()
#remove blank company names
company<-trimws(company)
company<-company[nchar(company)>1]
#parse company overview
overv<-page %>% html_nodes('.jix_companyindex_overview_ad_content') %>%
html_text()
#parse active information
active<-page %>% html_nodes('.jix_active a') %>% html_text()
#create temporary dataframe to store data from this loop
tempdf<-data.frame(company, overv, active)
#combine temp with all data
alllistings<-rbind(alllistings, tempdf)
}

using rvest to scrape data from a long table with pagination

I am new to data scraping and trying to use rvest to scrape all the salary data from the long table on this website:
https://www.fedsdatacenter.com/federal-pay-rates/
as expected, the following code gives me the variable names of the data:
url <- "https://www.fedsdatacenter.com/federal-pay-rates/"
names <- url %>%
read_html() %>%
html_node('thead') %>%
html_text()
However, why this code gives me no data?
url <- "https://www.fedsdatacenter.com/federal-pay-rates/"
dat <- url %>%
read_html() %>%
html_node('tbody') %>%
html_text()
I followed an example in this article: http://bradleyboehmke.github.io/2015/12/scraping-html-tables.html
url <- "https://www.fedsdatacenter.com/federal-pay-rates/"
sal <- url %>%
read_html() %>%
html_node('#table-example') %>%
html_table(fill=TRUE)
Again, it produces only column names with no data.
Also, how should I read through all the tens of thousands of pages to get all the data from the table? I suspect that I need to use the information in "#table-example_wrapper > div:nth-child(2) > div", but don't know how. Could anyone help?

Resources