rvest scraping data with different length - r

As a practice project, I am trying to scrape property data from a website. (I only intend to practice my web scraping skills with no intention to further take advantage of the data scraped). But I found that some properties don't have price available, therefore, this creates an error of different length when I am trying to combine them into one data frame.
Here is the code for scraping:
library(tidyverse)
library(revest)
web_page <- read_html("https://wx.fang.anjuke.com/loupan/all/a1_p2/")
community_name <- web_page %>%
html_nodes(".items-name") %>%
html_text()
length(community_name)
listed_price <- web_page %>%
html_nodes(".price") %>%
html_text()
length(listed_price)
property_data <- data.frame(
name=community_name,
price=listed_price
)
How can I identity the property with no listed price and fill the price variable with NA when there is no value scraped?

Inspection of the web page shows that the class is .price when price has a value, and .price-txt when it does not. So one solution is to use an XPath expression in html_nodes() and match classes that start with "price":
listed_price <- web_page %>%
html_nodes(xpath = "//p[starts-with(#class, 'price')]") %>%
html_text()
length(listed_price)
[1] 60

Related

Scraping Dynamic JSON Data in R

On pgatour.com/stats I am trying to scrape multiple stats over multiple tournaments over multiple years. Unfortunately, I am struggling to scrape data for past years or tournament ID’s. In the past, PGA’s website looked like:
https://www.pgatour.com/stats/stat.STAT_ID.y.YEAR_ID.eoff.TOURNAMENT_ID.html
STAT_ID, YEAR_ID, and TOURNAMENT_ID would all change as you updated the particular stat, year, and tournament id to correspond with their unique id’s. Because of this, I was able to use a function that sifted through all combinations of stat_id, year_id, and tournament_id to scrape the website.
Now the website URL’s don’t change except for the particular stat_id being searched. If I change the tournament or year through dropdowns, the stats will load, but the url remains unchanged. This prevents targeting different tournaments or years.
https://www.pgatour.com/stats/detail/02675 - 02675 being an example stat_id
#Dave2e has been very helpful in showing me that pga uses java and how to access some of the JSON data. I combined his teachings along with my past code to scrape all stats for the most recent tournament. However, I can’t figure out how to get the stats for past years or tournaments. In the JSON str I see that there are id’s for $tournamentId and $year, but I’m uncertain of how to use this info to search for past tournaments and years.
How can I access the tournament and year id's to scrape past data on pgatour.com. Should I be trying to access this data with rselenium opposed to a program like rvest?
Code
library(tidyverse)
library(rvest)
library(dplyr)
df23 <- expand.grid(
stat_id = c("02568","02675", "101")
) %>%
mutate(
links = paste0(
"https://www.pgatour.com/stats/detail/",
stat_id
)
) %>%
as_tibble()
get_info <- function(link, stat_id) {
data <- link %>%
read_html() %>%
html_elements(xpath = ".//script[#id='__NEXT_DATA__']") %>%
html_text() %>%
jsonlite::fromJSON()
answer <- data$props$pageProps$statDetails$rows %>%
#NA's in player name stops data from being collected
drop_na(playerName)
# get lists of dataframes into single dataframe, then merge back with original dataframe
answer2 <- answer$stats
answer2 <- bind_rows(answer2, .id = "column_label") %>%
select(-color) %>%
pivot_wider(
values_from = statValue,
names_from = statName)
#All stats combined and unnested
stats2 <- dplyr::bind_cols(answer, answer2)
}
test_stats <- df23 %>%
mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))
test_stats <- test_stats %>%
unnest(everything())
Simplified code courtesy of #Dave2e
#read page
library(rvest)
page <- read_html("https://www.pgatour.com/stats/detail/02675")
#find the script with the correct id tage, strip the html code
datascript <- page %>% html_elements(xpath = ".//script[#id='__NEXT_DATA__']") %>% html_text()
#convert from JSON
output <- jsonlite::fromJSON(datascript)
#explore the output
str(output)
#get the main table
answer <-output$props$pageProps$statDetails$rows
If you take a look at the developer tools (F12 key in your browser) and observe the Network tab when you click on a different year you can see a background request is being made to retrieve that year's data:
It returns a JSON dataset similar to the one in your original post:
To scrape this you need to replicate this GraphQL POST request in your R program. Note that it sends a JSON document with query details which includes tournament codes and the year.
Finally to ensure that your graphql succeeds make sure that you match headers you see in this inspector in your R program. In particular the headers Origin, Referer and the X- prefixed ones:
(you can probably hardcode these)

Scraping with Rvest in R Studio: Returns df 0 rows by 32 columns

I am trying to scrape some sports data from this website (https://en.khl.ru/stat/players/1097/skaters/) using rvest. There are no pages to filter through, but there is a 'Show All' icon to show all the data on the page.
I have been trying to use a css selector to extract the table. Unfortunately, no rows are produced but the column names of the table are present.
I suspect the problem lies in the website's interactive features with the table.
Yes, this page is dynamically generated, thus troublesome for rvest to handle. But the key to scrape this page is to realize the data is stored as JSON in a script element on the page.
The code below reads the page and extracts the script nodes. Reviewed the script node to find the correct one. Then some trial and error extracted the JSON data. Cleaned up the player and team name columns for the final answer.
library(rvest)
library(dplyr)
library(stringr)
url <- "https://en.khl.ru/stat/players/1097/skaters/"
page <- read_html(url)
#the data for the page is stored in a script element
scripts <-page %>% html_elements("script")
#get column names
headers <- page %>% html_elements("thead th") %>% html_text()
#examined the nodes and manually determined the 31st node was it
tail(scripts, 18)
data <- scripts[31] %>% html_text()
#examined the data string and notice the start of the JSON was '[ ['
#end of the JSON was ']]'
jsonstring <- str_extract(data, "\\[ \\[.+\\]\\]")
#convert the JSON into data frame
answer <- jsonlite::fromJSON(jsonstring) %>% as.data.frame
#rename column titles
names(answer) <- headers
#function to clean up html code in columns
cleanhtml <- function(text) {
out<-text %>% read_html() %>% html_text()
}
#remove the html information in columns 1 &3
answer <- answer[ , -32] %>% rowwise() %>%
mutate(Player = cleanhtml(Player), Team=cleanhtml(Team))
answer

R Web Scraping Multiple Levels of a Website

I am a beginner to R web scraping. In this case first I have tried to do a simple web scraping with R. This is the work that I have done.
sort out the staff member details from this website (https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff), this is the code that I have used,
library(rvest)
url <- read_html("https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff")
url %>% html_nodes(".sppb-addon-content") %>% html_text()
Above code is working and all the sorted data is showing.
When u click on each staff member u can get another details as Research Interests, Areas of Specialization, Profile etc.... How can I get these data and show that data in the above data set according to each staff member?
The code below will get you all the links to each professor's page. From there, you can map each link to another set of rvest calls using purrr's map_df or map functions.
Most importantly, giving credit where it's due #hrbrmstr:
R web scraping across multiple pages
The linked answer is subtly different in that it's mapping across a set of numbers, as opposed to mapping across a vector of URL's like in the code below.
library(rvest)
library(purrr)
library(stringr)
library(dplyr)
url <- read_html("https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff")
names <- url %>%
html_nodes(".sppb-addon-content") %>%
html_nodes("strong") %>%
html_text()
#extract the names
names <- names[-c(3,4)]
#drop the head of department and blank space
names <- names %>%
tolower() %>%
str_extract_all("[:alnum:]+") %>%
sapply(paste, collapse = "-")
#create a list of names separated by dashes, should be identical to link names
content <- url %>%
html_nodes(".sppb-addon-content") %>%
html_text()
content <- content[! content %in% "+"]
#drop the "+" from the content
content_names <- data.frame(prof_name = names, content = content)
#make a df with the content and the names, note the prof_name column is the same as below
#this allows for joining later on
links <- url %>%
html_nodes(".sppb-addon-content") %>%
html_nodes("strong") %>%
html_nodes("a") %>%
html_attr("href")
#create a vector of href links
url_base <- "https://science.kln.ac.lk%s"
urls <- sprintf(url_base, links)
#create a vector of urls for the professor's pages
prof_info <- map_df(urls, function(x) {
#create an anonymous function to pull the data
prof_name <- gsub("https://science.kln.ac.lk/depts/im/index.php/", "", x)
#extract the prof's name from the url
page <- read_html(x)
#read each page in the urls vector
sections <- page %>%
html_nodes(".sppb-panel-title") %>%
html_text()
#extract the section title
info <- page %>%
html_nodes(".sppb-panel-body") %>%
html_nodes(".sppb-addon-content") %>%
html_text()
#extract the info from each section
data.frame(sections = sections, info = info, prof_name = prof_name)
#create a dataframe with the section titles as the column headers and the
#info as the data in the columns
})
#note this returns a dataframe. Change map_df to map if you want a list
#of tibbles instead
prof_info <- inner_join(content_names, prof_info, by = "prof_name")
#joining the content from the first page to all the individual pages
Not sure this is the cleanest or most efficient way to do this, but I think this is what you're after.

Rvest scraping multiple data in one function

I know how to loop when a page is paginated, but I wish to scrape multiple information/html_nodes in one loop function, but I am not sure if you can set it up. So far I have tried the following. It's basically a jobsearch website, where I want company name, company description and number of open positions.
I use sprintf to get page 1-14.
urlingtek <- sprintf("https://www.jobindex.dk/virksomhedsoversigt/kanal/ingenioer?page=%d", 1:14)
I have made a loop, which works to scrape one data source.
company <- function(virksomhed){
company %>% read_html() %>%
html_nodes('.jix_company_name_link a') %>%
html_text()
}
virk <- lapply(urlingtek, virksomhed)
But I wish to scrape all the utilities down at once if possible.
I have so far tried using
jobvirksom <- function(alt){
alt %>%
read_html() %>%
html_nodes('.jix_company_name_link a') %>%
html_text()
html_nodes('.jix_companyindex_overview_ad_content') %>%
html_text()
html_nodes('.jix_active a') %>%
html_text()
}
So far without any luck. Would be a lot better if I could scrape it all at once, press lapply and turn into one list.
Here is the start of a solution. In this case with only 14 webpages to parse through it is sometimes easier to just use a loop. With this number of pages the time between a for loop and lapply is insignificant.
I notice the web pages are not consistently formatted so this solution will need additional work when the data is missing or inconsistent. This will work for the first 2 pages and fail on the third where the overview is missing.
library(rvest)
urlingtek <- sprintf("https://www.jobindex.dk/virksomhedsoversigt/kanal/ingenioer?page=%d", 1:14)
#define empty data frame to store all data
alllistings<-data.frame()
for (i in urlingtek){
print(i)
#read the page just once
page<-read_html(i)
#parse company name
company<-page%>%html_nodes('.jix_company_name_link a') %>% html_text()
#remove blank company names
company<-trimws(company)
company<-company[nchar(company)>1]
#parse company overview
overv<-page %>% html_nodes('.jix_companyindex_overview_ad_content') %>%
html_text()
#parse active information
active<-page %>% html_nodes('.jix_active a') %>% html_text()
#create temporary dataframe to store data from this loop
tempdf<-data.frame(company, overv, active)
#combine temp with all data
alllistings<-rbind(alllistings, tempdf)
}

using rvest to scrape data from a long table with pagination

I am new to data scraping and trying to use rvest to scrape all the salary data from the long table on this website:
https://www.fedsdatacenter.com/federal-pay-rates/
as expected, the following code gives me the variable names of the data:
url <- "https://www.fedsdatacenter.com/federal-pay-rates/"
names <- url %>%
read_html() %>%
html_node('thead') %>%
html_text()
However, why this code gives me no data?
url <- "https://www.fedsdatacenter.com/federal-pay-rates/"
dat <- url %>%
read_html() %>%
html_node('tbody') %>%
html_text()
I followed an example in this article: http://bradleyboehmke.github.io/2015/12/scraping-html-tables.html
url <- "https://www.fedsdatacenter.com/federal-pay-rates/"
sal <- url %>%
read_html() %>%
html_node('#table-example') %>%
html_table(fill=TRUE)
Again, it produces only column names with no data.
Also, how should I read through all the tens of thousands of pages to get all the data from the table? I suspect that I need to use the information in "#table-example_wrapper > div:nth-child(2) > div", but don't know how. Could anyone help?

Resources