R Web Scraping Multiple Levels of a Website - r

I am a beginner to R web scraping. In this case first I have tried to do a simple web scraping with R. This is the work that I have done.
sort out the staff member details from this website (https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff), this is the code that I have used,
library(rvest)
url <- read_html("https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff")
url %>% html_nodes(".sppb-addon-content") %>% html_text()
Above code is working and all the sorted data is showing.
When u click on each staff member u can get another details as Research Interests, Areas of Specialization, Profile etc.... How can I get these data and show that data in the above data set according to each staff member?

The code below will get you all the links to each professor's page. From there, you can map each link to another set of rvest calls using purrr's map_df or map functions.
Most importantly, giving credit where it's due #hrbrmstr:
R web scraping across multiple pages
The linked answer is subtly different in that it's mapping across a set of numbers, as opposed to mapping across a vector of URL's like in the code below.
library(rvest)
library(purrr)
library(stringr)
library(dplyr)
url <- read_html("https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff")
names <- url %>%
html_nodes(".sppb-addon-content") %>%
html_nodes("strong") %>%
html_text()
#extract the names
names <- names[-c(3,4)]
#drop the head of department and blank space
names <- names %>%
tolower() %>%
str_extract_all("[:alnum:]+") %>%
sapply(paste, collapse = "-")
#create a list of names separated by dashes, should be identical to link names
content <- url %>%
html_nodes(".sppb-addon-content") %>%
html_text()
content <- content[! content %in% "+"]
#drop the "+" from the content
content_names <- data.frame(prof_name = names, content = content)
#make a df with the content and the names, note the prof_name column is the same as below
#this allows for joining later on
links <- url %>%
html_nodes(".sppb-addon-content") %>%
html_nodes("strong") %>%
html_nodes("a") %>%
html_attr("href")
#create a vector of href links
url_base <- "https://science.kln.ac.lk%s"
urls <- sprintf(url_base, links)
#create a vector of urls for the professor's pages
prof_info <- map_df(urls, function(x) {
#create an anonymous function to pull the data
prof_name <- gsub("https://science.kln.ac.lk/depts/im/index.php/", "", x)
#extract the prof's name from the url
page <- read_html(x)
#read each page in the urls vector
sections <- page %>%
html_nodes(".sppb-panel-title") %>%
html_text()
#extract the section title
info <- page %>%
html_nodes(".sppb-panel-body") %>%
html_nodes(".sppb-addon-content") %>%
html_text()
#extract the info from each section
data.frame(sections = sections, info = info, prof_name = prof_name)
#create a dataframe with the section titles as the column headers and the
#info as the data in the columns
})
#note this returns a dataframe. Change map_df to map if you want a list
#of tibbles instead
prof_info <- inner_join(content_names, prof_info, by = "prof_name")
#joining the content from the first page to all the individual pages
Not sure this is the cleanest or most efficient way to do this, but I think this is what you're after.

Related

Scraping with Rvest in R Studio: Returns df 0 rows by 32 columns

I am trying to scrape some sports data from this website (https://en.khl.ru/stat/players/1097/skaters/) using rvest. There are no pages to filter through, but there is a 'Show All' icon to show all the data on the page.
I have been trying to use a css selector to extract the table. Unfortunately, no rows are produced but the column names of the table are present.
I suspect the problem lies in the website's interactive features with the table.
Yes, this page is dynamically generated, thus troublesome for rvest to handle. But the key to scrape this page is to realize the data is stored as JSON in a script element on the page.
The code below reads the page and extracts the script nodes. Reviewed the script node to find the correct one. Then some trial and error extracted the JSON data. Cleaned up the player and team name columns for the final answer.
library(rvest)
library(dplyr)
library(stringr)
url <- "https://en.khl.ru/stat/players/1097/skaters/"
page <- read_html(url)
#the data for the page is stored in a script element
scripts <-page %>% html_elements("script")
#get column names
headers <- page %>% html_elements("thead th") %>% html_text()
#examined the nodes and manually determined the 31st node was it
tail(scripts, 18)
data <- scripts[31] %>% html_text()
#examined the data string and notice the start of the JSON was '[ ['
#end of the JSON was ']]'
jsonstring <- str_extract(data, "\\[ \\[.+\\]\\]")
#convert the JSON into data frame
answer <- jsonlite::fromJSON(jsonstring) %>% as.data.frame
#rename column titles
names(answer) <- headers
#function to clean up html code in columns
cleanhtml <- function(text) {
out<-text %>% read_html() %>% html_text()
}
#remove the html information in columns 1 &3
answer <- answer[ , -32] %>% rowwise() %>%
mutate(Player = cleanhtml(Player), Team=cleanhtml(Team))
answer

scraping data from web page within <div> tag with R

I would like to scrape product name and the rating from a webpage. Upon inspecting the element, I know I need to get the data from product__title and attraqt-star-rating-stars__bar. But I am not sure how to do it as this is embedded within the multiple layers of tag. I've tried the following with no avail; any suggestions are welcome.
library(rvest)
library(dplyr)
url = 'https://www.chemistwarehouse.com.au/shop-online/159/oral-hygiene-and-dental-care'
stores <- read_html(url)
stores %>% html_nodes('body') %>%
html_nodes('.product__title') %>%
rvest::html_text()
stores %>% html_nodes('body') %>%
html_nodes('attraqt-star-rating-stars__bar') %>%
rvest::html_text()
Data is pulled dynamically from an API call. As the json returned is nested you need to extract the desired info e.g., by writing a couple of user-defined functions.
I first extract the listings (list of products), then have a function get_info, which takes an individual product listing and extracts the title and rating and returns a tibble. As the index at which the rating may appear can vary, I have an additional helper function get_rating_index, which retrieves dynamically the correct index for the rating. This function passes the index back to get_info.
I apply get_info over the list of product info, listings, using map_dfr to generate a final DataFrame from each tibble.
library(jsonlite)
library(purrr)
library(dplyr)
data <- jsonlite::read_json("https://www.chemistwarehouse.com.au/searchapi/webapi/search/category?category=159&index=0&sort=")
listings <- data$universes$universe[[1]]$`items-section`$items$item
get_info <- function(listing) {
tibble(
title = listing$attribute[[2]]$value[[1]]$value,
rating = listing$attribute[[get_rating_index(listing$attribute)]]$value[[1]]$value %>% as.numeric()
) -> t
return(t)
}
get_rating_index <-function(attribute){
return(match(T, map(attribute, ~{.x$name == 'bv_star_rating'})))
}
dental_product_ratings <- purrr::map_dfr(listings, get_info)

rvest scraping data with different length

As a practice project, I am trying to scrape property data from a website. (I only intend to practice my web scraping skills with no intention to further take advantage of the data scraped). But I found that some properties don't have price available, therefore, this creates an error of different length when I am trying to combine them into one data frame.
Here is the code for scraping:
library(tidyverse)
library(revest)
web_page <- read_html("https://wx.fang.anjuke.com/loupan/all/a1_p2/")
community_name <- web_page %>%
html_nodes(".items-name") %>%
html_text()
length(community_name)
listed_price <- web_page %>%
html_nodes(".price") %>%
html_text()
length(listed_price)
property_data <- data.frame(
name=community_name,
price=listed_price
)
How can I identity the property with no listed price and fill the price variable with NA when there is no value scraped?
Inspection of the web page shows that the class is .price when price has a value, and .price-txt when it does not. So one solution is to use an XPath expression in html_nodes() and match classes that start with "price":
listed_price <- web_page %>%
html_nodes(xpath = "//p[starts-with(#class, 'price')]") %>%
html_text()
length(listed_price)
[1] 60

Scraping data from finviz with R - Structure for

I am new using R and this is my first question. I apologize if it has been solved before but I haven't found a solution.
By using below code, that I found here, I can get data from and specific subsector from Finviz screener:
library (rvest)
url <- read_html("https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry")
tables <- html_nodes(url,"table")
screener <- tables %>% html_nodes("table") %>% .[11] %>%
html_table(fill=TRUE) %>% data.frame()
head(screener)
It was a bit difficult to find the table number bud I did. My question refers to lists with more than 20, like the one I am using at the example. They use &r=1, &r=21, &r=41, &r=61 at the end of each url.
How could I create in this case the structure?
i=0
for(z in ...){
Many thanks in advance for your help.
Update script based on new table number and link:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry"
TableList<-c("1","21","41","61") # table list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&r=",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&r=",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[17] %>%
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # getting all data in form of list
Here is one approach using stringr and lapply:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry" # base url
TableList<-c("1","21","41","61") # table number list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[11] %>% # check
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # list of dataframes
However please check for .[11] number as it will be changed for these URLs(URL with &1, &21, etc). It is working fine for the base URL. Data is not present for the URL with &1, &21, etc at 11th index. Please change accordingly.

Rvest scraping multiple data in one function

I know how to loop when a page is paginated, but I wish to scrape multiple information/html_nodes in one loop function, but I am not sure if you can set it up. So far I have tried the following. It's basically a jobsearch website, where I want company name, company description and number of open positions.
I use sprintf to get page 1-14.
urlingtek <- sprintf("https://www.jobindex.dk/virksomhedsoversigt/kanal/ingenioer?page=%d", 1:14)
I have made a loop, which works to scrape one data source.
company <- function(virksomhed){
company %>% read_html() %>%
html_nodes('.jix_company_name_link a') %>%
html_text()
}
virk <- lapply(urlingtek, virksomhed)
But I wish to scrape all the utilities down at once if possible.
I have so far tried using
jobvirksom <- function(alt){
alt %>%
read_html() %>%
html_nodes('.jix_company_name_link a') %>%
html_text()
html_nodes('.jix_companyindex_overview_ad_content') %>%
html_text()
html_nodes('.jix_active a') %>%
html_text()
}
So far without any luck. Would be a lot better if I could scrape it all at once, press lapply and turn into one list.
Here is the start of a solution. In this case with only 14 webpages to parse through it is sometimes easier to just use a loop. With this number of pages the time between a for loop and lapply is insignificant.
I notice the web pages are not consistently formatted so this solution will need additional work when the data is missing or inconsistent. This will work for the first 2 pages and fail on the third where the overview is missing.
library(rvest)
urlingtek <- sprintf("https://www.jobindex.dk/virksomhedsoversigt/kanal/ingenioer?page=%d", 1:14)
#define empty data frame to store all data
alllistings<-data.frame()
for (i in urlingtek){
print(i)
#read the page just once
page<-read_html(i)
#parse company name
company<-page%>%html_nodes('.jix_company_name_link a') %>% html_text()
#remove blank company names
company<-trimws(company)
company<-company[nchar(company)>1]
#parse company overview
overv<-page %>% html_nodes('.jix_companyindex_overview_ad_content') %>%
html_text()
#parse active information
active<-page %>% html_nodes('.jix_active a') %>% html_text()
#create temporary dataframe to store data from this loop
tempdf<-data.frame(company, overv, active)
#combine temp with all data
alllistings<-rbind(alllistings, tempdf)
}

Resources