Scraping data from finviz with R - Structure for - r

I am new using R and this is my first question. I apologize if it has been solved before but I haven't found a solution.
By using below code, that I found here, I can get data from and specific subsector from Finviz screener:
library (rvest)
url <- read_html("https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry")
tables <- html_nodes(url,"table")
screener <- tables %>% html_nodes("table") %>% .[11] %>%
html_table(fill=TRUE) %>% data.frame()
head(screener)
It was a bit difficult to find the table number bud I did. My question refers to lists with more than 20, like the one I am using at the example. They use &r=1, &r=21, &r=41, &r=61 at the end of each url.
How could I create in this case the structure?
i=0
for(z in ...){
Many thanks in advance for your help.

Update script based on new table number and link:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry"
TableList<-c("1","21","41","61") # table list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&r=",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&r=",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[17] %>%
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # getting all data in form of list
Here is one approach using stringr and lapply:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry" # base url
TableList<-c("1","21","41","61") # table number list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[11] %>% # check
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # list of dataframes
However please check for .[11] number as it will be changed for these URLs(URL with &1, &21, etc). It is working fine for the base URL. Data is not present for the URL with &1, &21, etc at 11th index. Please change accordingly.

Related

Scraping with Rvest in R Studio: Returns df 0 rows by 32 columns

I am trying to scrape some sports data from this website (https://en.khl.ru/stat/players/1097/skaters/) using rvest. There are no pages to filter through, but there is a 'Show All' icon to show all the data on the page.
I have been trying to use a css selector to extract the table. Unfortunately, no rows are produced but the column names of the table are present.
I suspect the problem lies in the website's interactive features with the table.
Yes, this page is dynamically generated, thus troublesome for rvest to handle. But the key to scrape this page is to realize the data is stored as JSON in a script element on the page.
The code below reads the page and extracts the script nodes. Reviewed the script node to find the correct one. Then some trial and error extracted the JSON data. Cleaned up the player and team name columns for the final answer.
library(rvest)
library(dplyr)
library(stringr)
url <- "https://en.khl.ru/stat/players/1097/skaters/"
page <- read_html(url)
#the data for the page is stored in a script element
scripts <-page %>% html_elements("script")
#get column names
headers <- page %>% html_elements("thead th") %>% html_text()
#examined the nodes and manually determined the 31st node was it
tail(scripts, 18)
data <- scripts[31] %>% html_text()
#examined the data string and notice the start of the JSON was '[ ['
#end of the JSON was ']]'
jsonstring <- str_extract(data, "\\[ \\[.+\\]\\]")
#convert the JSON into data frame
answer <- jsonlite::fromJSON(jsonstring) %>% as.data.frame
#rename column titles
names(answer) <- headers
#function to clean up html code in columns
cleanhtml <- function(text) {
out<-text %>% read_html() %>% html_text()
}
#remove the html information in columns 1 &3
answer <- answer[ , -32] %>% rowwise() %>%
mutate(Player = cleanhtml(Player), Team=cleanhtml(Team))
answer

R Web Scraping Multiple Levels of a Website

I am a beginner to R web scraping. In this case first I have tried to do a simple web scraping with R. This is the work that I have done.
sort out the staff member details from this website (https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff), this is the code that I have used,
library(rvest)
url <- read_html("https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff")
url %>% html_nodes(".sppb-addon-content") %>% html_text()
Above code is working and all the sorted data is showing.
When u click on each staff member u can get another details as Research Interests, Areas of Specialization, Profile etc.... How can I get these data and show that data in the above data set according to each staff member?
The code below will get you all the links to each professor's page. From there, you can map each link to another set of rvest calls using purrr's map_df or map functions.
Most importantly, giving credit where it's due #hrbrmstr:
R web scraping across multiple pages
The linked answer is subtly different in that it's mapping across a set of numbers, as opposed to mapping across a vector of URL's like in the code below.
library(rvest)
library(purrr)
library(stringr)
library(dplyr)
url <- read_html("https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff")
names <- url %>%
html_nodes(".sppb-addon-content") %>%
html_nodes("strong") %>%
html_text()
#extract the names
names <- names[-c(3,4)]
#drop the head of department and blank space
names <- names %>%
tolower() %>%
str_extract_all("[:alnum:]+") %>%
sapply(paste, collapse = "-")
#create a list of names separated by dashes, should be identical to link names
content <- url %>%
html_nodes(".sppb-addon-content") %>%
html_text()
content <- content[! content %in% "+"]
#drop the "+" from the content
content_names <- data.frame(prof_name = names, content = content)
#make a df with the content and the names, note the prof_name column is the same as below
#this allows for joining later on
links <- url %>%
html_nodes(".sppb-addon-content") %>%
html_nodes("strong") %>%
html_nodes("a") %>%
html_attr("href")
#create a vector of href links
url_base <- "https://science.kln.ac.lk%s"
urls <- sprintf(url_base, links)
#create a vector of urls for the professor's pages
prof_info <- map_df(urls, function(x) {
#create an anonymous function to pull the data
prof_name <- gsub("https://science.kln.ac.lk/depts/im/index.php/", "", x)
#extract the prof's name from the url
page <- read_html(x)
#read each page in the urls vector
sections <- page %>%
html_nodes(".sppb-panel-title") %>%
html_text()
#extract the section title
info <- page %>%
html_nodes(".sppb-panel-body") %>%
html_nodes(".sppb-addon-content") %>%
html_text()
#extract the info from each section
data.frame(sections = sections, info = info, prof_name = prof_name)
#create a dataframe with the section titles as the column headers and the
#info as the data in the columns
})
#note this returns a dataframe. Change map_df to map if you want a list
#of tibbles instead
prof_info <- inner_join(content_names, prof_info, by = "prof_name")
#joining the content from the first page to all the individual pages
Not sure this is the cleanest or most efficient way to do this, but I think this is what you're after.

Mutating a new column on a datafame inside a List / dplyr / mutate / list / Rstudio

sorry if this question is already solved, I have search without success to solve this doubt.
I scraped 10 seasons of the NBA and store the datasets inside a list but the main problem is that I don't have a column with the year of the season inside the datasets making difficult to identify from which season is the dataset coming.
So what im looking forward to do is to mutate a new column based on a vector of seasons and recognize the year of the season.
This is what I have tried:
library(tidyverse)
library(rvest)
library(xml2)
season_scrape <- c(2010:2019)
url <- paste0("https://www.basketball-reference.com/leagues/NBA_", season_scrape, "_totals.html")
scrape_function <- function(url){
season_stats <- url %>%
read_html() %>%
html_nodes("table") %>%
.[[1]] %>%
html_table() %>%
mutate(season_year = season_scrape)
}
season_data <- lapply(url, scrape_function)
What would you recommend? mutate inside the scrape_function or after getting the dataset inside the list.
Thanks in advance.
You can handle this in multiple ways. One way is to pass an additional year parameter in the function and apply the function using Map instead of lapply.
library(dplyr)
library(rvest)
scrape_function <- function(url, year){
url %>%
read_html() %>%
html_nodes("table") %>%
.[[1]] %>%
html_table() %>%
mutate(season_year = year)
}
season_data <- Map(scrape_function, url, season_scrape)
If you need to bind the data together into one dataframe, you can also use map2_df from purrr.
season_data <- purrr::map2_df(url, season_scrape, scrape_function)

Web Scraping a table into R

I'm new to trying to web scrape, and am sure there's a very obvious answer I'm missing here, but have exhausted every post I can find on using rvest, XML, xml2, etc on reading a table from the web into R, and I've had no success.
An example of the table I'm looking to scrape can be found here:
https://www.eliteprospects.com/iframe_player_stats.php?player=364033
I've tried
EXAMPLE <- read_html("http://www.eliteprospects.com/iframe_player_stats.php?
player=364033")
EXAMPLE
URL <- 'http://www.eliteprospects.com/iframe_player_stats.php?player=364033'
table <- URL %>%
read_html %>%
html_nodes("table")
But am unsure what to do with these results to get them into a dataframe, or anything usable.
You need to extract the correct html_nodes, and then convert them into a data.frame. The code below is an example of how to go about doing something like this. I find Selector Gadget very useful for finding the right CSS selectors.
library(tidyverse)
library(rvest)
# read the html
html <- read_html('http://www.eliteprospects.com/iframe_player_stats.php?player=364033')
# function to read columns
read_col <- function(x){
col <- html %>%
# CSS nodes to select by using selector gadget
html_nodes(paste0("td:nth-child(", x, ")")) %>%
html_text()
return(col)
}
# apply the function
col_list <- lapply(c(1:8, 10:15), read_col)
# collapse into matrix
mat <- do.call(cbind, col_list)
# put data into dataframe
df <- data.frame(mat[2:nrow(mat), ] %>% data.frame())
# assign names
names(df) <- mat[1, ]
df

R - Get html-addresses from data frame to rvest

I am new to R, and I have come upon a problem I can't solve. I would like to scrape Swedish election data at electoral district level. They are structured as can be found here http://www.val.se/val/val2014/slutresultat/K/valdistrikt/25/82/0134/personroster.html
I get the data I want by using this code:
library(rvest)
district.data <- read_html("http://www.val.se/val/val2014/slutresultat/K/kommun/25/82/0134/personroster.html")
prost <- district.data %>%
html_nodes("table") %>%
.[[2]] %>%
html_table()
But that is just for one district out of 6,227 districts. The districts are identified by the html address. In the website mentioned above it is identified by "25/82/0134". I can find the identities of all districts here http://www.val.se/val/val2014/statistik/2014_riksdagsval_per_valdistrikt.skv
And I read this semi-colon separated file into R by using this code:
valres <- read_csv2("http://www.val.se/val/val2014/statistik/2014_riksdagsval_per_valdistrikt.skv" )
(as a side note, how can I change the encoding so that the Swedish letters (e.g. å, ä, ö) are imported correctly? I manage to do that with read.csv and specifying encoding='utf-8' but not with read_csv)
In this data frame, the columns LAN, KOM and VALDIST give the identities of the districts (note that VALDIST sometimes just have 2 characters). Hence the addresses have the following structure http://www.val.se/val/val2014/slutresultat/K/kommun/LAN/KOM/VALDIST/personroster.html
So, I would like to use the combination in each row to get the identity of district, scrape the information into R, add a column with the district identity (i.e. LAN, KOM and VALDIST combined into one string), and do so over all 6,227 districts and append the information from each of those districts into one single data frame. I assume I need to use some kind of loop or some of those apply functions, to iterate over the data frame, but I have not figured out how.
UPDATE:
After the help I received (thank you!) in the answer below, the code now is as follows. My remaining problem is that I want to add the district identity (i.e. paste0(LAN, KOM, VALDIST)) for each website that is scraped to a column in the final data frame. Can someone help me with this final step?
# Read the indentities of the districts (w Swedish letters)
districts_url <- "http://www.val.se/val/val2014/statistik/2014_riksdagsval_per_valdistrikt.skv"
valres <- read_csv2(districts_url, locale=locale("sv",encoding="ISO-8859-1", asciify=FALSE))
# Add a variabel to separate the two types of electoral districts
valres$typ <- "valdistrikt"
valres$typ [nchar(small_valres$VALDIST) == 2] <- "onsdagsdistrikt"
# Create a vector w all the web addresses to the district data
base_url <- "http://www.val.se/val/val2014/slutresultat/K/%s/%s/%s/%s/personroster.html"
urls <- with(small_valres, sprintf(base_url, typ, LAN, KOM, VALDIST))
# Scrape the data
pb <- progress_estimated(length(urls))
map_df(urls, function(x) {
pb$tick()$print()
# Maybe add Sys.sleep(1)
read_html(x) %>%
html_nodes("table") %>%
.[[2]] %>%
html_table()
}) -> df
Any help would be greatly appreciated!
All the best,
Richard
You can use sprintf() to do positional substitution and then use purrr::map_df() to iterate over a vector of URLs and generate a data frame:
library(rvest)
library(readr)
library(purrr)
library(dplyr)
districts_url <- "http://www.val.se/val/val2014/statistik/2014_riksdagsval_per_valdistrikt.skv"
valres <- read_csv2(districts_url, locale=locale("sv",encoding="UTF-8", asciify=FALSE))
base_url <- "http://www.val.se/val/val2014/slutresultat/K/valdistrikt/%s/%s/%s/personroster.html"
urls <- with(valres, sprintf(base_url, LAN, KOM, VALDIST))
pb <- progress_estimated(length(urls))
map_df(urls, function(x) {
pb$tick()$print()
read_html(x) %>%
html_nodes("table") %>%
.[[2]] %>%
html_table()
}) -> df
HOWEVER, you should add a randomized delay to avoid being blocked as a bot and should look at wrapping read_html() with purrr::safely() since not all those LAN/KOM/VALDIST combinations are valid URLs (at least in my testing).
That code also provides a progress bar since it's going to take a while (prbly an hour on a moderately decent connection).

Resources