Make a loop to scrape a website to create multiple dataframes - r

I'm working on a project where I can see two ways to potentially solve my problem. I'm scraping a webpage by using a loop to save the each page locally as a HTML file. The problem I'm having is when I try to select on the files in my local folder they are basically blank pages. I'm not sure why. I've used this same code on other sites for this project with success.
This is the code I'm using.
#scrape playoff teams for multiple seasons and saved html to local folder
for(i in 2002:2021){
playoff_url <- read_html(paste0("https://www.espn.com/nfl/stats/player/_/season/",i,"/seasontype/3"))
playoff_stats <- playoff_url %>%
write_html(paste0("playoff",i,".HTML"))
}
My second option is to scrape individual seasons into a data frame, but I would like to do it in a loop, and to not have to run this code 20 different times. I also don't want to continually scrape data from the site every time I run the code. It doesn't matter if all the data is in 1 large data frame for all 20 seasons or 20 separate ones. I can export the code to a local file then import it when I need it.
#read in code for playoff QBs from ESPN and added year column
playoff_url <- read_html("https://www.espn.com/nfl/stats/player/_/season/2015/seasontype/3")
play_QB2015 <-playoff_url %>%
html_nodes("table") %>%
html_table()
#combine list from QB playoff data to convert to dataframe
play_QB2015 <- c(play_QB2015[[1]], play_QB2015[[2]])
# Convert list to dataframe using data.frame()
play_QB2015 <- data.frame(play_QB2015)
play_QB2015$Year = 2015

Not sure what happens to your files, but first downloading and storing with httr2 and then parsing saved files with rvest works fine for me (sorry for overused tidyverse ..) :
library(fs)
library(dplyr)
library(httr2)
library(rvest)
library(purrr)
library(stringr)
dest_dir <- path_temp("playoffs")
dir_create(dest_dir)
years <- 2002:2012
# collect all years to a list
playoff_lst <- map(
set_names(years),
~ {
dest_file <- path(dest_dir, str_glue("{.x}.html"))
# only download if local copy is not present
if (!file_exists(dest_file)){
request(str_glue("https://www.espn.com/nfl/stats/player/_/season/{.x}/seasontype/3")) %>%
req_perform(dest_file)
}
read_html(dest_file) %>%
html_elements("table") %>%
html_table() %>%
bind_cols()
}
)
Results:
names(playoff_lst)
#> [1] "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009" "2010" "2011"
#> [11] "2012"
head(playoff_lst$`2002`)
#> # A tibble: 6 × 17
#> RK Name POS GP CMP ATT `CMP%` YDS AVG `YDS/G` LNG TD
#> <int> <chr> <chr> <int> <int> <int> <dbl> <int> <dbl> <dbl> <int> <int>
#> 1 1 Rich Gan… QB 3 73 115 63.5 841 7.3 280. 50 7
#> 2 2 Brad Joh… QB 3 53 98 54.1 670 6.8 223. 71 5
#> 3 3 Tommy Ma… QB 2 51 89 57.3 633 7.1 316. 40 5
#> 4 4 Steve Mc… QB 2 48 80 60 532 6.7 266 39 3
#> 5 5 Jeff Gar… QB 2 49 85 57.6 524 6.2 262 76 3
#> 6 6 Donovan … QB 2 46 79 58.2 490 6.2 245 42 1
#> # … with 5 more variables: INT <int>, SACK <int>, SYL <int>, QBR <lgl>,
#> # RTG <dbl>
dir_tree(dest_dir)
#> ... /RtmpcjLFJe/playoffs
#> ├── 2002.html
#> ├── 2003.html
#> ├── 2004.html
#> ├── 2005.html
#> ├── 2006.html
#> ├── 2007.html
#> ├── 2008.html
#> ├── 2009.html
#> ├── 2010.html
#> ├── 2011.html
#> └── 2012.html
Created on 2023-02-16 with reprex v2.0.2

Related

Creating Multiple Tables then Combining All of the Tables into One in R

I have scraped multiple tables from a basketball site using a for loop.
years <- c(2016:2021)
final_table <- {}
for(i in 1:length(years)){
url <- paste0("https://www.basketball-reference.com/friv/free_agents.cgi?year=",years[i])
past_free_agency_page <- read_html(url)
past_free_agency_webtable<- html_nodes(past_free_agency_page, "table")
past_free_agency_table <- html_table(past_free_agency_webtable, header = T)[[1]]
final_table <- rbind(final_table, past_free_agency_table)
}
This retrieves everything correctly, but I am trying to combine all of these tables as they are created. If you notice it is 5 total tables (Year 2016 - 2021).
There is one error that I am getting: I try to combine the table with rbind() at the end of the loop. It does not work. It says "the names do not match". I do not know of a clever way to fix this issue because I am new to working with loops, and I have tried turning the scraped table into a df with no success.
My next issue has to do with how the tables are combined. In the website links, one can see that the table has headers within it, that repeat the Master header exactly. The code treats it as another row, so it appears as an instance within each of the tables. I want these to be ignored.
The last issue has to do with making each of these rows unique, I want the respective year of each table to be a column in its own. For example, for the year 2016, I want the table to have a column that says 2016. I have tried something inside the loop, such as past_free_agency_table[,1] <- c(years[i]), I want to do this because some of these tables have the same players, and I want to be able to uniquely identify, which table is which.
Sort of a loop, but in purrr way.
library(tidyverse)
library(rvest)
get_df <- function(year) {
"https://www.basketball-reference.com/friv/free_agents.cgi?year=" %>%
paste0(., year) %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
mutate(years = year) %>%
select(Rk, years, everything())
}
df <- map_dfr(2016:2020, get_df)
# A tibble: 1,161 × 16
Rk years Player Pos Age Type OTm `2015-16 Stats` WS NTm
<chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2016 Kevin Du… F-G 33-2… UFA OKC 28.2 Pts, 8.2 … 14.5 GSW
2 2 2016 LeBron J… F-G 37-1… UFA CLE 25.3 Pts, 7.4 … 13.6 CLE
3 3 2016 Hassan W… C 33-0… UFA MIA 14.2 Pts, 11.8… 10.3 MIA
4 4 2016 DeMar De… G-F 32-3… UFA TOR 23.5 Pts, 4.5 … 9.9 TOR
5 5 2016 Al Horfo… C-F 36-0… UFA ATL 15.2 Pts, 7.3 … 9.4 BOS
6 6 2016 Marvin W… F 36-0… UFA CHO 11.7 Pts, 6.4 … 7.8 CHA
7 7 2016 Andre Dr… C 28-3… RFA DET 16.2 Pts, 14.8… 7.4 DET
8 8 2016 Pau Gasol C-F 41-3… UFA CHI 16.5 Pts, 11.0… 7.1 SAS
9 9 2016 Dirk Now… F 44-0… UFA DAL 18.3 Pts, 6.5 … 6.8 DAL
10 10 2016 Dwight H… C 36-1… UFA HOU 13.7 Pts, 11.8… 6.6 ATL
# … with 1,151 more rows, and 6 more variables: Terms <chr>, Notes <chr>,
# `2016-17 Stats` <chr>, `2017-18 Stats` <chr>, `2018-19 Stats` <chr>,
# `2019-20 Stats` <chr>

R combine rows and columns within a dataframe

I've looked around for a while trying to figure this out, but I just can't seem to describe my problem concisely enough to google my way out of it. I am trying to work with Michigan COVID stats where the data has Detroit listed separately from Wayne County. I need to add Detroit's numbers to Wayne County's numbers, then remove the Detroit rows from the data frame.
I have included a screen grab too. For the purposes of this problem, can someone explain how I can get Detroit City added to Dickinson, and then make the Detroit City rows disappear? Thanks.
library(tidyverse)
library(openxlsx)
cases_deaths <- read.xlsx("https://www.michigan.gov/coronavirus/-/media/Project/Websites/coronavirus/Cases-and-Deaths/4-20-2022/Cases-and-Deaths-by-County-2022-04-20.xlsx?rev=f9f34cd7a4614efea0b7c9c00a00edfd&hash=AA277EC28A17C654C0EE768CAB41F6B5.xlsx")[,-5]
# Remove rows that don't describe counties
cases_deaths <- cases_deaths[-c(51,52,101,102,147,148,167,168),]
Code chunk output picture
You could do:
cases_deaths %>%
filter(COUNTY %in% c("Wayne", "Detroit City")) %>%
mutate(COUNTY = "Wayne") %>%
group_by(COUNTY, CASE_STATUS) %>%
summarize_all(sum) %>%
bind_rows(cases_deaths %>%
filter(!COUNTY %in% c("Wayne", "Detroit City")))
#> # A tibble: 166 x 4
#> # Groups: COUNTY [83]
#> COUNTY CASE_STATUS Cases Deaths
#> <chr> <chr> <dbl> <dbl>
#> 1 Wayne Confirmed 377396 7346
#> 2 Wayne Probable 25970 576
#> 3 Alcona Confirmed 1336 64
#> 4 Alcona Probable 395 7
#> 5 Alger Confirmed 1058 8
#> 6 Alger Probable 658 5
#> 7 Allegan Confirmed 24109 294
#> 8 Allegan Probable 3024 52
#> 9 Alpena Confirmed 4427 126
#> 10 Alpena Probable 1272 12
#> # ... with 156 more rows
Created on 2022-04-23 by the reprex package (v2.0.1)

Struggling to Create a Pivot Table in R

I am very, very new to any type of coding language. I am used to Pivot tables in Excel, and trying to replicate a pivot I have done in Excel in R. I have spent a long time searching the internet/ YouTube, but I just can't get it to work.
I am looking to produce a table in which I the left hand side column shows a number of locations, and across the top of the table it shows different pages that have been viewed. I want to show in the table the number of views per location which each of these pages.
The data frame 'specificreports' shows all views over the past year for different pages on an online platform. I want to filter for the month of October, and then pivot the different Employee Teams against the number of views for different pages.
specificreports <- readxl::read_excel("Multi-Tab File - Dashboard
Usage.xlsx", sheet = "Specific Reports")
specificreportsLocal <- tbl_df(specificreports)
specificreportsLocal %>% filter(Month == "October") %>%
group_by("Employee Team") %>%
This bit works, in that it groups the different team names and filters entries for the month of October. After this I have tried using the summarise function to summarise the number of hits but can't get it to work at all. I keep getting errors regarding data type. I keep getting confused because solutions I look up keep using different packages.
I would appreciate any help, using the simplest way of doing this as I am a total newbie!
Thanks in advance,
Holly
let's see if I can help a bit. It's hard to know what your data looks like from the info you gave us. So I'm going to guess and make some fake data for us to play with. It's worth noting that having field names with spaces in them is going to make your life really hard. You should start by renaming your fields to something more manageable. Since I'm just making data up, I'll give my fields names without spaces:
library(tidyverse)
## this makes some fake data
## a data frame with 3 fields: month, team, value
n <- 100
specificreportsLocal <-
data.frame(
month = sample(1:12, size = n, replace = TRUE),
team = letters[1:5],
value = sample(1:100, size = n, replace = TRUE)
)
That's just a data frame called specificreportsLocal with three fields: month, team, value
Let's do some things with it:
# This will give us total values by team when month = 10
specificreportsLocal %>%
filter(month == 10) %>%
group_by(team) %>%
summarize(total_value = sum(value))
#> # A tibble: 4 x 2
#> team total_value
#> <fct> <int>
#> 1 a 119
#> 2 b 172
#> 3 c 67
#> 4 d 229
I think that's sort of like what you already did, except I added the summarize to show how it works.
Now let's use all months and reshape it from 'long' to 'wide'
# if I want to see all months I leave out the filter and
# add a group_by month
specificreportsLocal %>%
group_by(team, month) %>%
summarize(total_value = sum(value)) %>%
head(5) # this just shows the first 5 values
#> # A tibble: 5 x 3
#> # Groups: team [1]
#> team month total_value
#> <fct> <int> <int>
#> 1 a 1 17
#> 2 a 2 46
#> 3 a 3 91
#> 4 a 4 69
#> 5 a 5 83
# to make this 'long' data 'wide', we can use the `spread` function
specificreportsLocal %>%
group_by(team, month) %>%
summarize(total_value = sum(value)) %>%
spread(team, total_value)
#> # A tibble: 12 x 6
#> month a b c d e
#> <int> <int> <int> <int> <int> <int>
#> 1 1 17 122 136 NA 167
#> 2 2 46 104 158 94 197
#> 3 3 91 NA NA NA 11
#> 4 4 69 120 159 76 98
#> 5 5 83 186 158 19 208
#> 6 6 103 NA 118 105 84
#> 7 7 NA NA 73 127 107
#> 8 8 NA 130 NA 166 99
#> 9 9 125 72 118 135 71
#> 10 10 119 172 67 229 NA
#> 11 11 107 81 NA 131 49
#> 12 12 174 87 39 NA 41
Created on 2018-12-01 by the reprex package (v0.2.1)
Now I'm not really sure if that's what you want. So feel free to make a comment on this answer if you need any of this clarified.
Welcome to Stack Overflow!
I'm not sure I correctly understand your need without a data sample, but this may work for you:
library(rpivotTable)
specificreportsLocal %>% filter(Month == "October")
rpivotTable(specificreportsLocal, rows="Employee Team", cols="page", vals="views", aggregatorName = "Sum")
Otherwise, if you do not need it interactive (as the Pivot Tables in Excel), this may work as well:
specificreportsLocal %>% filter(Month == "October") %>%
group_by_at(c("Employee Team", "page")) %>%
summarise(nr_views = sum(views, na.rm=TRUE))

Is rvest the best tool to collect information from this table?

I have used rvest package to extract a list of companies and the a.href elements in each company, which I need to proceed with the data collection process. This is the link of the website: http://www.bursamalaysia.com/market/listed-companies/list-of-companies/main-market.
I have used the following code to extract the table but nothing comes out. I used other approaches as those posted in "Scraping table of NBA stats with rvest" and similar links, but I cannot obtain what I want. Any help would be greatly appreciated.
my code:
link.main <-
"http://www.bursamalaysia.com/market/listed-companies/list-of-companies/main-market/"
web <- read_html(link.main) %>%
html_nodes("table#bm_equities_prices_table")
# it does not work even when I write html_nodes("table")
or ".table" or #bm_equities_prices_table
web <- read_html(link.main)
%>% html_nodes(".bm_center.bm_dataTable")
# no working
web <- link.main %>% read_html() %>% html_table()
# to inspect the position of table in this website
The page generates the table using JavaScript, so you either need to use RSelenium or Python's Beautiful Soup to simulate the browser session and allow javascript to run.
Another alternative is to use awesome package by #hrbrmstr called decapitated, which basically runs headless Chrome browser session in the background.
#devtools::install_github("hrbrmstr/decapitated")
library(decapitated)
library(rvest)
res <- chrome_read_html(link.main)
main_df <- res %>%
rvest::html_table() %>%
.[[1]] %>%
as_tibble()
This outputs the content of the table alright. If you want to get to the elements underlying the table (href attributes behind the table text), you will need to do a bit more of list gymnastics. Some of the elements in the table are actually missing links, extracting by css proved to be difficult.
library(dplyr)
library(purrr)
href_lst <- res %>%
html_nodes("table td") %>%
as_list() %>%
map("a") %>%
map(~attr(.x, "href"))
# we need every third element starting from second element
idx <- seq.int(from=2, by=3, length.out = nrow(main_df))
href_df <- tibble(
market_href=as.character(href_lst[idx]),
company_href=as.character(href_lst[idx+1])
)
bind_cols(main_df, href_df)
#> # A tibble: 800 x 5
#> No `Company Name` `Company Website` market_href company_href
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 7-ELEVEN MALAYS~ http://www.7elev~ /market/list~ http://www.~
#> 2 2 A-RANK BERHAD [~ http://www.arank~ /market/list~ http://www.~
#> 3 3 ABLEGROUP BERHA~ http://www.gefun~ /market/list~ http://www.~
#> 4 4 ABM FUJIYA BERH~ http://www.abmfu~ /market/list~ http://www.~
#> 5 5 ACME HOLDINGS B~ http://www.suppo~ /market/list~ http://www.~
#> 6 6 ACOUSTECH BERHA~ http://www.acous~ /market/list~ http://www.~
#> 7 7 ADVANCE SYNERGY~ http://www.asb.c~ /market/list~ http://www.~
#> 8 8 ADVANCECON HOLD~ http://www.advan~ /market/list~ http://www.~
#> 9 9 ADVANCED PACKAG~ http://www.advan~ /market/list~ http://www.~
#> 10 10 ADVENTA BERHAD ~ http://www.adven~ /market/list~ http://www.~
#> # ... with 790 more rows
Another option without using browser:
library(httr)
library(jsonlite)
library(XML)
r <- httr::GET(paste0(
"http://ws.bursamalaysia.com/market/listed-companies/list-of-companies/list_of_companies_f.html",
"?_=1532479072277",
"&callback=jQuery16206432131784246533_1532479071878",
"&alphabet=",
"&market=main_market",
"&_=1532479072277"))
l <- rawToChar(r$content)
m <- gsub("jQuery16206432131784246533_1532479071878(", "", substring(l, 1, nchar(l)-1), fixed=TRUE)
tbl <- XML::readHTMLTable(jsonlite::fromJSON(m)$html)$bm_equities_prices_table
output:
> head(tbl)
# No Company Name Company Website
#1 1 7-ELEVEN MALAYSIA HOLDINGS BERHAD http://www.7eleven.com.my
#2 2 A-RANK BERHAD [S] http://www.arank.com.my
#3 3 ABLEGROUP BERHAD [S] http://www.gefung.com.my
#4 4 ABM FUJIYA BERHAD [S] http://www.abmfujiya.com.my
#5 5 ACME HOLDINGS BERHAD [S] http://www.supportivetech.com/
#6 6 ACOUSTECH BERHAD [S] http://www.acoustech.com.my/

rvest: follow different links with same tag

I'm doing a little project in R that involves scraping some football data from a website. Here's the link to one of the years of data:
http://www.sports-reference.com/cfb/years/2007-schedule.html.
As you can see, there is a "Date" column with the dates hyperlinked, this hyperlink takes you to the stats from that particular game, which is the data I would like to scrape. Unfortunately, a lot of games take place on the same dates, which means their hyperlinks are the same. So if I scrape the hyperlinks from the table (which I have done) and then do something like:
url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
stats = html_session(url) %>%
follow_link(link[i]) %>%
html_nodes('whateverthisnodeis') %>%
html_table()
}
it will scrape from the first link corresponding to each date. For example there were 11 games that took place on Aug 30, 2007, but if I put that in the follow_link function, it grabs data from the first game (Boise St. Weber St.) every time. Is there any way I can specify that I want it to move down the table?
I have already found a workaround by finding out the formula for the urls to which the date hyperlinks take you, but it's a pretty convoluted process, so I thought I'd see if anyone knew how to do it this way.
This is a complete example:
library(rvest)
library(dplyr)
library(pbapply)
# Get the main page
URL <- 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
pg <- html(URL)
# Get the dates links
links <- html_attr(html_nodes(pg, xpath="//table/tbody/tr/td[3]/a"), "href")
# I'm only limiting to 10 since I rly don't care about football
# enough to waste the bandwidth.
#
# You can just remove the [1:10] for your needs
# pblapply gives you a much-needed progress bar for free
scoring_games <- pblapply(links[1:10], function(x) {
game_pg <- html(sprintf("http://www.sports-reference.com%s", x))
scoring <- html_table(html_nodes(game_pg, xpath="//table[#id='passing']"), header=TRUE)[[1]]
colnames(scoring) <- scoring[1,]
filter(scoring[-1,], !Player %in% c("", "Player"))
})
# you can bind_rows them all together but you should
# probably add a column for the game then
bind_rows(scoring_games)
## Source: local data frame [27 x 11]
##
## Player School Cmp Att Pct Yds Y/A AY/A TD Int Rate
## (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1 Taylor Tharp Boise State 14 19 73.7 184 9.7 10.7 1 0 172.4
## 2 Nick Lomax Boise State 1 5 20.0 5 1.0 1.0 0 0 28.4
## 3 Ricky Cookman Boise State 1 2 50.0 9 4.5 -18.0 0 1 -12.2
## 4 Ben Mauk Cincinnati 18 27 66.7 244 9.0 8.9 2 1 159.6
## 5 Tony Pike Cincinnati 6 9 66.7 57 6.3 8.6 1 0 156.5
## 6 Julian Edelman Kent State 17 26 65.4 161 6.2 3.5 1 2 114.7
## 7 Bret Meyer Iowa State 14 23 60.9 148 6.4 3.4 1 2 111.9
## 8 Matt Flynn Louisiana State 12 19 63.2 128 6.7 8.8 2 0 154.5
## 9 Ryan Perrilloux Louisiana State 2 3 66.7 21 7.0 13.7 1 0 235.5
## 10 Michael Henig Mississippi State 11 28 39.3 120 4.3 -5.4 0 6 32.4
## .. ... ... ... ... ... ... ... ... ... ... ...
you are going over a loop, but setting to the same variable ever time, try this:
url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
stats[i] = html_session(url) %>%
follow_link(link[i]) %>%
html_nodes('whateverthisnodeis') %>%
html_table()
}

Resources