I'm working on scraping data from a table on the following website
https://fantasy.nfl.com/research/scoringleaders?position=1&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1
I want to create a scrape that takes all 17 weeks, all four positions (qb,rb,wr,te) and takes the first 4 pages to get the first 100 rows (only 25 shown on a page at a time).
library(tidyverse)
library(rvest)
library(glue)
scrape_19 <- function(week, position, page) {
Sys.sleep(3)
cat(".")
url <- glue("https://fantasy.nfl.com/research/scoringleaders?{page}position={position}&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek={week}")
read_html(url) %>%
html_nodes("table") %>%
html_table(header = T) %>%
simplify() %>%
first() %>%
setNames(paste0(colnames(.), as.character(.[1,]))) %>%
slice(-1) %>%
list()
}
Here are all the iterations of each call in glue:
week = 1:17;
position = 1:4;
page = c("", "offset=26&", "offset=51&", "offset=76&")
The problem I run into is when I try to make one df with all the data for each week, position and page. Here is code that works for week and position but will not work for an additional nested df.
scaffold <- tibble(week = weeks,
position = list(positions)) %>% tidyr::unnest()
scaffold
tbl_data <- scaffold %>%
mutate(data = purrr::map2(week, position, ~scrape_19(.x, .y)[[1]]))
Basically, I need help in crafting the scaffold and turning that scaffold into the final total data set with all weeks, positions and pages.
Here is my attempt. I am not sure if glue() is the way to go. See below.
first_name <- c("Fred", "Ana", "Bob")
last_name <- c("JOhnson", "Trump")
glue('My name is {first_name} {last_name}.')
Error: Variables must be length 1 or 3
Your case is similar to this example. So I tried to create all possible links using loops with map(). Then, I checked if all URLs exist or not. I used map_dfr() in order to loop through all URLs and bind all data frames. In this process, I added week and position information as well. If position is 1, it is QB. If necessary, replace these numbers by yourself. Note that I scraped four URLs in this demonstration.
library(httr)
library(rvest)
library(tidyverse)
# Create all URLs.
# Create 4 base URLs
paste("https://fantasy.nfl.com/research/scoringleaders?",
c("", "offset=26&", "offset=51&", "offset=76&"),
"position={position}&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek={week}",
sep = "") -> mytemp
# For each base URL, create 4 URLs. (4 x 4 = 16 URLs)
map(.x = 1:4,
.f = function(x){gsub(x = mytemp, pattern = "\\{position\\}", replacement = x)}) %>%
unlist -> mytemp
# For each of the 16 URLs, create 17 URLs
map(.x = 1:17,
.f = function(x){gsub(x = mytemp, pattern = "\\{week\\}", replacement = x)}) %>%
unlist -> myurls
# Check if any URLs are invalid
sapply(myurls, url_success) %>% table
# TRUE
# 272
# Scrape the tables
map_dfr(.x = myurls[1:4],
.f = function(x){read_html(x) %>%
html_nodes("table") %>%
html_table() %>%
simplify() %>%
first() %>%
setNames(paste0(colnames(.), as.character(.[1,]))) %>%
slice(-1) %>%
mutate(position = str_extract(string = x, pattern = "(?<=position=)\\d+(?=&)"),
week = str_extract(string = x, pattern = "(?<=statWeek=)\\d+"))},
.id = "url") -> foo
url Rank Player Opp PassingYds PassingTD PassingInt RushingYds RushingTD ReceivingRec ReceivingYds
1 1 1 Lamar Jackson QB - BAL #MIA 324 5 - 6 - - -
2 1 2 Dak Prescott QB - DAL NYG 405 4 - 12 - - -
3 1 3 Deshaun Watson QB - HOU #NO 268 3 1 40 1 - -
4 1 4 Matthew Stafford QB - DET #ARI 385 3 - 22 - - -
5 1 5 Patrick Mahomes QB - KC #JAX 378 3 - 2 - - -
ReceivingTD RetTD MiscFumTD Misc2PT FumLost FantasyPoints position week
1 - - - - - 33.56 1 1
2 - - - - - 33.40 1 1
3 - - - - - 30.72 1 1
4 - - - - 1 27.60 1 1
5 - - - - - 27.32 1 1
Related
I am trying scraping a website with R. I need the table and the links from that table associated with the correct row in the table. I can get the table and the links but because in the web table there are two columns with links and some rows in the table don't have links and the links can't be sorted and joined by the file names. I can't figure out how to create a dateframe with the columns and the links associated with the correct row.
library(rvest)
#Read HTML from EPA website
content <- read_html("https://www.epa.gov/national-aquatic-resource-surveys/data-national-aquatic-resource-surveys")
tables <- content %>%
html_table(fill = TRUE)
EPA_table <- tables[[1]]
#get links from table
web <- content %>%
html_nodes("table") %>% html_nodes("tr") %>% html_nodes("a") %>%
html_attr("href") #as above
Use xpath= argument of to select columns.
## Data links
web <- content %>%
html_nodes("table tr")%>%
html_nodes(xpath="//td[3]") %>% ## xpath
html_nodes("a") %>%
html_attr("href")
EPA_table$web1 <- web ## add Data links column
## metadata links accordingly
web2 <- content %>%
html_nodes("table tr") %>%
html_nodes(xpath="//td[4]") %>% ## xpath
html_nodes("a") %>%
html_attr("href")
The empty metadata cells can be set to NA, the description links fit where it's not NA then.
EPA_table[EPA_table$Metadata %in% "", "Metadata"] <- NA
EPA_table[!is.na(EPA_table$Metadata), "web2"] <- web2 ## add metadata column
Result
head(EPA_table)
# Survey Indicator
# 1 Lakes 2007 All
# 2 Lakes 2007 Landscape Data
# 3 Lakes 2007 Water Chemistry
# 4 Lakes 2007 Visual Assessment
# 5 Lakes 2007 Site Information
# 6 Lakes 2007 Notes
# Data
# 1 NLA 2007 All Data (ZIP)(1 pg, 5 MB)
# 2 NLA 2007 Basin Landuse Metrics - Data 20061022 (CSV)(1 pg, 307 K)
# 3 NLA 2007 Profile - Data 20091008 (CSV)(1 pg, 888 K)
# 4 NLA 2007 Visual Assessment - Data 20091015 (CSV)(1 pg, 813 K)
# 5 NLA 2007 Site Information - Data 20091113 (CSV)(1 pg, 980 K)
# 6 National Lakes Assessment 2007 Final Data Notes
# Metadata
# 1 <NA>
# 2 NLA 2007 Basin Landuse Metrics - Metadata 20091022 (TXT)(1 pg, 4 K)
# 3 NLA 2007 Profile - Metadata 20091008 (TXT)(1 pg, 650 B)
# 4 NLA 2007 Visual Assessment - Metadata 10091015 (TXT)(1 pg, 7 K)
# 5 NLA 2007 Site Information - Metadata 20091113 (TXT)(1 pg, 8 K)
# 6 <NA>
# web1
# 1 /sites/production/files/2017-02/nla2007_alldata.zip
# 2 /sites/production/files/2013-09/nla2007_basin_landuse_metrics_20061022.csv
# 3 /sites/production/files/2013-09/nla2007_profile_20091008.csv
# 4 /sites/production/files/2014-01/nla2007_visualassessment_20091015.csv
# 5 /sites/production/files/2014-01/nla2007_sampledlakeinformation_20091113.csv
# 6 /national-aquatic-resource-surveys/national-lakes-assessment-2007-final-data-notes
# web2
# 1 <NA>
# 2 /sites/production/files/2013-09/nla2007_basin_landuse_metrics_info_20091022.txt
# 3 /sites/production/files/2013-09/nla2007_profile_info_20091008_0.txt
# 4 /sites/production/files/2014-01/nla2007_visualassessment_info_20091015.txt
# 5 /sites/production/files/2014-01/nla2007_sampledlakeinformation_info_20091113.txt
# 6 <NA>
I would have gone with css selectors and :nth-child to separate out individual columns from a loop over the table rows. By using tbody in the selector I would exclude the header row and only process the table body rows and pass that list to map_df
library(rvest)
library(purrr)
url <- "https://www.epa.gov/national-aquatic-resource-surveys/data-national-aquatic-resource-surveys"
rows <- read_html(url) %>% html_nodes("#narsdata tbody tr")
df <- map_df(rows, function(x) {
data.frame(
Survey = x %>% html_node("td:nth-child(1)") %>% html_text(),
Indicator = x %>% html_node("td:nth-child(2)") %>% html_text(),
Data = x %>% html_node("td:nth-child(3) a") %>% html_attr("href") %>% if_else(is.na(.), ., url_absolute(., url)),
Metadata = x %>% html_node("td:nth-child(4) a") %>% html_attr("href") %>% if_else(is.na(.), ., url_absolute(., url)),
stringsAsFactors = FALSE
)
})
Don't think you really need the file names in addition to the urls, but if so, you can expand the data.frame with two additional columns and extract the html_text rather than html_attr e.g.
Data_Name = x %>% html_node("td:nth-child(3) a") %>% html_text(),
Metadata_Name = x %>% html_node("td:nth-child(4) a") %>% html_text()
trying to grab links from a page for subsequent analysis and can only grab about 1/2 of them which may be due to filtering. I'm trying to extract the links highlighted here:
My approach is as follows, which is not ideal because I believe I may be losing some links in the filter() call.
library(rvest)
library(tidyverse)
#initiate session
session <- html_session("https://www.backlisted.fm/episodes")
#collect links for all episodes from the index page:
session %>%
read_html() %>%
html_nodes(".underline-body-links a") %>%
html_attr("href") %>%
tibble(link_temp = .) %>%
filter(str_detect(link_temp, pattern = "episodes/")) %>%
distinct()
#css:
#.underline-body-links #page .html-block a, .underline-body-links #page .product-excerpt ahere
#result:
link_temp
<chr>
1 /episodes/116-mfk-fisher-how-to-cook-a-wolf
2 https://www.backlisted.fm/episodes/109-barbara-pym-excellent-women
3 /episodes/115-george-amp-weedon-grossmith-the-diary-of-a-nobody
4 https://www.backlisted.fm/episodes/27-jane-gardam-a-long-way-from-verona
5 https://www.backlisted.fm/episodes/5-b-s-johnson-christie-malrys-own-double-entry
6 https://www.backlisted.fm/episodes/97-ray-bradbury-the-illustrated-man
7 /episodes/114-william-golding-the-inheritors
8 https://www.backlisted.fm/episodes/30-georgette-heyer-venetia
9 https://www.backlisted.fm/episodes/49-anita-brookner-look-at-me
10 https://www.backlisted.fm/episodes/71-jrr-tolkien-the-return-of-the-king
# … with 43 more rows
I've been reading multiple documents but I can't target that one type of href. Any help will be much appreciated. Thank you.
Try this
library(rvest)
library(tidyverse)
session <- html_session("https://www.backlisted.fm/index")
raw_html <- read_html(session)
node <- raw_html %>% html_nodes(css = "li p a")
link <- node %>% html_attr("href")
title <- node %>% html_text()
tibble(title, link)
# A tibble: 117 x 2
# title link
# <chr> <chr>
# 1 "A Month in the Country" https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country
# 2 " - J.L. Carr (with Lissa Evans)" #
# 3 "Good Morning, Midnight - Jean Rhys" https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight
# 4 "It Had to Be You - David Nobbs" https://www.backlisted.fm/episodes/3-david-nobbs-1
# 5 "The Blessing - Nancy Mitford" https://www.backlisted.fm/episodes/4-nancy-mitford-the-blessing
# 6 "Christie Malry's Own Double Entry - B.S. Joh… https://www.backlisted.fm/episodes/5-b-s-johnson-christie-malrys-own-dou…
# 7 "Passing - Nella Larsen" https://www.backlisted.fm/episodes/6-nella-larsen-passing
# 8 "The Great Fire - Shirley Hazzard" https://www.backlisted.fm/episodes/7-shirley-hazzard-the-great-fire
# 9 "Lolly Willowes - Sylvia Townsend Warner" https://www.backlisted.fm/episodes/8-sylvia-townsend-warner-lolly-willow…
# 10 "The Information - Martin Amis" https://www.backlisted.fm/episodes/9-martin-amis-the-information
# … with 107 more rows
I have a dataframe like as shown below
test_df <- data.frame("SN" = c("ABC123","ABC123","ABC123","MNO098","MNO098","MNO098"),
"code" = c("ABC1111","DEF222","GHI133","","MNO1123","MNO567"),
"d_time" = c("2220-08-27","2220-05-27","2220-02-27","2220-11-27","2220-02-27",""))
I am trying to do 2 things
1) create 2 new columns (p_id,v_id) by stripping alphabets from columns SN and code and retain only 9 digits
2) create a lag column (p_vid) based on v_id for each person sorted based on his/her d_time
t_df <- test_df %>% group_by(SN)
t_df %>% arrange((d_time), .by_group = TRUE) ->> sorted_df #sorted based on d_time
transform_ids = function(DF){ # this function is to create person and visit_occurrence ids
DF %>%
mutate(p_id = as.integer(str_remove_all(.$SN,"[a-z]|[A-Z]") %>% #retaining only the numeric part
str_sub(1,9))) %>%
mutate(v_id = as.integer(str_remove_all(.$code,"[a-z]|[A-Z]") %>%
str_sub(1,9))) %>%
group_by(p_id) %>%
mutate(pre_vid = lag(v_id)) %>%
ungroup
}
transform_ids(sorted_df)
But when I do this I encounter the below error
Error in View : Column p_id must be length 3 (the group size) or one, not 6
Error: Column p_id must be length 3 (the group size) or one, not 6
In addition: Warning message:
In view(transform_ids(t_df)) :
Show Traceback
Rerun with Debug
Error: Column p_id must be length 3 (the group size) or one, not 6
I expect my output to be like as shown below. Basically I am trying to link each v_id of a person to his previous visit which is p_vid
To generate the p_id and v_id columns, just use sub:
t_df$p_id <- gsub("[A-Z]+", "", t_df$SN)
t_df$v_id <- gsub("[A-Z]+", "", t_df$code)
For the p_vid column, use lag() from the dplyr package:
t_df %>%
group_by(p_id) %>%
mutate(p_vid = lag(v_id, order_by=d_time, default=0))
The output from the above actually gives you a tibble. If you want a data frame, just use:
t_df <- as.data.frame(t_df)
Output:
SN code d_time p_id v_id p_vid
<fct> <fct> <fct> <chr> <chr> <chr>
1 ABC123 ABC1111 2220-08-27 123 1111 222
2 ABC123 DEF222 2220-05-27 123 222 133
3 ABC123 GHI133 2220-02-27 123 133 0
4 MNO098 "" 2220-11-27 098 "" 1123
5 MNO098 MNO1123 2220-02-27 098 1123 567
6 MNO098 MNO567 "" 098 567 0
I am beginner of R. I need to transfer some Eviews code to R. There are some loop code to add 10 or more columns\variables with some function in data in Eviews.
Here are eviews example code to estimate deflator:
for %x exp con gov inv cap ex im
frml def_{%x} = gdp_{%x}/gdp_{%x}_r*100
next
I used dplyr package and use mutate function. But it is very hard to add many variables.
library(dplyr)
nominal_gdp<-rnorm(4)
nominal_inv<-rnorm(4)
nominal_gov<-rnorm(4)
nominal_exp<-rnorm(4)
real_gdp<-rnorm(4)
real_inv<-rnorm(4)
real_gov<-rnorm(4)
real_exp<-rnorm(4)
df<-data.frame(nominal_gdp,nominal_inv,
nominal_gov,nominal_exp,real_gdp,real_inv,real_gov,real_exp)
df<-df %>% mutate(deflator_gdp=nominal_gdp/real_gdp*100,
deflator_inv=nominal_inv/real_inv,
deflator_gov=nominal_gov/real_gov,
deflator_exp=nominal_exp/real_exp)
print(df)
Please help me to this in R by loop.
The answer is that your data is not as "tidy" as it could be.
This is what you have (with an added observation ID for clarity):
library(dplyr)
df <- data.frame(nominal_gdp = rnorm(4),
nominal_inv = rnorm(4),
nominal_gov = rnorm(4),
real_gdp = rnorm(4),
real_inv = rnorm(4),
real_gov = rnorm(4))
df <- df %>%
mutate(obs_id = 1:n()) %>%
select(obs_id, everything())
which gives:
obs_id nominal_gdp nominal_inv nominal_gov real_gdp real_inv real_gov
1 1 -0.9692060 -1.5223055 -0.26966202 0.49057546 2.3253066 0.8761837
2 2 1.2696927 1.2591910 0.04238958 -1.51398652 -0.7209661 0.3021453
3 3 0.8415725 -0.1728212 0.98846942 -0.58743294 -0.7256786 0.5649908
4 4 -0.8235101 1.0500614 -0.49308092 0.04820723 -2.0697008 1.2478635
Consider if you had instead, in df2:
obs_id variable real nominal
1 1 gdp 0.49057546 -0.96920602
2 2 gdp -1.51398652 1.26969267
3 3 gdp -0.58743294 0.84157254
4 4 gdp 0.04820723 -0.82351006
5 1 inv 2.32530662 -1.52230550
6 2 inv -0.72096614 1.25919100
7 3 inv -0.72567857 -0.17282123
8 4 inv -2.06970078 1.05006136
9 1 gov 0.87618366 -0.26966202
10 2 gov 0.30214534 0.04238958
11 3 gov 0.56499079 0.98846942
12 4 gov 1.24786355 -0.49308092
Then what you want to do is trivial:
df2 %>% mutate(deflator = real / nominal)
obs_id variable real nominal deflator
1 1 gdp 0.49057546 -0.96920602 -0.50616221
2 2 gdp -1.51398652 1.26969267 -1.19240392
3 3 gdp -0.58743294 0.84157254 -0.69801819
4 4 gdp 0.04820723 -0.82351006 -0.05853872
5 1 inv 2.32530662 -1.52230550 -1.52749012
6 2 inv -0.72096614 1.25919100 -0.57256297
7 3 inv -0.72567857 -0.17282123 4.19901294
8 4 inv -2.06970078 1.05006136 -1.97102841
9 1 gov 0.87618366 -0.26966202 -3.24919196
10 2 gov 0.30214534 0.04238958 7.12782060
11 3 gov 0.56499079 0.98846942 0.57158146
12 4 gov 1.24786355 -0.49308092 -2.53074800
So the question becomes: how do we get to the nice dplyr-compatible data.frame.
You need to gather your data using tidyr::gather. However, because you have 2 sets of variables to gather (the real and nominal values), it is not straightforward. I have done it in two steps, there may be a better way though.
real_vals <- df %>%
select(obs_id, starts_with("real")) %>%
# the line below is where the magic happens
tidyr::gather(variable, real, starts_with("real")) %>%
# extracting the variable name (by erasing up to the underscore)
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Same thing for nominal values
nominal_vals <- df %>%
select(obs_id, starts_with("nominal")) %>%
tidyr::gather(variable, nominal, starts_with("nominal")) %>%
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Merging them... Now we have something we can work with!
df2 <-
full_join(real_vals, nominal_vals, by = c("obs_id", "variable"))
Note the importance of the observation id when merging.
We can grep the matching names, and sort:
x <- colnames(df)
df[ sort(x[ (grepl("^nominal", x)) ]) ] /
df[ sort(x[ (grepl("^real", x)) ]) ] * 100
Similarly, if the columns were sorted, then we could just:
df[ 1:4 ] / df[ 5:8 ] * 100
We can loop over column names using purrr::map_dfc then apply a custom function over the selected columns (i.e. the columns that matched the current name from nms)
library(dplyr)
library(purrr)
#Replace anything before _ with empty string
nms <- unique(sub('.*_','',names(df)))
#Use map if you need the ouptut as a list not a dataframe
map_dfc(nms, ~deflator_fun(df, .x))
Custom function
deflator_fun <- function(df, x){
#browser()
nx <- paste0('nominal_',x)
rx <- paste0('real_',x)
select(df, matches(x)) %>%
mutate(!!paste0('deflator_',quo_name(x)) := !!ensym(nx) / !!ensym(rx)*100)
}
#Test
deflator_fun(df, 'gdp')
nominal_gdp real_gdp deflator_gdp
1 -0.3332074 0.181303480 -183.78433
2 -1.0185754 -0.138891362 733.36121
3 -1.0717912 0.005764186 -18593.97398
4 0.3035286 0.385280401 78.78123
Note: Learn more about quo_name, !!, and ensym which they are tools for programming with dplyr here
I am trying to create a dataframe that has the columns: First name, Last name, Party, State, Member ID. Here is my code
library('rvest')
candidate_url <- 'https://www.congress.gov/help/field-values/member-bioguide-ids'
candidate_page <- read_html(candidate_url)
candidate_nodes <- html_nodes(candidate_page, 'table')
candidate_list <- html_text(candidate_nodes)
My main issue is getting the member IDs. An example ID is A000009. When I use the gsub function I lose the leading A in this example. The A is from this candidate's last name (Abercrombie), but I do not know how to add the A back into the member ID. Of course if there's a better way I am open to any suggestions.
Since you've got an HTML table, use html_table to extract it to a data.frame. You'll need fill = TRUE, because the table has extra empty rows inserted between each entry, which you can easily drop afterwards with tidyr::drop_na.
library(tidyverse)
library(rvest)
page <- 'https://www.congress.gov/help/field-values/member-bioguide-ids' %>%
read_html()
members <- page %>%
html_node('table') %>%
html_table(fill = TRUE) %>%
set_names('member', 'bioguide') %>%
drop_na(member) %>% # remove empty rows inserted in the table
tbl_df() # for printing
members
#> # A tibble: 2,243 x 2
#> member bioguide
#> * <chr> <chr>
#> 1 Abdnor, James (Republican - South Dakota) A000009
#> 2 Abercrombie, Neil (Democratic - Hawaii) A000014
#> 3 Abourezk, James (Democratic - South Dakota) A000017
#> 4 Abraham, Ralph Lee (Republican - Louisiana) A000374
#> 5 Abraham, Spencer (Republican - Michigan) A000355
#> 6 Abzug, Bella S. (Democratic - New York) A000018
#> 7 Acevedo-Vila, Anibal (Democratic - Puerto Rico) A000359
#> 8 Ackerman, Gary L. (Democratic - New York) A000022
#> 9 Adams, Alma S. (Democratic - North Carolina) A000370
#> 10 Adams, Brock (Democratic - Washington) A000031
#> # ... with 2,233 more rows
The member column could be further extracted, if you like.
There are also many other useful sources for this data, some of which correlate it with other useful variables. This one is well-structured and updated regularly.
Give this a try. I have updated this to include separating out the different fields.
library('rvest')
library('dplyr')
library('tidyr')
candidate_url <- 'https://www.congress.gov/help/field-values/member-bioguide-ids'
candidate_page <- read_html(candidate_url)
candidate_nodes <- html_nodes(candidate_page, 'table')
df.candidates <- as.data.frame(html_table(candidate_nodes, header = TRUE, fill = TRUE), stringsAsFactors = FALSE)
df.candidates <- df.candidates[!is.na(df.candidates$Member),]
df.candidates <- df.candidates %>%
mutate(Party.State = gsub("[\\(\\)]", "", regmatches(Member, gregexpr("\\(.*?\\)", Member))[[1]])) %>%
separate(Party.State, into = c("Party","State"), sep = " - ") %>%
mutate(Full.name = trimws(regmatches(df.candidates$Member, regexpr("^[^\\(]+", df.candidates$Member)))) %>%
separate(Full.name, into = c("Last.Name","First.Name","Suffix"), sep = ",", fill = "right") %>%
select(First.Name, Last.Name, Suffix, Party, State, Member.ID)
This is a bit hackish, but if you want to extract the variables using regex here are a few pointers.
candidate_list <- unlist(candidate_list)
ID <- regmatches(candidate_list,
gregexpr("[a-zA-Z]{1}[0-9]{6}", candidate_list))
party_state <- regmatches(candidate_list,
gregexpr("(?<=\\()[^)]+(?=\\))", candidate_list, perl=TRUE))
names_etc <- strsplit(candidate_list, "[a-zA-Z]{1}[0-9]{6}")
names <- sapply(names_etc, function(x) sub(" \\([^)]*\\)", "", x))