Web Scraping Pro Football Reference - r

I am interested in web scraping Pro Football Reference. I need to set up a function that enables me to scrape multiple pages. So far, I have code that seems to be functional. However, I continuously get an error...
scrapeData = function(urlprefix, urlend, startyr, endyr) {
master = data.frame()
for (i in startyr:endyr) {
cat('Loading Year', i, '\n')
URL = paste(urlprefix, as.character(i), urlend, sep = "")
table = readHTMLTable(URL, stringsAsFactors = F)[[1]]
table$Year = i
master = rbind(table, master)
}
return(master)
}
drafts = scrapeData('http://www.pro-football-reference.com/years/', '/draft.htm', 2010, 2010)
When running it, the return is --
Error: failed to load external entity "http://www.pro-football-reference.com/years/2010/draft.htm"
Any advice would be helpful. Thank you.

library(tidyverse)
library(rvest)
get_football <- function(year) {
str_c("https://www.pro-football-reference.com/years/",
year,
"/draft.htm") %>%
read_html() %>%
html_table() %>%
pluck(1) %>%
janitor::row_to_names(1) %>%
janitor::clean_names() %>%
mutate(year = year)
}
map_dfr(2010:2015, get_football)
# A tibble: 1,564 × 30
rnd pick tm player pos age to ap1 pb st w_av dr_av g cmp att
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 STL Sam B… QB 22 2018 0 0 5 44 25 83 1855 2967
2 1 2 DET Ndamu… DT 23 2022 3 5 12 100 59 196 0 0
3 1 3 TAM Geral… DT 22 2021 1 6 10 69 65 140 0 0
4 1 4 WAS Trent… T 22 2022 1 9 11 78 58 160 0 0
5 1 5 KAN Eric … DB 21 2018 3 5 5 50 50 89 0 0
6 1 6 SEA Russe… T 21 2020 0 2 9 56 31 131 0 0
7 1 7 CLE Joe H… DB 21 2021 0 3 10 62 39 158 0 0
8 1 8 OAK Rolan… LB 21 2015 0 0 5 25 15 65 0 0
9 1 9 BUF C.J. … RB 23 2017 0 1 3 34 32 90 0 0
10 1 10 JAX Tyson… DT 23 2022 0 0 7 44 33 188 0 0
# … with 1,554 more rows, and 15 more variables

Related

R: Calculate linear regression and get slope for "a subset of data"

My goal is to find out half life (from terminal phase if anyone is familiar with Pharmacokinetics)
I have some data containing the following;
1500 rows, with ID being main "key". There is 15 rows per ID. Then I have other columns TIME and CONCENTRATION. Now What I want to do is, for each ID, remove the first TIME (which equals "000" (numeric)), then run lm() function on the remaining 14 rows per ID, and then use abs() to extract the absolute value of the slope, then then save this to a new column named THALF. (If anyone is familiar with Pharmacokinetics maybe there is better way to do this?)
But I have not be able to do this using my limited knowledge of R.
Here is what I've come up with so far:
data_new <- data %>% dplyr::group_by(data $ID) %>% dplyr::filter(data $TIME != 10) %>% dplyr::mutate(THAFL = abs(lm$coefficients[2](data $CONC ~ data $TIME)))
From what I've understood from other Stackoverflow answers, lm$coefficients[2] will extract the slope.
But however, I have not been able to make this work. I get this error from trying to run the code:
Error: Problem with `mutate()` input `..1`.
x Input `..1` can't be recycled to size 15.
i Input `..1` is `data$ID`.
i Input `..1` must be size 15 or 1, not 1500.
i The error occurred in group 1: data$ID = "pat1".
Any suggestions on how to solve this? IF you need more info, let me know please.
(Also, if anyone is familiar with Pharmacokinetics, when they ask for half life from terminal phase, do I do lm() from the concentration max ? I Have a column with value for the highest observed concentration at what time. )
If after the model fitting you still need the observations with TIME == 10, you can try summarising after you group by ID and then using a right join
data %>%
filter(TIME != 10) %>%
group_by(ID) %>%
summarise(THAFL = abs(lm(CONC ~ TIME)$coefficients[2])) %>%
right_join(data, by = "ID")
# A tibble: 30 x 16
ID THAFL Sex Weight..kg. Height..cm. Age..yrs. T134A A443G G769C G955C A990C TIME CONC LBM `data_combine$ID` CMAX
<chr> <dbl> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <chr> <chr> <dbl>
1 pat1 0.00975 F 50 135 47 0 2 1 2 0 10 0 Under pat1 60
2 pat1 0.00975 F 50 135 47 0 2 1 2 0 20 6.93 Under pat1 60
3 pat1 0.00975 F 50 135 47 0 2 1 2 0 30 12.2 Under pat1 60
4 pat1 0.00975 F 50 135 47 0 2 1 2 0 45 14.8 Under pat1 60
5 pat1 0.00975 F 50 135 47 0 2 1 2 0 60 15.0 Under pat1 60
6 pat1 0.00975 F 50 135 47 0 2 1 2 0 90 12.4 Under pat1 60
7 pat1 0.00975 F 50 135 47 0 2 1 2 0 120 9.00 Under pat1 60
8 pat1 0.00975 F 50 135 47 0 2 1 2 0 150 6.22 Under pat1 60
9 pat1 0.00975 F 50 135 47 0 2 1 2 0 180 4.18 Under pat1 60
10 pat1 0.00975 F 50 135 47 0 2 1 2 0 240 1.82 Under pat1 60
# ... with 20 more rows
If after the model fitting you don't want the rows with TIME == 10 to appear on your dataset, you can use mutate
data %>%
filter(TIME != 10) %>%
group_by(ID) %>%
mutate(THAFL = abs(lm(CONC ~ TIME)$coefficients[2]))
# A tibble: 28 x 16
# Groups: ID [2]
ID Sex Weight..kg. Height..cm. Age..yrs. T134A A443G G769C G955C A990C TIME CONC LBM `data_combine$ID` CMAX THAFL
<chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 pat1 F 50 135 47 0 2 1 2 0 20 6.93 Under pat1 60 0.00975
2 pat2 M 75 175 29 0 2 0 0 0 20 6.78 Under pat2 60 0.00835
3 pat1 F 50 135 47 0 2 1 2 0 30 12.2 Under pat1 60 0.00975
4 pat2 M 75 175 29 0 2 0 0 0 30 11.6 Above pat2 60 0.00835
5 pat1 F 50 135 47 0 2 1 2 0 45 14.8 Under pat1 60 0.00975
6 pat2 M 75 175 29 0 2 0 0 0 45 13.5 Under pat2 60 0.00835
7 pat1 F 50 135 47 0 2 1 2 0 60 15.0 Under pat1 60 0.00975
8 pat2 M 75 175 29 0 2 0 0 0 60 13.1 Above pat2 60 0.00835
9 pat1 F 50 135 47 0 2 1 2 0 90 12.4 Under pat1 60 0.00975
10 pat2 M 75 175 29 0 2 0 0 0 90 9.77 Under pat2 60 0.00835
# ... with 18 more rows
You can use broom:
library(broom)
library(dplyr)
#Code
data %>% group_by(ID) %>%
filter(TIME!=10) %>%
do(fit = tidy(lm(CONC ~ TIME, data = .))) %>%
unnest(fit) %>%
filter(term=='TIME') %>%
mutate(estimate=abs(estimate))
Output:
# A tibble: 2 x 6
ID term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 pat1 TIME 0.00975 0.00334 -2.92 0.0128
2 pat2 TIME 0.00835 0.00313 -2.67 0.0204
If joining with original data is needed, try:
#Code 2
data <- data %>% left_join(data %>% group_by(ID) %>%
filter(TIME!=10) %>%
do(fit = tidy(lm(CONC ~ TIME, data = .))) %>%
unnest(fit) %>%
filter(term=='TIME') %>%
mutate(estimate=abs(estimate)) %>%
select(c(ID,estimate)))
Similar to #RicS.

How to reshape data frame from a row level to person level in R

I have the following codes for Netflix experiment to reduce the price of Netflix and see if people watch more or less TV. Each time someone uses Netflix, it shows what they watched and how long they watched it for.
**library(tidyverse)
sample_size <- 10000
set.seed(853)
viewing_data <-
tibble(unique_person_id = sample(x = c(1:100),
size = sample_size,
replace = TRUE),
tv_show = sample(x = c("Broadchurch", "Duty-Shame", "Drive to Survive", "Shetland", "The Crown"),
size = sample_size,
replace = TRUE),
)**
I then want to write some code that would randomly assign people into one of two groups - treatment and control. However, the dataset it's in a row level as there are 1000 observations. I want change it to person level in R, then I could sign a person be either treated or not. A person should not be both treated and not treated. However, the tv_show shows many times for one person. Any one know how to reshape the dataset in this case?
library(dplyr)
treatment <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(treated = sample(c("yes", "no"), size = 100, replace = TRUE))
viewing_data %>%
left_join(treatment, by = "unique_person_id")
You can change the way of sampling if you need to...
You can do the below, this groups your observations by person id, assigns a unique "treat/control" per group:
library(dplyr)
viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
# A tibble: 10,000 x 3
# Groups: unique_person_id [100]
unique_person_id tv_show group
<int> <chr> <chr>
1 9 Drive to Survive control
2 64 Shetland treated
3 90 The Crown treated
4 93 Drive to Survive treated
5 17 Duty-Shame treated
6 29 The Crown control
7 84 Broadchurch control
8 83 The Crown treated
9 3 The Crown control
10 33 Broadchurch control
# … with 9,990 more rows
We can check our results, all of the ids have only 1 group of treated / control:
newdata <- viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
tapply(newdata$group,newdata$unique_person_id,n_distinct)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In case you wanted random and equal allocation of persons into the two groups (complete random allocation), you can use the following code.
library(dplyr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=sample(100), # in case the ids are not truly random
group=ifelse(group %% 2 == 0, 0, 1)) # works if only two groups
Persons
# A tibble: 100 x 2
unique_person_id group
<int> <dbl>
1 1 0
2 2 0
3 3 1
4 4 0
5 5 1
6 6 1
7 7 1
8 8 0
9 9 1
10 10 0
# ... with 90 more rows
And to check that we've got 50 in each group:
Persons %>% count(group)
# A tibble: 2 x 2
group n
<dbl> <int>
1 0 50
2 1 50
You could also use the randomizr package, which has many more features apart from complete random allocation.
library(randomizr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=complete_ra(N=100, m=50))
Persons %>% count(group) # Check
To link this back to the viewing_data, use inner_join.
viewing_data %>% inner_join(Persons, by="unique_person_id")
# A tibble: 10,000 x 3
unique_person_id tv_show group
<int> <chr> <int>
1 10 Shetland 1
2 95 Broadchurch 0
3 7 Duty-Shame 1
4 68 Drive to Survive 0
5 17 Drive to Survive 1
6 70 Shetland 0
7 78 Drive to Survive 0
8 21 Broadchurch 1
9 80 The Crown 0
10 70 Shetland 0
# ... with 9,990 more rows

Using R to determine if errors are normally distributed:

Say I have a dataset called wage that looks like this:
wage
# A tibble: 935 x 17
wage hours iq kww educ exper tenure age married black south urban sibs brthord meduc
<int> <int> <int> <int> <int> <int> <int> <int> <fctr> <fctr> <fctr> <fctr> <int> <int> <int>
1 769 40 93 35 12 11 2 31 1 0 0 1 1 2 8
2 808 50 119 41 18 11 16 37 1 0 0 1 1 NA 14
3 825 40 108 46 14 11 9 33 1 0 0 1 1 2 14
4 650 40 96 32 12 13 7 32 1 0 0 1 4 3 12
5 562 40 74 27 11 14 5 34 1 0 0 1 10 6 6
6 1400 40 116 43 16 14 2 35 1 1 0 1 1 2 8
7 600 40 91 24 10 13 0 30 0 0 0 1 1 2 8
8 1081 40 114 50 18 8 14 38 1 0 0 1 2 3 8
9 1154 45 111 37 15 13 1 36 1 0 0 0 2 3 14
10 1000 40 95 44 12 16 16 36 1 0 0 1 1 1 12
# ... with 925 more rows, and 2 more variables: feduc <int>, lwage <dbl>
Say I then look at a simple linear regression btw wage and IQ:
m_wage_iq = lm(wage ~ iq, data = wage)
m_wage_iq$coefficients
which gives me:
## (Intercept) iq
## 116.991565 8.303064
I want check that the errors are:
ϵi∼N(0,σ2)
How do I check this using R?
There are a number of ways you can try.
One way would be the shapiro.test to test for normality. A p.value greater than your alpha level (typically up to 10%) would mean that the null hypothesis (i.e. the errors are normally distributed) cannot be rejected. However, the test is biased by sample size so you might want to reinforce your results by looking at the QQplot.
You can see that by plotting m_wage_iq (plot(m_wage_iq )) and looking at the second graph. If your points approximately lie on the x=y line then that would suggest that the errors follow a normal distribution.

Add a new column, based on data in between zeroes

I have power data (Power) collected every second (Sample). My data.frame is therefore structured as follows:
Test <- data.frame(Sample = c(1:20),
Power = c(0,0,0,0,0,50,67,100,92,0,0,0,36,89,36,0,0,0,89,90))
The number of power entries is dependent upon a human performing an effort on a bike and resting sporadically. Therefore, power does not appear in an ordered fashion. As there are no markers to indicate when an effort starts and stops, I want to include this detail. An effort can be characterised when power > 0 and the start/ stop of each effort can be assessed based on data group together.
I now wish to include a new column (Marker) that looks for power data grouped together and separated by zeroes. For example, my anticipated output would be:
Test$Marker <- c("Rest","Rest","Rest","Rest","Rest","Effort 1","Effort 1","Effort 1","Effort 1",
"Rest","Rest","Rest","Effort 2","Effort 2","Effort 2","Rest","Rest","Rest",
"Effort 3","Effort 3")
Unfortunately my raw data is > 3000 rows long, so to do this manually would be tedious! How do I please go about doing this in R?
An option with base R:
indx1 = with(rle(Test$Power>0),rep(values,lengths))
indx2 = with(rle(Test$Power>0),rep(cumsum(values),lengths))
Test$Effort[indx1] = paste0("Effort",indx2[indx1])
Test$Effort[!indx1]="Rest"
Output:
Sample Power Effort
1 1 0 Rest
2 2 0 Rest
3 3 0 Rest
4 4 0 Rest
5 5 0 Rest
6 6 50 Effort1
7 7 67 Effort1
8 8 100 Effort1
9 9 92 Effort1
10 10 0 Rest
11 11 0 Rest
12 12 0 Rest
13 13 36 Effort2
14 14 89 Effort2
15 15 36 Effort2
16 16 0 Rest
17 17 0 Rest
18 18 0 Rest
19 19 89 Effort3
20 20 90 Effort3
About 0.0038 seconds for 3,000 rows ;) Hope this helps!
An alternative base R version using cumsum:
mrk <- Test$Power==0
Test$New[!mrk] <- paste("effort", as.numeric(factor(cumsum(mrk)[!mrk])))
Test$New[mrk] <- "rest"
# Sample Power Marker New
#1 1 0 Rest rest
#2 2 0 Rest rest
#3 3 0 Rest rest
#4 4 0 Rest rest
#5 5 0 Rest rest
#6 6 50 Effort 1 effort 1
#7 7 67 Effort 1 effort 1
#8 8 100 Effort 1 effort 1
#9 9 92 Effort 1 effort 1
#10 10 0 Rest rest
#11 11 0 Rest rest
#12 12 0 Rest rest
#13 13 36 Effort 2 effort 2
#14 14 89 Effort 2 effort 2
#15 15 36 Effort 2 effort 2
#16 16 0 Rest rest
#17 17 0 Rest rest
#18 18 0 Rest rest
#19 19 89 Effort 3 effort 3
#20 20 90 Effort 3 effort 3
An option with dplyr from the tidyverse:
library(dplyr)
Test <- data.frame(Sample = c(1:20),
Power = c(0,0,0,0,0,50,67,100,92,0,0,0,36,89,36,0,0,0,89,90))
Test_df <- Test %>%
mutate(
Marker = case_when(
Power > 0 ~ "Effort",
Power == 0 ~"Rest"),
rleid = cumsum(Marker != lag(Marker, 1, default = "NA")),
Marker = case_when(
Marker == "Effort" ~ paste0(Marker, rleid %/% 2),
TRUE ~ "Rest"),
rleid = NULL
)
Test_df
#> Sample Power Marker
#> 1 1 0 Rest
#> 2 2 0 Rest
#> 3 3 0 Rest
#> 4 4 0 Rest
#> 5 5 0 Rest
#> 6 6 50 Effort1
#> 7 7 67 Effort1
#> 8 8 100 Effort1
#> 9 9 92 Effort1
#> 10 10 0 Rest
#> 11 11 0 Rest
#> 12 12 0 Rest
#> 13 13 36 Effort2
#> 14 14 89 Effort2
#> 15 15 36 Effort2
#> 16 16 0 Rest
#> 17 17 0 Rest
#> 18 18 0 Rest
#> 19 19 89 Effort3
#> 20 20 90 Effort3
An other option using a one-liner data.table :
library(data.table)
Test <- data.frame(Sample = c(1:20),
Power = c(0,0,0,0,0,50,67,100,92,0,0,0,36,89,36,0,0,0,89,90))
setDT(Test)
Test[, Marker := ifelse(Power > 0, paste0("Effort", rleidv(Power > 0) %/% 2), "Rest")]
Test
#> Sample Power Marker
#> 1: 1 0 Rest
#> 2: 2 0 Rest
#> 3: 3 0 Rest
#> 4: 4 0 Rest
#> 5: 5 0 Rest
#> 6: 6 50 Effort1
#> 7: 7 67 Effort1
#> 8: 8 100 Effort1
#> 9: 9 92 Effort1
#> 10: 10 0 Rest
#> 11: 11 0 Rest
#> 12: 12 0 Rest
#> 13: 13 36 Effort2
#> 14: 14 89 Effort2
#> 15: 15 36 Effort2
#> 16: 16 0 Rest
#> 17: 17 0 Rest
#> 18: 18 0 Rest
#> 19: 19 89 Effort3
#> 20: 20 90 Effort3

How to remove rows with NULL / Zero (0) in R

CampActID AccountID LocationName LocationID
<int> <chr> <chr> <int>
1 12 3 Mark + Brandy 3
2 12 15 NULL 0
3 12 102 Spuntino 100
4 12 126 NULL 0
5 12 128 Intersport Concept Store 312
6 12 15 NULL 0
7 12 48 Aspeli Dame 46
8 12 75 Albert Bistro 73
9 12 126 NULL 0
10 12 128 Intersport Concept Store 312
We can try
library(dplyr)
df1%>%
filter(LocationName != "NULL" & LocationID!=0)

Resources