Finding a Weighted Average Based on Years - r

I want to create a weighted average of the baseball statistic WAR from 2017 to 2019.
The Averages would go as following:
2019: 57.14%
2018: 28.57%
2017: 14.29%
However some players only played in 2018 and 2019, some having played in 2019 and 2017.
If they've only played in two years it would be 67/33, and only one year would be 100% obviously.
I was wondering if there was an easy way to do this.
My data set looks like this
Name Season G PA HR BB_pct K_pct ISO wOBA wRC_plus Def WAR
337 A.J. Pollock 2017 112 466 14 7.5 15.2 0.205 0.340 103 2.6 2.2
357 A.J. Pollock 2018 113 460 21 6.7 21.7 0.228 0.338 111 0.9 2.6
191 Aaron Altherr 2017 107 412 19 7.8 25.2 0.245 0.359 120 -7.9 1.4
162 Aaron Hicks 2017 88 361 15 14.1 18.6 0.209 0.363 128 6.4 3.4
186 Aaron Hicks 2018 137 581 27 15.5 19.1 0.219 0.360 129 2.3 5.0
464 Aaron Hicks 2019 59 255 12 12.2 28.2 0.208 0.325 102 1.3 1.1
And the years vary from person to person, but was wondering if anyone had a way to do this weighted average dependent on the years they played. I also dont want any only 2017-ers if that make sense.

I guess, there is an easy way of doing your task. Unfortunately my approach is a little bit more complex. I'm using dplyr and purr.
First I put those weights into a list:
one_year <- 1
two_years <- c(2/3, 1/3)
three_years <- c(4/7, 3/7, 1/7)
weights <- list(one_year, two_years, three_years)
Next I split the datset into a list by the number of seasons each player took part:
df %>%
group_by(Name) %>%
mutate(n=n()) %>%
arrange(n) %>%
ungroup() %>%
group_split(n) -> my_list
Now I define a function that calculates the average using the weights:
WAR_average <- function(i) {my_list[[i]] %>%
group_by(Name) %>%
mutate(WAR_average = sum(WAR * weights[[i]]))}
And finally I apply the function WAR_average on my_list and filter/select the data:
my_list %>%
seq_along() %>%
lapply(WAR_average) %>% # apply function
reduce(rbind) %>% # bind the dataframes into one df
filter(Season != 2017 | n != 1) %>% # filter players only active in 2017
select(Name, WAR_average) %>% # select player and war_average
distinct() # remove duplicates
This whole process returns
# A tibble: 2 x 2
# Groups: Name [2]
Name WAR_average
<chr> <dbl>
1 A.J. Pollock 2.33
2 Aaron Hicks 4.24

Related

Creating Multiple Tables then Combining All of the Tables into One in R

I have scraped multiple tables from a basketball site using a for loop.
years <- c(2016:2021)
final_table <- {}
for(i in 1:length(years)){
url <- paste0("https://www.basketball-reference.com/friv/free_agents.cgi?year=",years[i])
past_free_agency_page <- read_html(url)
past_free_agency_webtable<- html_nodes(past_free_agency_page, "table")
past_free_agency_table <- html_table(past_free_agency_webtable, header = T)[[1]]
final_table <- rbind(final_table, past_free_agency_table)
}
This retrieves everything correctly, but I am trying to combine all of these tables as they are created. If you notice it is 5 total tables (Year 2016 - 2021).
There is one error that I am getting: I try to combine the table with rbind() at the end of the loop. It does not work. It says "the names do not match". I do not know of a clever way to fix this issue because I am new to working with loops, and I have tried turning the scraped table into a df with no success.
My next issue has to do with how the tables are combined. In the website links, one can see that the table has headers within it, that repeat the Master header exactly. The code treats it as another row, so it appears as an instance within each of the tables. I want these to be ignored.
The last issue has to do with making each of these rows unique, I want the respective year of each table to be a column in its own. For example, for the year 2016, I want the table to have a column that says 2016. I have tried something inside the loop, such as past_free_agency_table[,1] <- c(years[i]), I want to do this because some of these tables have the same players, and I want to be able to uniquely identify, which table is which.
Sort of a loop, but in purrr way.
library(tidyverse)
library(rvest)
get_df <- function(year) {
"https://www.basketball-reference.com/friv/free_agents.cgi?year=" %>%
paste0(., year) %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
mutate(years = year) %>%
select(Rk, years, everything())
}
df <- map_dfr(2016:2020, get_df)
# A tibble: 1,161 × 16
Rk years Player Pos Age Type OTm `2015-16 Stats` WS NTm
<chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2016 Kevin Du… F-G 33-2… UFA OKC 28.2 Pts, 8.2 … 14.5 GSW
2 2 2016 LeBron J… F-G 37-1… UFA CLE 25.3 Pts, 7.4 … 13.6 CLE
3 3 2016 Hassan W… C 33-0… UFA MIA 14.2 Pts, 11.8… 10.3 MIA
4 4 2016 DeMar De… G-F 32-3… UFA TOR 23.5 Pts, 4.5 … 9.9 TOR
5 5 2016 Al Horfo… C-F 36-0… UFA ATL 15.2 Pts, 7.3 … 9.4 BOS
6 6 2016 Marvin W… F 36-0… UFA CHO 11.7 Pts, 6.4 … 7.8 CHA
7 7 2016 Andre Dr… C 28-3… RFA DET 16.2 Pts, 14.8… 7.4 DET
8 8 2016 Pau Gasol C-F 41-3… UFA CHI 16.5 Pts, 11.0… 7.1 SAS
9 9 2016 Dirk Now… F 44-0… UFA DAL 18.3 Pts, 6.5 … 6.8 DAL
10 10 2016 Dwight H… C 36-1… UFA HOU 13.7 Pts, 11.8… 6.6 ATL
# … with 1,151 more rows, and 6 more variables: Terms <chr>, Notes <chr>,
# `2016-17 Stats` <chr>, `2017-18 Stats` <chr>, `2018-19 Stats` <chr>,
# `2019-20 Stats` <chr>

Inflation rate with the CPI multiples country, with R

I have to calculate the inflation rate from 2015 to 2019. I have to do this with the CPI, which I have for each month during the 4 years. This means that I have to calculate the percentage growth rate for the same month last year.
They ask me for the calculation of several countries and then calculate or show the average for the period 2015-2019.
This is my database:
data <- read.table("https://pastebin.com/raw/6cetukKb")
I have tried the quantmod, dplyr, lubridate packages, but I can't do the CPI conversion.
I tried this but I know it is not correct:
data$year <- year(data$date)
anual_cpi <- data %>% group_by(year) %>% summarize(cpi = mean(Argentina))
anual_cpi$adj_factor <- anual_cpi$cpi/anual_cpi$cpi[anual_cpi$year == 2014]
**
UPDATE
**
my teacher gave us a hint on how to get the result, but when I try to add it to the code, I get an error.
data %>%
tidyr::pivot_longer(cols = Antigua_Barbuda:Barbados) %>%
group_by(name, year) %>%
summarise(value = mean(value)) %>%
mutate((change=(x-lag(x,1))/lag(x,1)*100))
| Antigua_Barbuda | -1.55 |
|----------------- |------- |
| Argentina | 1.03 |
| Aruba | -1.52 |
| Bahamas | -1.56 |
| Barbados | -1.38 |
where "value" corresponds to the average inflation for each country during the entire period 2015-2019
We can use data.table methods
library(data.table)
melt(fread("https://pastebin.com/raw/6cetukKb"),
id.var = c('date', 'year', 'period', 'periodName'))[,
.(value = mean(value)), .(variable, year)][,
adj_factor := value/value[year == 2014]][]
# variable year value adj_factor
# 1: Antigua_Barbuda 2014 96.40000 1.0000000
# 2: Antigua_Barbuda 2015 96.55833 1.7059776
# 3: Antigua_Barbuda 2016 96.08333 1.0146075
# 4: Antigua_Barbuda 2017 98.40833 0.9900235
# 5: Antigua_Barbuda 2018 99.62500 0.5822618
# 6: Antigua_Barbuda 2019 101.07500 1.0484959
# 7: Argentina 2014 56.60000 1.0000000
# ..
You should read your data with header = TRUE since the first row are the names of the columns. Then get your data in long format which makes it easy to do the calculation.
After this you can perform whichever calculation you want. For example, to perform the same steps as your attempt i.e divide all the values with the value in the year 2014 for each country you can do.
library(dplyr)
data <- read.table("https://pastebin.com/raw/6cetukKb", header = TRUE)
data %>%
tidyr::pivot_longer(cols = Antigua_Barbuda:Barbados) %>%
group_by(name, year) %>%
summarise(value = mean(value)) %>%
mutate(adj_factor = value/value[year == 2014])
# name year value adj_factor
# <chr> <int> <dbl> <dbl>
# 1 Antigua_Barbuda 2014 96.4 1
# 2 Antigua_Barbuda 2015 96.6 1.00
# 3 Antigua_Barbuda 2016 96.1 0.997
# 4 Antigua_Barbuda 2017 98.4 1.02
# 5 Antigua_Barbuda 2018 99.6 1.03
# 6 Antigua_Barbuda 2019 101. 1.05
# 7 Argentina 2014 56.6 1
# 8 Argentina 2015 64.0 1.13
# 9 Argentina 2016 89.9 1.59
#10 Argentina 2017 113. 2.00
# … with 20 more rows

Calculating mean age by group in R

I have the following data: https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-age/congress-terms.csv
I'm trying to determine how to calculate the mean age of members of Congress by year (termstart) for each party (Republican and Democrat).
I was hoping for some help on how to go about doing this. I am a beginner in R and I'm just playing around with the data.
Thanks!
Try this approach. Make a filter for the required parties and then summarise. After that you can reshape to wide in order to have both parties for each individual date. Here the code using tidyverse functions:
library(dplyr)
library(tidyr)
#Data
df <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-age/congress-terms.csv',stringsAsFactors = F)
#Code
newdf <- df %>% filter(party %in% c('R','D')) %>%
group_by(termstart,party) %>% summarise(MeanAge=mean(age,na.rm=T)) %>%
pivot_wider(names_from = party,values_from=MeanAge)
Output:
# A tibble: 34 x 3
# Groups: termstart [34]
termstart D R
<chr> <dbl> <dbl>
1 1947-01-03 52.0 53.0
2 1949-01-03 51.4 54.6
3 1951-01-03 52.3 54.3
4 1953-01-03 52.3 54.1
5 1955-01-05 52.3 54.7
6 1957-01-03 53.2 55.4
7 1959-01-07 52.4 54.7
8 1961-01-03 53.4 53.9
9 1963-01-09 53.3 52.6
10 1965-01-04 52.3 52.2
# ... with 24 more rows

Aggregate with multiple duplicates and calculate their mean

Assume we have a DF with duplicates in their respected UserID's but with different namings, which of course can be duplicates as well.
DF <- data.frame(ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26), Other_Scores=c(15,9,34,23,43,23,34,23,23))
The aim is to aggregate and calculate the mean and standard deviation of the UserID's and their names respectively. A desired output example:
UserID Name Class Scoring_mean Scoring_std
101 Ed Junior 12.5 3
101 Hank Junior 24.67 11.62
102 Sandy High 24.75 6.29
102 Jessica High 24.25 1.5
Hence my question:
What are the options to aggregate the Names based on the UserID, without the loss of information (Hank being coerced into Ed etc. as with summarise() or mutate() )
In my way of thinking, R has to check which Name corresponds to the UserID, and if a match; aggregate and calculate mean & standard deviation, but I'm not able to get this working in R with dplyr.
At the same time I couldn't find any other post that is somewhat related to this question, as in:
How to calculate the mean of specific rows in R?
Subtract pairs of columns based on matching column
Calculating mean when 2 conditions need met in R
average between duplicated rows in R
Here's a tidyverse option that uses some reshaping to create one column of scores and then some grouping in order to get the summary stats:
DF <- data.frame(
ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26),
Other_Scores=c(15,9,34,23,43,23,34,23,23)
)
library(tidyverse)
DF %>%
gather(score_type, score, Scoring, Other_Scores) %>% # reshape score columns
group_by(ID, Name, Class) %>% # group by combinations
summarise(scoring_mean = mean(score), # get summary stats
scoring_sd = sd(score)) %>%
ungroup() # forget the grouping
# # A tibble: 4 x 5
# ID Name Class scoring_mean scoring_sd
# <dbl> <fct> <fct> <dbl> <dbl>
# 1 101. Ed Junior 12.5 3.00
# 2 101. Hank Junior 24.7 11.6
# 3 102. Jessica Mid 24.2 1.50
# 4 102. Sandy High 24.8 6.29
What about computing your summary stats then joining the results to your initial dataframe. Like so:
DF <- data.frame(ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26), Other_Scores=c(15,9,34,23,43,23,34,23,23))
DF2 <- DF %>% group_by(Name) %>%
summarise(scoring_mean=mean(Scoring), scoring_sd = sd(Scoring)) %>%
left_join(DF[,c(1,2,3)], by="Name")
Giving:
# A tibble: 9 x 5
Name scoring_mean scoring_sd ID Class
<fct> <dbl> <dbl> <dbl> <fct>
1 Ed 13.0 2.83 101. Junior
2 Ed 13.0 2.83 101. Junior
3 Hank 16.0 3.46 101. Junior
4 Hank 16.0 3.46 101. Junior
5 Hank 16.0 3.46 101. Junior
6 Jessica 25.5 0.707 102. Mid
7 Jessica 25.5 0.707 102. Mid
8 Sandy 21.0 1.41 102. High
9 Sandy 21.0 1.41 102. High

How to control number of decimal digits in write.table() output?

When working with data (e.g., in data.frame) the user can control displaying digits by using
options(digits=3)
and listing the data.frame like this.
ttf.all
When the user needs to paste the data in Excell like this
write.table(ttf.all, 'clipboard', sep='\t',row.names=F)
The digits parameter is ignored and numbers are not rounded.
See nice output
> ttf.all
year V1.x.x V1.y.x ratio1 V1.x.y V1.y.y ratioR V1.x.x V1.y.x ratioAL V1.x.y V1.y.y ratioRL
1 2006 227 645 35.2 67 645 10.4 150 645 23.3 53 645 8.22
2 2007 639 1645 38.8 292 1645 17.8 384 1645 23.3 137 1645 8.33
3 2008 1531 3150 48.6 982 3150 31.2 755 3150 24.0 235 3150 7.46
4 2009 1625 3467 46.9 1026 3467 29.6 779 3467 22.5 222 3467 6.40
But what is in excel (clipboard) is not rounded. How to control in in write.table()?
You can use the function format() as in:
write.table(format(ttf.all, digits=2), 'clipboard', sep='\t',row.names=F)
format() is a generic function that has methods for many classes, including data.frames. Unlike round(), it won't throw an error if your dataframe is not all numeric. For more details on the formatting options, see the help file via ?format
Adding a solution for data frame having mixed character and numeric columns. We first use mutate_if to select numeric columns then apply the round() function to them.
# install.packages('dplyr', dependencies = TRUE)
library(dplyr)
df <- read.table(text = "id year V1.x.x V1.y.x ratio1
a 2006 227.11111 645.11111 35.22222
b 2007 639.11111 1645.11111 38.22222
c 2008 1531.11111 3150.11111 48.22222
d 2009 1625.11111 3467.11111 46.22222",
header = TRUE, stringsAsFactors = FALSE)
df %>%
mutate_if(is.numeric, round, digits = 2)
#> id year V1.x.x V1.y.x ratio1
#> 1 a 2006 227.11 645.11 35.22
#> 2 b 2007 639.11 1645.11 38.22
#> 3 c 2008 1531.11 3150.11 48.22
#> 4 d 2009 1625.11 3467.11 46.22
### dplyr v1.0.0+
df %>%
mutate(across(where(is.numeric), ~ round(., digits = 2)))
#> id year V1.x.x V1.y.x ratio1
#> 1 a 2006 227.11 645.11 35.22
#> 2 b 2007 639.11 1645.11 38.22
#> 3 c 2008 1531.11 3150.11 48.22
#> 4 d 2009 1625.11 3467.11 46.22
Created on 2019-03-17 by the reprex package (v0.2.1.9000)

Resources