Apologize for the not-so-clear title (could use help) - hopefully the example below will clarify many things. I have the following dataframe of basketball shot results (1 row == 1 basketball shot):
> dput(zed)
structure(list(shooterTeamAlias = c("DUKE", "DUKE", "BC", "DUKE",
"DUKE", "DUKE", "DUKE", "DUKE", "DUKE", "BC", "BC", "BC", "DUKE",
"BC", "BC", "DUKE", "DUKE", "DUKE", "BC", "DUKE"), distanceCategory = c("sht2",
"sht2", "sht3", "atr2", "mid2", "sht2", "lng3", "sht3", "atr2",
"sht3", "sht3", "sht2", "mid2", "sht3", "sht3", "sht3", "atr2",
"atr2", "sht2", "mid2"), eventType = c("twopointmiss", "twopointmade",
"threepointmade", "twopointmade", "twopointmiss", "twopointmade",
"threepointmiss", "threepointmiss", "twopointmade", "threepointmiss",
"threepointmade", "twopointmiss", "twopointmade", "threepointmiss",
"threepointmade", "threepointmiss", "twopointmade", "twopointmade",
"twopointmade", "twopointmade")), row.names = c(NA, 20L), class = "data.frame")
> zed
shooterTeamAlias distanceCategory eventType
1 DUKE sht2 twopointmiss
2 DUKE sht2 twopointmade
3 BC sht3 threepointmade
4 DUKE atr2 twopointmade
5 DUKE mid2 twopointmiss
6 DUKE sht2 twopointmade
7 DUKE lng3 threepointmiss
8 DUKE sht3 threepointmiss
9 DUKE atr2 twopointmade
10 BC sht3 threepointmiss
11 BC sht3 threepointmade
12 BC sht2 twopointmiss
13 DUKE mid2 twopointmade
14 BC sht3 threepointmiss
15 BC sht3 threepointmade
16 DUKE sht3 threepointmiss
17 DUKE atr2 twopointmade
18 DUKE atr2 twopointmade
19 BC sht2 twopointmade
20 DUKE mid2 twopointmade
This dataframe is currently in a tidy-ish format, and I need to group_by team and then fatten it big time. The full data has 6 distanceCategories atr2, sht2, mid2, lng2, sht3, lng3 (example above has 5 only), as well as 2 categories that are a function of the other 6: all2 is atr2, sht2, lng2, mid2 and all3 is sht3, lng3. For each of these 8 categories then, I would like a column for makes, attempts, pct, and attempt frequency. I use the eventType column to determine if a shot was made. I am currently doing so with the following
fat.data <- {zed %>%
dplyr::group_by(shooterTeamAlias) %>%
dplyr::summarise(
shotsCount = n(),
# Shooting By Distance Stats
atr2Made = sum(distanceCategory == "atr2" & eventType == "twopointmade"),
atr2Att = sum(distanceCategory == "atr2" & eventType %in% c("twopointmiss", "twopointmade")),
atr2AttFreq = atr2Att / shotsCount,
atr2Pct = ifelse(atr2Att > 0, atr2Made / atr2Att, 0),
sht2Made = sum(distanceCategory == "sht2" & eventType == "twopointmade"),
sht2Att = sum(distanceCategory == "sht2" & eventType %in% c("twopointmiss", "twopointmade")),
sht2AttFreq = sht2Att / shotsCount,
sht2Pct = ifelse(sht2Att > 0, sht2Made / sht2Att, 0),
mid2Made = sum(distanceCategory == "mid2" & eventType == "twopointmade"),
mid2Att = sum(distanceCategory == "mid2" & eventType %in% c("twopointmiss", "twopointmade")),
mid2AttFreq = mid2Att / shotsCount,
mid2Pct = ifelse(mid2Att > 0, mid2Made / mid2Att, 0),
lng2Made = sum(distanceCategory == "lng2" & eventType == "twopointmade"),
lng2Att = sum(distanceCategory == "lng2" & eventType %in% c("twopointmiss", "twopointmade")),
lng2AttFreq = lng2Att / shotsCount,
lng2Pct = ifelse(lng2Att > 0, lng2Made / lng2Att, 0),
all2Made = sum(atr2Made, sht2Made, mid2Made, lng2Made),
all2Att = sum(atr2Att, sht2Att, mid2Att, lng2Att),
all2AttFreq = all2Att / shotsCount,
all2Pct = ifelse(all2Att > 0, all2Made / all2Att, 0),
sht3Made = sum(distanceCategory == "sht3" & eventType == "threepointmade"),
sht3Att = sum(distanceCategory == "sht3" & eventType %in% c("threepointmiss", "threepointmade")),
sht3AttFreq = sht3Att / shotsCount,
sht3Pct = ifelse(sht3Att > 0, sht3Made / sht3Att, 0),
lng3Made = sum(distanceCategory == "lng3" & eventType == "threepointmade"),
lng3Att = sum(distanceCategory == "lng3" & eventType %in% c("threepointmiss", "threepointmade")),
lng3AttFreq = lng3Att / shotsCount,
lng3Pct = ifelse(lng3Att > 0, lng3Made / lng3Att, 0),
all3Made = sum(sht3Made, lng3Made),
all3Att = sum(sht3Att, lng3Att),
all3AttFreq = all3Att / shotsCount,
all3Pct = ifelse(all3Att > 0, all3Made / all3Att, 0))}
...for the 6 categories that appear in the data (all but all2 and all3), their 4 columns are all computed in the same manner. As you'll see for all2 and all3, the calculations are a bit different.
Not worrying for the time being about the all2 and all3 categories, is there a better way to compute the makes, attempts, pct, and attempt frequencies for the 6 categories in the data? For the 8 categories * 4 column-types == 32 columns here, it's not so bad, but I have another, similar instance where I have 21 categories * 4 column-types, and I have to do this multiple times in my code.
Not sure if dplyr::group_by dplyr::summarise is my best option (obv it's what im using currently), or if there's a better way to go about this. Improving this code / potentially speeding it up for my project is pivotally important, and any help is appreciated / i'll try to remember to bounty this post even if answered in the next 2 days.
Edit !!! : I've just realized that grouping by the distanceCategory first, computing the 4 stats for each distanceCategory, and then re-structuring that dataframe into this fat format may be easier... it is something I'm working on computing currently. Something along these lines:
zed %>%
dplyr::group_by(shooterTeamAlias, distanceCategory) %>%
dplyr::summarise(
attempts = ...,
makes = ...,
pct = ...,
attfreq = ...
) %>%
tidyr::spread(...)
Thanks!!
This looks like it could be made simpler by grouping by distanceCategory and then applying the same logic to each:
library(tidyverse)
zed %>%
group_by(shooterTeamAlias, distanceCategory) %>%
summarize(att = n(), # n() counts how many rows in this group
made = sum(eventType %>% str_detect("made"))
pct = if_else(att > 0, made / att, 0)) %>%
mutate(freq = att / sum(att))
# A tibble: 7 x 6
# Groups: shooterTeamAlias [2]
shooterTeamAlias distanceCategory att made pct freq
<chr> <chr> <int> <int> <dbl> <dbl>
1 BC sht2 2 1 0.5 0.286
2 BC sht3 5 3 0.6 0.714
3 DUKE atr2 4 4 1 0.308
4 DUKE lng3 1 0 0 0.0769
5 DUKE mid2 3 2 0.667 0.231
6 DUKE sht2 3 2 0.667 0.231
7 DUKE sht3 2 0 0 0.154
If you want that in wide format, you could first gather the calculations above, unite the distance with the stat, and then spread by that:
[same code as above] %>%
gather(stat, value, -distanceCategory, -shooterTeamAlias) %>%
unite(stat, distanceCategory, stat) %>%
spread(stat, value)
# A tibble: 2 x 21
# Groups: shooterTeamAlias [2]
shooterTeamAlias atr2_att atr2_freq atr2_made atr2_pct lng3_att lng3_freq lng3_made lng3_pct mid2_att mid2_freq mid2_made mid2_pct sht2_att sht2_freq sht2_made sht2_pct sht3_att sht3_freq sht3_made sht3_pct
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BC NA NA NA NA NA NA NA NA NA NA NA NA 2 0.286 1 0.5 5 0.714 3 0.6
2 DUKE 4 0.308 4 1 1 0.0769 0 0 3 0.231 2 0.667 3 0.231 2 0.667 2 0.154 0 0
Related
Okay, I hope I manage to sum up what I need to achieve. I am running experiments in which I obtain data from two different source, with a date_time being the matching unifying variable. The data in the two separate sources have the same structure (in csv or txt). The distinction is in the filenames. I provide an example below:
list_of_files <- structure(
list
(
solid_epoxy1_10 = data.frame(
date_time = c("20/07/2022 13:46",
"20/07/2022 13:56",
"20/07/2022 14:06"),
frequency = c("30000",
"31000",
"32000"),
index = c("1", "2", "3")
),
solid_otherpaint_20 = data.frame(
date_time = c("20/07/2022 13:10",
"20/07/2022 13:20",
"20/07/2022 14:30"),
frequency = c("20000",
"21000",
"22000"),
index = c("1", "2", "3")
),
water_epoxy1_10 = data.frame(
date_time = c("20/07/2022 13:46",
"20/07/2022 13:56",
"20/07/2022 14:06"),
temperature = c("22.3",
"22.6",
"22.5")
),
water_otherpaint_20 = data.frame(
date_time = c("20/07/2022 13:10",
"20/07/2022 13:20",
"20/07/2022 14:30"),
temperature = c("24.5",
"24.6",
"24.8")
)
)
)
First I want to read the data files in two separate lists. One that contains the keyword "solid" in the file name, and the other one that contains "water".
Then I need to create a new columns from the filename in each data frame that will be separated by "_" (e.g paint = "epox1", thickness = "10"), by which I could do an inner join by the date_time column, paint, thickness,etc. Basically what I struggle so far is to create a function that loads that files in two separate lists. This is what I've tried so far
load_files <-
function(list_of_files) {
all.files.board <- list()
all.files.temp <- list()
for (i in 1:length(list_of_files))
{
if (exists("board")) {
all.files.board[[i]] = fread(list_of_files[i])
}
else{
all.files.temp[[i]] = fread(list_of_files[i])
}
return(list(all.files.board, all.files.temp))
}
}
But it doesn't do what I need it. I hope I made it as clear as possible. I'm pretty comfortable with the tidyverse package but writing still a newbie in writing custom functions. Any ideas welcomed.
Regarding question in the title -
first issue, calling return() too early and thus breaking a for-loop, was already mentioned in comments and that should be sorted.
next one is condition itself, if (exists("board")){} checks if there is an object called board; in provided sample it would evaluate to TRUE only if something was assigned to global board object before calling load_files() function and it would evaluate to FALSE only if there were no such assignment or board was explicitly removed. I.e. with
board <- "something"; dataframes <- load_files(file_list) that check will be TRUE while with
rm(board); dataframes <- load_files(file_list) it will be FALSE, there's nothing in function itself that would change the "existance" of board, so the result is actually determined before calling the function.
If core of the question is about joining 2 somewhat different datasets and splitting result by groups, I'd just drop loops, conditions and most of involved lists and would go with something like this with Tidyverse:
library(fs)
library(readr)
library(stringr)
library(dplyr)
library(tidyr)
# prepare input files for sample ------------------------------------------
sample_dfs <- structure(
list
(
solid_epoxy1_10 = data.frame(
date_time = c("20/07/2022 13:46", "20/07/2022 13:56", "20/07/2022 14:06"),
frequency = c("30000", "31000", "32000"),
index = c("1", "2", "3")
),
solid_otherpaint_20 = data.frame(
date_time = c("20/07/2022 13:10", "20/07/2022 13:20", "20/07/2022 14:30"),
frequency = c("20000", "21000", "22000"),
index = c("1", "2", "3")
),
water_epoxy1_10 = data.frame(
date_time = c("20/07/2022 13:46", "20/07/2022 13:56", "20/07/2022 14:06"),
temperature = c("22.3", "22.6", "22.5")
),
water_otherpaint_20 = data.frame(
date_time = c("20/07/2022 13:10", "20/07/2022 13:20", "20/07/2022 14:30"),
temperature = c("24.5", "24.6", "24.8")
)
)
)
tmp_path <- file_temp("reprex")
dir_create(tmp_path)
sample_filenames <- str_glue("{1:length(sample_dfs)}_{names(sample_dfs)}.csv")
for (i in seq_along(sample_dfs)) {
write_csv(sample_dfs[[i]], path(tmp_path, sample_filenames[i]))
}
dir_ls(tmp_path, type = "file")
#> Temp/RtmpqUoct8/reprex5cc517f177b/1_solid_epoxy1_10.csv
#> Temp/RtmpqUoct8/reprex5cc517f177b/2_solid_otherpaint_20.csv
#> Temp/RtmpqUoct8/reprex5cc517f177b/3_water_epoxy1_10.csv
#> Temp/RtmpqUoct8/reprex5cc517f177b/4_water_otherpaint_20.csv
# read files --------------------------------------------------------------
t_solid <- dir_ls(tmp_path, glob = "*solid*.csv", type = "file") %>%
read_csv(id = "filename") %>%
extract(filename, c("paint", "thickness"), "_([^_]+)_(\\d+)\\.csv")
t_solid
#> # A tibble: 6 × 5
#> paint thickness date_time frequency index
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 epoxy1 10 20/07/2022 13:46 30000 1
#> 2 epoxy1 10 20/07/2022 13:56 31000 2
#> 3 epoxy1 10 20/07/2022 14:06 32000 3
#> 4 otherpaint 20 20/07/2022 13:10 20000 1
#> 5 otherpaint 20 20/07/2022 13:20 21000 2
#> 6 otherpaint 20 20/07/2022 14:30 22000 3
t_water <- dir_ls(tmp_path, glob = "*water*.csv", type = "file") %>%
read_csv(id = "filename") %>%
extract(filename, c("paint", "thickness"), "_([^_]+)_(\\d+)\\.csv")
t_water
#> # A tibble: 6 × 4
#> paint thickness date_time temperature
#> <chr> <chr> <chr> <dbl>
#> 1 epoxy1 10 20/07/2022 13:46 22.3
#> 2 epoxy1 10 20/07/2022 13:56 22.6
#> 3 epoxy1 10 20/07/2022 14:06 22.5
#> 4 otherpaint 20 20/07/2022 13:10 24.5
#> 5 otherpaint 20 20/07/2022 13:20 24.6
#> 6 otherpaint 20 20/07/2022 14:30 24.8
# or implement as a function ----------------------------------------------
load_files <- function(csv_path, glob = "*.csv") {
return(
dir_ls(csv_path, glob = glob, type = "file") %>%
# store filenames in filename column
read_csv(id = "filename", show_col_types = FALSE) %>%
# extract each regex group to its own column
extract(filename, c("paint", "thickness"), "_([^_]+)_(\\d+)\\.csv"))
}
# join / group / split ----------------------------------------------------
t_solid <- load_files(tmp_path, "*solid*.csv")
t_water <- load_files(tmp_path, "*water*.csv")
# either join by multiple columns or select only required cols
# to avoid x.* & y.* columns in result
inner_join(t_solid, t_water, by = c("date_time", "paint", "thickness")) %>%
group_by(paint) %>%
group_split()
Final result as a list of tibbles:
#> <list_of<
#> tbl_df<
#> paint : character
#> thickness : character
#> date_time : character
#> frequency : double
#> index : double
#> temperature: double
#> >
#> >[2]>
#> [[1]]
#> # A tibble: 3 × 6
#> paint thickness date_time frequency index temperature
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 epoxy1 10 20/07/2022 13:46 30000 1 22.3
#> 2 epoxy1 10 20/07/2022 13:56 31000 2 22.6
#> 3 epoxy1 10 20/07/2022 14:06 32000 3 22.5
#>
#> [[2]]
#> # A tibble: 3 × 6
#> paint thickness date_time frequency index temperature
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 otherpaint 20 20/07/2022 13:10 20000 1 24.5
#> 2 otherpaint 20 20/07/2022 13:20 21000 2 24.6
#> 3 otherpaint 20 20/07/2022 14:30 22000 3 24.8
I am looking to scrape the following web page:
https://kubears.com/sports/football/stats/2021/assumption/boxscore/11837
... specifically, the "Play-by-Play" tab in the top menu. Getting the information was pretty simple to do:
library(tidyverse)
library(rvest)
url <- "https://kubears.com/sports/football/stats/2021/assumption/boxscore/11837"
page <- rvest::read_html(url)
page %>%
html_table() %>%
pluck(27)
... and the result of that being:
# A tibble: 7 x 2
`Assumption at 15:00` `Assumption at 15:00`
<chr> <chr>
1 Down & Distance Play
2 Assumption at 15:00 Assumption at 15:00
3 1st and 10 at ASM19 Assumption drive start at 15:00.
4 1st and 10 at ASM19 Turner,Easton rush for 2 yards loss to the ASM17 (Justice,Amani).
5 2nd and 12 at ASM17 Exum-Strong,Khaleed rush for 9 yards gain to the ASM26 (Justice,Amani).
6 3rd and 3 at ASM26 Turner,Easton pass incomplete to Collier,Bailey broken up by Meyers,Ryan.
7 4th and 3 at ASM26 Bertolazzo,Gabriel punt 46 yards to the KTZ28.
This is where I am getting lost though. I would like to take that information and manipulate it into various columns. As an example, this is a screenshot from Excel where I lay out what I would like the finished output to look like:
As you can see, I am taking the information from each individual play and translating the information to specific columns. Was it a rush play? Was it a pass play? How many yards were gained? Down, distance, yardline_100, etc.
And then doing that process for the entire play-by-play of the game.
Any suggestions and/or pointers on how to start the process would be appreciated. Scraping is certainly not my core strength when it comes to R.
I've had a go by creating an extract_plays function, which basically parses through the Play column using with a series of stringr::str_detect(), stringr::str_extract() and if_else() functions.
The tricky thing is that there is some inconsistency in the tables at the start of each quarter which requires some special attention, otherwise the code should be quite self-explanatory.
I'm not an American Football follower, so please check some of the assumptions I've made.
library(tidyverse)
library(rvest)
url <- "https://kubears.com/sports/football/stats/2021/assumption/boxscore/11837"
page <- rvest::read_html(url)
# set index for teams
teams <- tibble(team = c("Assumption", "Kutztown"),
posteam = c("ASM", "KTZ"),
defteam = c("KTZ","ASM"))
# grab the play pages from table
plays <- page %>%
html_table() %>%
.[27:47] %>%
# use colnames to extract teams later
map(~rbind(colnames(.x),.x))
## note that 7th and 17th elements of plays are the 2nd and 4th quarter starts - different format
## 11th element is 2nd half start, incompatible table (not included in result)
# create plays extraction function
extract_plays <- function(df){
df <- df %>%
set_names(c("Downs","Play")) %>%
mutate(team = str_extract(first(Downs), "^\\S+"),
drive_start = str_extract(first(Downs),"\\S+$")) %>%
filter(str_detect(Downs, "and")) %>%
inner_join(teams) %>%
extract(Downs,
into = c("down","ydstogo","yardline_100"),
regex = "(^\\d..) and (\\d+) at (.*)") %>%
mutate(yardline_100 = ifelse(str_detect(yardline_100,defteam),
parse_number(yardline_100),
100 - parse_number(yardline_100))) %>%
mutate(pass = +str_detect(Play, "pass"),
rush = +str_detect(Play, "rush"),
punt = +str_detect(Play, "punt"),
special = ifelse(pass + rush == 0,1,0)) %>%
mutate(play_type = case_when(str_detect(Play, "pass") ~ "pass",
str_detect(Play, "rush") ~ "rush",
str_detect(Play, "punt") ~ "punt",
str_detect(Play, "field goal") ~ "fieldgoal",
str_detect(Play, "kickoff") ~ "kickoff",
TRUE ~ "other")) %>%
mutate(passer = str_extract(Play, "(.*)(?=\\spass)"),
rusher = str_extract(Play, "(.*)(?=\\srush)"),
punter = str_extract(Play, "(.*)(?=\\spunt)"),
tackle = NA,
tackle = ifelse(play_type == "pass",
str_extract(Play, "(?<=\\().+?(?=\\))"), tackle),
tackle = ifelse(play_type == "rush",
str_extract(Play, "(?<=\\().+?(?=\\))"), tackle),
yrds = ifelse(play_type == "rush", str_extract(Play, "(\\d+\\syards\\s\\w{4})"),NA),
yrds = ifelse(play_type == "pass", str_extract(Play, "for\\s\\d+\\syards"),yrds),
yrdsgained = ifelse(str_detect(yrds,"loss"),-1*parse_number(yrds),parse_number(yrds)),
complete_pass = case_when(str_detect(Play, "completed") ~ 1,
str_detect(Play, "incomplete") ~ 0,
TRUE ~ NA_real_),
pass_breakup = str_extract(Play,"((?<=broken\\sup\\sby\\s).*$)")
) %>%
select(drive_start,posteam,defteam,yardline_100,down,ydstogo,play_type,pass,rush,special,passer,rusher,yrdsgained,tackle,complete_pass,pass_breakup,punt,punter,Play)
return(df)
}
## Grab for each quarter
Q1 <- plays %>%
.[1:6] %>%
map_df(extract_plays) %>%
mutate(Quarter = "1st", .before = drive_start)
Q2 <- plays %>%
.[8:10] %>%
map_df(extract_plays) %>%
mutate(Quarter = "2nd", .before = drive_start)
Q3 <- plays %>%
.[12:16] %>%
map_df(extract_plays) %>%
mutate(Quarter = "3rd", .before = drive_start)
Q4 <- plays %>%
.[18:21] %>%
map_df(extract_plays) %>%
mutate(Quarter = "4th", .before = drive_start)
## Grab the special cases for 2nd and 4th quarter starts
Q2.1 <- plays %>%
.[[7]] %>% rbind(c("Kutztown at 15:00","Play"),.) %>%
extract_plays() %>%
mutate(Quarter = "2nd", .before = drive_start)
Q4.1 <- plays %>%
.[[17]] %>% rbind(c("Kutztown at 15:00","Play"),.) %>%
extract_plays()%>%
mutate(Quarter = "4th", .before = drive_start)
## Add to Q2 and Q4
Q2 <- bind_rows(Q2.1,Q2)
Q4 <- bind_rows(Q4.1,Q4)
#final table
result <- bind_rows(list(Q1,Q2,Q3,Q4))
result
Giving:
# A tibble: 170 × 20
Quarter drive_start posteam defteam yardline_100 down ydstogo play_type pass rush
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1st 15:00 ASM KTZ 81 1st 10 other 0 0
2 1st 15:00 ASM KTZ 81 1st 10 rush 0 1
3 1st 15:00 ASM KTZ 83 2nd 12 rush 0 1
4 1st 15:00 ASM KTZ 74 3rd 3 pass 1 0
5 1st 15:00 ASM KTZ 74 4th 3 punt 0 0
6 1st 13:30 KTZ ASM 72 1st 10 other 0 0
7 1st 13:30 KTZ ASM 72 1st 10 rush 0 1
8 1st 13:30 KTZ ASM 70 2nd 8 rush 0 1
9 1st 13:30 KTZ ASM 66 3rd 4 rush 0 1
10 1st 13:30 KTZ ASM 61 1st 10 rush 0 1
# … with 160 more rows, and 10 more variables: special <dbl>, passer <chr>, rusher <chr>,
# yrdsgained <dbl>, tackle <chr>, complete_pass <chr>, pass_breakup <chr>, punt <dbl>,
# punter <chr>, Play <chr>
Here's a way to achieve your result using functions from the tidyverse. There are a lot of different ways to get the same results, this is just one way. The code is structured in three main parts: first, building a big dataframe by binding the rows of the multiple lists, second removing the useless rows that were in the original dataframe, and third create all the variables.
The tab dataframe is also slightly different from your page original input, see the code in the data and functions part. I basically changed the column names so that they are not the same and rename them col1 and col2.
Only a few different functions are actually used. I created extract_digit, which extracts the nth occurrence of a number from a string. str_extract and str_match extract the specified pattern from a string, while str_detects only detects (and returns a logical, TRUE or FALSE). word gets the nth word from a string.
library(tidyverse)
tab %>%
# Bind rows of the list to make a big data frame
bind_rows() %>%
# Define the quarter
mutate(quarter = str_extract(paste(col1, col2), "\\w+(?=\\s+quarter)")) %>%
fill(quarter) %>%
# Keep only rows that starts with a number
filter(str_detect(col1, '^[0-9]')) %>%
# Create some variables
mutate(drive_start = str_extract(col2, '[0-9][0-9]:[0-9][0-9]')) %>%
fill(drive_start) %>%
# Remove duplicated rows in the first column, keeping the ones with the longest string
group_by(col1) %>%
slice_max(nchar(col2)) %>%
ungroup() %>%
# ordering the rows
arrange(quarter, desc(drive_start)) %>%
# Mutating part, creation of new variables
mutate(posteam = ifelse(str_detect(col1, "ASM"), "ASM", "KTZ"),
defteam = ifelse(str_detect(col1, "ASM"), "KTZ", "ASM"),
yardline_100 = 100 - extract_digit(col1, 3),
down = extract_digit(col1, 1),
ydstogo = extract_digit(col1, 2),
ydsgained = -c(diff(ydstogo), 0),
penalty = str_detect(col2, "PENALTY"),
end = str_detect(col2, "End"),
play_type = case_when(penalty ~ "penalty",
end ~ "end",
T ~ word(col2, 2)),
rush = +(play_type == "rush"),
pass = +(play_type == "pass"),
special = +(!play_type %in% c("rush", "pass")),
passer = ifelse(pass == 1, word(col2, 1), NA),
rusher = ifelse(rush == 1, word(col2, 1), NA),
tackle = str_match(col2, "(?<=\\().+?(?=\\))")[,1], #Get word between parentheses
complete_pass = case_when(str_detect(col2, "incomplete") & pass == 1 ~ 0,
pass == 1 ~ 1,
TRUE ~ NA_real_),
pass_breakup = word(col2, 2, sep = "broken up by "),
punt = +(play_type == "punt"),
punter = ifelse(punt == 1, word(col2, 1), NA)) %>%
select(-c(col1, col2, penalty, end))
output
# A tibble: 128 x 19
quarter drive_start posteam defteam yardline_100 down ydstogo ydsgained play_type rush pass special passer rusher tackle complete_pass pass_breakup punt punter
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <int> <int> <int> <chr> <chr> <chr> <dbl> <chr> <int> <chr>
1 1st 15:00 ASM KTZ 81 1 10 -2 rush 1 0 0 NA Turner,Easton Justice,Amani NA NA 0 NA
2 1st 15:00 ASM KTZ 83 2 12 9 rush 1 0 0 NA Exum-Strong,Khaleed Justice,Amani NA NA 0 NA
3 1st 15:00 ASM KTZ 74 3 3 0 pass 0 1 0 Turner,Easton NA NA 0 Meyers,Ryan. 0 NA
4 1st 15:00 ASM KTZ 74 4 3 -7 punt 0 0 1 NA NA NA NA NA 1 Berto~
5 1st 13:30 KTZ ASM 72 1 10 0 rush 1 0 0 NA Nickel,Eric Borguet,Keelan NA NA 0 NA
6 1st 13:30 KTZ ASM 61 1 10 -7 rush 1 0 0 NA Davis,Jordan Allen,Edward NA NA 0 NA
7 1st 13:30 KTZ ASM 68 2 17 10 pass 0 1 0 Nickel,Eric NA Malm,Daniel 1 NA 0 NA
8 1st 13:30 KTZ ASM 58 2 7 -1 penalty 0 0 1 NA NA Omokaro,Akugbe NA NA 0 NA
9 1st 13:30 KTZ ASM 70 2 8 6 rush 1 0 0 NA Davis,Jordan Omokaro,Akugbe; Wright,Tr~ NA NA 0 NA
10 1st 13:30 KTZ ASM 53 3 2 0 pass 0 1 0 Nickel,Eric NA NA 0 NA 0 NA
# ... with 118 more rows
data and function
extract_digit <- function(x, n) as.numeric(sapply(str_extract_all(x, "[0-9]+"), `[`, n))
# This function is a wrapper to get the nth occurrence of a number in a string.
tab <- page %>%
html_table() %>%
.[26:47] %>%
map(~ .x %>%
tibble(.name_repair = "universal") %>%
rename_with(~ str_c("col", 1:2)))
[[1]]
# A tibble: 4 x 1
col1
<chr>
1 Play
2 Kutztown wins toss and defers; ASM will receive; KTZ will defend East end-zone.
3 Start of 1st quarter, clock 15:00.
4 Krcic,Dean kickoff 65 yards to the ASM00 Easton,Brevin return 19 yards to the ASM19 (~
[[2]]
# A tibble: 7 x 2
col1 col2
<chr> <chr>
1 Down & Distance Play
2 Assumption at 15:00 Assumption at 15:00
3 1st and 10 at ASM19 Assumption drive start at 15:00.
4 1st and 10 at ASM19 Turner,Easton rush for 2 yards loss to the ASM17 (Justice,Amani).
5 2nd and 12 at ASM17 Exum-Strong,Khaleed rush for 9 yards gain to the ASM26 (Justice,A~
6 3rd and 3 at ASM26 Turner,Easton pass incomplete to Collier,Bailey broken up by Meye~
7 4th and 3 at ASM26 Bertolazzo,Gabriel punt 46 yards to the KTZ28.
# ...
I have data in long format, and I'm trying to test each row against the mean of a certain grouping combination, in order to generate a new column with the conclusion from that test.
Example
In this toy example, I have data about 20 cars. Each car could be of one of three possible makers. We have mpg data for each car, measured 8 times: in the city or highway, in the morning or evening, during the winter or spring.
library(tidyr)
set.seed(2021)
df_id_and_makers <-
data.frame(id = 1:20,
maker = sample(c("toyota", "audi", "bmw"), size = 20, replace = TRUE))
df <- tidyr::expand_grid(df_id_and_makers,
road_type = c("city", "highway"),
time_of_day = c("morning", "evening"),
season = c("winter", "spring"))
df$mpg_val <- sample(15:40, size = nrow(df), replace = TRUE)
df
#> # A tibble: 160 x 6
#> id maker road_type time_of_day season mpg_val
#> <int> <chr> <chr> <chr> <chr> <int>
#> 1 1 bmw city morning winter 28
#> 2 1 bmw city morning spring 22
#> 3 1 bmw city evening winter 40
#> 4 1 bmw city evening spring 18
#> 5 1 bmw highway morning winter 19
#> 6 1 bmw highway morning spring 36
#> 7 1 bmw highway evening winter 30
#> 8 1 bmw highway evening spring 16
#> 9 2 audi city morning winter 33
#> 10 2 audi city morning spring 18
#> # ... with 150 more rows
Created on 2021-07-07 by the reprex package (v2.0.0)
I want to analyze this data to test my hypothesis that mpg in city is larger than mpg in highway. To this end, I want to create a new column that tests whether the value in mpg_val when road_type is city is larger than the mean of mpg_val across rows where road_type is highway. Furthermore, I want to compare just among cars of the same makers.
So, for example, id = 1 is bmw, and therefore the new column I want to compute should test each value of mpg_val in rows where road_type == city (i.e., rows 1-4, but not 5-6), and see whether mpg_val is larger than mean(mpg_val) in rows where road_type == highway and maker == bmw.
Expected output
Here's the manual and dumb way of doing this. I'll show only how I do this for maker = bmw for the sake of demonstration.
library(dplyr)
# step 1 -- calculate the mean of `mpg_val` for `road_type = highway` and only across bmw
mean_bmw_highway_mpg <-
df %>%
filter(maker == "bmw",
road_type == "highway") %>%
pull(mpg_val) %>%
mean()
mean_bmw_highway_mpg
## [1] 26.22222
# step 2 -- compare each row where `maker = "bmw"` and `road_type = "city"` for its `mpg_val` against `mean_bmw_highway_mpg`
result_bmw_only <-
df %>%
mutate(is_mpg_city_larger_than_mpg_highway = case_when(maker != "bmw" ~ "not_relevant",
road_type != "city" ~ "not_relevant",
mpg_val > mean_bmw_highway_mpg ~ "yes",
TRUE ~ "no"))
result_bmw_only
## # A tibble: 160 x 7
## id maker road_type time_of_day season mpg_val is_mpg_city_larger_than_mpg_highway
## <int> <chr> <chr> <chr> <chr> <int> <chr>
## 1 1 bmw city morning winter 28 yes ## because 28 > 26.222
## 2 1 bmw city morning spring 22 no ## because 22 < 26.222
## 3 1 bmw city evening winter 40 yes
## 4 1 bmw city evening spring 18 no
## 5 1 bmw highway morning winter 19 not_relevant
## 6 1 bmw highway morning spring 36 not_relevant
## 7 1 bmw highway evening winter 30 not_relevant
## 8 1 bmw highway evening spring 16 not_relevant
## 9 2 audi city morning winter 33 not_relevant
## 10 2 audi city morning spring 18 not_relevant
## # ... with 150 more rows
How could I achieve the same result as result_bmw_only (but applied to the entire df) in a more elegant way? Hopefully using dplyr approach, because this is what I'm used to, but otherwise any method will do.
Thanks!
EDIT 1
One solution I could think of involves purrr, but I can't get this done yet.
library(purrr)
solution_purrr <-
df %>%
group_by(maker) %>%
nest(data = -maker) %>%
mutate(tbl_with_desired_new_col = map(.x = data,
.f = ~ .x %>%
mutate(is_mpg_city_lrgr_thn_mpg_hwy = case_when(road_type != "city" ~ "not_relevant",
mpg_val > mean(mpg_val) ~ "yes",
TRUE ~ "no"))))
It seems that solution_purrr gets the desired output, but not exactly. This is because the second logic in case_when (i.e., mpg_val > mean(mpg_val) ~ "yes") is not what I want. I want to compare mpg_val to mean(mpg_val) when that mean is computed based only on rows where road_type == "highway". But here mean(mpg_val) computes across all rows.
EDIT 2
Based on #Till's answer below, I'd like to clarify that I'm looking for a solution that avoids a separate calculation of the mean we want to test against. What I did above with mean_bmw_highway_mpg is the undesired way of working towards the output. I showed mean_bmw_highway_mpg only for demonstrating the kind of mean I need to calculate.
What you tried is already close. Take a look at the documentation of dplyr::group_by()
it is designed for these kinds of operations.
Below is how you can expand your BMW-only solution to the full dataset using group_by().
library(tidyverse)
mean_highway_mpg_df <-
df %>%
filter(road_type == "highway") %>%
group_by(maker) %>%
summarise(mean_highway_mpg = mean(mpg_val))
result_df <-
df %>%
filter(road_type == "city") %>%
group_by(maker) %>%
left_join(mean_highway_mpg_df) %>%
mutate(mpg_city_higher_highway = mpg_val > mean_highway_mpg)
#> Joining, by = "maker"
result_df %>%
select(-(time_of_day:season))
#> # A tibble: 80 x 6
#> # Groups: maker [3]
#> id maker road_type mpg_val mean_highway_mpg mpg_city_higher_highway
#> <int> <chr> <chr> <int> <dbl> <lgl>
#> 1 1 bmw city 28 26.2 TRUE
#> 2 1 bmw city 22 26.2 FALSE
#> 3 1 bmw city 40 26.2 TRUE
#> 4 1 bmw city 18 26.2 FALSE
#> 5 2 audi city 33 28.1 TRUE
#> 6 2 audi city 18 28.1 FALSE
#> 7 2 audi city 35 28.1 TRUE
#> 8 2 audi city 36 28.1 TRUE
#> 9 3 audi city 25 28.1 FALSE
#> 10 3 audi city 32 28.1 TRUE
#> # … with 70 more rows
I think I got this. The following solution is based on both my EDIT 1 above, as well as #MrFlick's comment here.
First, we define a helper function:
is_x_larger_than_mean_y <- function(x, y) {
x > mean(y)
}
Then, we run:
library(dplyr)
library(purrr)
library(tidyr)
df %>%
group_by(maker) %>%
nest(data = -maker) %>%
mutate(tbl_with_desired_new_col = map(.x = data,
.f = ~ .x %>%
mutate(is_mpg_city_lrgr_thn_mpg_hwy = case_when(road_type != "city" ~ "not_relevant",
is_x_larger_than_mean_y(mpg_val, mpg_val[road_type == "highway"]) ~ "yes",
TRUE ~ "no")))) %>%
select(-data) %>%
unnest(cols = tbl_with_desired_new_col)
This way, the line within case_when() that says is_x_larger_than_mean_y(mpg_val, mpg_val[road_type == "highway"]) ~ "yes" ensures that we compute the mean of mpg_val only based on rows in which road_type == "highway".
I am trying to figure out a way to aggregate levels of a group creating a new level based on a threshold value of what you are aggregating.
Create some data:
library(tidyr)
library(dplyr)
demo_data <- as_tibble(VADeaths) %>%
mutate(age_bucket = row.names(VADeaths)) %>%
pivot_longer(-age_bucket) %>%
arrange(name)
Here are a bunch of values below our threshold (say 15 here)
demo_data %>%
filter(value < 15)
#> # A tibble: 5 x 3
#> age_bucket name value
#> <chr> <chr> <dbl>
#> 1 50-54 Rural Female 8.7
#> 2 55-59 Rural Female 11.7
#> 3 50-54 Rural Male 11.7
#> 4 50-54 Urban Female 8.4
#> 5 55-59 Urban Female 13.6
Now I can use some logic to do this with case_when but this seems fragile because it is so specific. This does, however, illustrate what I am after:
demo_data %>%
mutate(age_bucket_agg = case_when(
age_bucket %in% c("50-54", "55-59") & name == "Rural Female" ~ "50-59",
age_bucket %in% c("50-54", "55-59") & name == "Urban Female" ~ "50-59",
age_bucket %in% c("50-54", "55-59") & name == "Rural Male" ~ "50-59",
TRUE ~ age_bucket
)
) %>%
group_by(age_bucket_agg, name) %>%
summarise(value = sum(value))
#> `summarise()` regrouping output by 'age_bucket_agg' (override with `.groups` argument)
#> # A tibble: 17 x 3
#> # Groups: age_bucket_agg [6]
#> age_bucket_agg name value
#> <chr> <chr> <dbl>
#> 1 50-54 Urban Male 15.4
#> 2 50-59 Rural Female 20.4
#> 3 50-59 Rural Male 29.8
#> 4 50-59 Urban Female 22
#> 5 55-59 Urban Male 24.3
#> 6 60-64 Rural Female 20.3
#> 7 60-64 Rural Male 26.9
#> 8 60-64 Urban Female 19.3
#> 9 60-64 Urban Male 37
#> 10 65-69 Rural Female 30.9
#> 11 65-69 Rural Male 41
#> 12 65-69 Urban Female 35.1
#> 13 65-69 Urban Male 54.6
#> 14 70-74 Rural Female 54.3
#> 15 70-74 Rural Male 66
#> 16 70-74 Urban Female 50
#> 17 70-74 Urban Male 71.1
My question is can anyone think of an automated way of doing this? How can I tell dplyr (or R in general) to take all values below as threshold and add them to the next age_bucket and then recode that grouping level to take the lowest value and the biggest value and create a new range.
I think your example is a bit too minimal for this really challenging question. I added some challenges to your data which I think the approaches of the other answers can't tackle yet. My approach is quite verbose. Essentially, it checks every logical combination / direction in which age buckets could be merged and then recursively merges the age buckets until the threshold is met or until there are no other age buckets left to merge together. With a bit more work we could turn this into a more general function.
library(tidyverse)
demo_data <- as_tibble(VADeaths) %>%
mutate(age_bucket = row.names(VADeaths)) %>%
pivot_longer(-age_bucket) %>%
arrange(name) %>%
# lets add more challenges to the data
mutate(value = case_when(
age_bucket == "55-59" & name == "Rural Female" ~ 2,
age_bucket == "70-74" & name == "Rural Male" ~ 13,
age_bucket == "65-69" & name == "Urban Female" ~ 8,
age_bucket == "70-74" & name == "Urban Male" ~ 3,
T ~ value))
# function that implements merging age buckets
merge_impl <- function(x) {
if(any(x$first)) {
e <- filter(x, first == 1)
if (e$id & !is.na(e$age_max_lead)) {
out <- mutate(x,
age_max = if_else(first,
age_max_lead,
age_max),
value = if_else(first,
value + value_lead,
value))
out <- filter(out, !lag(first, default = FALSE))
} else if (e$id & is.na(e$age_max_lead & !is.na(e$age_min_lag))) {
out <- mutate(x,
age_min = if_else(first,
age_min_lag,
age_min),
value = if_else(first,
value + value_lag,
value))
out <- filter(out, !lead(first, default = FALSE))
} else if (e$id & is.na(e$age_max_lead & is.na(e$age_min_lag))) {
out <- x
} else if (!e$id & !is.na(e$age_min_lag)) {
out <- mutate(x,
age_min = if_else(first,
age_min_lag,
age_min),
value = if_else(first,
value + value_lag,
value))
out <- filter(out, !lead(first, default = FALSE))
} else if (!e$id & is.na(e$age_min_lag) & !is.na(e$age_max_lead)) {
out <- mutate(x,
age_max = if_else(first,
age_max_lead,
age_max),
value = if_else(first,
value + value_lead,
value)) %>%
out <- filter(out, !lag(first, default = FALSE))
} else if (!e$id & is.na(e$age_min_lag) & is.na(e$age_max_lead)) {
out <- x
}
} else {
out <- x
}
select(out,
-contains("lead"), -contains("lag"),
-first, -id)
}
merge_age_buckets <- function(x, threshold) {
# initialize
data_ls <-
x %>%
separate(age_bucket,
c("age_min", "age_max"),
convert = TRUE) %>%
group_by(name) %>%
mutate(across(c(age_min, age_max, value),
list(lead = ~ lead(.x),
lag = ~ lag(.x))
)
) %>%
mutate(id = age_min %% 10 == 0,
first = value < threshold & cumsum(value < threshold) == 1) %>%
group_split
# check & proceed
if(any(map_lgl(data_ls, ~ any(.x$first & nrow(.x) > 1)))) {
res <- map_dfr(data_ls, merge_impl) %>%
mutate(age_bucket = paste0(age_min, "-", age_max)) %>%
select(- c(age_min, age_max))
# if result still needs adjustment repeat
if(any(res$value < threshold)) {
merge_age_buckets(res, threshold = threshold)
} else {
return(res)
}
} else {
out <- reduce(data_ls, bind_rows) %>%
mutate(age_buckets = paste0(age_min, "-", age_max)) %>%
select(- c(age_min, age_max))
return(out)
}
}
merge_age_buckets(demo_data, 15)
#> # A tibble: 13 x 3
#> name value age_bucket
#> <chr> <dbl> <chr>
#> 1 Rural Female 31 50-64
#> 2 Rural Female 30.9 65-69
#> 3 Rural Female 54.3 70-74
#> 4 Rural Male 29.8 50-59
#> 5 Rural Male 26.9 60-64
#> 6 Rural Male 54 65-74
#> 7 Urban Female 22 50-59
#> 8 Urban Female 27.3 60-69
#> 9 Urban Female 50 70-74
#> 10 Urban Male 15.4 50-54
#> 11 Urban Male 24.3 55-59
#> 12 Urban Male 37 60-64
#> 13 Urban Male 57.6 65-74
Created on 2020-06-23 by the reprex package (v0.3.0)
Here is a unneccessary complicated way using dplyr and stringr:
demo_data %>%
group_by(name) %>%
mutate(csum = cumsum(value),
min_split = ifelse(value<15, as.numeric(str_split(age_bucket[value<15], "-", simplify = TRUE))[1], NA),
max_split = ifelse(value<15, as.numeric(str_split(age_bucket[min(which(csum>15))], "-", simplify = TRUE))[2], NA),
age_bucket = ifelse(value<15, str_c(min_split, "-", max_split), age_bucket),
value = ifelse(value<15, csum[min(which(csum>15))], value)) %>%
select(-min_split, -max_split, -csum) %>%
distinct() %>%
arrange(age_bucket)
which yields
# A tibble: 18 x 3
# Groups: name [4]
age_bucket name value
<chr> <chr> <dbl>
1 50-54 Urban Male 15.4
2 50-59 Rural Female 20.4
3 50-59 Rural Male 29.8
4 50-59 Urban Female 22
5 55-59 Rural Male 18.1
6 55-59 Urban Male 24.3
7 60-64 Rural Female 20.3
8 60-64 Rural Male 26.9
9 60-64 Urban Female 19.3
10 60-64 Urban Male 37
11 65-69 Rural Female 30.9
12 65-69 Rural Male 41
13 65-69 Urban Female 35.1
14 65-69 Urban Male 54.6
15 70-74 Rural Female 54.3
16 70-74 Rural Male 66
17 70-74 Urban Female 50
18 70-74 Urban Male 71.1
Not sure if I understand the requirements correctly after TimTeaFan’s comments, here is approach in data.table:
library(data.table)
DT <- setDT(reshape2::melt(VADeaths, id.vars=NULL))
DT[, c("low", "high") := lapply(tstrsplit(Var1, "-"), as.integer)]
DT[value < 15, c("low","high") := .(min(low), max(high)), Var2]
DT[, sum(value), .(low, high, Var2)]
With a mix of cumsum and rle, (here using data.table::rleid, but you can also use base::rle)
library(tidyr)
library(dplyr)
demo_data <- as_tibble(VADeaths) %>%
mutate(age_bucket = as.factor(row.names(VADeaths))) %>% #factorise to get the levels right
pivot_longer(-age_bucket) %>%
arrange(name, age_bucket) #added this to sort
thresh <- 15
demo_data %>%
group_by(name) %>%
mutate(rle_val = data.table::rleid(value < thresh),
min_nonconsec = which.min(c(1, diff(rle_val) != 1)),
newbuck = cumsum(row_number() > min_nonconsec)) %>%
group_by(name, newbuck) %>%
summarise(newname = paste(age_bucket, collapse = "-"),
newbucket = paste(unlist(strsplit(newname, "-"))[1], tail(unlist(strsplit(newname, "-")),1), sep = "-"),
newval = sum(value)
) %>%
select(-newname)
#> `summarise()` regrouping output by 'name' (override with `.groups` argument)
#> # A tibble: 15 x 4
#> # Groups: name [4]
#> name newbuck newbucket newval
#> <chr> <int> <chr> <dbl>
#> 1 Rural Female 0 50-64 40.7
#> 2 Rural Female 1 65-69 30.9
#> 3 Rural Female 2 70-74 54.3
#> 4 Rural Male 0 50-59 29.8
#> 5 Rural Male 1 60-64 26.9
#> 6 Rural Male 2 65-69 41
#> 7 Rural Male 3 70-74 66
#> 8 Urban Female 0 50-64 41.3
#> 9 Urban Female 1 65-69 35.1
#> 10 Urban Female 2 70-74 50
#> 11 Urban Male 0 50-54 15.4
#> 12 Urban Male 1 55-59 24.3
#> 13 Urban Male 2 60-64 37
#> 14 Urban Male 3 65-69 54.6
#> 15 Urban Male 4 70-74 71.1
Created on 2020-06-20 by the reprex package (v0.3.0)
I'm looking at covid-19 data to calculate estimates for the reproductive number R0.
library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)
library(TTR)
# Get COVID cases, available from:
url <- "https://static.usafacts.org/public/data/covid-19/covid_confirmed_usafacts.csv"
DoubleCOV <- read.csv(url, stringsAsFactors = FALSE)
names(DoubleCOV)[1] <- "countyFIPS"
DoubleCovid <- pivot_longer(DoubleCOV, cols=starts_with("X"),
values_to="cases",
names_to=c("X","date_infected"),
names_sep="X") %>%
mutate(infected = as.Date(date_infected, format="%m.%d.%y"),
countyFIPS = str_pad(as.character(countyFIPS), 5, pad="0"))
#data is by county, summarise for the state of interest
stateData <- DoubleCovid %>% filter(State == "AL") %>% filter(cases != 0) %>%
group_by(infected) %>% summarise(sum(cases)) %>%
mutate(DaysSince = infected - min(infected))
names(stateData)[2] <- "cumCases"
#3 day moving average to smooth a little
stateData <- stateData %>% mutate(MA = runMean(cumCases,3))
#calculate doubling rate (DR) and then R0 infectious period/doubling rate
for(j in 4:nrow(stateData)){
stateData$DR[j] <- log(2)/log(stateData$MA[j]/stateData$MA[j-1])
stateData$R0[j] <- 14/stateData$DR[j]
}
CDplot <- stateData %>%
ggplot(mapping = aes(x = as.numeric(DaysSince), y = R0)) +
geom_line(color = "firebrick")
print(CDplot)
So in the above the state of interest is Alabama, hence filter(State == "AL") and this works.
But if I change the state to "NY" I get
Error in `$<-.data.frame`(`*tmp*`, "DR", value = c(NA, NA, NA, 0.733907206043719 :
replacement has 4 rows, data has 39
head(stateData) yields
infected cumCases DaysSince MA
<date> <int> <drtn> <dbl>
1 2020-03-02 1 0 days NA
2 2020-03-03 2 1 days NA
3 2020-03-04 11 2 days 4.67
4 2020-03-05 23 3 days 12
5 2020-03-06 25 4 days 19.7
6 2020-03-07 77 5 days 41.7
The moving average values in rows 3 and 4 (12 and 4.67) would yield a doubling rate of 0.734 which aligns with the value in the error message value = c(NA, NA, NA, 0.733907206043719 but why does it throw an error after that?
Bonus question: I know loops are frowned upon in R...is there a way to get the moving average and R0 calculation without one?
You have to initialise the new variables before you can access them using the j index. Due to recycling, Alabama, which has 28 rows (divisible by 4), does not return an error, only the warnings about uninitialised columns. New York, however, has 39 rows, which is not divisible by 4 so recycling fails and R returns an error. You shouldn't ignore warnings, sometimes you can, but it's not a good idea.
Try this to see what R (you) is trying to do:
stateData[4]
You should get all rows of the 4th column, not the 4th row.
Solution: initialise your DR and R0 columns first.
stateData$DR <- NA
stateData$R0 <- NA
for(j in 4:nrow(stateData)){
stateData$DR[j] <- log(2)/log(stateData$MA[j]/stateData$MA[j-1])
stateData$R0[j] <- 14/stateData$DR[j]
}
For the bonus question, you can use lag in the same mutate with MA:
stateData <- stateData %>% mutate(MA = runMean(cumCases,3),
DR = log(2)/log(MA/lag(MA)),
R0 = 14 / DR)
stateData
# A tibble: 28 x 6
infected cumCases DaysSince MA DR R0
<date> <int> <drtn> <dbl> <dbl> <dbl>
1 2020-03-13 5 0 days NA NA NA
2 2020-03-14 11 1 days NA NA NA
3 2020-03-15 22 2 days 12.7 NA NA
4 2020-03-16 29 3 days 20.7 1.42 9.89
5 2020-03-17 39 4 days 30 1.86 7.53
6 2020-03-18 51 5 days 39.7 2.48 5.64
7 2020-03-19 78 6 days 56 2.01 6.96
8 2020-03-20 106 7 days 78.3 2.07 6.78
9 2020-03-21 131 8 days 105 2.37 5.92
10 2020-03-22 167 9 days 135. 2.79 5.03
# ... with 18 more rows
I'm using Alabama's data.