find average incidents per business day - r

I've a dataset as under:
+----+-------+---------------------+
| ID | SUBID | date |
+----+-------+---------------------+
| A | 1 | 2021-01-01 12:00:00 |
| A | 1 | 2021-01-02 01:00:00 |
| A | 1 | 2021-01-02 02:00:00 |
| A | 1 | 2021-01-03 03:00:00 |
| A | 2 | 2021-01-05 16:00:00 |
| A | 2 | 2021-01-06 13:00:00 |
| A | 2 | 2021-01-07 06:00:00 |
| A | 2 | 2021-01-08 08:00:00 |
| A | 2 | 2021-01-08 10:00:00 |
| A | 2 | 2021-01-08 11:00:00 |
| A | 3 | 2021-01-09 09:00:00 |
| A | 3 | 2021-01-10 19:00:00 |
| A | 3 | 2021-01-11 20:00:00 |
| A | 3 | 2021-01-12 22:00:00 |
| B | 1 | 2021-02-01 23:00:00 |
| B | 1 | 2021-02-02 15:00:00 |
| B | 1 | 2021-02-03 06:00:00 |
| B | 1 | 2021-02-04 08:00:00 |
| B | 2 | 2021-02-05 18:00:00 |
| B | 2 | 2021-02-05 19:00:00 |
| B | 2 | 2021-02-06 22:00:00 |
| B | 2 | 2021-02-06 23:00:00 |
| B | 2 | 2021-02-07 04:00:00 |
| B | 2 | 2021-02-08 02:00:00 |
| B | 3 | 2021-02-09 01:00:00 |
| B | 3 | 2021-02-10 03:00:00 |
| B | 3 | 2021-02-11 13:00:00 |
| B | 3 | 2021-02-12 14:00:00 |
+----+-------+---------------------+
I want to be able to get the time difference between each ID and SUBID group in hours, preferably in terms of business hours, where each of the date that appears on a weekend or a federal holiday can be moved to a nearest weekday (preceding or succeeding) with a time of 23:59:59 as under:
+----+-------+---------------------+------------------------------------------------------------------+
| ID | SUBID | date | timediff (hours) with preceding date for each group (ID, SUBID) |
+----+-------+---------------------+------------------------------------------------------------------+
| A | 1 | 2021-01-01 12:00:00 | 0 |
| A | 1 | 2021-01-02 01:00:00 | 13 |
| A | 1 | 2021-01-02 02:00:00 | 1 |
| A | 1 | 2021-01-03 03:00:00 | 1 |
| A | 2 | 2021-01-05 16:00:00 | 0 |
| A | 2 | 2021-01-06 13:00:00 | 21 |
| A | 2 | 2021-01-07 06:00:00 | 17 |
| A | 2 | 2021-01-08 08:00:00 | 2 |
| A | 2 | 2021-01-08 10:00:00 | 2 |
| A | 2 | 2021-01-08 11:00:00 | 1 |
| A | 3 | 2021-01-09 09:00:00 | 0 |
| A | 3 | 2021-01-10 19:00:00 | 36 |
| A | 3 | 2021-01-11 20:00:00 | 1 |
| A | 3 | 2021-01-12 22:00:00 | 1 |
| B | 1 | 2021-02-01 23:00:00 | 0 |
| B | 1 | 2021-02-02 15:00:00 | 16 |
| B | 1 | 2021-02-03 06:00:00 | 15 |
| B | 1 | 2021-02-04 08:00:00 | 26 |
| B | 2 | 2021-02-05 18:00:00 | 0 |
| B | 2 | 2021-02-05 19:00:00 | 1 |
| B | 2 | 2021-02-06 22:00:00 | 27 |
| B | 2 | 2021-02-06 23:00:00 | 1 |
| B | 2 | 2021-02-07 04:00:00 | 5 |
| B | 2 | 2021-02-08 02:00:00 | 22 |
| B | 3 | 2021-02-09 01:00:00 | 0 |
| B | 3 | 2021-02-10 03:00:00 | 26 |
| B | 3 | 2021-02-11 13:00:00 | 11 |
| B | 3 | 2021-02-12 14:00:00 | 1 |
+----+-------+---------------------+------------------------------------------------------------------+
and lastly I want to calculate the average time which would be the sum of time differences per group (ID, SUBID) divide by the total count per group as under:
+----+-------+------------------------------------------------------------+
| ID | SUBID | Average time (count per group/ total time diff of group ) |
+----+-------+------------------------------------------------------------+
| A | 1 | 15/4 |
| A | 2 | 43/6 |
| A | 3 | 38/4 |
| B | 1 | 57/4 |
| B | 2 | 56/6 |
| B | 3 | 38/4 |
+----+-------+------------------------------------------------------------+
I'm fairly new to R and I came across lubridate to help me format the dates and I wasable to get the time diff using the code below
df%>%
group_by(ID, SUBID) %>%
mutate(time_diff = difftime(date, lag(date), unit = 'min'))
However I'm having troubles getting difference of just the business days time and also getting the average time as per the last table

Welcome on SO! Using dplyr and lubridate:
Data used:
library(tidyverse)
library(lubridate)
df <- data.frame(ID = c("A","A","A","A"),
SUBID = c(1,1,2,2),
Date = lubridate::as_datetime(c("2021-01-01 12:00:00","2021-01-02 1:00:00","2021-01-01 2:00:00","2021-01-01 13:00:00")))
ID SUBID Date
1 A 1 2021-01-01 12:00:00
2 A 1 2021-01-02 01:00:00
3 A 2 2021-01-01 02:00:00
4 A 2 2021-01-01 13:00:00
Code:
df %>%
group_by(ID, SUBID) %>%
mutate(diff = Date - lag(Date)) %>%
mutate(diff = ifelse(is.na(diff), 0, diff)) %>%
summarise(Average = sum(diff)/n())
Output:
ID SUBID Average
<chr> <dbl> <dbl>
1 A 1 6.5
2 A 2 5.5
Edit: How to handle week_ends
For the week-ends, the simplier solutions is to change the day to the next monday:
df %>%
mutate(week_day = wday(Date,label = TRUE, abbr = FALSE)) %>%
mutate(Date = ifelse(week_day == "samedi", Date + days(2),
ifelse(week_day == "dimanche", Date + days(1), Date))) %>%
mutate(Date = as_datetime(Date))
This create the column week_day with the name of the day. If the day is a "samedi" (saturday) or a "dimanche" (sunday), it adds 2 or 1 day to the Date so it becomes a Monday. Then, you just need to reorder the dates (df %>% arrange(ID, SUBID, Date)) and rerun the first code.
As my local langage is french, you have to change the samedi and dimanche to saturday and sunday
For the holidays, you can apply the same system by creating a time-interval variable which represents the holidays, test for each date if it is whithin this interval, and if so, change the date to the last day of this interval.

Related

Buy and Hold return around event date in R

I have a question in calculating returns in the following case.
For each ID, if Date=EventDate, I hope to calculate "buy and hold return" from 5 days prior to the event date to 5 days after.
To be more specific with the table below, I want to calculate 11 returns for each ID, where the returns are (9/10-1), (12/10-1), (14/10-1), ~ , (14/10-1), (17/10-1), (16/10-1) for ID = 1 and (57/50-1), (60/50-1), (49/50-1), ~ , (65/50-1), (57/50-1), (55/50-1) for ID = 2. (That is the price 6 days prior to the event date is the denominator in the return calculation.)
+----+------------+-------+------------+
| ID | Date | Price | EventDate |
+----+------------+-------+------------+
| 1 | 2011-03-06 | 10 | NA |
| 1 | 2011-03-07 | 9 | NA |
| 1 | 2011-03-08 | 12 | NA |
| 1 | 2011-03-09 | 14 | NA |
| 1 | 2011-03-10 | 15 | NA |
| 1 | 2011-03-11 | 17 | NA |
| 1 | 2011-03-12 | 12 | 2011-03-12 |
| 1 | 2011-03-13 | 14 | NA |
| 1 | 2011-03-14 | 17 | NA |
| 1 | 2011-03-15 | 14 | NA |
| 1 | 2011-03-16 | 17 | NA |
| 1 | 2011-03-17 | 16 | NA |
| 1 | 2011-03-18 | 15 | NA |
| 1 | 2011-03-19 | 16 | NA |
| 1 | 2011-03-20 | 17 | NA |
| 1 | 2011-03-21 | 18 | NA |
| 1 | 2011-03-22 | 11 | NA |
| 1 | 2011-03-23 | 15 | NA |
| 1 | 2011-03-24 | 12 | 2011-03-24 |
| 1 | 2011-03-25 | 13 | NA |
| 1 | 2011-03-26 | 15 | NA |
| 2 | 2011-06-11 | 48 | NA |
| 2 | 2011-06-12 | 49 | NA |
| 2 | 2011-06-13 | 50 | NA |
| 2 | 2011-06-14 | 57 | NA |
| 2 | 2011-06-15 | 60 | NA |
| 2 | 2011-06-16 | 49 | NA |
| 2 | 2011-06-17 | 64 | NA |
| 2 | 2011-06-18 | 63 | NA |
| 2 | 2011-06-19 | 67 | 2011-06-19 |
| 2 | 2011-06-20 | 70 | NA |
| 2 | 2011-06-21 | 58 | NA |
| 2 | 2011-06-22 | 65 | NA |
| 2 | 2011-06-23 | 57 | NA |
| 2 | 2011-06-24 | 55 | NA |
| 2 | 2011-06-25 | 57 | NA |
| 2 | 2011-06-26 | 60 | NA |
+----+------------+-------+------------+
Eventually, I hope to make the following table with a new column.
+----+------------+-------+------------+---------------+
| ID | Date | Price | EventDate | BuyHoldReturn |
+----+------------+-------+------------+---------------+
| 1 | 2011-03-06 | 10 | NA | NA |
| 1 | 2011-03-07 | 9 | NA | -0.1 |
| 1 | 2011-03-08 | 12 | NA | 0.2 |
| 1 | 2011-03-09 | 14 | NA | 0.4 |
| 1 | 2011-03-10 | 15 | NA | 0.5 |
| 1 | 2011-03-11 | 17 | NA | 0.7 |
| 1 | 2011-03-12 | 12 | 2011-03-12 | 0.2 |
| 1 | 2011-03-13 | 14 | NA | 0.4 |
| 1 | 2011-03-14 | 17 | NA | 0.7 |
| 1 | 2011-03-15 | 14 | NA | 0.4 |
| 1 | 2011-03-16 | 17 | NA | 0.7 |
| 1 | 2011-03-17 | 16 | NA | 0.6 |
| 1 | 2011-03-18 | 15 | NA | NA |
| 1 | 2011-03-19 | 16 | NA | 0.066666667 |
| 1 | 2011-03-20 | 17 | NA | 0.133333333 |
| 1 | 2011-03-21 | 18 | NA | 0.2 |
| 1 | 2011-03-22 | 11 | NA | -0.266666667 |
| 1 | 2011-03-23 | 15 | NA | 0 |
| 1 | 2011-03-24 | 12 | 2011-03-24 | -0.2 |
| 1 | 2011-03-25 | 13 | NA | -0.133333333 |
| 1 | 2011-03-26 | 15 | NA | 0 |
| 2 | 2011-06-11 | 48 | NA | NA |
| 2 | 2011-06-12 | 49 | NA | NA |
| 2 | 2011-06-13 | 50 | NA | NA |
| 2 | 2011-06-14 | 57 | NA | 0.14 |
| 2 | 2011-06-15 | 60 | NA | 0.2 |
| 2 | 2011-06-16 | 49 | NA | -0.02 |
| 2 | 2011-06-17 | 64 | NA | 0.28 |
| 2 | 2011-06-18 | 63 | NA | 0.26 |
| 2 | 2011-06-19 | 67 | 2011-06-19 | 0.34 |
| 2 | 2011-06-20 | 70 | NA | 0.4 |
| 2 | 2011-06-21 | 58 | NA | 0.16 |
| 2 | 2011-06-22 | 65 | NA | 0.3 |
| 2 | 2011-06-23 | 57 | NA | 0.14 |
| 2 | 2011-06-24 | 55 | NA | 0.1 |
| 2 | 2011-06-25 | 57 | NA | NA |
| 2 | 2011-06-26 | 60 | NA | NA |
+----+------------+-------+------------+---------------+
I have an idea of using the code below, but couldn't figure out how to calculate the 11 buy and hold returns around the event date.
data<-data%>%
group_by(ID)%>%
mutate(BuyHoldReturn=ifelse(Date==EventDate, ....
Thanks in advance!
We can try
library(dplyr)
df |> group_by(ID) |> mutate( x = Price/lag(Price) - 1 ,
y = which(Date == EventDate) - 1:n() ,
BuyHoldReturn = case_when(between(y , -5 , 5) ~ x , TRUE ~ NA_real_)) |>
select(-x , -y)
Output
# A tibble: 28 × 5
# Groups: ID [2]
ID Date Price EventDate BuyHoldReturn
<int> <chr> <int> <chr> <dbl>
1 1 2011-03-06 10 NA NA
2 1 2011-03-07 9 NA -0.1
3 1 2011-03-08 12 NA 0.333
4 1 2011-03-09 14 NA 0.167
5 1 2011-03-10 15 NA 0.0714
6 1 2011-03-11 17 NA 0.133
7 1 2011-03-12 12 2011-03-12 -0.294
8 1 2011-03-13 14 NA 0.167
9 1 2011-03-14 17 NA 0.214
10 1 2011-03-15 14 NA -0.176
# … with 18 more rows

Fill columns with most recent value

I have a dataset like this in R:
Date | ID | Age |
2019-11-22 | 1 | 5 |
2018-12-21 | 1 | 4 |
2018-05-09 | 1 | 4 |
2018-05-01 | 2 | 5 |
2017-10-10 | 2 | 4 |
2017-07-21 | 1 | 3 |
How do I change the Age values of each group of ID to the most recent Age record?
Results should look like this:
Date | ID | Age |
2019-11-22 | 1 | 5 |
2018-12-21 | 1 | 5 |
2018-05-09 | 1 | 5 |
2018-05-01 | 2 | 5 |
2017-10-10 | 2 | 5 |
2017-07-21 | 1 | 5 |
I tried group_by(ID)%>% mutate(Age = max(Date, Age))
but it seems to be giving strange huge numbers for certain cases when I try it on a v huge dataset. What could be going wrong?
Try sorting first,
df %>%
arrange(as.Date(Date)) %>%
group_by(ID) %>%
mutate(Age = last(Age))
which gives,
# A tibble: 6 x 3
# Groups: ID [2]
Date ID Age
<fct> <int> <int>
1 2017-07-21 1 5
2 2017-10-10 2 5
3 2018-05-01 2 5
4 2018-05-09 1 5
5 2018-12-21 1 5
6 2019-11-22 1 5
I think the issue is in your mutate function:
Try this:
group_by(ID) %>%
arrange(as.date(Date) %>%
mutate(Age = max(Age))

Calculating difference between dates based on grouping one or more columns

A sample of my dataset is as below:
| id | Date | Buyer |
|:--:|-----------:|----------|
| 9 | 11/29/2018 | Jenny |
| 9 | 11/29/2018 | Jenny |
| 9 | 11/29/2018 | Jenny |
| 4 | 5/30/2018 | Chang |
| 4 | 7/4/2018 | Chang |
| 4 | 8/17/2018 | Chang |
| 5 | 5/25/2018 | Chunfei |
| 5 | 2/13/2019 | Chunfei |
| 5 | 2/16/2019 | Chunfei |
| 5 | 2/16/2019 | Chunfei |
| 5 | 2/23/2019 | Chunfei |
| 5 | 2/25/2019 | Chunfei |
| 8 | 2/28/2019 | Chunfei |
| 8 | 2/28/2019 | Chunfei |
I have two sets of questions with this dataset:
I need to calculate the difference between dates but this difference will be calculated based on grouping 'Buyer' and 'id', which means, the date difference for the Buyer 'Jenny' and Id '9' will be one group, Buyer 'Chang' with Id '4' will be another group and Buyer 'Chunfei' with Id '5' will be another group and 'Chunfei' with Id '8' will be another group. So, the output will be:
| id | Date | Buyer_id | Diff |
|:--:|-----------:|----------|------|
| 9 | 11/29/2018 | Jenny | NA |
| 9 | 11/29/2018 | Jenny | 0 |
| 9 | 11/29/2018 | Jenny | 0 |
| 4 | 5/30/2018 | Chang | NA |
| 4 | 7/4/2018 | Chang | 35 |
| 4 | 8/17/2018 | Chang | 44 |
| 5 | 5/25/2018 | Chunfei | NA |
| 5 | 2/13/2019 | Chunfei | 264 |
| 5 | 2/16/2019 | Chunfei | 3 |
| 5 | 2/16/2019 | Chunfei | 0 |
| 5 | 2/23/2019 | Chunfei | 7 |
| 5 | 2/25/2019 | Chunfei | 2 |
| 8 | 2/28/2019 | Chunfei | NA |
| 8 | 2/28/2019 | Chunfei | 0 |
The issue is that I'm not understanding why the group_by isn't working. The following code subtracts the consecutive rows rather than grouping them for same buyer and id and then subtracting.
df=data.frame(id=c("9","9","9","4","4","4","5","5","5","5","5","5","8","8"),
Date=c("11/29/2018","11/29/2018","11/29/2018","5/30/2018","7/4/2018",
"8/17/2018","5/25/2018","2/13/2019","2/16/2019","2/16/2019","2/23/2019",
"2/25/2019","2/28/2019","2/28/2019"),Buyer=c("Jenny","Jenny","Jenny",
"Chang","Chang","Chang","Chunfei","Chunfei","Chunfei","Chunfei","Chunfei",
"Chunfei","Chunfei","Chunfei"))
df$id=as.numeric(as.character(df$id))
df$Date=as.Date(df$Date, "%m/%d/%Y")
df$Buyer=as.character(df$Buyer)
df1=df %>% group_by(Buyer,id) %>%
mutate(diff=as.numeric(difftime(Date,lag(Date),units='days')))
After calculating the date difference, I need to filter those records whose differences between dates are 5 days. In the above example, the date difference between "5/25/2018", "2/13/2019","2/16/2019","2/16/2019","2/23/2019","2/25/2019" will be NA,264,3,0,7,2. However, if I provide a filter for n<6, I would miss on the dates "2/13/2019" and "2/23/2019". These dates will be important to retain in the final output, because even though the difference between dates "2/13/2019" and "5/25/2018" is 264, the difference between "2/16/2019" and "2/13/2019" is 3. Similarly, even though the difference between "2/16/2019" and "2/23/2019" is 7, the difference between "2/23/2019" and "2/25/2019" is 2. So,I need to retain these dates. How can this be achieved?
We can mask the column 'diff' in the final output and it should look like below:
| id | Date | Buyer_id |
|----|:----------:|---------:|
| 9 | 11/29/2018 | Jenny |
| 9 | 11/29/2018 | Jenny |
| 9 | 11/29/2018 | Jenny |
| 5 | 2/13/2019 | Chunfei |
| 5 | 2/16/2019 | Chunfei |
| 5 | 2/16/2019 | Chunfei |
| 5 | 2/23/2019 | Chunfei |
| 5 | 2/25/2019 | Chunfei |
| 8 | 2/28/2019 | Chunfei |
| 8 | 2/28/2019 | Chunfei |
We can use diff to subtract Date and select groups where there is at-least one value which is less than equal to 5 days.
library(dplyr)
df %>%
group_by(id, Buyer) %>%
filter(any(diff(Date) <= 5))
# id Date Buyer
# <dbl> <date> <chr>
# 1 9 2018-11-29 Jenny
# 2 9 2018-11-29 Jenny
# 3 9 2018-11-29 Jenny
# 4 5 2018-05-25 Chunfei
# 5 5 2019-02-13 Chunfei
# 6 5 2019-02-16 Chunfei
# 7 5 2019-02-16 Chunfei
# 8 5 2019-02-23 Chunfei
# 9 5 2019-02-25 Chunfei
#10 8 2019-02-28 Chunfei
#11 8 2019-02-28 Chunfei
After re-reading the question I think you might not be looking to filter entire groups but only those rows which have difference of 5 days. We can get indices which have diff value of less than 5 and select it's previous index as well.
df %>%
group_by(id, Buyer) %>%
mutate(diff = c(NA, diff(Date))) %>%
slice({i1 <- which(diff <= 5); unique(c(i1, i1-1))}) %>%
select(-diff)
# id Date Buyer
# <dbl> <date> <chr>
# 1 5 2019-02-16 Chunfei
# 2 5 2019-02-16 Chunfei
# 3 5 2019-02-25 Chunfei
# 4 5 2019-02-13 Chunfei
# 5 5 2019-02-23 Chunfei
# 6 8 2019-02-28 Chunfei
# 7 8 2019-02-28 Chunfei
# 8 9 2018-11-29 Jenny
# 9 9 2018-11-29 Jenny
#10 9 2018-11-29 Jenny
data
df <- structure(list(id = c(9, 9, 9, 4, 4, 4, 5, 5, 5, 5, 5, 5, 8,
8), Date = structure(c(17864, 17864, 17864, 17681, 17716, 17760,
17676, 17940, 17943, 17943, 17950, 17952, 17955, 17955), class = "Date"),
Buyer = c("Jenny", "Jenny", "Jenny", "Chang", "Chang", "Chang",
"Chunfei", "Chunfei", "Chunfei", "Chunfei", "Chunfei", "Chunfei",
"Chunfei", "Chunfei")), row.names = c(NA, -14L), class = "data.frame")

Remove the first row from each group if the second row meets a condition

Here's a sample of my dataset:
df=data.frame(id=c("9","9","9","5","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","5/25/2018","2/13/2019","2/13/2019","6/7/2018",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"))
I need to calculate the difference between dates which I have already done and the dataset then looks like:
| id | Date | Buyer | diff |
|----|:----------:|------:|------|
| 9 | 11/29/2018 | John | NA |
| 9 | 11/29/2018 | John | 0 |
| 9 | 11/29/2018 | John | 0 |
| 5 | 5/25/2018 | Maria | -188 |
| 5 | 2/13/2019 | Maria | 264 |
| 5 | 2/13/2019 | Maria | 0 |
| 4 | 6/7/2018 | Sandy | -251 |
| 4 | 6/15/2018 | Sandy | 8 |
| 4 | 6/20/2018 | Sandy | 5 |
| 4 | 8/17/2018 | Sandy | 58 |
| 4 | 8/20/2018 | Sandy | 3 |
| 20 | 12/25/2018 | Paul | 127 |
| 20 | 12/25/2018 | Paul | 0 |
Now, if the value of second row within each group of column 'diff' is greater than or equal to 5, then I need to delete the first row of each group. For example, the diff value 264 is greater than 5 for Buyer 'Maria' having id '5', so I would want to delete the first row within that group which would be the buyer 'Maria' having id '5', Date as '5/25/2018', and diff as '-188'
Below is a sample of my code:
df1=df %>% group_by(Buyer,id) %>%
mutate(diff = c(NA, diff(Date))) %>%
filter(!(diff >=5 & row_number() == 1))
The problem is that the above code selects the first row instead of the second row and I don't know how to specify the row to be 2nd for each group where the diff value should be greater than or equal to 5.
My expected output should look like:
| id | Date | Buyer | diff |
|----|:----------:|------:|------|
| 9 | 11/29/2018 | John | NA |
| 9 | 11/29/2018 | John | 0 |
| 9 | 11/29/2018 | John | 0 |
| 5 | 2/13/2019 | Maria | 264 |
| 5 | 2/13/2019 | Maria | 0 |
| 4 | 6/15/2018 | Sandy | 8 |
| 4 | 6/20/2018 | Sandy | 5 |
| 4 | 8/17/2018 | Sandy | 58 |
| 4 | 8/20/2018 | Sandy | 3 |
| 20 | 12/25/2018 | Paul | 127 |
| 20 | 12/25/2018 | Paul | 0 |
I think you forgot to provide the diff column in df. I created one called diffs so that it doesn't conflict with the function diff(). -
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y")))) %>%
filter(
n() == 1 | # always keep if only one row in group
row_number() > 1 | # always keep all row_number() > 1
diffs[2] < 5 # keep 1st row only if 2nd row diffs < 5
) %>%
ungroup()
# A tibble: 11 x 4
id Date Buyer diffs
<chr> <chr> <chr> <dbl>
1 9 11/29/2018 John NA
2 9 11/29/2018 John 0
3 9 11/29/2018 John 0
4 5 2/13/2019 Maria 264
5 5 2/13/2019 Maria 0
6 4 6/15/2018 Sandy 8
7 4 6/20/2018 Sandy 5
8 4 8/17/2018 Sandy 58
9 4 8/20/2018 Sandy 3
10 20 12/25/2018 Paul NA
11 20 12/25/2018 Paul 0
Data -
I added stringsAsFactors = FALSE
df1 <- data.frame(id=c("9","9","9","5","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","5/25/2018","2/13/2019","2/13/2019","6/7/2018",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul")
, stringsAsFactors = F)
Maybe I overthought it, but here is one idea,
df8 %>%
mutate(Date = as.Date(Date, format = '%m/%d/%Y')) %>%
mutate(diff = c(NA, diff(Date))) %>%
group_by(id) %>%
mutate(diff1 = as.integer(diff >= 5) + row_number()) %>%
filter(diff1 != 1 | lead(diff1) != 3) %>%
select(-diff1)
which gives,
# A tibble: 11 x 4
# Groups: id [4]
id Date Buyer diff
<fct> <date> <fct> <dbl>
1 9 2018-11-29 John NA
2 9 2018-11-29 John 0
3 9 2018-11-29 John 0
4 5 2019-02-13 Maria 264
5 5 2019-02-13 Maria 0
6 4 2018-06-15 Sandy 8
7 4 2018-06-20 Sandy 5
8 4 2018-08-17 Sandy 58
9 4 2018-08-20 Sandy 3
10 20 2018-12-25 Paul 127
11 20 2018-12-25 Paul 0

Count Wins and Average Home Win Odds in R

I'm trying to create a data frame in R that will allow me to view the average home betting odds for each team along with the number of home wins for each season.
There are 6,840 records in the dataset representing 18 seasons' worth of Premier League football. This means there are 380 match entries for each season.
Let me show you an example. It is a drastically cut down example, but it gives you a good enough idea about what I'm trying to achieve.
Key: FTHG (Full-Time Home Goals), FTAG (Full-Time Away Goals), FTR (Full-Time Result), HWO (Home Win Odds), AHWO (Average Home Win Odds), W (Win Count)
matchData:
Season | HomeTeam | AwayTeam | FTHG | FTAG | FTR | HWO
-----------------------------------------------------------------
1 | 2017/2018 | TeamA | TeamB | 2 | 1 | H | 1.30
2 | 2017/2018 | TeamA | TeamC | 1 | 1 | D | 1.45
3 | 2017/2018 | TeamA | TeamD | 1 | 0 | H | 2.20
4 | 2017/2018 | TeamB | TeamA | 4 | 1 | H | 1.85
5 | 2017/2018 | TeamC | TeamA | 1 | 0 | H | 1.70
6 | 2017/2018 | TeamD | TeamA | 2 | 3 | A | 3.10
7 | 2016/2017 | TeamA | TeamB | 2 | 1 | H | 1.30
8 | 2016/2017 | TeamA | TeamC | 0 | 0 | D | 1.50
9 | 2016/2017 | TeamA | TeamD | 1 | 2 | A | 1.67
10 | 2016/2017 | TeamB | TeamA | 3 | 1 | H | 1.42
11 | 2016/2017 | TeamB | TeamC | 2 | 1 | H | 1.90
12 | 2016/2017 | TeamB | TeamD | 5 | 1 | H | 1.20
13 | 2016/2017 | TeamC | TeamA | 1 | 0 | H | 2.00
14 | 2016/2017 | TeamC | TeamB | 3 | 1 | H | 1.80
I need to summarise the matchData data frame into a new one like this:
homeWinOdds:
Season | Team | W | AHWO
-------------------------------------
1 | 2017/2018 | TeamA | 2 | 1.75
2 | 2017/2018 | TeamB | 1 | 1.85
3 | 2017/2018 | TeamC | 1 | 1.70
4 | 2017/2018 | TeamD | 0 | 3.10
5 | 2016/2017 | TeamA | 1 | 1.49
6 | 2016/2017 | TeamB | 3 | 1.51
7 | 2016/2017 | TeamC | 2 | 1.90
8 | 2016/2017 | TeamD | 0 | N/A
For instance, based on the above, TeamB won three home matches in season 2016/2017 and their average home odds (based on all of their home matches in that season) were 1.51.
In my actual dataset, each one of the 20 teams will have each played exactly 19 home matches in every season, so the home odds of these matches will be averaged.
In summary:
count the number of home wins a team has had in a season
average the home win odds for the whole season (only for the team's home games)
display as separate records — in the actual dataset there are 20 teams for each season so therefore there will be 20 records for each season.
I appreciate in advance anyone who can help me with this.
library(dplyr)
homeWinOdds <- matchData %>%
group_by(Season, HomeTeam) %>%
summarize(W = sum(FTR == "H"),
AHWO = mean(HWO)) %>%
ungroup()

Resources