How to remove non hierarchical data from a table - hierarchy

I have a table like the following one.
Parent and child fieldS are in a one (parent) to many (child) relation except for FRANCE.
FRANCE has got two parents: EMEA and APAC.
What I need is to keep only one relation for FRANCE (the one with the highest customers count) and put the others into a log table.
Please could you help?
Many thanks in advance.
Alberto
Original table
+-------+--------+--------+-----------------+
| RowID | parent | child | customers count |
+-------+--------+--------+-----------------+
| 1 | EMEA | FRANCE | 5 |
| 2 | EMEA | ITALY | 2 |
| 3 | AMER | USA | 1 |
| 4 | AMER | BRASIL | 5 |
| 5 | APAC | FRANCE | 1 |
| 6 | APAC | JAPAN | 3 |
+-------+--------+--------+-----------------+
the final result should be:
Master data table
+-------+--------+--------+-----------------+
| RowID | parent | child | customers count |
+-------+--------+--------+-----------------+
| 1 | EMEA | FRANCE | 5 |
| 2 | EMEA | ITALY | 2 |
| 3 | AMER | USA | 1 |
| 4 | AMER | BRASIL | 5 |
| 6 | APAC | JAPAN | 3 |
+-------+--------+--------+-----------------+
log table
+-------+--------+--------+-----------------+
| RowID | parent | child | customers count |
+-------+--------+--------+-----------------+
| 5 | APAC | FRANCE | 1 |
+-------+--------+--------+-----------------+

Related

Buy and Hold return around event date in R

I have a question in calculating returns in the following case.
For each ID, if Date=EventDate, I hope to calculate "buy and hold return" from 5 days prior to the event date to 5 days after.
To be more specific with the table below, I want to calculate 11 returns for each ID, where the returns are (9/10-1), (12/10-1), (14/10-1), ~ , (14/10-1), (17/10-1), (16/10-1) for ID = 1 and (57/50-1), (60/50-1), (49/50-1), ~ , (65/50-1), (57/50-1), (55/50-1) for ID = 2. (That is the price 6 days prior to the event date is the denominator in the return calculation.)
+----+------------+-------+------------+
| ID | Date | Price | EventDate |
+----+------------+-------+------------+
| 1 | 2011-03-06 | 10 | NA |
| 1 | 2011-03-07 | 9 | NA |
| 1 | 2011-03-08 | 12 | NA |
| 1 | 2011-03-09 | 14 | NA |
| 1 | 2011-03-10 | 15 | NA |
| 1 | 2011-03-11 | 17 | NA |
| 1 | 2011-03-12 | 12 | 2011-03-12 |
| 1 | 2011-03-13 | 14 | NA |
| 1 | 2011-03-14 | 17 | NA |
| 1 | 2011-03-15 | 14 | NA |
| 1 | 2011-03-16 | 17 | NA |
| 1 | 2011-03-17 | 16 | NA |
| 1 | 2011-03-18 | 15 | NA |
| 1 | 2011-03-19 | 16 | NA |
| 1 | 2011-03-20 | 17 | NA |
| 1 | 2011-03-21 | 18 | NA |
| 1 | 2011-03-22 | 11 | NA |
| 1 | 2011-03-23 | 15 | NA |
| 1 | 2011-03-24 | 12 | 2011-03-24 |
| 1 | 2011-03-25 | 13 | NA |
| 1 | 2011-03-26 | 15 | NA |
| 2 | 2011-06-11 | 48 | NA |
| 2 | 2011-06-12 | 49 | NA |
| 2 | 2011-06-13 | 50 | NA |
| 2 | 2011-06-14 | 57 | NA |
| 2 | 2011-06-15 | 60 | NA |
| 2 | 2011-06-16 | 49 | NA |
| 2 | 2011-06-17 | 64 | NA |
| 2 | 2011-06-18 | 63 | NA |
| 2 | 2011-06-19 | 67 | 2011-06-19 |
| 2 | 2011-06-20 | 70 | NA |
| 2 | 2011-06-21 | 58 | NA |
| 2 | 2011-06-22 | 65 | NA |
| 2 | 2011-06-23 | 57 | NA |
| 2 | 2011-06-24 | 55 | NA |
| 2 | 2011-06-25 | 57 | NA |
| 2 | 2011-06-26 | 60 | NA |
+----+------------+-------+------------+
Eventually, I hope to make the following table with a new column.
+----+------------+-------+------------+---------------+
| ID | Date | Price | EventDate | BuyHoldReturn |
+----+------------+-------+------------+---------------+
| 1 | 2011-03-06 | 10 | NA | NA |
| 1 | 2011-03-07 | 9 | NA | -0.1 |
| 1 | 2011-03-08 | 12 | NA | 0.2 |
| 1 | 2011-03-09 | 14 | NA | 0.4 |
| 1 | 2011-03-10 | 15 | NA | 0.5 |
| 1 | 2011-03-11 | 17 | NA | 0.7 |
| 1 | 2011-03-12 | 12 | 2011-03-12 | 0.2 |
| 1 | 2011-03-13 | 14 | NA | 0.4 |
| 1 | 2011-03-14 | 17 | NA | 0.7 |
| 1 | 2011-03-15 | 14 | NA | 0.4 |
| 1 | 2011-03-16 | 17 | NA | 0.7 |
| 1 | 2011-03-17 | 16 | NA | 0.6 |
| 1 | 2011-03-18 | 15 | NA | NA |
| 1 | 2011-03-19 | 16 | NA | 0.066666667 |
| 1 | 2011-03-20 | 17 | NA | 0.133333333 |
| 1 | 2011-03-21 | 18 | NA | 0.2 |
| 1 | 2011-03-22 | 11 | NA | -0.266666667 |
| 1 | 2011-03-23 | 15 | NA | 0 |
| 1 | 2011-03-24 | 12 | 2011-03-24 | -0.2 |
| 1 | 2011-03-25 | 13 | NA | -0.133333333 |
| 1 | 2011-03-26 | 15 | NA | 0 |
| 2 | 2011-06-11 | 48 | NA | NA |
| 2 | 2011-06-12 | 49 | NA | NA |
| 2 | 2011-06-13 | 50 | NA | NA |
| 2 | 2011-06-14 | 57 | NA | 0.14 |
| 2 | 2011-06-15 | 60 | NA | 0.2 |
| 2 | 2011-06-16 | 49 | NA | -0.02 |
| 2 | 2011-06-17 | 64 | NA | 0.28 |
| 2 | 2011-06-18 | 63 | NA | 0.26 |
| 2 | 2011-06-19 | 67 | 2011-06-19 | 0.34 |
| 2 | 2011-06-20 | 70 | NA | 0.4 |
| 2 | 2011-06-21 | 58 | NA | 0.16 |
| 2 | 2011-06-22 | 65 | NA | 0.3 |
| 2 | 2011-06-23 | 57 | NA | 0.14 |
| 2 | 2011-06-24 | 55 | NA | 0.1 |
| 2 | 2011-06-25 | 57 | NA | NA |
| 2 | 2011-06-26 | 60 | NA | NA |
+----+------------+-------+------------+---------------+
I have an idea of using the code below, but couldn't figure out how to calculate the 11 buy and hold returns around the event date.
data<-data%>%
group_by(ID)%>%
mutate(BuyHoldReturn=ifelse(Date==EventDate, ....
Thanks in advance!
We can try
library(dplyr)
df |> group_by(ID) |> mutate( x = Price/lag(Price) - 1 ,
y = which(Date == EventDate) - 1:n() ,
BuyHoldReturn = case_when(between(y , -5 , 5) ~ x , TRUE ~ NA_real_)) |>
select(-x , -y)
Output
# A tibble: 28 × 5
# Groups: ID [2]
ID Date Price EventDate BuyHoldReturn
<int> <chr> <int> <chr> <dbl>
1 1 2011-03-06 10 NA NA
2 1 2011-03-07 9 NA -0.1
3 1 2011-03-08 12 NA 0.333
4 1 2011-03-09 14 NA 0.167
5 1 2011-03-10 15 NA 0.0714
6 1 2011-03-11 17 NA 0.133
7 1 2011-03-12 12 2011-03-12 -0.294
8 1 2011-03-13 14 NA 0.167
9 1 2011-03-14 17 NA 0.214
10 1 2011-03-15 14 NA -0.176
# … with 18 more rows

R: Extracting data from one df to another based on multiple variables matches?

I'm analyzing stock returns and have 1 data frame with tickers and position weights and another data frame with returns. I have included an example below. I need to extract return data from df 2 to the empty column in df 1 based on the ticker and date code. This is an example; there are many more tickers and dates. I have tried methods from previous posts unsuccessfully. I'm new to R. Can someone help? Thanks!
Df 1
+-----------+-----------+--------+
| Date code | Ticker | Return |
+-----------+-----------+--------+
| 1 | Ticker 3 | |
| 1 | Ticker 4 | |
| 1 | Ticker 5 | |
| 2 | Ticker 1 | |
| 2 | Ticker 10 | |
| 2 | Ticker 8 | |
| 3 | Ticker 9 | |
| 3 | Ticker 3 | |
| 3 | Ticker 7 | |
| 4 | Ticker 5 | |
| 4 | Ticker 5 | |
| 4 | Ticker 10 | |
| 5 | Ticker 8 | |
| 5 | Ticker 1 | |
| 5 | Ticker 7 | |
| 6 | Ticker 3 | |
| 6 | Ticker 9 | |
| 6 | Ticker 1 | |
| 7 | Ticker 6 | |
| 7 | Ticker 8 | |
| 7 | Ticker 3 | |
| 8 | Ticker 4 | |
| 8 | Ticker 5 | |
| 8 | Ticker 3 | |
| 9 | Ticker 5 | |
| 9 | Ticker 3 | |
| 9 | Ticker 9 | |
| 10 | Ticker 5 | |
| 10 | Ticker 5 | |
| 10 | Ticker 3 | |
+-----------+-----------+--------+
Df 2
+-----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+
| Date code | Ticker 1 | Ticker 2 | Ticker 3 | Ticker 4 | Ticker 5 | Ticker 6 | Ticker 7 | Ticker 8 | Ticker 9 | Ticker 10 |
+-----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+
| 1 | 0% | -3% | 3% | 1% | -3% | -1% | 0% | 0% | -3% | 0% |
| 2 | -2% | 1% | -2% | -3% | -1% | -2% | -1% | -2% | -3% | -1% |
| 3 | 2% | -2% | 2% | 1% | -1% | 2% | 0% | 3% | -3% | 1% |
| 4 | 1% | -2% | 2% | -1% | 0% | 0% | -2% | -3% | 3% | 3% |
| 5 | 3% | -2% | 1% | 0% | 0% | -1% | 0% | 3% | 3% | 0% |
| 6 | -3% | -3% | 0% | 2% | 0% | -3% | 0% | 0% | -3% | -2% |
| 7 | -1% | -2% | -2% | -1% | 3% | -3% | -3% | -2% | 2% | -3% |
| 8 | 0% | 1% | 2% | 2% | -2% | -3% | -3% | 3% | 3% | -3% |
| 9 | -2% | 2% | 3% | 2% | 1% | 3% | 0% | 2% | 1% | -3% |
| 10 | 2% | -2% | -2% | 0% | -2% | 1% | 1% | -3% | 3% | 1% |
+-----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+
Excepted result:
+-----------+-----------+--------+
| Date code | Ticker | Return |
+-----------+-----------+--------+
| 1 | Ticker 3 | 3% |
| 1 | Ticker 4 | 1% |
| 1 | Ticker 5 | -3% |
| 2 | Ticker 1 | -2% |
| 2 | Ticker 10 | -1% |
| 2 | Ticker 8 | -2% |
| 3 | Ticker 9 | -3% |
| 3 | Ticker 3 | 2% |
| 3 | Ticker 7 | 0% |
| 4 | Ticker 5 | 0% |
| 4 | Ticker 2 | -2% |
| 4 | Ticker 10 | 3% |
| 5 | Ticker 8 | 3% |
| 5 | Ticker 1 | 3% |
| 5 | Ticker 7 | 0% |
| 6 | Ticker 3 | 0% |
| 6 | Ticker 9 | -3% |
| 6 | Ticker 1 | -3% |
| 7 | Ticker 6 | -3% |
| 7 | Ticker 8 | -2% |
| 7 | Ticker 3 | -2% |
| 8 | Ticker 4 | 1% |
| 8 | Ticker 5 | -2% |
| 8 | Ticker 3 | 2% |
| 9 | Ticker 5 | 1% |
| 9 | Ticker 3 | 3% |
| 9 | Ticker 9 | 1% |
| 10 | Ticker 5 | -2% |
| 10 | Ticker 9 | 3% |
| 10 | Ticker 3 | -2% |
+-----------+-----------+--------+

Adding and subtracting values from a column between two data frames

I've two dataframes as under:
DF1:
+-----+---------+-----+-----+
| ID | CURRENT | JAN | FEB |
+-----+---------+-----+-----+
| 123 | 2 | 3 | 4 |
| 456 | 1 | 5 | 0 |
+-----+---------+-----+-----+
DF2:
+-----+-----------------+----------+----------+------------+
| ID | CURRENT_2018 | JAN_2018 | FEB_2018 | UNITS_SWAP |
+-----+-----------------+----------+----------+------------+
| 123 | 5 | 6 | 7 | 12 |
| 456 | 4 | 8 | 6 | 6 |
+-----+-----------------+----------+----------+------------+
What I'm trying to do here is subtract the number in UNITS_SWAP from rows in DF2 sequentially until the number in UNITS_SWAP reaches zero.
Also while doing this, add the number of subtracted UNITS_SWAP from each row to their respective matching rows in DF1 so that the total of ID's for same months remain the same in both the end result is as under
DF1:
+-----+---------+-----+-----+
| ID | CURRENT | JAN | FEB |
+-----+---------+-----+-----+
| 123 | 7 | 9 | 5 |
| 456 | 5 | 7 | 0 |
+-----+---------+-----+-----+
DF2:
+-----+-----------------+----------+----------+
| ID | CURRENT_2018 | JAN_2018 | FEB_2018 |
+-----+-----------------+----------+----------+
| 123 | 0 | 0 | 6 |
| 456 | 4 | 6 | 6 |
+-----+-----------------+----------+----------+
The total by ID and month before aggregating for ID 123
+-------+-----------------------+---------------+---------------+
| ID | CURRENT, CURRENT_2018 | JAN, JAN_2018 | FEB, FEB_2018 |
+-------+-----------------------+---------------+---------------+
| 123 | 2 | 3 | 4 |
| 123 | 5 | 6 | 7 |
| TOTAL | 7 | 9 | 11 |
+-------+-----------------------+---------------+---------------+
This total should match with the totals after aggregating:
+-------+-----------------------+---------------+---------------+
| ID | CURRENT, CURRENT_2018 | JAN, JAN_2018 | FEB, FEB_2018 |
+-------+-----------------------+---------------+---------------+
| 123 | 7 | 9 | 5 |
| 123 | 0 | 0 | 6 |
| TOTAL | 7 | 9 | 11 |
+-------+-----------------------+---------------+---------------+
Similarly for ID 456
Before:
+-------+-----------------------+---------------+---------------+
| ID | CURRENT, CURRENT_2018 | JAN, JAN_2018 | FEB, FEB_2018 |
+-------+-----------------------+---------------+---------------+
| 456 | 1 | 5 | 0 |
| 456 | 4 | 8 | 6 |
| TOTAL | 5 | 13 | 6 |
+-------+-----------------------+---------------+---------------+
After:
+-------+-----------------------+---------------+---------------+
| ID | CURRENT, CURRENT_2018 | JAN, JAN_2018 | FEB, FEB_2018 |
+-------+-----------------------+---------------+---------------+
| 456 | 5 | 7 | 0 |
| 456 | 0 | 6 | 6 |
| TOTAL | 5 | 13 | 6 |
+-------+-----------------------+---------------+---------------+
Script to load the data:
DF1 <- data.frame(ID = c(123,456),
CURRENT = c(2,1),
JAN = c(3,5),
FEB=c(4,0))
DF2 <- data.frame(ID = c(123,456),
CURRENT_2018 = c(4,5),
JAN_2018 = c(6,8),
FEB_2018=c(7,6),
UNITS_SWAP =c(12,6))

Calculating difference between dates based on grouping one or more columns

A sample of my dataset is as below:
| id | Date | Buyer |
|:--:|-----------:|----------|
| 9 | 11/29/2018 | Jenny |
| 9 | 11/29/2018 | Jenny |
| 9 | 11/29/2018 | Jenny |
| 4 | 5/30/2018 | Chang |
| 4 | 7/4/2018 | Chang |
| 4 | 8/17/2018 | Chang |
| 5 | 5/25/2018 | Chunfei |
| 5 | 2/13/2019 | Chunfei |
| 5 | 2/16/2019 | Chunfei |
| 5 | 2/16/2019 | Chunfei |
| 5 | 2/23/2019 | Chunfei |
| 5 | 2/25/2019 | Chunfei |
| 8 | 2/28/2019 | Chunfei |
| 8 | 2/28/2019 | Chunfei |
I have two sets of questions with this dataset:
I need to calculate the difference between dates but this difference will be calculated based on grouping 'Buyer' and 'id', which means, the date difference for the Buyer 'Jenny' and Id '9' will be one group, Buyer 'Chang' with Id '4' will be another group and Buyer 'Chunfei' with Id '5' will be another group and 'Chunfei' with Id '8' will be another group. So, the output will be:
| id | Date | Buyer_id | Diff |
|:--:|-----------:|----------|------|
| 9 | 11/29/2018 | Jenny | NA |
| 9 | 11/29/2018 | Jenny | 0 |
| 9 | 11/29/2018 | Jenny | 0 |
| 4 | 5/30/2018 | Chang | NA |
| 4 | 7/4/2018 | Chang | 35 |
| 4 | 8/17/2018 | Chang | 44 |
| 5 | 5/25/2018 | Chunfei | NA |
| 5 | 2/13/2019 | Chunfei | 264 |
| 5 | 2/16/2019 | Chunfei | 3 |
| 5 | 2/16/2019 | Chunfei | 0 |
| 5 | 2/23/2019 | Chunfei | 7 |
| 5 | 2/25/2019 | Chunfei | 2 |
| 8 | 2/28/2019 | Chunfei | NA |
| 8 | 2/28/2019 | Chunfei | 0 |
The issue is that I'm not understanding why the group_by isn't working. The following code subtracts the consecutive rows rather than grouping them for same buyer and id and then subtracting.
df=data.frame(id=c("9","9","9","4","4","4","5","5","5","5","5","5","8","8"),
Date=c("11/29/2018","11/29/2018","11/29/2018","5/30/2018","7/4/2018",
"8/17/2018","5/25/2018","2/13/2019","2/16/2019","2/16/2019","2/23/2019",
"2/25/2019","2/28/2019","2/28/2019"),Buyer=c("Jenny","Jenny","Jenny",
"Chang","Chang","Chang","Chunfei","Chunfei","Chunfei","Chunfei","Chunfei",
"Chunfei","Chunfei","Chunfei"))
df$id=as.numeric(as.character(df$id))
df$Date=as.Date(df$Date, "%m/%d/%Y")
df$Buyer=as.character(df$Buyer)
df1=df %>% group_by(Buyer,id) %>%
mutate(diff=as.numeric(difftime(Date,lag(Date),units='days')))
After calculating the date difference, I need to filter those records whose differences between dates are 5 days. In the above example, the date difference between "5/25/2018", "2/13/2019","2/16/2019","2/16/2019","2/23/2019","2/25/2019" will be NA,264,3,0,7,2. However, if I provide a filter for n<6, I would miss on the dates "2/13/2019" and "2/23/2019". These dates will be important to retain in the final output, because even though the difference between dates "2/13/2019" and "5/25/2018" is 264, the difference between "2/16/2019" and "2/13/2019" is 3. Similarly, even though the difference between "2/16/2019" and "2/23/2019" is 7, the difference between "2/23/2019" and "2/25/2019" is 2. So,I need to retain these dates. How can this be achieved?
We can mask the column 'diff' in the final output and it should look like below:
| id | Date | Buyer_id |
|----|:----------:|---------:|
| 9 | 11/29/2018 | Jenny |
| 9 | 11/29/2018 | Jenny |
| 9 | 11/29/2018 | Jenny |
| 5 | 2/13/2019 | Chunfei |
| 5 | 2/16/2019 | Chunfei |
| 5 | 2/16/2019 | Chunfei |
| 5 | 2/23/2019 | Chunfei |
| 5 | 2/25/2019 | Chunfei |
| 8 | 2/28/2019 | Chunfei |
| 8 | 2/28/2019 | Chunfei |
We can use diff to subtract Date and select groups where there is at-least one value which is less than equal to 5 days.
library(dplyr)
df %>%
group_by(id, Buyer) %>%
filter(any(diff(Date) <= 5))
# id Date Buyer
# <dbl> <date> <chr>
# 1 9 2018-11-29 Jenny
# 2 9 2018-11-29 Jenny
# 3 9 2018-11-29 Jenny
# 4 5 2018-05-25 Chunfei
# 5 5 2019-02-13 Chunfei
# 6 5 2019-02-16 Chunfei
# 7 5 2019-02-16 Chunfei
# 8 5 2019-02-23 Chunfei
# 9 5 2019-02-25 Chunfei
#10 8 2019-02-28 Chunfei
#11 8 2019-02-28 Chunfei
After re-reading the question I think you might not be looking to filter entire groups but only those rows which have difference of 5 days. We can get indices which have diff value of less than 5 and select it's previous index as well.
df %>%
group_by(id, Buyer) %>%
mutate(diff = c(NA, diff(Date))) %>%
slice({i1 <- which(diff <= 5); unique(c(i1, i1-1))}) %>%
select(-diff)
# id Date Buyer
# <dbl> <date> <chr>
# 1 5 2019-02-16 Chunfei
# 2 5 2019-02-16 Chunfei
# 3 5 2019-02-25 Chunfei
# 4 5 2019-02-13 Chunfei
# 5 5 2019-02-23 Chunfei
# 6 8 2019-02-28 Chunfei
# 7 8 2019-02-28 Chunfei
# 8 9 2018-11-29 Jenny
# 9 9 2018-11-29 Jenny
#10 9 2018-11-29 Jenny
data
df <- structure(list(id = c(9, 9, 9, 4, 4, 4, 5, 5, 5, 5, 5, 5, 8,
8), Date = structure(c(17864, 17864, 17864, 17681, 17716, 17760,
17676, 17940, 17943, 17943, 17950, 17952, 17955, 17955), class = "Date"),
Buyer = c("Jenny", "Jenny", "Jenny", "Chang", "Chang", "Chang",
"Chunfei", "Chunfei", "Chunfei", "Chunfei", "Chunfei", "Chunfei",
"Chunfei", "Chunfei")), row.names = c(NA, -14L), class = "data.frame")

Count Wins and Average Home Win Odds in R

I'm trying to create a data frame in R that will allow me to view the average home betting odds for each team along with the number of home wins for each season.
There are 6,840 records in the dataset representing 18 seasons' worth of Premier League football. This means there are 380 match entries for each season.
Let me show you an example. It is a drastically cut down example, but it gives you a good enough idea about what I'm trying to achieve.
Key: FTHG (Full-Time Home Goals), FTAG (Full-Time Away Goals), FTR (Full-Time Result), HWO (Home Win Odds), AHWO (Average Home Win Odds), W (Win Count)
matchData:
Season | HomeTeam | AwayTeam | FTHG | FTAG | FTR | HWO
-----------------------------------------------------------------
1 | 2017/2018 | TeamA | TeamB | 2 | 1 | H | 1.30
2 | 2017/2018 | TeamA | TeamC | 1 | 1 | D | 1.45
3 | 2017/2018 | TeamA | TeamD | 1 | 0 | H | 2.20
4 | 2017/2018 | TeamB | TeamA | 4 | 1 | H | 1.85
5 | 2017/2018 | TeamC | TeamA | 1 | 0 | H | 1.70
6 | 2017/2018 | TeamD | TeamA | 2 | 3 | A | 3.10
7 | 2016/2017 | TeamA | TeamB | 2 | 1 | H | 1.30
8 | 2016/2017 | TeamA | TeamC | 0 | 0 | D | 1.50
9 | 2016/2017 | TeamA | TeamD | 1 | 2 | A | 1.67
10 | 2016/2017 | TeamB | TeamA | 3 | 1 | H | 1.42
11 | 2016/2017 | TeamB | TeamC | 2 | 1 | H | 1.90
12 | 2016/2017 | TeamB | TeamD | 5 | 1 | H | 1.20
13 | 2016/2017 | TeamC | TeamA | 1 | 0 | H | 2.00
14 | 2016/2017 | TeamC | TeamB | 3 | 1 | H | 1.80
I need to summarise the matchData data frame into a new one like this:
homeWinOdds:
Season | Team | W | AHWO
-------------------------------------
1 | 2017/2018 | TeamA | 2 | 1.75
2 | 2017/2018 | TeamB | 1 | 1.85
3 | 2017/2018 | TeamC | 1 | 1.70
4 | 2017/2018 | TeamD | 0 | 3.10
5 | 2016/2017 | TeamA | 1 | 1.49
6 | 2016/2017 | TeamB | 3 | 1.51
7 | 2016/2017 | TeamC | 2 | 1.90
8 | 2016/2017 | TeamD | 0 | N/A
For instance, based on the above, TeamB won three home matches in season 2016/2017 and their average home odds (based on all of their home matches in that season) were 1.51.
In my actual dataset, each one of the 20 teams will have each played exactly 19 home matches in every season, so the home odds of these matches will be averaged.
In summary:
count the number of home wins a team has had in a season
average the home win odds for the whole season (only for the team's home games)
display as separate records — in the actual dataset there are 20 teams for each season so therefore there will be 20 records for each season.
I appreciate in advance anyone who can help me with this.
library(dplyr)
homeWinOdds <- matchData %>%
group_by(Season, HomeTeam) %>%
summarize(W = sum(FTR == "H"),
AHWO = mean(HWO)) %>%
ungroup()

Resources