Comparing Dates Across Multiple Variables - r

I'm attempting to figure out the amount of days in between games and if that has an impact on wins/losses, this is the information I'm starting with:
schedule:
Home
Away
Home_Final
Away_Final
Date
DAL
OAK
30
35
9/1/2015
KC
PHI
21
28
9/2/2015
This is the result I'd like to get:
Home
Away
Home_Final
Away_Final
Date
Home_Rest
Away_Rest
Adv
Adv_Days
Adv_Won
DAL
OAK
30
35
9/1/2015
null
null
null
null
null
KC
PHI
21
28
9/2/2015
null
null
null
null
null
DAL
PHI
28
7
9/9/2015
8
7
1
1
1
OAK
KC
14
21
9/9/2015
8
7
1
1
0
'Home_Rest' = The home teams amount of days between their games
'Away Rest' = The away teams amount of days between their games
'Adv' = True/False that there was an advantage on one side
'Adv_Days' = The amount of advantage in days
'Adv_Won' = The side with the advantage won
Here is what I've tried, I was able to get it to count how many days were between games for one team, but when I bring all the other ones in I can't wrap my head around how to do that.
library(tidyverse)
library(lubridate)
team_post <- schedule %>% filter(home == 'PHI' | visitor == 'PHI')
day_dif = interval(lag(ymd(team_post$date)), ymd(team_post$date))
team_post <- team_post %>% mutate(days_off = time_length(day_dif, "days"))

You can extend this to all teams using a grouped mutate. See docs for group_by() here.
Something like
schedule %>%
group_by(vars_to_group_by) %>%
mutate(new_var = expr_to_calculate_new_var)
In future, it would be helpful if you included code to recreate a minimal dataset for your example.

The problem is that before you can calculate differences between dates, you must put your dataframe in a friendlier format. Because the Date applies to both teams, that is, one item applies to two columns in the dataframe, which makes it difficult to give it a uniform treatment.
We'll add an id (row number) to the schedule dataframe, as a primary key, so it becomes easy to identify the rows later on.
schedule <- tidyr::tribble(
~Home, ~Away, ~Home_Final, ~Away_Final, ~Date,
"DAL", "OAK", 30, 35, "9/1/2015",
"KC", "PHI", 21, 28, "9/2/2015",
"DAL", "PHI", 28, 7, "9/9/2015",
"OAK", "KC", 14, 21, "9/9/2015"
)
schedule <- schedule %>% mutate(id = row_number())
> schedule
# A tibble: 4 x 6
Home Away Home_Final Away_Final Date id
<chr> <chr> <dbl> <dbl> <chr> <int>
1 DAL OAK 30 35 9/1/2015 1
2 KC PHI 21 28 9/2/2015 2
3 DAL PHI 28 7 9/9/2015 3
4 OAK KC 14 21 9/9/2015 4
Now we'll place your dataframe in a more 'relational' format.
schedule_relational <-
rbind(
schedule %>%
transmute(
id,
Team = Home,
Role = "Home",
Final = Home_Final,
Date
),
schedule %>%
transmute(
id,
Team = Away,
Role = "Away",
Final = Away_Final,
Date
)
)
> schedule_relational
# A tibble: 8 x 5
id Team Role Final Date
<int> <chr> <chr> <dbl> <chr>
1 1 DAL Home 30 9/1/2015
2 2 KC Home 21 9/2/2015
3 3 DAL Home 28 9/9/2015
4 4 OAK Home 14 9/9/2015
5 1 OAK Away 35 9/1/2015
6 2 PHI Away 28 9/2/2015
7 3 PHI Away 7 9/9/2015
8 4 KC Away 21 9/9/2015
How about that!
Now it becomes easy to calculate the difference between dates of games for each team:
schedule_relational <-
schedule_relational %>%
group_by(Team) %>%
arrange(Date) %>%
mutate(Rest = mdy(Date) - mdy(lag(Date))) %>%
ungroup()
> schedule_relational
# A tibble: 8 x 6
id Team Role Final Date Rest
<int> <chr> <chr> <dbl> <chr> <drtn>
1 1 DAL Home 30 9/1/2015 NA days
2 1 OAK Away 35 9/1/2015 NA days
3 2 KC Home 21 9/2/2015 NA days
4 2 PHI Away 28 9/2/2015 NA days
5 3 DAL Home 28 9/9/2015 8 days
6 4 OAK Home 14 9/9/2015 8 days
7 3 PHI Away 7 9/9/2015 7 days
8 4 KC Away 21 9/9/2015 7 days
Observe that the appropriate function to convert dates in character format is mdy(), because your dates are in month/day/year format.
We're very close to a solution! Now all we have to do is to pivot your data back to the wider format. We'll join back the data on the home team and away team by using the id as our unique key.
result <-
schedule_relational %>%
pivot_wider(
names_from = Role,
values_from = c(Team, Final, Rest),
names_glue = "{Role}_{.value}"
)
> result
# A tibble: 4 x 8
id Date Home_Team Away_Team Home_Final Away_Final Home_Rest Away_Rest
<int> <chr> <chr> <chr> <dbl> <dbl> <drtn> <drtn>
1 1 9/1/2015 DAL OAK 30 35 NA days NA days
2 2 9/2/2015 KC PHI 21 28 NA days NA days
3 3 9/9/2015 DAL PHI 28 7 8 days 7 days
4 4 9/9/2015 OAK KC 14 21 8 days 7 days
We'll adjust column names and ordering, and make the final calculations now.
result_final <-
result %>%
transmute(
Home = Home_Team,
Away = Away_Team,
Home_Final,
Away_Final,
Date,
Home_Rest,
Away_Rest,
Adv = as.integer(Home_Rest != Away_Rest),
Adv_Days = abs(Home_Rest != Away_Rest),
Adv_Won = as.integer(Home_Rest > Away_Rest & Home_Final > Away_Final | Away_Rest > Home_Rest & Away_Final > Home_Final)
)
> result_final
# A tibble: 4 x 10
Home Away Home_Final Away_Final Date Home_Rest Away_Rest Adv Adv_Days Adv_Won
<chr> <chr> <dbl> <dbl> <chr> <drtn> <drtn> <int> <int> <int>
1 DAL OAK 30 35 9/1/2015 NA days NA days NA NA NA
2 KC PHI 21 28 9/2/2015 NA days NA days NA NA NA
3 DAL PHI 28 7 9/9/2015 8 days 7 days 1 1 1
4 OAK KC 14 21 9/9/2015 8 days 7 days 1 1 0
It would be interesting if instead of reducing Adv and Adv_Won to yes/no (discrete) values, you'd track the number of days of rest and difference in score. Therefore you could correlate the results also in terms of magnitude.
I've made the code step by step, so you can see intermediate results and understand it better. You may later coalesce some of the statements if you want.
There may be more convoluted solutions, but this is very clear to read and understand.

Related

How to move every other row to a new column in R

I have a data frame that is supposed to show the winners of a tournament and their opponents. Currently the loser is in every other row. So, row 1 is the winner, row 2 is the loser, row 3 is the winner, row 4 is the loser, and so on.
I want the winner and their opponent to be next to each other so that it's easier to see who competed against who. The tricky part is keeping the gym, names, and competitor number for each person together in the same row.
How do I move every other row to a new column so that the winner and their opponent are in the same row?
y = read.csv('https://raw.githubusercontent.com/bandcar/Examples/main/y.csv')
# FAILED ATTEMPT
library(data.table)
z=dcast(setDT(y)[, grp := gl(.N, 2, .N)], grp ~ rowid(grp),
value.var = setdiff(names(y), 'grp'))[, grp := NULL][]
Note that both photos are different data sets
What my df currently looks like:
Similar to what I want it to look like:
Using dplyr you could do:
library(dplyr)
read.csv('https://raw.githubusercontent.com/bandcar/Examples/main/y.csv') %>%
group_by(fight, date) %>%
summarise(division = first(division),
competitor_1 = first(competitor),
name_1 = first(name),
competitor_2 = last(competitor),
name_2 = last(name))
#> `summarise()` has grouped output by 'fight'. You can override using the
#> `.groups` argument.
#> # A tibble: 61 x 7
#> # Groups: fight [26]
#> fight date division competitor_1 name_1 compe~1 name_2
#> <chr> <chr> <chr> <int> <chr> <int> <chr>
#> 1 BYE BYE Master 2 1 Rafael M~ 1 Rafae~
#> 2 FIGHT 19 Thu 09/01 at 12:14 PM Master 2 2 Piter Fr~ 63 Alan ~
#> 3 FIGHT 20 Thu 09/01 at 01:01 PM Master 2 16 Marques ~ 55 Diego~
#> 4 FIGHT 20 Thu 09/01 at 12:13 PM Master 2 28 Kenned D~ 44 Verge~
#> 5 FIGHT 22 Thu 09/01 at 12:27 PM Master 2 4 Marcus V~ 52 Kian ~
#> 6 FIGHT 23 Thu 09/01 at 12:33 PM Master 2 30 Adam Col~ 46 Steph~
#> 7 FIGHT 23 Thu 09/01 at 12:54 PM Master 2 31 Namrod B~ 47 Stefa~
#> 8 FIGHT 23 Thu 09/01 at 12:58 PM Master 2 13 David Ch~ 53 Joshu~
#> 9 FIGHT 24 Thu 09/01 at 01:08 PM Master 2 3 Sandro G~ 56 Carlo~
#> 10 FIGHT 24 Thu 09/01 at 12:35 PM Master 2 8 Rafael R~ 60 Andre~
#> # ... with 51 more rows, and abbreviated variable name 1: competitor_2
Created on 2022-09-16 with reprex v2.0.2
There are some problems with your dataset, e.g. for "FIGHT 22" there are four entries (from your description I expected two entries).
division gender belt weight fight date competitor name gym
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Master 2 Male BLACK Middle FIGHT 22 Thu 09/01 at 12:27 PM 4 Marcus V. C. Antelante Ares BJJ
2 Master 2 Male BLACK Middle FIGHT 22 Thu 09/01 at 12:27 PM 62 Andrew E. Ganthier Renzo Gracie Academy
3 Master 2 Male BLACK Middle FIGHT 22 Thu 09/01 at 12:27 PM 11 Jimmy Dang Khoa Tat CheckMat
4 Master 2 Male BLACK Middle FIGHT 22 Thu 09/01 at 12:27 PM 52 Kian Takumi Kadota Brasa CTA
The same problem exists for fights 26 and 35. Assuming these are corrected, and assuming odd rows contain winners and even rows contain losers, the following code should work (using tidyverse):
y %>%
mutate(outcome = if_else(row_number() %% 2 == 1, "winner", "loser")) %>%
pivot_wider(names_from = outcome, values_from = c(competitor, name, gym))
This will get you close to what you want. It adds the winner column. Odd number index is winner and even number index is loser. I removed the BYE week rows for aesthetics. Then we group by the date and fight and keep the desired data from the combined rows and expand the summarised columns to the winner loser information.
library(dplyr)
y %>%
mutate(
winner = ifelse((y$X %% 2) == 0,'loser','winner')) %>%
filter(date != 'BYE') %>%
group_by(date, fight) %>%
summarise(division = first(division),
belt = first(belt),
weight = first(weight),
gender = first(gender),
winner.rank = first(competitor),
winner = first(name),
winner.gym = first(gym),
opp.rank= last(competitor),
opponent = last(name),
opponent.gym = last(gym))

Joining two data frames using range of values

I have two data sets I would like to join. The income_range data is the master dataset and I would like to join data_occ to the income_range data based on what band the income falls inside. Where there are more than two observations(incomes) that are within the range I would like to take the lower income.
I was attempting to use data.table but was having trouble. I was would also like to keep all columns from both data.frames if possible.
The output dataset should only have 7 observations.
library(data.table)
library(dplyr)
income_range <- data.frame(id = "France"
,inc_lower = c(10, 21, 31, 41,51,61,71)
,inc_high = c(20, 30, 40, 50,60,70,80)
,perct = c(1,2,3,4,5,6,7))
data_occ <- data.frame(id = rep(c("France","Belgium"), each=50)
,income = sample(10:80, 50)
,occ = rep(c("manager","clerk","manual","skilled","office"), each=20))
setDT(income_range)
setDT(data_occ)
First attempt.
df2 <- income_range [data_occ ,
on = .(id, inc_lower <= income, inc_high >= income),
.(id, income, inc_lower,inc_high,perct,occ)]
Thank you in advance.
Since you tagged dplyr, here's one possible solution using that library:
library('fuzzyjoin')
# join dataframes on id == id, inc_lower <= income, inc_high >= income
joined <- income_range %>%
fuzzy_left_join(data_occ,
by = c('id' = 'id', 'inc_lower' = 'income', 'inc_high' = 'income'),
match_fun = list(`==`, `<=`, `>=`)) %>%
rename(id = id.x) %>%
select(-id.y)
# sort by income, and keep only the first row of every unique perct
result <- joined %>%
arrange(income) %>%
group_by(perct) %>%
slice(1)
And the (intermediate) results:
> head(joined)
id inc_lower inc_high perct income occ
1 France 10 20 1 10 manager
2 France 10 20 1 19 manager
3 France 10 20 1 14 manager
4 France 10 20 1 11 manager
5 France 10 20 1 17 manager
6 France 10 20 1 12 manager
> result
# A tibble: 7 x 6
# Groups: perct [7]
id inc_lower inc_high perct income occ
<chr> <dbl> <dbl> <dbl> <int> <chr>
1 France 10 20 1 10 manager
2 France 21 30 2 21 manual
3 France 31 40 3 31 manual
4 France 41 50 4 43 manager
5 France 51 60 5 51 clerk
6 France 61 70 6 61 manager
7 France 71 80 7 71 manager
I've added the intermediate dataframe joined for easy of understanding. You can omit it and just chain the two command chains together with %>%.
Here is one data.table approach:
cols = c("inc_lower", "inc_high")
data_occ[, (cols) := income]
result = data_occ[order(income)
][income_range,
on = .(id, inc_lower>=inc_lower, inc_high<=inc_high),
mult="first"]
data_occ[, (cols) := NULL]
# id income occ inc_lower inc_high perct
# 1: France 10 clerk 10 20 1
# 2: France 21 manager 21 30 2
# 3: France 31 clerk 31 40 3
# 4: France 41 clerk 41 50 4
# 5: France 51 clerk 51 60 5
# 6: France 62 manager 61 70 6
# 7: France 71 manager 71 80 7

Average using dplyr if row is equal to previous row in another column

I have the following code:
#load req'd libraries
library(plyr)
library(dplyr)
library(fitzRoy)
#get the raw data from the fitzRoy package (season selection use :)
player <- fetch_player_stats(season = 2020:2021, source = "fryzigg")
#select the req'd cols
player <- player %>%
select(venue_name, match_date, match_round, player_id, player_first_name, player_last_name,
kicks, handballs, disposals)
#change the match_date to date format
player$match_date <- as.Date(player$match_date, format = "%Y-%m-%d")
#add a column for the year (season)
player$season <- format(as.Date(player$match_date, format="%Y-%m-%d"),"%Y")
#change format for match_round
player$match_round <- as.numeric(player$match_round)
#add opponent
player2$opponent <- ifelse(player2$player_team == player2$match_home_team, player2$match_away_team,
ifelse(player2$player_team == player2$match_away_team, player2$match_home_team, player2$match_away_team))
#sort
player <- player %>%
arrange(player_id, season, match_round)
head(player)
This gives me a data frame like so:
# A tibble: 6 x 10
venue_name match_date match_round player_id player_first_name player_last_name kicks handballs disposals season
<chr> <date> <dbl> <int> <chr> <chr> <int> <int> <int> <chr>
1 GIANTS Stadium 2020-03-21 1 11170 Gary Ablett 16 8 24 2020
2 GMHBA Stadium 2020-06-12 2 11170 Gary Ablett 9 12 21 2020
3 GMHBA Stadium 2020-06-20 3 11170 Gary Ablett 8 6 14 2020
4 MCG 2020-06-28 4 11170 Gary Ablett 3 8 11 2020
5 GMHBA Stadium 2020-07-04 5 11170 Gary Ablett 6 8 14 2020
6 SCG 2020-07-09 6 11170 Gary Ablett 11 3 14 2020
I am trying to add several new columns:
A season average of disposals by player that is cumulative based on each round. So for example,
using the table above, the new column would look like:
| season_average_disposals
| 24
| 22.5
| 20
| 17.5
| 16.8
| 16.3
When the season changes, say from 2020 to 2021, this would reset and the first entry would be the total disposal for round 1 that season.
Similar to the above, a season average of disposals by player by venue that is cumulative based on each round.
Similar to the above, a season average of disposals by player by venue and opponent that is cumulative based on each round.
A career average that is cumulative based on season and round. So this would not reset when the season changes, it would just keep calculating.
I tried using this:
player <- player %>%
transform(season_average_disposals = ifelse(lag(season) == season, mean(disposals), disposals))
But this does not give me the required results.
For 1)
library(dplyr)
player %>%
group_split(season, player_id) %>%
purrr::map_dfr(~.x %>%
mutate(season_average_disposals = cummean(disposals)))

Creating subset of dataset based on multiple condition in r

I want to extract the past 3 weeks' data for each household_id, channel combination. These past 3 weeks will be calculated from mala_fide_week and mala_fide_year and it will be less than that for each household_id and channel combination.
Below is the dataset:
for e.g. Household_id 100 for channel A: the mala_fide_week is 42 and mala_fide_year 2021. So past three records will be less than week 42 of the year 2021. This will be calculated from the week and year columns.
For the Household_id 100 and channel B combination, there are only two records much less than mala_fide_week and mala_fide_year.
For Household_id 101 and channel C, there are two years involved in 2019 and 2020.
The final dataset will be as below
Household_id 102 is not considered as week and year is greater than mala_fide_week and mala_fide_year.
I am trying multiple options but not getting through. Any help is much appreciated!
sample dataset:
data <- data.frame(Household_id =
c(100,100,100,100,100,100,101,101,101,101,102,102),
channel = c("A","A","A","A","B","B","C","C","c","C","D","D"),
duration = c(12,34,567,67,34,67,98,23,56,89,73,76),
mala_fide_week = c(42,42,42,42,42,42,5,5,5,5,30,30),
mala_fide_year =c(2021,2021,2021,2021,2021,2021,2020,2020,2020,2020,2021,2021),
week =c(36,37,38,39,22,23,51,52,1,2,38,39),
year = c(2021,2021,2021,2021,2020,2020,2019,2019,2020,2020,2021,2021))
I think you first need to obtain the absolute number of weeks week + year * 52, then filter accordingly. slice_tail gets the last three rows of each group.
library(dplyr)
data |>
filter(week + 52*year <= mala_fide_week + 52 *mala_fide_year) |>
group_by(Household_id, channel) |>
arrange(year, week, .by_group = TRUE) |>
slice_tail(n = 3)
# A tibble: 8 x 7
# Groups: Household_id, channel [3]
Household_id channel duration mala_fide_week mala_fide_year week year
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 A 34 42 2021 37 2021
2 100 A 567 42 2021 38 2021
3 100 A 67 42 2021 39 2021
4 100 B 34 42 2021 22 2020
5 100 B 67 42 2021 23 2020
6 101 C 23 5 2020 52 2019
7 101 C 56 5 2020 1 2020
8 101 C 89 5 2020 2 2020

how to sum conditional functions to grouped rows in R

I so have the following data frame
customerid
payment_month
payment_date
bill_month
charges
1
January
22
January
30
1
February
15
February
21
1
March
2
March
33
1
May
4
April
43
1
May
4
May
23
1
June
13
June
32
2
January
12
January
45
2
February
15
February
56
2
March
2
March
67
2
April
4
April
65
2
May
4
May
54
2
June
13
June
68
3
January
25
January
45
3
February
26
February
56
3
March
30
March
67
3
April
1
April
65
3
June
1
May
54
3
June
1
June
68
(the id data is much larger) I want to calculate payment efficiency using the following function,
efficiency = (amount paid not late / total bill amount)*100
not late is paying no later than the 21st day of the bill's month. (paying January's bill on the 22nd of January is considered as late)
I want to calculate the efficiency of each customer with the expected output of
customerid
effectivity
1
59.90
2
100
3
37.46
I have tried using the following code to calculate for one id and it works. but I want to apply and assign it to the entire group id and summarize it into 1 column (effectivity) and 1 row per ID. I have tried using group by, aggregate and ifelse functions but nothing works. What should I do?
df1 <- filter(df, (payment_month!=bill_month & id==1) | (payment_month==bill_month & payment_date > 21 & id==1) )
df2 <-filter(df, id==1001)
x <- sum(df1$charges)
x <- sum(df2$charges)
100-(x/y)*100
An option using dplyr
library(dplyr)
df %>%
group_by(customerid) %>%
summarise(
effectivity = sum(
charges[payment_date <= 21 & payment_month == bill_month]) / sum(charges) * 100,
.groups = "drop")
## A tibble: 3 x 2
#customerid effectivity
# <int> <dbl>
#1 1 59.9
#2 2 100
#3 3 37.5
df %>%
group_by(customerid) %>%
mutate(totalperid = sum(charges)) %>%
mutate(pay_month_number = match(payment_month , month.name),
bill_month_number = match(bill_month , month.name)) %>%
mutate(nolate = ifelse(pay_month_number > bill_month_number, TRUE, FALSE)) %>%
summarise(efficiency = case_when(nolate = TRUE ~ (charges/totalperid)*100))

Resources