Creating subset of dataset based on multiple condition in r - r

I want to extract the past 3 weeks' data for each household_id, channel combination. These past 3 weeks will be calculated from mala_fide_week and mala_fide_year and it will be less than that for each household_id and channel combination.
Below is the dataset:
for e.g. Household_id 100 for channel A: the mala_fide_week is 42 and mala_fide_year 2021. So past three records will be less than week 42 of the year 2021. This will be calculated from the week and year columns.
For the Household_id 100 and channel B combination, there are only two records much less than mala_fide_week and mala_fide_year.
For Household_id 101 and channel C, there are two years involved in 2019 and 2020.
The final dataset will be as below
Household_id 102 is not considered as week and year is greater than mala_fide_week and mala_fide_year.
I am trying multiple options but not getting through. Any help is much appreciated!
sample dataset:
data <- data.frame(Household_id =
c(100,100,100,100,100,100,101,101,101,101,102,102),
channel = c("A","A","A","A","B","B","C","C","c","C","D","D"),
duration = c(12,34,567,67,34,67,98,23,56,89,73,76),
mala_fide_week = c(42,42,42,42,42,42,5,5,5,5,30,30),
mala_fide_year =c(2021,2021,2021,2021,2021,2021,2020,2020,2020,2020,2021,2021),
week =c(36,37,38,39,22,23,51,52,1,2,38,39),
year = c(2021,2021,2021,2021,2020,2020,2019,2019,2020,2020,2021,2021))

I think you first need to obtain the absolute number of weeks week + year * 52, then filter accordingly. slice_tail gets the last three rows of each group.
library(dplyr)
data |>
filter(week + 52*year <= mala_fide_week + 52 *mala_fide_year) |>
group_by(Household_id, channel) |>
arrange(year, week, .by_group = TRUE) |>
slice_tail(n = 3)
# A tibble: 8 x 7
# Groups: Household_id, channel [3]
Household_id channel duration mala_fide_week mala_fide_year week year
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 A 34 42 2021 37 2021
2 100 A 567 42 2021 38 2021
3 100 A 67 42 2021 39 2021
4 100 B 34 42 2021 22 2020
5 100 B 67 42 2021 23 2020
6 101 C 23 5 2020 52 2019
7 101 C 56 5 2020 1 2020
8 101 C 89 5 2020 2 2020

Related

Comparing Dates Across Multiple Variables

I'm attempting to figure out the amount of days in between games and if that has an impact on wins/losses, this is the information I'm starting with:
schedule:
Home
Away
Home_Final
Away_Final
Date
DAL
OAK
30
35
9/1/2015
KC
PHI
21
28
9/2/2015
This is the result I'd like to get:
Home
Away
Home_Final
Away_Final
Date
Home_Rest
Away_Rest
Adv
Adv_Days
Adv_Won
DAL
OAK
30
35
9/1/2015
null
null
null
null
null
KC
PHI
21
28
9/2/2015
null
null
null
null
null
DAL
PHI
28
7
9/9/2015
8
7
1
1
1
OAK
KC
14
21
9/9/2015
8
7
1
1
0
'Home_Rest' = The home teams amount of days between their games
'Away Rest' = The away teams amount of days between their games
'Adv' = True/False that there was an advantage on one side
'Adv_Days' = The amount of advantage in days
'Adv_Won' = The side with the advantage won
Here is what I've tried, I was able to get it to count how many days were between games for one team, but when I bring all the other ones in I can't wrap my head around how to do that.
library(tidyverse)
library(lubridate)
team_post <- schedule %>% filter(home == 'PHI' | visitor == 'PHI')
day_dif = interval(lag(ymd(team_post$date)), ymd(team_post$date))
team_post <- team_post %>% mutate(days_off = time_length(day_dif, "days"))
You can extend this to all teams using a grouped mutate. See docs for group_by() here.
Something like
schedule %>%
group_by(vars_to_group_by) %>%
mutate(new_var = expr_to_calculate_new_var)
In future, it would be helpful if you included code to recreate a minimal dataset for your example.
The problem is that before you can calculate differences between dates, you must put your dataframe in a friendlier format. Because the Date applies to both teams, that is, one item applies to two columns in the dataframe, which makes it difficult to give it a uniform treatment.
We'll add an id (row number) to the schedule dataframe, as a primary key, so it becomes easy to identify the rows later on.
schedule <- tidyr::tribble(
~Home, ~Away, ~Home_Final, ~Away_Final, ~Date,
"DAL", "OAK", 30, 35, "9/1/2015",
"KC", "PHI", 21, 28, "9/2/2015",
"DAL", "PHI", 28, 7, "9/9/2015",
"OAK", "KC", 14, 21, "9/9/2015"
)
schedule <- schedule %>% mutate(id = row_number())
> schedule
# A tibble: 4 x 6
Home Away Home_Final Away_Final Date id
<chr> <chr> <dbl> <dbl> <chr> <int>
1 DAL OAK 30 35 9/1/2015 1
2 KC PHI 21 28 9/2/2015 2
3 DAL PHI 28 7 9/9/2015 3
4 OAK KC 14 21 9/9/2015 4
Now we'll place your dataframe in a more 'relational' format.
schedule_relational <-
rbind(
schedule %>%
transmute(
id,
Team = Home,
Role = "Home",
Final = Home_Final,
Date
),
schedule %>%
transmute(
id,
Team = Away,
Role = "Away",
Final = Away_Final,
Date
)
)
> schedule_relational
# A tibble: 8 x 5
id Team Role Final Date
<int> <chr> <chr> <dbl> <chr>
1 1 DAL Home 30 9/1/2015
2 2 KC Home 21 9/2/2015
3 3 DAL Home 28 9/9/2015
4 4 OAK Home 14 9/9/2015
5 1 OAK Away 35 9/1/2015
6 2 PHI Away 28 9/2/2015
7 3 PHI Away 7 9/9/2015
8 4 KC Away 21 9/9/2015
How about that!
Now it becomes easy to calculate the difference between dates of games for each team:
schedule_relational <-
schedule_relational %>%
group_by(Team) %>%
arrange(Date) %>%
mutate(Rest = mdy(Date) - mdy(lag(Date))) %>%
ungroup()
> schedule_relational
# A tibble: 8 x 6
id Team Role Final Date Rest
<int> <chr> <chr> <dbl> <chr> <drtn>
1 1 DAL Home 30 9/1/2015 NA days
2 1 OAK Away 35 9/1/2015 NA days
3 2 KC Home 21 9/2/2015 NA days
4 2 PHI Away 28 9/2/2015 NA days
5 3 DAL Home 28 9/9/2015 8 days
6 4 OAK Home 14 9/9/2015 8 days
7 3 PHI Away 7 9/9/2015 7 days
8 4 KC Away 21 9/9/2015 7 days
Observe that the appropriate function to convert dates in character format is mdy(), because your dates are in month/day/year format.
We're very close to a solution! Now all we have to do is to pivot your data back to the wider format. We'll join back the data on the home team and away team by using the id as our unique key.
result <-
schedule_relational %>%
pivot_wider(
names_from = Role,
values_from = c(Team, Final, Rest),
names_glue = "{Role}_{.value}"
)
> result
# A tibble: 4 x 8
id Date Home_Team Away_Team Home_Final Away_Final Home_Rest Away_Rest
<int> <chr> <chr> <chr> <dbl> <dbl> <drtn> <drtn>
1 1 9/1/2015 DAL OAK 30 35 NA days NA days
2 2 9/2/2015 KC PHI 21 28 NA days NA days
3 3 9/9/2015 DAL PHI 28 7 8 days 7 days
4 4 9/9/2015 OAK KC 14 21 8 days 7 days
We'll adjust column names and ordering, and make the final calculations now.
result_final <-
result %>%
transmute(
Home = Home_Team,
Away = Away_Team,
Home_Final,
Away_Final,
Date,
Home_Rest,
Away_Rest,
Adv = as.integer(Home_Rest != Away_Rest),
Adv_Days = abs(Home_Rest != Away_Rest),
Adv_Won = as.integer(Home_Rest > Away_Rest & Home_Final > Away_Final | Away_Rest > Home_Rest & Away_Final > Home_Final)
)
> result_final
# A tibble: 4 x 10
Home Away Home_Final Away_Final Date Home_Rest Away_Rest Adv Adv_Days Adv_Won
<chr> <chr> <dbl> <dbl> <chr> <drtn> <drtn> <int> <int> <int>
1 DAL OAK 30 35 9/1/2015 NA days NA days NA NA NA
2 KC PHI 21 28 9/2/2015 NA days NA days NA NA NA
3 DAL PHI 28 7 9/9/2015 8 days 7 days 1 1 1
4 OAK KC 14 21 9/9/2015 8 days 7 days 1 1 0
It would be interesting if instead of reducing Adv and Adv_Won to yes/no (discrete) values, you'd track the number of days of rest and difference in score. Therefore you could correlate the results also in terms of magnitude.
I've made the code step by step, so you can see intermediate results and understand it better. You may later coalesce some of the statements if you want.
There may be more convoluted solutions, but this is very clear to read and understand.

Joining two data frames using range of values

I have two data sets I would like to join. The income_range data is the master dataset and I would like to join data_occ to the income_range data based on what band the income falls inside. Where there are more than two observations(incomes) that are within the range I would like to take the lower income.
I was attempting to use data.table but was having trouble. I was would also like to keep all columns from both data.frames if possible.
The output dataset should only have 7 observations.
library(data.table)
library(dplyr)
income_range <- data.frame(id = "France"
,inc_lower = c(10, 21, 31, 41,51,61,71)
,inc_high = c(20, 30, 40, 50,60,70,80)
,perct = c(1,2,3,4,5,6,7))
data_occ <- data.frame(id = rep(c("France","Belgium"), each=50)
,income = sample(10:80, 50)
,occ = rep(c("manager","clerk","manual","skilled","office"), each=20))
setDT(income_range)
setDT(data_occ)
First attempt.
df2 <- income_range [data_occ ,
on = .(id, inc_lower <= income, inc_high >= income),
.(id, income, inc_lower,inc_high,perct,occ)]
Thank you in advance.
Since you tagged dplyr, here's one possible solution using that library:
library('fuzzyjoin')
# join dataframes on id == id, inc_lower <= income, inc_high >= income
joined <- income_range %>%
fuzzy_left_join(data_occ,
by = c('id' = 'id', 'inc_lower' = 'income', 'inc_high' = 'income'),
match_fun = list(`==`, `<=`, `>=`)) %>%
rename(id = id.x) %>%
select(-id.y)
# sort by income, and keep only the first row of every unique perct
result <- joined %>%
arrange(income) %>%
group_by(perct) %>%
slice(1)
And the (intermediate) results:
> head(joined)
id inc_lower inc_high perct income occ
1 France 10 20 1 10 manager
2 France 10 20 1 19 manager
3 France 10 20 1 14 manager
4 France 10 20 1 11 manager
5 France 10 20 1 17 manager
6 France 10 20 1 12 manager
> result
# A tibble: 7 x 6
# Groups: perct [7]
id inc_lower inc_high perct income occ
<chr> <dbl> <dbl> <dbl> <int> <chr>
1 France 10 20 1 10 manager
2 France 21 30 2 21 manual
3 France 31 40 3 31 manual
4 France 41 50 4 43 manager
5 France 51 60 5 51 clerk
6 France 61 70 6 61 manager
7 France 71 80 7 71 manager
I've added the intermediate dataframe joined for easy of understanding. You can omit it and just chain the two command chains together with %>%.
Here is one data.table approach:
cols = c("inc_lower", "inc_high")
data_occ[, (cols) := income]
result = data_occ[order(income)
][income_range,
on = .(id, inc_lower>=inc_lower, inc_high<=inc_high),
mult="first"]
data_occ[, (cols) := NULL]
# id income occ inc_lower inc_high perct
# 1: France 10 clerk 10 20 1
# 2: France 21 manager 21 30 2
# 3: France 31 clerk 31 40 3
# 4: France 41 clerk 41 50 4
# 5: France 51 clerk 51 60 5
# 6: France 62 manager 61 70 6
# 7: France 71 manager 71 80 7

how to sum conditional functions to grouped rows in R

I so have the following data frame
customerid
payment_month
payment_date
bill_month
charges
1
January
22
January
30
1
February
15
February
21
1
March
2
March
33
1
May
4
April
43
1
May
4
May
23
1
June
13
June
32
2
January
12
January
45
2
February
15
February
56
2
March
2
March
67
2
April
4
April
65
2
May
4
May
54
2
June
13
June
68
3
January
25
January
45
3
February
26
February
56
3
March
30
March
67
3
April
1
April
65
3
June
1
May
54
3
June
1
June
68
(the id data is much larger) I want to calculate payment efficiency using the following function,
efficiency = (amount paid not late / total bill amount)*100
not late is paying no later than the 21st day of the bill's month. (paying January's bill on the 22nd of January is considered as late)
I want to calculate the efficiency of each customer with the expected output of
customerid
effectivity
1
59.90
2
100
3
37.46
I have tried using the following code to calculate for one id and it works. but I want to apply and assign it to the entire group id and summarize it into 1 column (effectivity) and 1 row per ID. I have tried using group by, aggregate and ifelse functions but nothing works. What should I do?
df1 <- filter(df, (payment_month!=bill_month & id==1) | (payment_month==bill_month & payment_date > 21 & id==1) )
df2 <-filter(df, id==1001)
x <- sum(df1$charges)
x <- sum(df2$charges)
100-(x/y)*100
An option using dplyr
library(dplyr)
df %>%
group_by(customerid) %>%
summarise(
effectivity = sum(
charges[payment_date <= 21 & payment_month == bill_month]) / sum(charges) * 100,
.groups = "drop")
## A tibble: 3 x 2
#customerid effectivity
# <int> <dbl>
#1 1 59.9
#2 2 100
#3 3 37.5
df %>%
group_by(customerid) %>%
mutate(totalperid = sum(charges)) %>%
mutate(pay_month_number = match(payment_month , month.name),
bill_month_number = match(bill_month , month.name)) %>%
mutate(nolate = ifelse(pay_month_number > bill_month_number, TRUE, FALSE)) %>%
summarise(efficiency = case_when(nolate = TRUE ~ (charges/totalperid)*100))

Create a new columns in R

I am carrying out an analysis on some Italian regions. I have a dataset similar to the following:
mydata <- data.frame(date= c(2020,2021,2020,2021,2020,2021),
Region= c('Sicilia','Sicilia','Sardegna','Sardegna','Campania','Campania'),
Number=c(20,30,50,70,90,69) )
Now I have to create two new columns. The first (called 'Total population') containing a fixed number for each region (for example each row with Sicily will have a "Total Population" = 250). The second column instead contains the % ratio between the value of 'Number' column and the corresponding value of 'Total Population' (for example for Sicily the value will be 20/250 and so on).
I hope I explained myself well, Thank you very much
Like thsi perhaps:
mydata %<>% group_by( Region ) %>%
mutate(
`Total Population` = sum(Number),
`Ratio of Total` = sprintf( "%.1f%%",100 * Number / sum(Number)) )
mydata is now:
> mydata
# A tibble: 6 x 5
# Groups: Region [3]
date Region Number `Total Population` `Ratio of Total`
<dbl> <chr> <dbl> <dbl> <chr>
1 2020 Sicilia 20 50 40.0%
2 2021 Sicilia 30 50 60.0%
3 2020 Sardegna 50 120 41.7%
4 2021 Sardegna 70 120 58.3%
5 2020 Campania 90 159 56.6%
6 2021 Campania 69 159 43.4%

Percentile for multiple groups of values in R

I'm using R to do my data analysis.
I'm looking for the code to achieve the below mentioned output.
I need a single piece of code to do this as I have over 500 groups & 24 months in my actual data. The below sample has only 2 groups & 2 months.
This is a sample of my data.
Date Group Value
1-Jan-16 A 10
2-Jan-16 A 12
3-Jan-16 A 17
4-Jan-16 A 20
5-Jan-16 A 12
5-Jan-16 B 56
1-Jan-16 B 78
15-Jan-16 B 97
20-Jan-16 B 77
21-Jan-16 B 86
2-Feb-16 A 91
2-Feb-16 A 44
3-Feb-16 A 93
4-Feb-16 A 87
5-Feb-16 A 52
5-Feb-16 B 68
1-Feb-16 B 45
15-Feb-16 B 100
20-Feb-16 B 81
21-Feb-16 B 74
And this is the output I'm looking for.
Month Year Group Minimum Value 5th Percentile 10th Percentile 50th Percentile 90th Percentile Max Value
Jan 2016 A
Jan 2016 B
Feb 2016 A
Feb 2016 B
considering dft as your input, you can try:
library(dplyr)
dft %>%
mutate(Date = as.Date(Date, format = "%d-%b-%y")) %>%
mutate(mon = month(Date),
yr = year(Date)) %>%
group_by(mon,yr,Group) %>%
mutate(minimum = min(Value),
maximum = max(Value),
q95 = quantile(Value, 0.95)) %>%
select(minimum, maximum, q95) %>%
unique()
which gives:
mon yr Group minimum maximum q95
<int> <int> <chr> <int> <int> <dbl>
1 1 2016 A 10 20 19.4
2 1 2016 B 56 97 94.8
3 2 2016 A 44 93 92.6
4 2 2016 B 45 100 96.2
and add more variables as per your need.

Resources