Grouping rows on multiple conditions - r

I have got a follow up question on my previous question about grouping rows on multiple conditions (Previous question).
I was wondering how I can group observations within 31 days of the first date. More importantly, after the 31 days are passed the next date within the same group will be the 'new' first date of that group. Furthermore, after each 'purchase' the grouping should also stop, and the next observation after the purchase will be the 'new' first day of that group.
Let me illustrate it with an example:
example <- structure(
list(
userID = c(1,1,1,1,1,1,2,2,2,2),
date = structure(
c(
18168, #2019-09-29
18189, #2019-10-20
18197, #2019-10-28
18205, #2019-11-05
18205, #2019-11-05
18217, #2019-11-17
18239, #2019-12-09
18270, #2020-01-09
18271, #2020-01-10
18275 #2020-01-14
),
class = "Date"
),
purchase = c(0,0, 0, 0, 0, 1, 0, 0, 1, 0)
),
row.names = c(NA, 10L),
class = "data.frame"
)
Desired outcome:
Outcome <- data.frame(
userID = c(1,1,2,2,2),
date.start = c("2019-09-29", "2019-11-05", "2019-12-09", "2020-01-10", "2020-01-14"),
date.end = c("2019-10-28", "2019-11-17", "2020-01-09", "2020-01-10", "2020-01-14"),
purchase = c(0, 1, 0, 1, 0)
)
Thanks in advance! :)

Like my answer on linked question, I again suggest accumulate strategy here
library(tidyverse)
example
#> userID date purchase
#> 1 1 2019-09-29 0
#> 2 1 2019-10-20 0
#> 3 1 2019-10-28 0
#> 4 1 2019-11-05 0
#> 5 1 2019-11-05 0
#> 6 1 2019-11-17 1
#> 7 2 2019-12-09 0
#> 8 2 2020-01-09 0
#> 9 2 2020-01-10 1
#> 10 2 2020-01-14 0
example %>% group_by(userID) %>%
group_by(grp = unlist(accumulate2(date, purchase[-n()], ~ if(as.numeric(..2 - ..1) < 31 & ..3 != 1) ..1 else ..2)),
grp = with(rle(grp), rep(seq_along(lengths), lengths)), .add = T) %>%
summarise(start.date = first(date),
last.date = last(date), .groups = 'drop')
#> # A tibble: 5 x 4
#> userID grp start.date last.date
#> <dbl> <int> <date> <date>
#> 1 1 1 2019-09-29 2019-10-28
#> 2 1 2 2019-11-05 2019-11-17
#> 3 2 3 2019-12-09 2019-12-09
#> 4 2 4 2020-01-09 2020-01-10
#> 5 2 5 2020-01-14 2020-01-14
Created on 2021-06-13 by the reprex package (v2.0.0)

We could also use the following solution:
library(dplyr)
library(data.table)
example %>%
group_by(grp = cumsum(ifelse(lag(purchase, default = 0) == 1, 1, 0))) %>%
mutate(grp2 = cumsum(as.numeric(date - lag(date, default = first(date)))) > 30) %>%
ungroup() %>%
mutate(grp2 = data.table::rleid(grp2)) %>%
group_by(userID, grp, grp2) %>%
summarise(first = first(date), last = last(date), .groups = "drop") %>%
select(-grp)
# A tibble: 5 x 4
userID grp2 first last
<dbl> <int> <date> <date>
1 1 1 2019-09-29 2019-10-28
2 1 2 2019-11-05 2019-11-17
3 2 3 2019-12-09 2019-12-09
4 2 4 2020-01-09 2020-01-10
5 2 5 2020-01-14 2020-01-14

Because there are dependencies between when one time period ends and the next one starts (given a date, you can only tell if it is the start, middle, or end of a period after investigating every prior record) I can not see any better way of doing this than using a for loop.
Something like the following:
# create output column
example = example %>% mutate(grouping = NA)
# setup tracking variables
current_date = as.Date('1900-01-01')
current_id = -1
prev_purchase = 0
current_group = 0
for(ii in 1:nrow(example)){
# reset on new identity OR on puchase OR on 31 days elapsed
if(example$userID[ii] != current_id # new identity
|| prev_purchase == 1 # just had a purchase
|| example$date[ii] - current_date > 31){ # more than 31 days elapsed
current_date = example$date[ii]
current_id = example$userID[ii]
prev_purchase = example$purchase[ii]
current_group = current_group + 1
example$grouping[ii] = current_group
}
# otherwise step forwards
else {
prev_purchase = example$purchase[ii]
example$grouping[ii] = current_group
}
}
One advantage of this approach, is you can pause after the for loop and check whether the groupings are as expected. The groups can then be collapsed to the requested output using:
output = example %>%
group_by(userID, grouping) %>%
summarise(date.start = min(date),
date.end = max(date),
purchase = max(purchase)) %>%
select(-grouping)

Related

remove rows with overlaped dates and keep longest time interval in R using dplyr or sqldf

I need to remove rows with overlapped dates and keep the x value which is maximum among the overlapped dates. Here is a data frame:
data.frame(time_left = c("2011-08-05",
"2011-07-25",
"2017-08-20",
"2017-08-20",
"2017-10-09",
"2019-06-01"),
time_right= c("2011-09-14",
"2011-09-01",
"2017-09-12",
"2017-09-26",
"2017-10-15",
"2019-11-05"),
x = c(114,20,10,1,5,100) ) -> df
so my input is:
time_left time_right x
1 2011-08-05 2011-09-14 114
2 2011-07-25 2011-09-01 20
3 2017-08-20 2017-09-12 10
4 2017-08-20 2017-09-26 1
5 2017-10-09 2017-10-15 5
6 2019-06-01 2019-11-05 100
and my desired output is:
time_left time_right x
1 2011-08-05 2011-09-14 114
2 2011-07-25 2011-09-01 20
4 2017-08-20 2017-09-26 10
5 2017-10-09 2017-10-15 5
6 2019-06-01 2019-11-05 100
I appreciate any help.
#Maël brought this issue to my attention over on the ivs issue page https://github.com/DavisVaughan/ivs/issues/20.
I think this can be very elegantly and efficiently solved with ivs, but it is a bit hard to come up with the solution, so I'll probably add a helper to do this more easily.
This solution works with "recursive" containers too, i.e. where range A contains range B, but then range C also contains range A, so you really only want to list range C. I've described this in more detail with examples here https://github.com/DavisVaughan/ivs/issues/20#issuecomment-1234479783.
library(ivs)
library(dplyr)
library(vctrs)
df <- tibble(
time_left = as.Date(c(
"2011-08-05", "2011-07-25", "2017-08-20",
"2017-08-20", "2017-10-09", "2019-06-01"
)),
time_right = as.Date(c(
"2011-09-14", "2011-09-01", "2017-09-12",
"2017-09-26", "2017-10-15", "2019-11-05"
)),
x = c(114, 20, 10, 1, 5, 100)
)
df <- df %>%
mutate(range = iv(time_left, time_right), .keep = "unused")
df
#> # A tibble: 6 × 2
#> x range
#> <dbl> <iv<date>>
#> 1 114 [2011-08-05, 2011-09-14)
#> 2 20 [2011-07-25, 2011-09-01)
#> 3 10 [2017-08-20, 2017-09-12)
#> 4 1 [2017-08-20, 2017-09-26)
#> 5 5 [2017-10-09, 2017-10-15)
#> 6 100 [2019-06-01, 2019-11-05)
iv_locate_max_containment <- function(x) {
# Find all locations where the range "contains" any other range
# (including itself)
locs <- iv_locate_overlaps(x, x, type = "contains")
# Find the "top" ranges, i.e. the containers that aren't contained
# by any other containers
top <- !vec_duplicate_detect(locs$haystack)
top <- vec_slice(locs$haystack, top)
top <- vec_in(locs$needles, top)
locs <- vec_slice(locs, top)
locs
}
# i.e. row 4 "contains" rows 3 and 4
locs <- iv_locate_max_containment(df$range)
locs
#> needles haystack
#> 1 1 1
#> 2 2 2
#> 3 4 3
#> 4 4 4
#> 5 5 5
#> 6 6 6
iv_align(df$range, df$x, locations = locs) %>%
rename(range = needles) %>%
group_by(range) %>%
summarise(x = max(haystack))
#> # A tibble: 5 × 2
#> range x
#> <iv<date>> <dbl>
#> 1 [2011-07-25, 2011-09-01) 20
#> 2 [2011-08-05, 2011-09-14) 114
#> 3 [2017-08-20, 2017-09-26) 10
#> 4 [2017-10-09, 2017-10-15) 5
#> 5 [2019-06-01, 2019-11-05) 100
Created on 2022-09-01 with reprex v2.0.2
This may sound a little verbose, however, this could also be a solution:
First we identify those observations that are potentially overlapped.
Then we group the similar ones.
In each group we choose the minimum time_left and maximum time_right and x.
library(tidyverse)
df %>%
mutate(across(starts_with('time'), ymd),
intv = interval(time_left, time_right),
id = row_number()) %>%
mutate(id2 = map2(intv, id, ~ if (any(.x %within% intv[intv != .x])) {
id[which(.x %within% intv[intv != .x]) + 1]
} else {
.y
})) %>%
group_by(id2) %>%
summarise(time_left = min(time_left),
across(c(time_right, x), max)) %>%
select(!(id2))
# A tibble: 4 × 3
time_left time_right x
<date> <date> <dbl>
1 2011-08-05 2011-09-14 114
2 2017-08-20 2017-09-26 10
3 2017-10-09 2017-10-15 5
4 2019-06-01 2019-11-05 100
I combined Anoushiravan's solution with this
How do I determine in R if a date interval overlaps another date interval for the same individual in a data frame?
and I think it is working now.
df %>%
mutate(id = row_number(), days = as.numeric(as.Date(time_right) - as.Date(time_left)) ) %>%
mutate(Int = interval(time_left, time_right),
within = map(seq_along(Int), function(x){
y = setdiff(seq_along(Int), x)
if(any(id[which((Int[x] %within% Int[y]))+1])){
return(id[days == max(days[which((Int[x] %within% Int[y]))+1])])
}else{ return(0)}
})
) %>%
mutate(within = ifelse(within > 0 , within, id)) %>%
group_by(within) %>%
summarise(time_left = min(time_left), time_right = max(time_right), x = max(x)) %>%
select(!within)
But it still has some bugs. for the following df, this code will not work unless I change the order of the records.
df = data.frame(time_left = c("2014-01-01", "2014-01-01", "2014-12-01", "2014-12-26"),
time_right = c("2014-04-23", "2014-12-31", "2014-12-31", "2014-12-31"),
x = c(10,100,200,20))

How to translate excel MINIFS function to dplyr code

Two columns and the formulas are replicated below:
MinSell=IF(F2="SELL",0,MINIFS(C:C,B:B,B2,F:F,"SELL")-C2)
MaxSell=IF(F2="SELL",0,MAXIFS(C:C,B:B,B2,F:F,"SELL")-C2)
Column F includes transactionstatus
Column C includes Tradedate
account B includes AccountNo
I have a df containing hundreds of columns and millions of rows. Below is a small snippet of the df containing details of just one account
AccountNo<-c(11223344,11223344,11223344,11223344)
transactionstatus<-c("BUY","BUY","SELL","SELL")
Tradedate<-c("2020-01-17","2020-01-16","2020-01-13","2020-01-12")
df<-as.data.frame(cbind(AccountNo,transactionstatus,Tradedate))
Expected OutPUT
MinSell = c(-5, -4, 0, 0)
MaxSell = c(-4, -3, 0, 0)
You can create a variable containing the mindate and maxdate, then mutate columns with your condition.
Setup
library(dplyr)
# Tradedate must have Date class
df <- tibble(
AccountNo = c(11223344,11223344,11223344,11223344),
transactionstatus = c("BUY", "BUY", "SELL", "SELL"),
Tradedate = as.Date(c("2020-01-17", "2020-01-16", "2020-01-13", "2020-01-12")))
df
# A tibble: 4 x 3
AccountNo transactionstatus Tradedate
<dbl> <chr> <date>
1 11223344 BUY 2020-01-17
2 11223344 BUY 2020-01-16
3 11223344 SELL 2020-01-13
4 11223344 SELL 2020-01-12
Solution
# final df
binded <- tibble()
for (account in unique(df$AccountNo)) {
df_fltrd <- filter(df, AccountNo == account)
mindate <- min(df_fltrd$Tradedate[df_fltrd$transactionstatus == "SELL"])
maxdate <- max(df_fltrd$Tradedate[df_fltrd$transactionstatus == "SELL"])
solution <- df_fltrd %>%
mutate(minsell = if_else(transactionstatus == "SELL", 0, as.numeric(mindate-Tradedate)),
maxsell = if_else(transactionstatus == "SELL", 0, as.numeric(maxdate-Tradedate)))
binded <- bind_rows(binded, solution)
}
binded
# A tibble: 4 x 5
AccountNo transactionstatus Tradedate minsell maxsell
<dbl> <chr> <date> <dbl> <dbl>
1 11223344 BUY 2020-01-17 -5 -4
2 11223344 BUY 2020-01-16 -4 -3
3 11223344 SELL 2020-01-13 0 0
4 11223344 SELL 2020-01-12 0 0

How to create a new variable in different group with different condition in r with dplyr

I want to add a new variable in my data frame in different groups with different conditions.
My data like this:
test <- data.frame(country =rep( letters[1:5], each = 10),
time = seq(from = as.Date('2020-01-01'), to = as.Date('2020-02-19'), by = 'day')) %>% mutate(time = as.Date(time))
lockdown_time <- data.frame(country = letters[1:4],
start_time = c('2020-01-06', '2020-01-16', '2020-01-26', '2020-02-05'),
end_time = c('2020-01-08','2020-01-18','2020-01-28','2020-02-07'))
I would use country == 'a' as an example:
# use country a as an example
test_a <- test %>% filter(country == 'a')
start_time_a <- lockdown_time[1,2] %>% as.Date()
end_time_a <- lockdown_time[1,3] %>% as.Date()
test_a %>% mutate(lockdown = case_when(between(time, start_time_a, end_time_a) ~ 1, T ~ 0))
I know how to add the new variable lockdown in every country one by one, but I wonder if there is an efficient way to do this.
Note that there is no country == 'e' in lockdown_time dataframe, so the created lockdown variable in country == 'e' should all be NA.
You can use >= and <= to identify if a date falls in specified range.
library(dplyr)
test %>%
left_join(lockdown_time, by = "country") %>%
mutate(start_time = as.Date(start_time), end_time = as.Date(end_time),
lockdown = + (time >= start_time & time <= end_time)) %>%
select(-ends_with("_time"))
or use between() with rowwise()
test %>%
left_join(lockdown_time, by = "country") %>%
mutate(start_time = as.Date(start_time), end_time = as.Date(end_time)) %>%
rowwise() %>%
mutate(lockdown = + between(time, start_time, end_time)) %>%
select(-ends_with("_time")) %>%
ungroup()
Output
# A tibble: 50 x 3
country time lockdown
<chr> <date> <int>
1 a 2020-01-01 0
2 a 2020-01-02 0
3 a 2020-01-03 0
4 a 2020-01-04 0
5 a 2020-01-05 0
6 a 2020-01-06 1
7 a 2020-01-07 1
8 a 2020-01-08 1
9 a 2020-01-09 0
10 a 2020-01-10 0
11 b 2020-01-11 0
12 b 2020-01-12 0
13 b 2020-01-13 0
14 b 2020-01-14 0
15 b 2020-01-15 0
16 b 2020-01-16 1
17 b 2020-01-17 1
18 b 2020-01-18 1
19 b 2020-01-19 0
20 b 2020-01-20 0
⠇
46 e 2020-02-15 NA
47 e 2020-02-16 NA
48 e 2020-02-17 NA
49 e 2020-02-18 NA
50 e 2020-02-19 NA
You need a left_join, also I'm using lubridate package to easy test between dates.
library(tidyverse)
library(lubridate)
test <- data.frame(
country =rep( letters[1:5], each = 10),
time = seq(from = as.Date('2020-01-01'), to = as.Date('2020-02-19'), by = 'day'),
stringsAsFactors = F
) %>%
mutate(time = lubridate::as_date(time))
lockdown_time <- data.frame(
country = letters[1:4],
start_time = c('2020-01-06', '2020-01-16', '2020-01-26', '2020-02-05'),
end_time = c('2020-01-08','2020-01-18','2020-01-28','2020-02-07'),
stringsAsFactors = F
) %>%
mutate(
start_time = as_date(start_time),
end_time = as_date(end_time))
test %>%
left_join(lockdown_time) %>%
mutate(lockdown = as.integer(time %within% interval(start_time, end_time)))

R - dataframe with sequence from another dataframe

I have a dataframe like this:
source_data <-
data.frame(
id = c(seq(1,3)),
start = c(as.Date("2020-04-04"), as.Date("2020-04-02"), as.Date("2020-04-03")),
end = c(as.Date("2020-04-08"), as.Date("2020-04-05"), as.Date("2020-04-05"))
)
I want to create a date sequence for each id = crate each day between start and end dates and put it to another dataframe. So the result should look like this:
result <-
data.frame(
id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
date = c(
as.Date("2020-04-04"),
as.Date("2020-04-05"),
as.Date("2020-04-06"),
as.Date("2020-04-07"),
as.Date("2020-04-08"),
as.Date("2020-04-02"),
as.Date("2020-04-03"),
as.Date("2020-04-04"),
as.Date("2020-04-05"),
as.Date("2020-04-03"),
as.Date("2020-04-04"),
as.Date("2020-04-05")
)
)
I started with this date sequence, but how to join my source_data dataframe there?
solution <-
data.frame(
date = seq(min(source_data$start), max(source_data$end), by = 1)
)
We can use map2 to create the sequence between each corresponding 'start', 'end' dates and then unnest the list column
library(dplyr)
library(purrr)
library(tidyr)
source_data %>%
transmute(id, date = map2(start, end, seq, by = '1 day')) %>%
unnest(c(date))
# A tibble: 12 x 2
# id date
# <int> <date>
# 1 1 2020-04-04
# 2 1 2020-04-05
# 3 1 2020-04-06
# 4 1 2020-04-07
# 5 1 2020-04-08
# 6 2 2020-04-02
# 7 2 2020-04-03
# 8 2 2020-04-04
# 9 2 2020-04-05
#10 3 2020-04-03
#11 3 2020-04-04
#12 3 2020-04-05
Or using data.table
library(data.table)
setDT(source_data)[, .(date = seq(start, end, by = '1 day')), by = id]
Additional option with base R
lst1 <- Map(seq, source_data$start, source_data$end, MoreArgs = list(by = '1 day'))
data.frame(id = rep(source_data$id, lengths(lst1)), date = do.call(c, lst1))
Another base R solution
result <- do.call(rbind,
c(make.row.names = FALSE,
lapply(split(source_data,source_data$id),
function(v) with(v,data.frame(id = id, date = seq(start,end,by = 1))))))
which yields
> result
id date
1 1 2020-04-04
2 1 2020-04-05
3 1 2020-04-06
4 1 2020-04-07
5 1 2020-04-08
6 2 2020-04-02
7 2 2020-04-03
8 2 2020-04-04
9 2 2020-04-05
10 3 2020-04-03
11 3 2020-04-04
12 3 2020-04-05
additional option
library(dplyr)
source_data %>%
rowwise() %>%
mutate(out = list(seq.Date(start, end, "day"))) %>%
unnest(out) %>%
select(-c(start, end))

creating a matrix with the sum of payment depending on date1 (row) and date2 (column) in R

I have a database with two dates (sold date and pay date), I would like to create a N x M matrix with the sum of the values depending of the dates like that:
the database code of the example is here:
#creating base
sold_date <- as.Date(c("01-01-2019", "01-01-2019", "01-02-2019", "01-02-2019", "01-03-2019", "01-01-2019"), "%d-%m-%Y")
pay_date <- as.Date(c("01-01-2019", "01-01-2019", "01-03-2019", "01-02-2019", "01-03-2019", "01-02-2019"), "%d-%m-%Y")
value <- c(10, 3, 5, 10, 15, 20)
base <- data.frame(sold_date, pay_date, value)
how can I do this?
best regards
A data.table approach where pivoting and aggregating can be done in a single step:
data.table::dcast(
setDT(base), sold_date ~ pay_date,
value.var = 'value',
fun.aggregate = sum
)
library(tidyverse)
base %>%
group_by(sold_date, pay_date) %>% # get unique pairs of dates
summarise(value = sum(value)) %>% # and get sum of values
ungroup() %>% # forget the grouping
spread(pay_date, value, fill = 0) # reshape dataset
# # A tibble: 3 x 5
# sold_date `2019-01-01` `2019-02-01` `2019-03-01` `2019-04-01`
# <date> <dbl> <dbl> <dbl> <dbl>
# 1 2019-01-01 10 20 0 0
# 2 2019-02-01 0 0 5 10
# 3 2019-03-01 0 0 15 0
library(tidyverse)
base %>%
group_by(sold_date, pay_date) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = pay_date, values_from = value, values_fill = list(value = 0))
# A tibble: 3 x 5
# Groups: sold_date [3]
sold_date `2019-01-01` `2019-02-01` `2019-03-01` `2019-04-01`
<date> <dbl> <dbl> <dbl> <dbl>
1 2019-01-01 10 20 0 0
2 2019-02-01 0 0 5 10
3 2019-03-01 0 0 15 0
Your example data does not have multiple entries with same sold_date - pay_date combinations, i fixed that for you:
sold_date <- as.Date(c("01-01-2019", "01-01-2019", "01-02-2019", "01-03-2019", "01-01-2019"), "%d-%m-%Y")
pay_date <- as.Date(c("01-03-2019", "01-03-2019", "01-04-2019", "01-03-2019", "01-02-2019"), "%d-%m-%Y")
value <- c(10, 5, 10, 15, 20)
base <- data.frame(sold_date, pay_date, value)
Then we can use dplyr syntax to group 'sold_date' and 'pay_date' and sum the value of thevariable 'value' for each group:
base %>%
group_by(sold_date, pay_date) %>%
summarise(Total = sum(value))
If you want the data in the form as shown in your original question we can use dplyr::pivot_wider:
base %>%
group_by(sold_date, pay_date) %>%
summarise(Total = sum(value)) %>%
pivot_wider(names_from = pay_date,
values_from = Total)
Here is a base R solution using reshape + aggregate
dfout <- reshape(aggregate(value ~ sold_date + pay_date,df,sum),
direction = "wide",
idvar = "sold_date",
timevar = "pay_date")
such that
> dfout
sold_date value.2019-01-01 value.2019-02-01 value.2019-03-01
1 2019-01-01 13 20 NA
3 2019-02-01 NA 10 5
5 2019-03-01 NA NA 15
If you want to fill NA with 0, then you append dfout[is.na(dfout)] <- 0 to the end of codes from above, such that
> dfout
sold_date value.2019-01-01 value.2019-02-01 value.2019-03-01
1 2019-01-01 13 20 0
3 2019-02-01 0 10 5
5 2019-03-01 0 0 15
DATA
df <- structure(list(sold_date = structure(c(17897, 17897, 17928, 17928,
17956, 17897), class = "Date"), pay_date = structure(c(17897,
17897, 17956, 17928, 17956, 17928), class = "Date"), value = c(10,
3, 5, 10, 15, 20)), class = "data.frame", row.names = c(NA, -6L
))

Resources