Count instances of value within overlapping dates? - r

I have a dataframe that includes start_date and end_date for a given unit_id along with the unit's group.
in_df <- data.frame(unit_id = c(1,2,3),
start_date = as.Date(c("2019-01-01","2019-02-05","2020-01-12")),
end_date = as.Date(c("2019-02-06","2019-02-28","2020-01-30")),
group = c("pass","fail","pass"))
For each unit_id, I need to calculate the proportion of all units that pass within the duration, start_date and end_date for the current unit_id.
Taking unit_id=1 as an example, I need to find all units that have start_date and/or end_date within the dates for unit 1, i.e. start_date = 2019-01-01 and end_date = 2019-02-06. Given my in_df, this would return two units, 1 and 2. One unit passes and one fails so the proportion of pass would be 0.5. desired_df shows the output I expect for this example.
desired_df <- data.frame(unit_id = c(1,2,3),
start_date = as.Date(c("2019-01-01","2019-02-05","2020-01-12")),
end_date = as.Date(c("2019-02-06","2019-02-28","2020-01-30")),
group = c("pass","fail","pass"),
pass_prop = c(0.5,0.5,1))
What I've tried
There are a lot of existing posts related to identifying overlapping dates. I've tried to work through some to see if I can figure this out but haven't been successful.
The following is the closest that I've gotten. It does what I want on my toy example but not on the real data (additional example data below).
library(dplyr)
library(ivs)
in_df <- data.frame(unit_id = c(1,2,3),
start_date = as.Date(c("2019-01-01","2019-02-05","2020-01-12")),
end_date = as.Date(c("2019-02-06","2019-02-28","2020-01-30")),
group = c("pass","fail","pass"))
desired_df <- data.frame(unit_id = c(1,2,3),
start_date = as.Date(c("2019-01-01","2019-02-05","2020-01-12")),
end_date = as.Date(c("2019-02-06","2019-02-28","2020-01-30")),
group = c("pass","fail","pass"),
pass_prop = c(0.5,0.5,1))
in_df <- in_df %>%
mutate(
start_dt = as.Date(start_date),
end_dt = as.Date(end_date)
) %>%
mutate(
range = iv(start_dt, end_dt),
.keep = "unused"
)
in_df$row_n <- 1:nrow(in_df)
in_df <- in_df %>%
group_by(group) %>%
mutate(groupDate = iv_identify_group(range)) %>%
group_by(groupDate, .add = TRUE)
groupCount <- in_df %>% group_by(groupDate) %>% dplyr::summarize(totalCount=n())
durationCount <- in_df %>% group_by(groupDate,group) %>% dplyr::summarize(groupCount=n())
durationCount <- dplyr::inner_join(groupCount,durationCount, by = "groupDate")
durationCount$pass_prop <- durationCount$groupCount/durationCount$totalCount
durationCount <- filter(durationCount, group == "pass")
desired_df <- dplyr::full_join(in_df,durationCount, by = "groupDate")
desired_df
The above displays exactly what I need under pass_prop. The problem with this is that iv_identify_group extends the groupDate too far when additional dates overlap as shown below.
Take unit = 1 as an example again. If I add another row to in_df that overlaps with unit = 1 and unit = 3, then the groupDate gets extended to include the ranges for units 1,2, and 4. This happens because unit 1 overlaps with 2 and 2 overlaps with 4. I want it to stop at the overlap with unit 2 since the range of unit 1 does not overlap with unit 4. Below displays this undesired output.
in_df <- data.frame(unit_id = c(1,2,3,4),
start_date = as.Date(c("2019-01-01","2019-02-05","2020-01-12","2019-02-20")),
end_date = as.Date(c("2019-02-06","2019-02-28","2020-01-30","2020-01-30")),
group = c("pass","fail","pass","pass"))
# execute same code as above

Perhaps this?
library(dplyr)
in_df %>%
fuzzyjoin::fuzzy_left_join(
in_df, by = c("start_date" = "end_date", "end_date" = "start_date"),
match_fun = list(`<=`, `>=`)) %>%
group_by(unit_id = unit_id.x, start_date = start_date.x,
end_date = end_date.x, group = group.x) %>%
summarize(pass_prop = sum(group.y == "pass") / n(), .groups = "drop")
Result
unit_id start_date end_date group pass_prop
<dbl> <date> <date> <chr> <dbl>
1 1 2019-01-01 2019-02-06 pass 0.5
2 2 2019-02-05 2019-02-28 fail 0.5
3 3 2020-01-12 2020-01-30 pass 1

I think ivs can help you, but I think you might be looking for iv_locate_overlaps() here:
library(ivs)
library(tidyverse)
# Starting with the more complex example with the 4th row
in_df <- tibble(unit_id = c(1,2,3,4),
start_date = as.Date(c("2019-01-01","2019-02-05","2020-01-12","2019-02-20")),
end_date = as.Date(c("2019-02-06","2019-02-28","2020-01-30","2020-01-30")),
group = c("pass","fail","pass","pass"))
in_df <- in_df %>%
mutate(range = iv(start_date, end_date), .keep = "unused")
in_df
#> # A tibble: 4 × 3
#> unit_id group range
#> <dbl> <chr> <iv<date>>
#> 1 1 pass [2019-01-01, 2019-02-06)
#> 2 2 fail [2019-02-05, 2019-02-28)
#> 3 3 pass [2020-01-12, 2020-01-30)
#> 4 4 pass [2019-02-20, 2020-01-30)
# "find all units that have `start_date` and/or `end_date` within the dates for unit i"
# So you are looking for "any" kind of overlap.
# `iv_locate_overlaps()` does: "For each `needle`, find every location in `haystack`
# where that `needle` has ANY overlap at all"
locs <- iv_locate_overlaps(
needles = in_df$range,
haystack = in_df$range,
type = "any"
)
# Note `needle` 1 overlaps `haystack` locations 1 and 2 (which is what you said
# you want for unit 1)
locs
#> needles haystack
#> 1 1 1
#> 2 1 2
#> 3 2 1
#> 4 2 2
#> 5 2 4
#> 6 3 3
#> 7 3 4
#> 8 4 2
#> 9 4 3
#> 10 4 4
# Slice `in_df` appropriately, keeping relevant columns needed to answer the question
needles <- in_df[locs$needles, c("unit_id", "range")]
haystack <- in_df[locs$haystack, c("group", "range")]
haystack <- rename(haystack, overlaps = range)
expanded_df <- bind_cols(needles, haystack)
expanded_df
#> # A tibble: 10 × 4
#> unit_id range group overlaps
#> <dbl> <iv<date>> <chr> <iv<date>>
#> 1 1 [2019-01-01, 2019-02-06) pass [2019-01-01, 2019-02-06)
#> 2 1 [2019-01-01, 2019-02-06) fail [2019-02-05, 2019-02-28)
#> 3 2 [2019-02-05, 2019-02-28) pass [2019-01-01, 2019-02-06)
#> 4 2 [2019-02-05, 2019-02-28) fail [2019-02-05, 2019-02-28)
#> 5 2 [2019-02-05, 2019-02-28) pass [2019-02-20, 2020-01-30)
#> 6 3 [2020-01-12, 2020-01-30) pass [2020-01-12, 2020-01-30)
#> 7 3 [2020-01-12, 2020-01-30) pass [2019-02-20, 2020-01-30)
#> 8 4 [2019-02-20, 2020-01-30) fail [2019-02-05, 2019-02-28)
#> 9 4 [2019-02-20, 2020-01-30) pass [2020-01-12, 2020-01-30)
#> 10 4 [2019-02-20, 2020-01-30) pass [2019-02-20, 2020-01-30)
# Compute the pass proportion per unit
expanded_df %>%
group_by(unit_id) %>%
summarise(pass_prop = sum(group == "pass") / length(group))
#> # A tibble: 4 × 2
#> unit_id pass_prop
#> <dbl> <dbl>
#> 1 1 0.5
#> 2 2 0.667
#> 3 3 1
#> 4 4 0.667
Created on 2022-07-19 by the reprex package (v2.0.1)

Related

remove rows with overlaped dates and keep longest time interval in R using dplyr or sqldf

I need to remove rows with overlapped dates and keep the x value which is maximum among the overlapped dates. Here is a data frame:
data.frame(time_left = c("2011-08-05",
"2011-07-25",
"2017-08-20",
"2017-08-20",
"2017-10-09",
"2019-06-01"),
time_right= c("2011-09-14",
"2011-09-01",
"2017-09-12",
"2017-09-26",
"2017-10-15",
"2019-11-05"),
x = c(114,20,10,1,5,100) ) -> df
so my input is:
time_left time_right x
1 2011-08-05 2011-09-14 114
2 2011-07-25 2011-09-01 20
3 2017-08-20 2017-09-12 10
4 2017-08-20 2017-09-26 1
5 2017-10-09 2017-10-15 5
6 2019-06-01 2019-11-05 100
and my desired output is:
time_left time_right x
1 2011-08-05 2011-09-14 114
2 2011-07-25 2011-09-01 20
4 2017-08-20 2017-09-26 10
5 2017-10-09 2017-10-15 5
6 2019-06-01 2019-11-05 100
I appreciate any help.
#Maël brought this issue to my attention over on the ivs issue page https://github.com/DavisVaughan/ivs/issues/20.
I think this can be very elegantly and efficiently solved with ivs, but it is a bit hard to come up with the solution, so I'll probably add a helper to do this more easily.
This solution works with "recursive" containers too, i.e. where range A contains range B, but then range C also contains range A, so you really only want to list range C. I've described this in more detail with examples here https://github.com/DavisVaughan/ivs/issues/20#issuecomment-1234479783.
library(ivs)
library(dplyr)
library(vctrs)
df <- tibble(
time_left = as.Date(c(
"2011-08-05", "2011-07-25", "2017-08-20",
"2017-08-20", "2017-10-09", "2019-06-01"
)),
time_right = as.Date(c(
"2011-09-14", "2011-09-01", "2017-09-12",
"2017-09-26", "2017-10-15", "2019-11-05"
)),
x = c(114, 20, 10, 1, 5, 100)
)
df <- df %>%
mutate(range = iv(time_left, time_right), .keep = "unused")
df
#> # A tibble: 6 × 2
#> x range
#> <dbl> <iv<date>>
#> 1 114 [2011-08-05, 2011-09-14)
#> 2 20 [2011-07-25, 2011-09-01)
#> 3 10 [2017-08-20, 2017-09-12)
#> 4 1 [2017-08-20, 2017-09-26)
#> 5 5 [2017-10-09, 2017-10-15)
#> 6 100 [2019-06-01, 2019-11-05)
iv_locate_max_containment <- function(x) {
# Find all locations where the range "contains" any other range
# (including itself)
locs <- iv_locate_overlaps(x, x, type = "contains")
# Find the "top" ranges, i.e. the containers that aren't contained
# by any other containers
top <- !vec_duplicate_detect(locs$haystack)
top <- vec_slice(locs$haystack, top)
top <- vec_in(locs$needles, top)
locs <- vec_slice(locs, top)
locs
}
# i.e. row 4 "contains" rows 3 and 4
locs <- iv_locate_max_containment(df$range)
locs
#> needles haystack
#> 1 1 1
#> 2 2 2
#> 3 4 3
#> 4 4 4
#> 5 5 5
#> 6 6 6
iv_align(df$range, df$x, locations = locs) %>%
rename(range = needles) %>%
group_by(range) %>%
summarise(x = max(haystack))
#> # A tibble: 5 × 2
#> range x
#> <iv<date>> <dbl>
#> 1 [2011-07-25, 2011-09-01) 20
#> 2 [2011-08-05, 2011-09-14) 114
#> 3 [2017-08-20, 2017-09-26) 10
#> 4 [2017-10-09, 2017-10-15) 5
#> 5 [2019-06-01, 2019-11-05) 100
Created on 2022-09-01 with reprex v2.0.2
This may sound a little verbose, however, this could also be a solution:
First we identify those observations that are potentially overlapped.
Then we group the similar ones.
In each group we choose the minimum time_left and maximum time_right and x.
library(tidyverse)
df %>%
mutate(across(starts_with('time'), ymd),
intv = interval(time_left, time_right),
id = row_number()) %>%
mutate(id2 = map2(intv, id, ~ if (any(.x %within% intv[intv != .x])) {
id[which(.x %within% intv[intv != .x]) + 1]
} else {
.y
})) %>%
group_by(id2) %>%
summarise(time_left = min(time_left),
across(c(time_right, x), max)) %>%
select(!(id2))
# A tibble: 4 × 3
time_left time_right x
<date> <date> <dbl>
1 2011-08-05 2011-09-14 114
2 2017-08-20 2017-09-26 10
3 2017-10-09 2017-10-15 5
4 2019-06-01 2019-11-05 100
I combined Anoushiravan's solution with this
How do I determine in R if a date interval overlaps another date interval for the same individual in a data frame?
and I think it is working now.
df %>%
mutate(id = row_number(), days = as.numeric(as.Date(time_right) - as.Date(time_left)) ) %>%
mutate(Int = interval(time_left, time_right),
within = map(seq_along(Int), function(x){
y = setdiff(seq_along(Int), x)
if(any(id[which((Int[x] %within% Int[y]))+1])){
return(id[days == max(days[which((Int[x] %within% Int[y]))+1])])
}else{ return(0)}
})
) %>%
mutate(within = ifelse(within > 0 , within, id)) %>%
group_by(within) %>%
summarise(time_left = min(time_left), time_right = max(time_right), x = max(x)) %>%
select(!within)
But it still has some bugs. for the following df, this code will not work unless I change the order of the records.
df = data.frame(time_left = c("2014-01-01", "2014-01-01", "2014-12-01", "2014-12-26"),
time_right = c("2014-04-23", "2014-12-31", "2014-12-31", "2014-12-31"),
x = c(10,100,200,20))

Count number of observations per distinct group inside summarise with dplyr (n_distinct equivalent?)

Is there a function that counts the number of observations within unique groups and not the number of distinct groups as n_distinct() does?
I'm summarising data with dplyr and group_by(), and I'm trying to calculate means of numbers of observations per a different grouping variable.
df<-data.frame(id=c('A', 'A', 'A', 'B', 'B', 'C','C','C'),
id.2=c('1', '2', '2', '1','1','1','2','2'),
v=c(sample(1:10, 8)))
df%>%
group_by(id.2)%>%
summarise(n.mean=mean(n_distinct(id)),
v.mean=mean(v))
# A tibble: 2 × 3
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 3 5
2 2 2 4.5
What I instead need:
id.2 n.mean v.mean
1 1 5
2 2 4.5
because for
id.2==1 n.mean is the mean of 1 observation for A, 2 for B, 1 observation for C,
> mean(1,2,1)
[1] 1
id.2==2 n.mean is the mean of 2 observations for A, 0 for B, 2 for C,
mean(2,0,2)
[1] 2
I tried grouping by group_by(id, id.2) first to count the observations and then pass those counts on when grouping by only id.2 in a subsequent step, but that didn't work (though I probably just don't know how to implement this with dplyr as I'm not very experienced with tidyverse solutions)
You are not using mean correctly. mean(1, 2, 1) ignores all but the first argument and therefore will return 1 no matter what other numbers are in the second and third positions. For id.2 == 1, you'd want mean(c(1, 2, 1)), which returns 1.333.
We can use table to quickly calculate the frequencies of id within each grouping of id.2, and then take the mean of those. We can compute v.mean in the same step.
library(tidyverse)
df %>%
group_by(id.2) %>%
summarize(
n.mean = mean(table(id)),
v.mean = mean(v)
)
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 1.33 4.25
2 2 2 6
Your example notes that id.2 == 2 does not have any values for id == B. It is not clear whether your desired solution counts this as a zero-length category, or simply ignores it. The solution above ignores it. The following includes it as a zero-length category by first complete-ing the input data (note new row #7, which has NA data):
df_complete <- complete(df, id.2, id)
id.2 id v
<chr> <chr> <int>
1 1 A 9
2 1 B 1
3 1 B 2
4 1 C 5
5 2 A 4
6 2 A 7
7 2 B NA
8 2 C 3
9 2 C 10
We can convert id to factor data, which will force table to preserve its unique levels even in groupings of zero length:
df_complete %>%
group_by(id.2) %>%
mutate(id = factor(id)) %>%
filter(!is.na(v)) %>%
summarize(
n.mean = mean(table(id)),
v.mean = mean(v, na.rm = T)
)
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 1.33 4.25
2 2 1.33 6
Or an alternate recipe that does not rely on table:
df_complete %>%
group_by(id.2, id) %>%
summarize(
n_rows = sum(!is.na(v)),
id_mean = mean(v)
) %>%
group_by(id.2) %>%
summarize(
n.mean = mean(n_rows),
v.mean = weighted.mean(id_mean, n_rows, na.rm = T)
)
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 1.33 4.25
2 2 1.33 6
Note that when providing randomized example data, you should use set.seed to control the randomization and ensure reproducibility. Here is what I used:
set.seed(0)
df<-data.frame(id=c('A', 'A', 'A', 'B', 'B', 'C','C','C'),
id.2=c('1', '2', '2', '1','1','1','2','2'),
v=c(sample(1:10, 8)))

Sum of elements in a forward looking rolling window by month

I have the following data.frame with columns: Id, Month, have
library(dplyr)
dt <- read.table(header = TRUE, text = '
Id Month have want
1 01-Jan-2018 1.000000000000000 1.234567901220000
1 01-Feb-2018 0.200000000000000 0.234567901233000
1 01-Mar-2018 0.030000000000000 0.034567901234400
1 01-Apr-2018 0.004000000000000 0.004567901234550
1 01-May-2018 0.000500000000000 0.000567901234566
1 01-Jun-2018 0.000060000000000 0.000067901234566
1 01-Jul-2018 0.000007000000000 0.000007901234566
1 01-Aug-2018 0.000000800000000 0.000000901234566
1 01-Sep-2018 0.000000090000000 0.000000101234566
1 01-Oct-2018 0.000000010000000 0.000000011234566
1 01-Nov-2018 0.000000001100000 0.000000001234566
1 01-Dec-2018 0.000000000120000 0.000000000134566
1 01-Jan-2019 0.000000000013000 0.000000000014566
1 01-Feb-2019 0.000000000001400 0.000000000001566
1 01-Mar-2019 0.000000000000150 0.000000000000166
1 01-Apr-2019 0.000000000000016 0.000000000000016
2 01-Jan-2018 1337.00 1338.00
2 01-Feb-2018 1.00 1.00
3 01-Jan-2018 5.000000000000000000 5.000000000000000
') %>% mutate(Month=as.Date(Month, format='%d-%b-%Y')
I would like to programmatically calculate sum of elements in a 12 month forward looking rolling window by Month and grouped by Id as demonstrated in column want. If the rolling observation window is less than 12 months, the missing elements should be ignored.
For bonus points would the solution would also allow for missing months, such as in:
dt <- read.table(header = TRUE, text = '
Id Month have want
1 01-Jan-18 1.000000000000000 1.200000000000000
1 01-Dec-18 0.200000000000000 0.230000000000000
1 01-Jan-19 0.030000000000000 0.030000000000000
') %>% mutate(Month=as.Date(Month, format='%d-%b-%Y')
I have tried different solutions, e.g. rollapplyr() of the zoo package and some functions in the runner package, but it doesn't seem to give me what I need.
You can use zoo's rollaply with partial = TRUE
library(dplyr)
dt %>%
group_by(Id) %>%
tidyr::complete(Month = seq(min(Month), max(Month), "month")) %>%
mutate(result = zoo::rollapply(have, 12, sum, na.rm = TRUE,
align = 'left', partial = TRUE)) -> result
result
If you have data for every month for each Id like in the example shared you can remove the complete step.
I suggest to use runner package in this case. runner function let you to calculate rolling window having a full control in time. k is a window length, lag is a lag of the window and in idx you specify index column which window depends on.
library(runner)
dt %>%
group_by(Id) %>%
mutate(want2 = runner(
.,
f = function(x) sum(x$have),
k = 12, # or "12 months"
lag = -11, # or "-11 months"
idx = Month)
)
# # A tibble: 19 x 5
# # Groups: Id [3]
# Id Month have want want2
# <int> <date> <dbl> <dbl> <dbl>
# 1 1 2018-01-01 1.00e+ 0 1.23e+ 0 1.00e+ 0
# 2 1 2018-02-01 2.00e- 1 2.35e- 1 2.00e- 1
# 3 1 2018-03-01 3.00e- 2 3.46e- 2 3.00e- 2
# 4 1 2018-04-01 4.00e- 3 4.57e- 3 4.00e- 3
# 5 1 2018-05-01 5.00e- 4 5.68e- 4 5.00e- 4
# 6 1 2018-06-01 6.00e- 5 6.79e- 5 6.00e- 5

replace a value in one column with a value from a second column on condition of a value from a third column from different rows

I have a data frame:
df1 <- data.frame(Object = c("Klaus","Klaus","Peter","Peter","Daniel","Daniel"),
PointA = as.numeric(c("7",NA,"17",NA,NA,NA)),
PointB = as.numeric(c("18","22",NA,NA,"17",NA)),
measure = c("1","2","1","2","1","2")
)
And I want this:
df2 <- data.frame(Object = c("Klaus","Klaus","Peter","Peter","Daniel","Daniel"),
PointA = as.numeric(c("7","18","17",NA,NA,"17")),
PointB = as.numeric(c("18","22",NA,NA,"17",NA)),
measure = c("1","2","1","2","1","2")
)
Which is, if there is a no value for an Object for PointA for measure == 2, I want it replaced with PointB from measure == 1 of the same Object.
First thing that comes to mind is:
library(dplyr)
df$PointA <- coalesce(df$PointA, df$PointB)
But afaik there is no way to make this condional.
Then I thought maybe something like:
df$PointA[is.na(df$PointA)] <- df$PointB
But this does not differentiate for the measure.
So I thought about:
df$PointA <- ifelse(df$measure == 2 & is.na(df$PointA), df$PointB, df$PointA)
But that does not take into account that I need the corresponding value from measure == 1.
Now, I am at a loss here. I am out of ideas how to approch this. Help?
Edit: I got two very good solutions already, but both rely on the order in the data frame. I tried, but obviously my example was to simple. I am looking for something that works under the following condition, too:
df1 <- df1[sample(nrow(df1)), ]
One possible option is using row_number() from dplyr. In case you need to sort your dataframe first, you can insert an arrange statement.
library(dplyr)
df1 %>%
arrange(Object, measure) %>%
group_by(Object) %>%
mutate(PointA = if_else(measure == 2 & is.na(PointA), PointB[row_number()-1], PointA))
# A tibble: 6 x 4
# Groups: Object [3]
# Object PointA PointB measure
# <chr> <dbl> <dbl> <chr>
# 1 Daniel NA 17 1
# 2 Daniel 17 NA 2
# 3 Klaus 7 18 1
# 4 Klaus 18 22 2
# 5 Peter 17 NA 1
# 6 Peter NA NA 2
You could use coalesce +lag as shown below:
library(tidyverse)
df1 %>%
arrange(Object, measure) %>%
group_by(Object) %>%
mutate(PointA = coalesce(PointA, lag(PointB)))
# A tibble: 6 x 4
# Groups: Object [3]
Object PointA PointB measure
<chr> <dbl> <dbl> <chr>
1 Klaus 7 18 1
2 Klaus 18 18 2
3 Peter 17 NA 1
4 Peter NA NA 2
5 Daniel NA 17 1
6 Daniel 17 NA 2
This could be condensed, but it should be relatively clear and doesn't rely on the row order at all. Beware if you have multiple rows for the same Object/Measure pair - the self-join will have multiple matches and you'll end up with a lot more rows than you started with.
library(dplyr)
df_fill = df1 %>%
filter(measure == 1) %>%
select(Object, fill_in = PointB) %>%
mutate(needs_fill = 1L)
result = df1 %>%
mutate(needs_fill = if_else(measure == 2 & is.na(PointA), 1L, NA_integer_)) %>%
left_join(df_fill) %>%
mutate(PointA = coalesce(PointA, fill_in)) %>%
select(-fill_in, -needs_fill)
result
# Object PointA PointB measure
# 1 Klaus 7 18 1
# 2 Klaus 18 22 2
# 3 Peter 17 NA 1
# 4 Peter NA NA 2
# 5 Daniel NA 17 1
# 6 Daniel 17 NA 2
Same as above but without saving the intermediate object:
result = df1 %>%
mutate(needs_fill = if_else(measure == 2 & is.na(PointA), 1L, NA_integer_)) %>%
left_join(
df1 %>%
filter(measure == 1) %>%
select(Object, fill_in = PointB) %>%
mutate(needs_fill = 1L)
) %>%
mutate(PointA = coalesce(PointA, fill_in)) %>%
select(-fill_in, -needs_fill)

dplyr: calculate the days for product replenishment

I am working on a dataset in which I need to calculate how long does it take for a retail store to replenish some products from shortage, and here is a quick view of the dataset in the simplest form:
Date <- c("2019-1-1","2019-1-2","2019-1-3","2019-1-4","2019-1-5","2019-1-6","2019-1-7","2019-1-8")
Product <- rep("Product A",8)
Net_Available_Qty <- c(-2,-2,10,8,-5,-6,-7,0)
sample_df <- data.frame(Date,Product,Net_Available_Qty)
When the Net_Available_Qty becomes negative, it means there is a shortage. When it turns back to 0 or positive qty, it means the supply has been recovered. What I need to calculate is the days between when we first see shortage and when it is recovered. In this case, for the 1st shortage, it took 2 days to recover and for the second shortage, it took 3 days to recover.
A tidyverse solution would be most welcome.
I hope someone else finds a cleaner solution. But this produces diffDate which assigns the date difference from when a negative turns positive/zero.
sample_df %>%
mutate(sign = ifelse(Net_Available_Qty > 0, "pos", ifelse(Net_Available_Qty < 0, "neg", "zero")),
sign_lag = lag(sign, default = sign[1]), # get previous value (exception in the first place)
change = ifelse(sign != sign_lag, 1 , 0), # check if there's a change
sequence=sequence(rle(as.character(sign))$lengths)) %>%
group_by(sequence) %>%
mutate(diffDate = as.numeric(difftime(Date, lag(Date,1))),
diffDate=ifelse(Net_Available_Qty <0, NA, ifelse((sign=='pos'| sign=='zero') & sequence==1, diffDate, NA))) %>%
ungroup() %>%
select(Date, Product, Net_Available_Qty, diffDate)
#Schilker had a great idea using rle. I am building on his answer and offering a slightly shorter version including the use of cumsum
Date <- c("2019-1-1","2019-1-2","2019-1-3","2019-1-4","2019-1-5","2019-1-6","2019-1-7","2019-1-8")
Product <- rep("Product A",8)
Net_Available_Qty <- c(-2,-2,10,8,-5,-6,-7,0)
sample_df <- data.frame(Date,Product,Net_Available_Qty)
library(tidyverse)
sample_df %>%
mutate(
diffDate = c(1, diff(as.Date(Date))),
sequence = sequence(rle(Net_Available_Qty >= 0)$lengths),
group = cumsum(c(TRUE, diff(sequence)) != 1L)
) %>%
group_by(group) %>%
mutate(n_days = max(cumsum(diffDate)))
#> # A tibble: 8 x 7
#> # Groups: group [4]
#> Date Product Net_Available_Qty diffDate sequence group n_days
#> <fct> <fct> <dbl> <dbl> <int> <int> <dbl>
#> 1 2019-1-1 Product A -2 1 1 0 2
#> 2 2019-1-2 Product A -2 1 2 0 2
#> 3 2019-1-3 Product A 10 1 1 1 2
#> 4 2019-1-4 Product A 8 1 2 1 2
#> 5 2019-1-5 Product A -5 1 1 2 3
#> 6 2019-1-6 Product A -6 1 2 2 3
#> 7 2019-1-7 Product A -7 1 3 2 3
#> 8 2019-1-8 Product A 0 1 1 3 1
Created on 2020-02-23 by the reprex package (v0.3.0)

Resources