Summarize consecutive failures with dplyr and rle - r

I'm trying to build a churn model that includes the maximum consecutive number of UX failures for each customer and having trouble. Here's my simplified data and desired output:
library(dplyr)
df <- data.frame(customerId = c(1,2,2,3,3,3), date = c('2015-01-01','2015-02-01','2015-02-02', '2015-03-01','2015-03-02','2015-03-03'),isFailure = c(0,0,1,0,1,1))
> df
customerId date isFailure
1 1 2015-01-01 0
2 2 2015-02-01 0
3 2 2015-02-02 1
4 3 2015-03-01 0
5 3 2015-03-02 1
6 3 2015-03-03 1
desired results:
> desired.df
customerId maxConsecutiveFailures
1 1 0
2 2 1
3 3 2
I'm flailing quite a bit and searching through other rle questions isn't helping me yet - this is what I was "expecting" a solution to resemble:
df %>%
group_by(customerId) %>%
summarise(maxConsecutiveFailures =
max(rle(isFailure[isFailure == 1])$lengths))

We group by the 'customerId' and use do to perform the rle on 'isFailure' column. Extract the lengths that are 'TRUE' for values (lengths[values]), and create the 'Max' column with an if/else condition to return 0 for those that didn't have any 1 value.
df %>%
group_by(customerId) %>%
do({tmp <- with(rle(.$isFailure==1), lengths[values])
data.frame(customerId= .$customerId, Max=if(length(tmp)==0) 0
else max(tmp)) }) %>%
slice(1L)
# customerId Max
#1 1 0
#2 2 1
#3 3 2

Here is my try, only using standard dplyr functions:
df %>%
# grouping key(s):
group_by(customerId) %>%
# check if there is any value change
# if yes, a new sequence id is generated through cumsum
mutate(last_one = lag(isFailure, 1, default = 100),
not_eq = last_one != isFailure,
seq = cumsum(not_eq)) %>%
# the following is just to find the largest sequence
count(customerId, isFailure, seq) %>%
group_by(customerId, isFailure) %>%
summarise(max_consecutive_event = max(n))
Output:
# A tibble: 5 x 3
# Groups: customerId [3]
customerId isFailure max_consecutive_event
<dbl> <dbl> <int>
1 1 0 1
2 2 0 1
3 2 1 1
4 3 0 1
5 3 1 2
A final filter on isFailure value would yield the wanted result (need to add back 0 failure count customers though).
The script can take any values of isFailure column and count the maximum consecutive days of having the same value.

Related

How can I find the column index of the first non-zero value in a row with R dplyr?

I'm working in R. I have a dataset of COVID case totals that looks like this:
Facility
Day_1
Day_2
Day_3
A
0
0
1
B
1
2
5
C
0
2
6
D
0
0
0
I would like to use mutate() to create a new column, first_case, that has the column index of the first non-zero element in each row -- or "NA" if there is no non-zero element. I thought about using where(), but couldn't quite figure out how to get a column index instead of a row index.
Any help is much appreciated!
We can use max.col to get the first instance when the value is non-zero in each zero.
library(dplyr)
df %>%
mutate(first_case = {
tmp <- select(., starts_with('Day'))
ifelse(rowSums(tmp) == 0, NA, max.col(tmp != 0, ties.method = 'first'))
})
# Facility Day_1 Day_2 Day_3 first_case
#1 A 0 0 1 3
#2 B 1 2 5 1
#3 C 0 2 6 2
#4 D 0 0 0 NA
first_case has column number of the 'Day' columns, if you need column number in the data you can add + 1 to above output.
This is probably unnecessarily complex, because the data is not in a long ('tidy') format that dplyr etc expect.
datlong <- dat %>%
pivot_longer(cols=starts_with("Day"), names_to = c("day"), names_pattern="_(\\d+)")
## A tibble: 12 x 3
# Facility day value
# <chr> <chr> <int>
# 1 A 1 0
# 2 A 2 0
# 3 A 3 1
# 4 B 1 1
# 5 B 2 2
# 6 B 3 5
# 7 C 1 0
# 8 C 2 2
# 9 C 3 6
#10 D 1 0
#11 D 2 0
#12 D 3 0
It's then simple to get the first/second/third/[n]th day above whatever value, as well as to calculate minimums, maximums, means, weekly averages, rolling averages, whatever, because you are now dealing with a plain old vector of values rather than a list of values across multiple columns.
datlong %>%
group_by(Facility) %>%
filter(value > 0, .preserve=TRUE) %>%
summarise(first_day = first(day))
#`summarise()` ungrouping output (override with `.groups` argument)
## A tibble: 4 x 2
# Facility first_day
# <chr> <chr>
#1 A 3
#2 B 1
#3 C 2
#4 D <NA>
Alternative using indexes and stuff, which is less dplyr-like:
datlong %>%
group_by(Facility) %>%
summarise(first_day = day[value > 0][1])

R: Slicing a grouped data frame conditional on a column

I have a data frame with a group, a condition that differs by group, and an index within each group:
df <- data.frame(group = c(rep(c("A", "B", "C"), each = 3)),
condition = rep(c(0,1,1), each = 3),
index = c(1:3,1:3,2:4))
> df
group condition index
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 B 1 2
6 B 1 3
7 C 1 2
8 C 1 3
9 C 1 4
I would like to slice the data within each group, filtering out all but the row with the lowest index. However, this filter should only be applied when the condition applies, i.e., condition == 1. My solution was to compute a ranking on the index within each group and filter on the combination of condition and rank:
df %>%
group_by(group) %>%
mutate(rank = order(index)) %>%
filter(case_when(condition == 0 ~ TRUE,
condition == 1 & rank == 1 ~ TRUE))
# A tibble: 5 x 4
# Groups: group [3]
group condition index rank
<chr> <dbl> <int> <int>
1 A 0 1 1
2 A 0 2 2
3 A 0 3 3
4 B 1 1 1
5 C 1 2 1
This left me wondering whether there is a faster solution that does not require a separate ranking variable, and potentially uses slice_min() instead.
You can use filter() to keep all cases where the condition is zero or the index equals the minimum index.
library(dplyr)
df %>%
group_by(group) %>%
filter(condition == 0 | index == min(index))
# A tibble: 5 x 3
# Groups: group [3]
group condition index
<chr> <dbl> <int>
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 C 1 2
An option with slice
library(dplyr)
df %>%
group_by(group) %>%
slice(unique(c(which(condition == 0), which.min(index))))

How to count number of days between events in a dataset

I'm trying to restructure my data to recode a variable ('Event') so that I can determine the number of days between events. Essentially, I want to be able to count the number of days that occur between events occuring Importantly, I only want to start the 'count' between events after the first event has occurred for each person. Here is a sample dataframe:
Day = c(1:8,1:8)
Event = c(0,0,1,NA,0,0,1,0,0,1,NA,NA,0,1,0,1)
Person = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2)
sample <- data.frame(Person,Day,Event);sample
I would like it to end up like this:
NewEvent = c(NA,NA,0,1,2,3,0,1,NA,0,1,2,3,0,1,0)
sample2 <- sample <- data.frame(Person,Day,NewEvent);sample2
I'm new to R, unfamiliar with loops or if statements, and I could not find a thread which already answered this type of issue, so any help would be greatly appreciated. Thank you!
One approach is to group on Person and calculate distinct occurrence of events by cumsum(Event == 1). Now, group on both Person and grp to count days passed from occurrence of distinct event. The solution will be as :
library(dplyr)
sample %>% group_by(Person) %>%
mutate(EventNum = cumsum(!is.na(Event) & Event == 1)) %>%
group_by(Person, EventNum) %>%
mutate(NewEvent = ifelse(EventNum ==0, NA, row_number() - 1)) %>%
ungroup() %>%
select(Person, Day, NewEvent) %>%
as.data.frame()
# Person Day NewEvent
# 1 1 1 NA
# 2 1 2 NA
# 3 1 3 0
# 4 1 4 1
# 5 1 5 2
# 6 1 6 3
# 7 1 7 0
# 8 1 8 1
# 9 2 1 NA
# 10 2 2 0
# 11 2 3 1
# 12 2 4 2
# 13 2 5 3
# 14 2 6 0
# 15 2 7 1
# 16 2 8 0
Note: If data is not sorted on Day then one should add arrange(Day) in above code.

How to add row to other matrix

I have following panel data:
firmid date return
1 1 1
1 2 1
1 3 1
2 2 2
2 3 2
3 1 2
3 3 2
I want to transform this long format to wide but only for date 1 to look like this
firmid return in date=1
1 1
3 2
I appreciate any advice!
df <- read.table(header = T, text = "firmid date return
1 1 1
1 2 1
1 3 1
2 2 2
2 3 2
3 1 2
3 3 2")
Base R solution:
df <- df[df$date == 1, ]
df$date <- NULL
df
firmid return
1 1 1
6 3 2
data.table solution:
library(data.table)
setDT(df)
df <- df[date == 1, ]
df[, date := NULL]
firmid return
1: 1 1
2: 3 2
You can use dplyr to achieve it too:
library(dplyr)
df2 <- df %>%
filter(date == 1) %>%
select(-date)
# firmid return
#1 1 1
#2 3 2
A different dplyr solution that allows you to have multiple values of return within firmid:
df %>%
filter(date == 1) %>%
group_by(firmid, return) %>%
summarise()

R: calculate time difference between specific events

I have the following dataset:
df = data.frame(cbind(user_id = c(rep(1, 4), rep(2,4)),
complete_order = c(rep(c(1,0,0,1), 2)),
order_date = c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21')))
library(lubridate)
df$order_date = as_date(df$order_date)
user_id complete_order order_date
1 1 2015-01-28
1 0 2015-01-31
1 0 2015-02-08
1 1 2015-02-23
2 1 2015-01-25
2 0 2015-01-28
2 0 2015-02-06
2 1 2015-02-21
I'm trying to calculate the difference in days between only completed orders for each user. The desirable outcome would look like this:
user_id complete_order order_date complete_order_time_diff
<fctr> <fctr> <date> <time>
1 1 2015-01-28 NA days
1 0 2015-01-31 3 days
1 0 2015-02-08 11 days
1 1 2015-02-23 26 days
2 1 2015-01-25 NA days
2 0 2015-01-28 3 days
2 0 2015-02-06 12 days
2 1 2015-02-21 27 days
when I try this solution:
library(dplyr)
df %>%
group_by(user_id) %>%
mutate(complete_order_time_diff = order_date[complete_order==1]-lag(order_date[complete_order==1))
it returns the error:
Error: incompatible size (3), expecting 4 (the group size) or 1
Any help with this will be great, thank you!
try this
library(dplyr)
df %>% group_by(user_id, complete_order) %>%
mutate(c1 = order_date - lag(order_date)) %>%
group_by(user_id) %>% mutate(c2 = order_date - lag(order_date)) %>% ungroup %>%
mutate(complete_order_time_diff = ifelse(complete_order==0, c2, c1)) %>%
select(-c(c1, c2))
Update
for multiple cancelled orders
df %>% mutate(c3=cumsum( complete_order != "0")) %>% group_by(user_id, complete_order) %>%
mutate(c1 = order_date - lag(order_date)) %>%
group_by(user_id) %>% mutate(c2 = order_date - lag(order_date)) %>%
mutate(c2=as.numeric(c2)) %>% group_by(user_id, c3) %>%
mutate(c2=cumsum(ifelse(complete_order==1, 0, c2))) %>% ungroup %>%
mutate(complete_order_time_diff = ifelse(complete_order==0, c2, c1)) %>%
select(-c(c1, c2, c3))
logic
c3 is an id every time there is an order (i.e. complete_order not 0) to increment by 1.
c1 calculates the day difference bu user_id (but for non complete orders the result is wrong)
c2 fixes this inconsistency of c1 with respect to non complete orders.
hope this clears things.
I would suggest you work with combinations of group_by() and mutate(cumsum()) to better understand the results of having more than one grouped variable.
It seems that you're looking for the distance of each order from the last completed one. Having a binary vector, x, c(NA, cummax(x * seq_along(x))[-length(x)]) gives the indices of the last "1" seen before each element. Then, subtracting each element of "order_date" from the "order_date" at that respective index gives the desired output. E.g.
set.seed(1453); x = sample(0:1, 10, TRUE)
set.seed(1821); y = sample(5, 10, TRUE)
cbind(x, y,
last_x = c(NA, cummax(x * seq_along(x))[-length(x)]),
y_diff = y - y[c(NA, cummax(x * seq_along(x))[-length(x)])])
# x y last_x y_diff
# [1,] 1 3 NA NA
# [2,] 0 3 1 0
# [3,] 1 5 1 2
# [4,] 0 1 3 -4
# [5,] 0 3 3 -2
# [6,] 1 5 3 0
# [7,] 1 1 6 -4
# [8,] 0 3 7 2
# [9,] 0 4 7 3
#[10,] 1 5 7 4
On your data, first format df for convenience:
df$order_date = as.Date(df$order_date)
df$complete_order = df$complete_order == "1" # lose the 'factor'
And, then, either apply the above approach after a group_by:
library(dplyr)
df %>% group_by(user_id) %>%
mutate(time_diff = order_date -
order_date[c(NA, cummax(complete_order * seq_along(complete_order))[-length(complete_order)])])
, or, perhaps give a try on operations that avoid grouping (assuming ordered "user_id") after accounting for the indices where "user_id" changes:
# save variables to vectors and keep a "logical" of when "id" changes
id = df$user_id
id_change = c(TRUE, id[-1] != id[-length(id)])
compl = df$complete_order
dord = df$order_date
# accounting for changes in "id", locate last completed order
i = c(NA, cummax((compl | id_change) * seq_along(compl))[-length(compl)])
is.na(i) = id_change
dord - dord[i]
#Time differences in days
#[1] NA 3 11 26 NA 3 12 27
I think you can add a filter function in place of the subsetting with order_date[complete_order == 1] and make sure the order_date (and other variables) are the correct data types by adding stringsAsFactors = F to data.frame()):
df = data.frame(cbind(user_id = c(rep(1, 4), rep(2,4)),
complete_order = c(rep(c(1,1,0,1), 2)),
order_date = c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21')),
stringsAsFactors = F)
df$order_date <- lubridate::ymd(df$order_date)
df %>%
group_by(user_id) %>%
filter(complete_order == 1) %>%
mutate(complete_order_time_diff = order_date - lag(order_date))
This returns the time until the next complete order (and NA if there is not one):
user_id complete_order order_date complete_order_time_diff
<chr> <chr> <date> <time>
1 1 1 2015-01-28 NA days
2 1 1 2015-01-31 3 days
3 1 1 2015-02-23 23 days
4 2 1 2015-01-25 NA days
5 2 1 2015-01-28 3 days
6 2 1 2015-02-21 24 days

Resources