Trying to identify number of continuous purchases by ID using R - r

I have a data set
customerId <- c(101,101,101,102,102,102,104,104,106,109,109,109)
Purchasedate<- c("2020-06-19","2020-06-20","2020-06-21","2020-06-24","2020-06-27","2020-06-28","2020-06-20","2020-06-21"
,"2020-06-24","2020-06-10","2020-06-14","2020-06-16")
df <- data.frame(customerId,Purchasedate)
I am trying to find out following output
101 3
104 2
as the 101 & 104 customer id only representing continuous purchase dates
I am trying to find out the customerid who had make continuous purchase and for how many days by using R

Maybe you could consider checking for difference between dates using group_by for each id, filter by those ids where the difference is always 1, and summarise to total up the number of rows/dates:
library(dplyr)
df %>%
group_by(customerId) %>%
mutate(diffDays = c(1, diff(Purchasedate))) %>%
filter(n_distinct(diffDays) == 1 & n() > 1) %>%
summarise(continuousDays = n())
Output
customerId continuousDays
<dbl> <int>
1 101 3
2 104 2
Data
df <- structure(list(customerId = c(101, 101, 101, 102, 102, 102, 104,
104, 106, 109, 109, 109), Purchasedate = structure(c(18432, 18433,
18434, 18437, 18440, 18441, 18433, 18434, 18437, 18423, 18427,
18429), class = "Date")), row.names = c(NA, -12L), class = "data.frame")

A similar question was answered here: https://stackoverflow.com/a/53713204/12400385
Adapting it slightly to your case
library(dplyr)
library(lubridate)
# Convert Purchasedate to a date column
df <- df %>%
mutate(Purchasedate = ymd(Purchasedate))
# Create custom function
max_consec <- function(x) {
y <- c(unclass(diff(x))) # c and unclass -- preparing it for rle
r <- rle(y)
with(r, max(lengths[values==1]) + 1)
}
# Apply function to each customer
df %>%
group_by(customerId) %>%
summarize(max.consecutive = max_consec(Purchasedate))
#-------
# A tibble: 5 x 2
customerId max.consecutive
<dbl> <dbl>
1 101 3
2 102 2
3 104 2
4 106 -Inf
5 109 -Inf

Related

How to create a variable based on a time period grouped on id number

I have a dataframe
idnr <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,5,5,5,5,5,6,6,6,6,6,7,7,7,7)
labvalue <- c(100, 80, 75, 70, 50, 60, 55, 200, 180, 165, 160, 150, 170, 175, 300, 280, 260, 250, 255, 400, 380, 360, 350, 355, 500, 480, 460)
labdate <- as.Date(c("2022-01-01", "2022-01-02", "2022-01-03", "2022-01-04", "2022-01-05", "2022-01-06", "2022-01-07",
"2022-01-08", "2022-01-09", "2022-01-10", "2022-01-11", "2022-01-12", "2022-01-13", "2022-01-14",
"2022-01-15", "2022-01-16", "2022-01-17", "2022-01-18", "2022-01-19", "2022-01-20", "2022-01-21",
"2022-01-22", "2022-01-23", "2022-01-24", "2022-01-25", "2022-01-26", "2022-01-27"))
data <- data.frame(idnr, labvalue, labdate)
I would like to create a variable for each idnr indicating if the unique idnr have had a drop in lab value of 40 or more and within 2 days. To clarify, if a unique idnr has a lab value of 200, i want to check if there is any lab values taken after the date of lab value 200 but within 48 hours that is 160 or less.
Preferably I would like it to work if the dates had time stamps as well. I understand that I probably need to do a for loop but I can't get it to work.
You can make a helper function to check each row for drops within 2 days, then apply to dates and values using purrr::map2_lgl(), grouped by idnr.
library(dplyr)
library(purrr)
has_drop <- function(cur_date, cur_value, all_dates, all_values) {
days_diff <- as.numeric(all_dates - cur_date, unit = "days")
vals_2day <- all_values[between(days_diff, 0, 2)]
any(vals_2day - cur_value <= -40)
}
data %>%
group_by(idnr) %>%
summarize(
drop = any(map2_lgl(
labdate,
labvalue,
\(d, v) has_drop(d, v, labdate, labvalue)
))
)
# A tibble: 5 × 2
idnr drop
<dbl> <lgl>
1 1 FALSE
2 2 FALSE
3 5 TRUE
4 6 TRUE
5 7 TRUE
To get the dates of values with drops within 2 days, use filter() instead of summarize():
data %>%
group_by(idnr) %>%
filter(map2_lgl(
labdate,
labvalue,
\(d, v) has_drop(d, v, labdate, labvalue)
)) %>%
ungroup()
# A tibble: 3 × 3
idnr labvalue labdate
<dbl> <dbl> <date>
1 5 300 2022-01-15
2 6 400 2022-01-20
3 7 500 2022-01-25
The same code should work for POSIXct timestamps.

Trying to filter for two observation per condition

I have the following dataframe
student_id <- c(1,1,1,2,2,2)
test_score <- c(100, 90, 80, 100, 70, 90)
test_type <- c("English", "English", "English", "Spanish", "Spanish", "Spanish")
time_period <- c(1, 0, 1, 0, 1, 0)
df <- data.frame(student_id, test_score, test_type, time_period)
I am trying to filter my observations so that each student_id has a Spanish test and an English test. I have tried the following:
df <- df %>%
group_by(student_id, test_type) %>%
dplyr::filter(row_number() == 1)
But this seems to only return values from the English test. Is there a way to return single observations from each student_id for English and Spanish tests?
Your example data does not contain any student who has done multiple tests, i.e., filtering out only those that have done both English and Spanish will leave you with an empty dataframe. However, let's suppose the following is your data:
df <- data.frame(student_id = c(1,1,2,2,3,3),
test_score = c(100, 90, 80, 100, 70, 90),
test_type = c("English", "English", "English", "Spanish", "Spanish", "Spanish"),
time_period = c(1, 0, 1, 0, 1, 0)
)
Here, student 2 has done both, and we wish to filter for all students who have done both types of exams. One approach is to look at each student and count the number of unique exam types. If that is larger than 1, then we found the relevant rows (including students who have completed three or more languages).
df %>% group_by(student_id) %>%
mutate(n_dist = n_distinct(test_type)) %>%
filter(n_dist>1) %>%
select(-n_dist)
# A tibble: 2 x 4
# Groups: student_id [1]
student_id test_score test_type time_period
<dbl> <dbl> <fct> <dbl>
1 2 80 English 1
2 2 100 Spanish 0
This gives you all rows for student 2.
Having said that, it is a bit unclear what you wish to achieve, but if all you want is the first row per student x test_type combination, then your code does work. Another option is to use slice, as in:
df %>% group_by(student_id, test_type) %>% slice(1)
# A tibble: 4 x 4
# Groups: student_id, test_type [4]
student_id test_score test_type time_period
<dbl> <dbl> <fct> <dbl>
1 1 100 English 1
2 2 80 English 1
3 2 100 Spanish 0
4 3 70 Spanish 1
Note I am using df as defined in my answer above.

R: count distinct IDs where selected column is non-null

I have the following dataframe:
user_id <- c(97, 97, 97, 97, 96, 95, 95, 94, 94)
event_id <- c(42, 15, 43, 12, 44, 32, 38, 10, 11)
plan_id <- c(NA, 38, NA, NA, 30, NA, NA, 30, 25)
treatment_id <- c(NA, 20, NA, NA, NA, 28, 41, 17, 32)
system <- c(1, 1, 1, 1, NA, 2, 2, NA, NA)
df <- data.frame(user_id, event_id, plan_id, treatment_id system)
I would like to count the distinct number of user_id for each column, excluding the NA values. The output I am hoping for is:
user_id event_id plan_id treatment_id system
1 4 4 3 4 2
I tried to leverage mutate_all, but that was unsuccessful because my data frame is too large. In other functions, I've used the following two lines of code to get the nonnull count and the count distinct for each column:
colSums(!is.empty(df[,]))
apply(df[,], 2, function(x) length(unique(x)))
Optimally, I would like to combine the two with an ifelse to minimize the mutations, as this will ultimately be thrown into a function to be applied with a number of other summary statistics to a list of data frames.
I have tried a brute-force method, where make the values 1 if not null and 0 otherwise and then copy the id to that column if 1. I can then just use the count distinct line from above to get my output. However, I get the wrong values when copying it into the other columns and the number of adjustments is sub optimal. See code:
binary <- cbind(df$user_id, !is.empty(df[,2:length(df)]))
copied <- binary %>% replace(. > 0, binary[.,1])
I'd greatly appreciate your help.
1: Base
sapply(df, function(x){
length(unique(df$user_id[!is.na(x)]))
})
# user_id event_id plan_id treatment_id system
# 4 4 3 3 2
2: Base
aggregate(user_id ~ ind, unique(na.omit(cbind(stack(df), df[1]))[-1]), length)
# ind user_id
#1 user_id 4
#2 event_id 4
#3 plan_id 3
#4 treatment_id 3
#5 system 2
3: tidyverse
df %>%
mutate(key = user_id) %>%
pivot_longer(!key) %>%
filter(!is.na(value)) %>%
group_by(name) %>%
summarise(value = n_distinct(key)) %>%
pivot_wider()
## A tibble: 1 x 5
# event_id plan_id system treatment_id user_id
# <int> <int> <int> <int> <int>
#1 4 3 2 3 4
Thanks #dcarlson I had misunderstood the question:
apply(df, 2, function(x){length(unique(df[!is.na(x), 1]))})
A data.table option with uniqueN
> setDT(df)[, lapply(.SD, function(x) uniqueN(user_id[!is.na(x)]))]
user_id event_id plan_id treatment_id system
1: 4 4 3 3 2
Using dplyr you can use summarise with across :
library(dplyr)
df %>% summarise(across(.fns = ~n_distinct(user_id[!is.na(.x)])))
# user_id event_id plan_id treatment_id system
#1 4 4 3 3 2

Column comparison with varying id's

I have the following variables
id1 = c(1,1,1,2,3,4,4,5)
id2 = c(1,1,2,2,3,3,4)
digit1 = c(243, 888, 343, 276, 493, 024, 305, 093)
digit2 = c(343, 756, 947, 089, 390, 930, 024)
df1 = data.frame(id1, digit1)
df2 = data.frame(id2, digit2)
I'm looking to find a way to see how many matches digits1 has with digits2 based on similar id's. The frequency for a given id can vary within the same data frame and also when compared with the other data frame.
I don't want it to count an extra incorrect or correct if id1 has more frequency than id2 for a given id. For example when comparing the first three digits in df1 with only the first two digit in df2 the returning vector would count that as 1 correct , 1 incorrect, and 1 NA. I'm trying to merge both data frames and add the new column for the outcomes of the matching.
After aligning the the column in df1 and df2 and merging into a new data frame, I want to add a vector like (0, NA, 1, 0, 0, 1, NA, NA) to the new data frame.
The actual data that I'll be using has thousands of rows for each dataframe
One way to do this would be by joining the two data frames and then checking which ids have digit1 = digit2, here's what I mean:
# Data from the question (before the edit)
id1 = c(1,1,2,3,4,4,5,6,7)
id2 = c(1,2,2,3,3,4,5)
digit1 = c(243, 343, 276, 493, 024, 305, 093, 393, 208)
digit2 = c(343, 947, 089, 390, 930, 024, 093)
df1 = data.frame(id1, digit1)
df2 = data.frame(id2, digit2)
library(dplyr)
df1 %>%
full_join(df2, by = c('id1'='id2')) %>%
mutate(match = (digit1 == digit2)) %>%
group_by(id1) %>%
summarise(match = sum(match))
you get:
# id1 match
# <dbl> <int>
# 1 1 1
# 2 2 0
# 3 3 0
# 4 4 1
# 5 5 1
# 6 6 NA
# 7 7 NA

Add rows based on missing dates within a group

I am trying to add rows to a data frame based on the minimum and maximum data within each group. Suppose this is my original data frame:
df = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01", "2018-02-01","2017-12-01", "2018-02-01")),
Group = c(1,1,2,2,2,3,3),
Value = c(100, 200, 150, 125, 200, 150, 175))
Notice that Group 1 has 2 consecutive dates, group 2 has 3 consecutive dates, and group 3 is missing the date in the middle (2018-01-01). I'd like to be able to complete the data frame by adding rows for missing dates. But the thing is I only want to add additional dates based on dates that are missing between the minimum and maximum date within each group. So if I were to complete this data frame it would look like this:
df_complete = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01", "2018-02-01","2017-12-01","2018-01-01", "2018-02-01")),
Group = c(1,1,2,2,2,3,3,3),
Value = c(100, 200, 150, 125, 200, 150,NA, 175))
Only one row was added because Group 3 was missing one date. There was no date added for Group 1 because it had all the dates between its minimum (2017-12-01) and maximum date (2018-01-01).
You can use tidyr::complete with dplyr to find a solution. The interval between consecutive dates seems to be month. The approach will be as below:
library(dplyr)
library(tidyr)
df %>% group_by(Group) %>%
complete(Group, Date = seq.Date(min(Date), max(Date), by = "month"))
# A tibble: 8 x 3
# Groups: Group [3]
# Group Date Value
# <dbl> <date> <dbl>
# 1 1.00 2017-12-01 100
# 2 1.00 2018-01-01 200
# 3 2.00 2017-12-01 150
# 4 2.00 2018-01-01 125
# 5 2.00 2018-02-01 200
# 6 3.00 2017-12-01 150
# 7 3.00 2018-01-01 NA
# 8 3.00 2018-02-01 175
Data
df = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01",
"2018-02-01","2017-12-01", "2018-02-01")),
Group = c(1,1,2,2,2,3,3),
Value = c(100, 200, 150, 125, 200, 150, 175))
#MKR's approach of using tidyr::complete with dplyr is good, but will fail if the group column is not numeric. It will then be typecast as factors and the complete() operation will then result in a tibble with a row for every factor/time combination for each group.
complete() does not need the group variable as first argument, so the solution is
library(dplyr)
library(tidyr)
df %>% group_by(Group) %>%
complete(Date = seq.Date(min(Date), max(Date), by = "month"))

Resources