Rounding month end data using weekly split in R - r

Input table:
Date Qty
2017-01-01 234
2017-01-08 123
2017-01-15 445
2017-01-22 113
2017-01-29 674
2018-02-05 120
2018-02-12 921
2018-02-19 732
2018-02-26 634
2018-03-05 711
Expected table:
Date Qty
2017-01-01 234
2017-01-08 123
2017-01-15 445
2017-01-22 113
2017-01-29 708.28 #674+(120/7 * 2)
2018-02-05 85.71 #(120/7 * 5)
2018-02-12 921
2018-02-19 732
2018-02-26 837.14 #634+(711/7 * 2)
2018-03-05 507.85 #(711/7 * 5)
In the above o/p table, the quantity belonging to the first date of the new month is expected to split to the last date of the past month using the weekly proportions.
Eg:
2017-02-26 had a quantity of 634 and 2018-03-05 had 711
So, the quantity 711 is split by 7 (#days in a week) i.e. 711/7 = 101.571 and the month of February has 28 days in general so 2 shares (#days left in February as the present date of that row is 2017-02-26) of 101.571 are added to the existing quantity of 2017-02-26, thus making it 634+(101.571*2) => 634+203.14 => 837.14 (as you can observe in the expected table). Similarly the remaining 2 shares are deducted from the 2018-03-05 and now it remains with 5 shares (#days of the first week of the present month as the present date of that row is 2018-03-05) ie 711/5 => 507.85 (as you can observe in the expected table).
Using R how should I generalise this situation?

Does this answer:
> library(dplyr)
> first_day_of_month_wday <- function(dx) {
+ day(dx) <- 1
+ wday(dx)
+ }
> fil <- ceiling((day(df$Date) + first_day_of_month_wday(df$Date) - 1) / 7)
>
> df %>% mutate(Qty1 = case_when(fil > 4 ~ Qty + (days_in_month(df$Date) - day(Date)) * lead(Qty)/7, TRUE ~ Qty)) %>%
+ mutate(Qty1 = case_when(lag(fil) > 4 ~ Qty/7 * day(Date), TRUE ~ Qty1)) %>% select(-Qty) %>% rename(Qty = Qty1)
# A tibble: 10 x 2
Date Qty
<date> <dbl>
1 2017-01-01 234
2 2017-01-08 123
3 2017-01-15 445
4 2017-01-22 113
5 2017-01-29 708.
6 2018-02-05 85.7
7 2018-02-12 921
8 2018-02-19 732
9 2018-02-26 837.
10 2018-03-05 508.
>
PS: Used first_day_of_month_wday function from R: How to get the Week number of the month.

Related

Recursive function over a vector based on the length of it

I have several periods separated by 6 months. Initial "in_talls_temp_6" and Ends "f_talls_temp_6".
`in_talls_temp_6 <- seq.Date(from=i_preImp_preref, to=f_postImp, by="6 months")
f_talls_temp_6 <- in_talls_temp_6 + months(6) - days(1)
I have data like this:
name <- paste0("time_point", seq(1:13))
a <- round(runif(length(name), 200, 500), 0)
data <- data.frame(name, a)
name a
1 time_point1 361
2 time_point2 444
3 time_point3 221
4 time_point4 434
5 time_point5 400
6 time_point6 438
7 time_point7 411
8 time_point8 367
9 time_point9 409
10 time_point10 337
11 time_point11 481
12 time_point12 201
13 time_point13 417
And I want to assign to each "time_point" x their initial and ending dates of the period.
Until now I have done it the silly way:
data %>%
mutate( i.date.time.point = case_when (name == "time_point1" ~ in_talls_temp_6[1],
name == "time_point2" ~ in_talls_temp_6[2],
name == "time_point3" ~ in_talls_temp_6[3],
name == "time_point4" ~ in_talls_temp_6[4],
name == "time_point5" ~ in_talls_temp_6[5],
name == "time_point6" ~ in_talls_temp_6[6],
name == "time_point7" ~ in_talls_temp_6[7],
name == "time_point8" ~ in_talls_temp_6[8],
name == "time_point9" ~ in_talls_temp_6[9],
name == "time_point10" ~ in_talls_temp_6[10],
name == "time_point11" ~ in_talls_temp_6[11],
name == "time_point12" ~ in_talls_temp_6[12],
name == "time_point13" ~ in_talls_temp_6[13]) ) %>%
mutate( f.date.time.point = case_when (name == "time_point1" ~ f_talls_temp_6[1],
name == "time_point2" ~ f_talls_temp_6[2],
name == "time_point3" ~ f_talls_temp_6[3],
name == "time_point4" ~ f_talls_temp_6[4],
name == "time_point5" ~ f_talls_temp_6[5],
name == "time_point6" ~ f_talls_temp_6[6],
name == "time_point7" ~ f_talls_temp_6[7],
name == "time_point8" ~ f_talls_temp_6[8],
name == "time_point9" ~ f_talls_temp_6[9],
name == "time_point10" ~ f_talls_temp_6[10],
name == "time_point11" ~ f_talls_temp_6[11],
name == "time_point12" ~ f_talls_temp_6[12],
name == "time_point13" ~ f_talls_temp_6[13])
)
Getting this:
name a i.date.time.point f.date.time.point
1 time_point1 361 2014-07-01 2014-12-31
2 time_point2 444 2015-01-01 2015-06-30
3 time_point3 221 2015-07-01 2015-12-31
4 time_point4 434 2016-01-01 2016-06-30
5 time_point5 400 2016-07-01 2016-12-31
6 time_point6 438 2017-01-01 2017-06-30
7 time_point7 411 2017-07-01 2017-12-31
8 time_point8 367 2018-01-01 2018-06-30
9 time_point9 409 2018-07-01 2018-12-31
10 time_point10 337 2019-01-01 2019-06-30
11 time_point11 481 2019-07-01 2019-12-31
12 time_point12 201 2020-01-01 2020-06-30
13 time_point13 417 2020-07-01 2020-12-31
I think that there is a better way and I'm not capable of doing it. I'm stucked here cause I want to get bigger with the project and now I want to do the same with:
in_talls_temp_3 <- seq.Date(from=i_preImp_preref, to=f_postImp, by="3 months")
f_talls_temp_3 <- in_talls_temp_3 + months(3) - days(1)
More time_points. And this probably could grow in the future...
I have thought about a ¿recursive function? (is this the proper name to it?) like this (just an idea):
repeat_v <- function(x){
n <- length(x)
for (y in 1:n) {
return(x[[y]])
}
}
I dunno If it's the right way to do it is with a for loop (apply would be better?). Also I doubt with the idea and don't know if it's good for the job or I will regret it later because will be time consuming..
Any ideas?
Any thoughts will be appreciated! ^^
We can just use standard R [ subsetting:
n = readr::parse_number(data$name)
data$i.date.time.point = in_talls_temp_6[n]
data$f.date.time.point = f_talls_temp_6[n]
# name a i.date.time.point f.date.time.point
# 1 time_point1 267 2014-07-01 2014-12-31
# 2 time_point2 208 2015-01-01 2015-06-30
# 3 time_point3 332 2015-07-01 2015-12-31
# 4 time_point4 325 2016-01-01 2016-06-30
# 5 time_point5 455 2016-07-01 2016-12-31
# 6 time_point6 345 2017-01-01 2017-06-30
# 7 time_point7 425 2017-07-01 2017-12-31
# 8 time_point8 212 2018-01-01 2018-06-30
# 9 time_point9 359 2018-07-01 2018-12-31
# 10 time_point10 297 2019-01-01 2019-06-30
# 11 time_point11 230 2019-07-01 2019-12-31
# 12 time_point12 334 2020-01-01 2020-06-30
# 13 time_point13 457 2020-07-01 2020-12-31
Just do:
generate_df <- function(months, time_points, min_val=200, max_val=500,
from=i_preImp_preref,
to=f_postImp) {
dates <- seq.Date(from=from, to=to, by=paste0(months, " months"))
data.frame(name = paste0("time_point", 1:time_points),
a = round(runif(length(name), min_val, max_val), 0),
i.date.time.point = dates,
f_talls_temp_3 = dates + months(months) - days(1))
}
The first df would be sth like:
generate_df(6, 13, 200, 500, i_preImp_preref, postImp)
And the second:
generate_df(3, 13, 200, 500, i_preImp_preref, postImp)

Iterating over Dates by Group in R using FOR loops

I'm trying to populate "FinalDate" based on "ExpectedDate" and "ObservedDate".
The rules are: for each group, if observed date is greater than previous expected date and less than the next expected date then final date is equal to observed date, otherwise final date is equal to expected date.
How can I modify the code below to make sure that:
FinalDate is filled in by Group
Iteration numbers don't skip any rows
set.seed(2)
dat<-data.frame(Group=sample(LETTERS[1:10], 100, replace=TRUE),
Date=sample(seq(as.Date('2013/01/01'), as.Date('2020/01/01'), by="day"), 100))%>%
mutate(ExpectedDate=Date+sample(10:200, 100, replace=TRUE),
ObservedDate=Date+sample(10:200, 100, replace=TRUE))%>%
group_by(Group)%>%
arrange(Date)%>%
mutate(n=row_number())%>%arrange(Group)%>%ungroup()%>%
as.data.frame()
#generate some missing values in "ObservedDate"
dat[sample(nrow(dat),20), "ObservedDate"]<-NA
dat$FinalDate<-NA
for (i in 1:nrow(dat)){
dat[i, "FinalDate"]<-if_else(!is.na(dat$"ObservedDate")[i] &&
dat[i, "ObservedDate"] > dat[i-1, "ExpectedDate"] &&
dat[i, "ObservedDate"] < dat[i+1, "ExpectedDate"],
dat[i, "ObservedDate"],
dat[i,"ExpectedDate"])
}
dat$FinalDate<-as.Date(dat$FinalDate) # convert numeric to Date format
e.g. in output below:
at i=90, the code looks for previous ExpectedDate within letter I
we want it to look for ExpectedDate only within letter J. If there is no previous expected date for a group and ObservedDate is greater than ExpectedDate but less than the next ExpectedDate then FinalDate should be filled with ExpectedDate.
at i=100, the code generates NA because there is no next observation available
we want this value to be filled in such that for last observation in each group, FinalDate=ObservedDate if ObservedDate is greater than this last ExpectedDate within group, else ExpectedDate.
Group Date ExpectedDate ObservedDate n FinalDate
88 I 2015-09-07 2015-12-05 <NA> 7 2015-12-05
89 I 2018-08-02 2018-11-01 2018-08-13 8 2018-11-01
90 J 2013-07-24 2013-08-30 2013-08-12 1 2013-08-30
91 J 2013-11-22 2014-01-02 2014-04-05 2 2014-04-05
92 J 2014-11-03 2015-03-23 2015-05-10 3 2015-05-10
93 J 2015-08-30 2015-12-09 2016-02-04 4 2016-02-04
94 J 2016-04-18 2016-09-03 <NA> 5 2016-09-03
95 J 2016-10-10 2017-01-29 2017-04-14 6 2017-04-14
96 J 2017-02-14 2017-07-05 <NA> 7 2017-07-05
97 J 2017-04-21 2017-10-01 2017-08-26 8 2017-08-26
98 J 2017-10-01 2018-01-27 2018-02-28 9 2018-02-28
99 J 2018-08-03 2019-01-31 2018-10-20 10 2018-10-20
100 J 2019-04-25 2019-06-23 2019-08-16 11 <NA>
We can let go off for loop and use group_by, lag and lead here from dplyr :
library(dplyr)
dat %>%
group_by(Group) %>%
mutate(FinalDate = if_else(ObservedDate > lag(ExpectedDate) &
ObservedDate < lead(ExpectedDate), ObservedDate, ExpectedDate))
We can also do this data.table::between
dat %>%
group_by(Group) %>%
mutate(FinalDate = if_else(data.table::between(ObservedDate,
lag(ExpectedDate), lead(ExpectedDate)), ObservedDate, ExpectedDate))

Efficient way to calculate date-by-date rolling aggregation of the last 30 days

My codes executes just fine, however it takes an enourmous amount of time to finalize. Would like some help to optimize the code, if possible, a way to execute a rolling aggregation on multiple columns.
I've been tring a few other ways by creating a function and vectorizing my dataframe with library(data.table), but no success in doing so, i actually get half of what i should get and I can only do with one column at a time.
# Creating functions
fun <- function(x, date, thresh) {
D <- as.matrix(dist(date)) #distance matrix between dates
D <- D <= thresh
D[lower.tri(D)] <- FALSE #don't sum to future
R <- D * x #FALSE is treated as 0
colMeans(R, na.rm = TRUE)
}
setDT(df_2)
df_2[, invoiceDate := as.Date(invoiceDate, format = "%m/%d/%Y")]
setkey(df_2, cod_unb, cod_pdv, invoiceDate)
df_2[, volume_total_diario_RT30 := fun(volume_total_diario, invoiceDate, 30), by = list(cod_unb, cod_pdv)]
This is my current code that works fine, but takes too much time (Over 8h to process 30 days)
years <- c(2017:2019)
months <- c(1:12)
days <- c(1:31)
df_final <- df_n[1,c('cod_unb','cod_pdv','cpf_cnpj','idade_pdv_meses','status_telefone','col1','col2','col3','year','month','day')] #eliminating first line
for (i in years) {
for (j in months) {
for (k in days) {
if (j == 1){
df_temp <- df_n[(df_n$years == i & df_n$months == j & df_n$days <= k) | (df_n$years == (i-1) & df_n$months == 12 & df_n$days >= k),]
}
if (j != 1){
df_temp <- df_n[(df_n$years == i & df_n$months == j & df_n$days <= k) | (df_n$years == i & df_n$months == (j - 1) & df_n$days >= k),]
}
#Agreggate.
if(nrow(df_temp) >= 1){
df_temp <- aggregate(df_temp[, c('col1','col2','col3')], by = list(df_temp$cod_unb,df_temp$cod_pdv,df_temp$cpf_cnpj,df_temp$idade_pdv_meses,df_temp$status_telefone), FUN = mean)
names(df_temp)[names(df_temp) == "Group.1"] <- "cod_unb"
names(df_temp)[names(df_temp) == "Group.2"] <- "cod_pdv"
names(df_temp)[names(df_temp) == "Group.3"] <- "cpf_cnpj"
names(df_temp)[names(df_temp) == "Group.4"] <- "idade_pdv_meses"
names(df_temp)[names(df_temp) == "Group.5"] <- "status_telefone"
df_temp$years <- i
df_temp$months <- j
df_temp$days <- k
df_final <- rbind(df_final,df_temp)
}
}
}
}
df_final <- df_final[-1,]
Output should be column R30
cod_unb;cod_pdv;Years;Months;Days;date;volume_total_diario;R30
111;1005;2018;11;3;03/11/2018;0.48;
111;1005;2018;11;9;09/11/2018;0.79035;
111;1005;2018;11;16;16/11/2018;1.32105;
111;1005;2018;11;24;24/11/2018;0.6414;
111;1005;2018;11;30;30/11/2018;0.6;
111;1005;2018;12;7;07/12/2018;1.79175;1.02891
111;1005;2018;12;15;15/12/2018;1.4421;1.15926
111;1005;2018;12;21;21/12/2018;0.48;0.99105
111;1005;2018;12;28;28/12/2018;0.5535;0.97347
111;1005;2019;1;4;04/01/2019;0.36;0.92547
If I understand correctly, the OP has requested to aggregate values over a rolling period of 30 days and to append these aggregates to the original data.
This can be solved efficiently by aggregating in a non-equi join.
Here is an example for one variable using sample data provided by the OP:
library(data.table)
# coerce to data.table, coerce character date to class IDate
setDT(df_n)[, date := as.IDate(date, "%d/%m/%Y")]
# intermediate result for demonstration:
df_n[.(upper = date, lower = date - 30), on = .(date <= upper, date >= lower),
mean(volume_total_diario), by = .EACHI]
date date V1
1: 2018-11-03 2018-10-04 0.480000
2: 2018-11-09 2018-10-10 0.635175
3: 2018-11-16 2018-10-17 0.863800
4: 2018-11-24 2018-10-25 0.808200
5: 2018-11-30 2018-10-31 0.766560
6: 2018-12-07 2018-11-07 1.028910
7: 2018-12-15 2018-11-15 1.159260
8: 2018-12-21 2018-11-21 0.991050
9: 2018-12-28 2018-11-28 0.973470
10: 2019-01-04 2018-12-05 0.925470
The intermediate result shows the upper and lower limits of the date range included in the aggregation and the aggragated values for the respective periods. This can be used to add a new column to df_n:
# update df_n by appending new column
setDT(df_n)[, R30_new := df_n[.(upper = date, lower = date - 30), on = .(date <= upper, date >= lower),
mean(volume_total_diario), by = .EACHI]$V1]
df_n
cod_unb cod_pdv Years Months Days date volume_total_diario R30 R30_new
1: 111 1005 2018 11 3 2018-11-03 0.48000 NA 0.480000
2: 111 1005 2018 11 9 2018-11-09 0.79035 NA 0.635175
3: 111 1005 2018 11 16 2018-11-16 1.32105 NA 0.863800
4: 111 1005 2018 11 24 2018-11-24 0.64140 NA 0.808200
5: 111 1005 2018 11 30 2018-11-30 0.60000 NA 0.766560
6: 111 1005 2018 12 7 2018-12-07 1.79175 1.02891 1.028910
7: 111 1005 2018 12 15 2018-12-15 1.44210 1.15926 1.159260
8: 111 1005 2018 12 21 2018-12-21 0.48000 0.99105 0.991050
9: 111 1005 2018 12 28 2018-12-28 0.55350 0.97347 0.973470
10: 111 1005 2019 1 4 2019-01-04 0.36000 0.92547 0.925470
The values of R30 and R30_new are identical; R30_new contains also results for the first 5 rows.
Caveat
Additional grouping variables have been ignored for the sake of clarity but can be included easily. Also, the solution can be extended to aggregate multiple value columns.
Data
library(data.table)
df_n <- fread("
cod_unb;cod_pdv;Years;Months;Days;date;volume_total_diario;R30
111;1005;2018;11;3;03/11/2018;0.48;
111;1005;2018;11;9;09/11/2018;0.79035;
111;1005;2018;11;16;16/11/2018;1.32105;
111;1005;2018;11;24;24/11/2018;0.6414;
111;1005;2018;11;30;30/11/2018;0.6;
111;1005;2018;12;7;07/12/2018;1.79175;1.02891
111;1005;2018;12;15;15/12/2018;1.4421;1.15926
111;1005;2018;12;21;21/12/2018;0.48;0.99105
111;1005;2018;12;28;28/12/2018;0.5535;0.97347
111;1005;2019;1;4;04/01/2019;0.36;0.92547
")
EDIT: Aggregating multiple variables
As the OP has asked for a way to execute a rolling aggregation on multiple columns here is an example.
First, we need to create an additional value var in OP's sample dataset:
df_n <- fread("
cod_unb;cod_pdv;Years;Months;Days;date;volume_total_diario;R30
111;1005;2018;11;3;03/11/2018;0.48;
111;1005;2018;11;9;09/11/2018;0.79035;
111;1005;2018;11;16;16/11/2018;1.32105;
111;1005;2018;11;24;24/11/2018;0.6414;
111;1005;2018;11;30;30/11/2018;0.6;
111;1005;2018;12;7;07/12/2018;1.79175;1.02891
111;1005;2018;12;15;15/12/2018;1.4421;1.15926
111;1005;2018;12;21;21/12/2018;0.48;0.99105
111;1005;2018;12;28;28/12/2018;0.5535;0.97347
111;1005;2019;1;4;04/01/2019;0.36;0.92547
")[
, date := as.IDate(date, "%d/%m/%Y")][, var2 := .I][]
df_n
cod_unb cod_pdv Years Months Days date volume_total_diario R30 var2
1: 111 1005 2018 11 3 2018-11-03 0.48000 NA 1
2: 111 1005 2018 11 9 2018-11-09 0.79035 NA 2
3: 111 1005 2018 11 16 2018-11-16 1.32105 NA 3
4: 111 1005 2018 11 24 2018-11-24 0.64140 NA 4
5: 111 1005 2018 11 30 2018-11-30 0.60000 NA 5
6: 111 1005 2018 12 7 2018-12-07 1.79175 1.02891 6
7: 111 1005 2018 12 15 2018-12-15 1.44210 1.15926 7
8: 111 1005 2018 12 21 2018-12-21 0.48000 0.99105 8
9: 111 1005 2018 12 28 2018-12-28 0.55350 0.97347 9
10: 111 1005 2019 1 4 2019-01-04 0.36000 0.92547 10
So, a column var2 has been added (which simply contains the row number).
This is the code to aggregate multiple column using the same aggregation function:
cols <- c("volume_total_diario", "var2")
setDT(df_n)[, paste0("mean_", cols) :=
df_n[.(upper = date, lower = date - 30),
on = .(date <= upper, date >= lower),
lapply(.SD, mean),
.SDcols = cols, by = .EACHI][
, .SD, .SDcols = cols]][]
df_n
cod_unb cod_pdv Years Months Days date volume_total_diario R30 var2 mean_volume_total_diario mean_var2
1: 111 1005 2018 11 3 2018-11-03 0.48000 NA 1 0.480000 1.0
2: 111 1005 2018 11 9 2018-11-09 0.79035 NA 2 0.635175 1.5
3: 111 1005 2018 11 16 2018-11-16 1.32105 NA 3 0.863800 2.0
4: 111 1005 2018 11 24 2018-11-24 0.64140 NA 4 0.808200 2.5
5: 111 1005 2018 11 30 2018-11-30 0.60000 NA 5 0.766560 3.0
6: 111 1005 2018 12 7 2018-12-07 1.79175 1.02891 6 1.028910 4.0
7: 111 1005 2018 12 15 2018-12-15 1.44210 1.15926 7 1.159260 5.0
8: 111 1005 2018 12 21 2018-12-21 0.48000 0.99105 8 0.991050 6.0
9: 111 1005 2018 12 28 2018-12-28 0.55350 0.97347 9 0.973470 7.0
10: 111 1005 2019 1 4 2019-01-04 0.36000 0.92547 10 0.925470 8.0
Note that the new columns have been named programmtically.

R- create dataset by removing duplicates based on a condition - filter

I have a data frame where for each day, I have several prices.
I would like to modify my data frame with the following code :
newdf <- Data %>%
filter(
if (Data$Date == Data$Echeance) {
Data$Close == lag(Data$Close,1)
} else {
Data$Close == Data$Close
}
)
However, it is not giving me what I want, that is :
create a new data frame where the variable Close takes its normal value, unless the day of Date is equal to the day of Echeance. In this case, take the following Close value.
I added filter because I wanted to remove the duplicate dates, and keep only one date per day where Close satisfies the condition above.
There is no error message, it just doesn't give me the right database.
Here is a glimpse of my data:
Date Echeance Compens. Open Haut Bas Close
1 1998-03-27 00:00:00 1998-09-10 00:00:00 125. 828 828 820 820. 197
2 1998-03-27 00:00:00 1998-11-10 00:00:00 128. 847 847 842 842. 124
3 1998-03-27 00:00:00 1999-01-11 00:00:00 131. 858 858 858 858. 2
4 1998-03-30 00:00:00 1998-09-10 00:00:00 125. 821 821 820 820. 38
5 1998-03-30 00:00:00 1998-11-10 00:00:00 129. 843 843 843 843. 1
6 1998-03-30 00:00:00 1999-01-11 00:00:00 131. 860 860 860 860. 5
Thanks a lot in advance.
Sounds like a use case for ifelse, with dplyr:
library(dplyr)
Data %>%
mutate(Close = ifelse(Date==Echeance, lead(Close,1), Close))
Here an example:
dat %>%
mutate(var_new = ifelse(date1==date2, lead(var,1), var))
# A tibble: 3 x 4
# date1 date2 var var_new
# <date> <date> <int> <int>
# 1 2018-03-27 2018-03-27 10 11
# 2 2018-03-28 2018-01-01 11 11
# 3 2018-03-29 2018-02-01 12 12
The function lead will move the vector by 1 position. Also note that I created a var_new just to show the difference, but you can mutate directly var.
Data used:
dat <- tibble(date1 = seq(from=as.Date("2018-03-27"), to=as.Date("2018-03-29"), by="day"),
date2 = c(as.Date("2018-03-27"), as.Date("2018-01-01"), as.Date("2018-02-01")),
var = 10:12)
dat
# A tibble: 3 x 3
# date1 date2 var
# <date> <date> <int>
# 1 2018-03-27 2018-03-27 10
# 2 2018-03-28 2018-01-01 11
# 3 2018-03-29 2018-02-01 12

How can I implement a rolling function partitioned over another column?

I have some data that looks like
CustomerID InvoiceDate
<fctr> <dttm>
1 13313 2011-01-04 10:00:00
2 18097 2011-01-04 10:22:00
3 16656 2011-01-04 10:23:00
4 16875 2011-01-04 10:37:00
5 13094 2011-01-04 10:37:00
6 17315 2011-01-04 10:38:00
7 16255 2011-01-04 11:30:00
8 14606 2011-01-04 11:34:00
9 13319 2011-01-04 11:40:00
10 16282 2011-01-04 11:42:00
It tells me when a person make a transaction. I would like to know the time between transactions for each customer, preferably in days. I do this in the following way
d <- data %>%
arrange(CustomerID,InvoiceDate) %>%
group_by(CustomerID) %>%
mutate(delta.t = InvoiceDate - lag(InvoiceDate), #calculating the difference
delta.day = as.numeric(delta.t, unit = 'days')) %>%
na.omit() %>%
arrange(CustomerID) %>%
inner_join(Ntrans) %>% #Existing data.frame telling me the number of transactions per customer
filter(N>=10) %>% #only want people with more than 10 transactions
select(-N)
However, the result doesn't make sense (seen below)
CustomerID InvoiceDate delta.t delta.day
<fctr> <dttm> <time> <dbl>
1 12415 2011-01-10 09:58:00 5686 days 5686
2 12415 2011-02-15 09:52:00 51834 days 51834
3 12415 2011-03-03 10:59:00 23107 days 23107
4 12415 2011-04-01 14:28:00 41969 days 41969
5 12415 2011-05-17 15:42:00 66314 days 66314
6 12415 2011-05-20 14:13:00 4231 days 4231
7 12415 2011-06-15 13:37:00 37404 days 37404
8 12415 2011-07-13 15:30:00 40433 days 40433
9 12415 2011-07-13 15:31:00 1 days 1
10 12415 2011-07-19 10:51:00 8360 days 8360
The differences measured in days are way off. What I want is something close to SQL's rolling window function partitioned over customerID. How can I implement this?
If you just want to change the difference to days you can use the package lubridate.
> library('lubridate')
> library('dplyr')
>
> InvoiceDate <- c('2011-01-10 09:58:00', '2011-02-15 09:52:00', '2011-03-03 10:59:00')
> CustomerID <- c(111, 111, 111)
>
> dat <- data.frame('Invo' = InvoiceDate, 'ID' = CustomerID)
>
> dat %>% mutate('Delta' = as_date(Invo) - as_date(lag(Invo)))
Invo ID Delta
1 2011-01-10 09:58:00 111 NA days
2 2011-02-15 09:52:00 111 36 days
3 2011-03-03 10:59:00 111 16 days

Resources