Using BTYD to predict date and amount of customer's next purchase - r

The BTYD package in R looks very useful for predicting future customer behavior based on past transactions.
However, the walk-through only illustrates predicting how many transactions a customer will make in an upcoming period, for example in the next year or month.
Is there a way to use this package to create a prediction for the date on which a customer will purchase, and the expected amount of the purchase?
For example, using the sample data set available in the BTYD package:
cdnowElog <- system.file("data/cdnowElog.csv", package = "BTYD")
elog <- dc.ReadLines(cdnowElog, cust.idx = 2,
date.idx = 3, sales.idx = 5)
# Change to date format
elog$date <- as.Date(elog$date, "%Y%m%d");
# cust date sales
# 1 1 1997-01-01 29.33
# 2 1 1997-01-18 29.73
# 3 1 1997-08-02 14.96
I would want an output that has the customer number, expected next date of purchase, and expected purchase amount.
# cust exp_date exp_sales
# 1 1998-02-23 19.35
# 2 1997-09-12 39.83
# 3 1998-01-05 24.56
Or this package can only predict the expected number of transactions in a time period, not the date itself or the spend amount? Is there a better approach for what I want to achieve?
I apologize if this question seems very basic, but I couldn't find the answer to this conceptual question in the documentation.


In R, How can I create a new date variable adopting the nearest date value right after an index date variable?

My dataframe in R studio is as follows:
StudyID FITDate.1 ScopeDate.1 ScopeDate.2 ScopeDate.3 ScopeDate.4
1 2014-05-15 2010-06-02 2014-05-28 2014-08-01 2015-10-27
2 2017-11-29 2018-02-27
3 2015-10-04 2016-06-24 2017-01-18
I have a variable "FITDate.1" indicates the date for FIT test, and several variables "ScopeDate.x" indicates the dates for multiple scope tests.
In my research, a person can have only one date for FIT test, but can have multiple dates for scope. Clinically, if a person has a FIT test, then he will be referred to undertake scope test. However, this person may receive scope tests for other reasons.
So if the date of a scope test is right after the date of a FIT test, then we will define them highly related.
I want to create a variable "FITrelatedscopedate" to include the dates of FIT related scopes. For example, in the row of StudyID==1, the date of "FITDate.1"is 2014-05-15, which is right between ScopeDate.1 (2010-06-02) and ScopeDate.2 (2014-05-28). So the date value 2014-05-28 of ScopeDate.2 is what i need, and I will use 2014-05-28 as the FIT related scope date and write it in the new variable "FITrelatedscopedate".
I think I have to use loop syntax, but i had no experience to realize it. Do you have any experience to solve similar problem? Do you know any codes to realize it? Thanks, any help are appreciated.
Here is one approach with tidyverse assuming you start with two long data.frames, one for FIT testing, and the other for endoscopy.
df_fit <- data.frame(
StudyID = 1:3,
FITDate = as.Date(c("2014-05-15", "2017-11-29", "2015-10-04"))
StudyID FITDate
1 1 2014-05-15
2 2 2017-11-29
3 3 2015-10-04
df_scope <- data.frame(
StudyID = c(1,1,1,1,2,3,3),
ScopeDate = as.Date(c("2010-06-02", "2014-05-28", "2014-08-01", "2015-10-27", "2018-02-27",
"2016-06-24", "2017-01-18"))
StudyID ScopeDate
1 1 2010-06-02
2 1 2014-05-28
3 1 2014-08-01
4 1 2015-10-27
5 2 2018-02-27
6 3 2016-06-24
7 3 2017-01-18
First, you can do a left_join by the StudyID to add the scope dates to the FIT data. Then, you can filter to only keep scope dates after FIT testing. For each StudyID, use slice to retain only the first row (this assumes dates are in chronological order...if not, add arrange(ScopeDate) first in the pipe - let me know if you need help with this).
Then, you can right_join back to df_fit so that those FIT testing dates without endoscopy will have NA for the ScopeDate. The final statement with mutate will calculate the time duration between endoscopy and FIT testing.
by = "StudyID"
) %>%
filter(ScopeDate > FITDate) %>%
group_by(StudyID) %>%
slice(1) %>%
right_join(df_fit) %>%
mutate(Duration = ScopeDate - FITDate)
StudyID FITDate ScopeDate Duration
<dbl> <date> <date> <drtn>
1 1 2014-05-15 2014-05-28 13 days
2 2 2017-11-29 2018-02-27 90 days
3 3 2015-10-04 2016-06-24 264 days
Let me know if this works for you. A data.table approach can be considered if you need something faster and have a very large dataset.
If you need the Duration as a numeric column, you can use as.numeric(ScopeDate - FITDate).

Calculate number of years worked with different end dates

Consider the following two datasets. The first dataset describes an id variable that identifies a person and the date when his or her unemployment benefits starts.
The second dataset shows the number of service years, which makes it possible to calculate the maximum entitlement period. More precisely, each year denotes a dummy variable, which is equal to unity in case someone build up unemployment benefits rights in a particular year (i.e. if someone worked). If this is not the case, this variable is equal to zero.
df1<-data.frame( c("R005", "R006", "R007"), c(20120610, 20130115, 20141221))
colnames(df1)<-c("id", "start_UI")
df1$start_UI<-as.Date(df1$start_UI, "%Y%m%d")
df2<-data.frame( c("R005", "R006", "R007"), c(1,1,1), c(1,1,1), c(0,1,1), c(1,0,1), c(1,0,1) )
colnames(df2)<-c("id", "worked2010", "worked2011", "worked2012", "worked2013", "worked2014")
Just to summarize the information from the above two datasets. We see that person R005 worked in the years 2010 and 2011. In 2012 this person filed for Unemployment insurance. Thereafter person R005 works again in 2013 and 2014 (we see this information in dataset df2). When his unemployment spell started in 2012, his entitlement was based on the work history before he got unemployed. Hence, the work history is equal to 2. In a similar vein, the employment history for R006 and R007 is equal to 3 and 5, respectively (for R007 we assume he worked in 2014 as he only filed for unemployment benefits in December of that year. Therefore the number is 5 instead of 4).
Now my question is how I can merge these two datasets effectively such that I can get the following table
df_final<- data.frame(c("R005", "R006", "R007"), c(20120610, 20130115, 20141221), c(2,3,5))
colnames(df_final)<-c("id", "start_UI", "employment_history")
id start_UI employment_history
1 R005 20120610 2
2 R006 20130115 3
3 R007 20141221 5
I tried using "aggregate", but in that case I also include work history after the year someone filed for unemployment benefits and that is something I do not want. Does anyone have an efficient way how to combine the information from the two above datasets and calculate the unemployment history?
I appreciate any help.
base R
You should use Reduce with accumulate = T.
df2$employment_history <- apply(df2[,-1], 1, function(x) sum(!Reduce(any, x==0, accumulate = TRUE)))
merge(df1, df2[c("id","employment_history")])
Or use the built-in dplyr::cumany function:
df2 %>%
pivot_longer(-id) %>%
group_by(id) %>%
summarise(employment_history = sum(value[!cumany(value == 0)])) %>%
left_join(df1, .)
id start_UI employment_history
1 R005 2012-06-10 2
2 R006 2013-01-15 3
3 R007 2014-12-21 5

XTS:: Help me on the usage & differences between period.apply() & to.period()

I am learning time series analysis with R and came across these 2 functions while learning. I do understand that the output of both of these is a periodic data defined by the frequency of period and the only difference I can see is the OHLC output option in the to.period().
Other than the OHLC when a particular of these functions is to be used?
to.period and all the to.minutes, to.weekly, to.quarterly are indeed meant for OHLC data.
If you take the function to.period it will take the open from the first day of the period, the close of the last day of the period and the highest high / lowest low of the specified period. These functions work very well together with the quantmod / tidyquant / quantstrat packages. See code example 1.
If you give the to.period non-OHLC data, but a timeseries with 1 data column, you still get a sort of OHLC back. See code example 2.
Now period.apply is is more interesting. Here you can supply your own functions to be applied on the data. Especially in combination with endpoints this can be a powerful function in timeseries data if you want to aggregate your function to different time periods. The index is mostly specified with endpoints, since with endpoints you can create the index you need to get to higher time levels (from day to week / etc etc). See code example 3 and 4.
Remember to use matrix functions with period.apply if you have more than 1 column of data since xts is basicly a matrix and an index. See code example 5.
More info on this course.
data(sample_matrix) <- zoo(rnorm(31)+10,as.Date(13514:13744,origin="1970-01-01"))
# code example 1
sample_matrix.Open sample_matrix.High sample_matrix.Low sample_matrix.Close
2007 Q1 50.03978 51.32342 48.23648 48.97490
2007 Q2 48.94407 50.33781 47.09144 47.76719
# same as to.quarterly
to.period(sample_matrix, period = "quarters")
sample_matrix.Open sample_matrix.High sample_matrix.Low sample_matrix.Close
2007 Q1 50.03978 51.32342 48.23648 48.97490
2007 Q2 48.94407 50.33781 47.09144 47.76719
# code example 2
to.period(, period = "quarters")
2007-03-31 9.039875 11.31391 7.451139 10.35057
2007-06-30 10.834614 11.31391 7.451139 11.28427
2007-08-19 11.004465 11.31391 7.451139 11.30360
# code example 3 using base standard deviation in the chosen period
period.apply(, endpoints(, on = "quarters"), sd)
2007-03-31 2007-06-30 2007-08-19
1.026825 1.052786 1.071758
# self defined function of summing x + x for the period
period.apply(, endpoints(, on = "quarters"), function(x) sum(x + x) )
2007-03-31 2007-06-30 2007-08-19
1798.7240 1812.4736 993.5729
# code example 5
period.apply(sample_matrix, endpoints(sample_matrix, on = "quarters"), colMeans)
Open High Low Close
2007-03-31 50.15493 50.24838 50.05231 50.14677
2007-06-30 48.47278 48.56691 48.36606 48.45318

Find missing entry at different time period

I am processing a data frame with two columns:
portfolio date stock Value
1 200006 Apple 10
1 200006 Google 20
1 200006 IBM 30
1 200007 Apple 10
Because the amount of data is large, I want to find a simple way to check from date June 2000 to July 2000, within portfolio 1, both stock Google and IBM are missing. The return would be a c("IBM","GOOGLE"). I will use the information what stocks are not listed in July 2000 and get these stocks' value in June 2000 to balance the portfolio in July 2000. So in this case, I hope to get c("IBM","GOOGLE") and then get their values (20,30) to do further adjustment for Apple's value 10.
The data type for four columns are: factor, Integer, factor and Integer for portfolio, date, stock and Value.
Is there any function or package that can deal with this problem?
You can try this:
# Get all possible stocks
stocks <- unique(df$stock)
# Get missing stocks
df[, stocks[!stocks %in% stock], .(portfolio, date)]
# portfolio date V1
# 1: 1 200007 Google
# 2: 1 200007 IBM
# Or vector output (no date or portfolio info)
df[, stocks[!stocks %in% stock], .(portfolio, date)]$V1
# [1] "Google" "IBM"

How to enter censored data into R's survival model?

I'm attempting to model customer lifetimes on subscriptions. As the data is censored I'll be using R's survival package to create a survival curve.
The original subscriptions dataset looks like this..
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
Which I manipulate to look like this..
id tenure_in_months status(1=cancelled, 0=active)
1 2 1
2 ? 0
3 1 1 order to feed the survival model:
obj <- with(subscriptions, Surv(time=tenure_in_months, event=status, type="right"))
fit <- survfit(obj~1, data=subscriptions)
What shall I put in the tenure_in_months variable for the consored cases i.e. the cases where the subscription is still active today - should it be the tenure up until today or should it be NA?
First I shall say I disagree with the previous answer. For a subscription still active today, it should not be considered as tenure up until today, nor NA. What do we know exactly about those subscriptions? We know they tenured up until today, that is equivalent to say tenure_in_months for those observations, although we don't know exactly how long they are, they are longer than their tenure duration up to today.
This is a situation known as right-censor in survival analysis. See:
So your data would need to translate from
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
id t1 t2 status(3=interval_censored)
1 2 2 3
2 3 NA 3
3 1 1 3
Then you will need to change your R surv object, from:
Surv(time=tenure_in_months, event=status, type="right")
Surv(t1, t2, event=status, type="interval2")
See for more syntax details. A very good summary of computational details can be found:
Interval censored data can be represented in two ways. For the first use type = interval and the codes shown above. In that usage the value of the time2 argument is ignored unless event=3. The second approach is to think of each observation as a time interval with (-infinity, t) for left censored, (t, infinity) for right censored, (t,t) for exact and (t1, t2) for an interval. This is the approach used for type = interval2, with NA taking the place of infinity. It has proven to be the more useful.
If a missing end date means that the subscription is still active, then you need to take the time until the current date as censor date.
NA wont work with the survival object. I think those cases will be omitted. That is not what you want! Because these cases contain important information about the survival.
SQL code to get the time till event (use in SELECT part of query)
DATEDIFF(M,start_date,ISNULL(end_date,GETDATE()) AS tenure_in_months
I would use difference in days, for my analysis. Does not make sense to round off the time to months.
You need to know the date the data was collected. The tenure_in_months for id 2 should then be this date minus 2013-06-01.
Otherwise I believe your encoding of the data is correct. the status of 0 for id 2 indicates it's right-censored (meaning we have a lower bound on it's lifetime, but not an upper bound).
