I'm new to R have gone through several tutorials online and paid but still struggling with the implementation of my requirement. I would like to build a retention cohort in R. I currently do this in excel, which takes me 4-5 hours every time I need to implement one. Therefore, exploring to see if R can help out. It seems so but need some direction.
Below is the output in excel. If you notice May 2011 I had 31 customers who joined and their progression monthwise till the current month.
Output in excel
This is the input variable
Initial columns are customer id, his date of joining, and his date of expiry. Two additional columns convert the dj & de to text. Columns K onwards is processing of the data to check if the customer is active in given month or has he churned? I use date of join in text from to indicate active and append it to with "C" to denote expiry. Later I just count the columns with the date to get the cohort.
So, how do I implement this in R.
Assuming this is the sample data and I need the cohort from 1 May 2015 to 1 Jan 2016
customer dj exp
abc 01/05/15 25/6/15
efg 01/05/15 25/7/15
ghd 01/05/15 25/7/15
mkd 01/06/15 25/7/15
kskm 01/06/15 05/8/15
This is what I would like to create from the above data.
Cohort M0 M1 M2 M3 M4
2015-05 3 3 2 0 0
2015-06 2 2 1 1 0
Explanation: M0 is the months from the date of joining. So 3 people joined us in 2015 May and all of them are active in May. M1 would be June and all of them are active in June. We lose 1 customer in the end of June on the 25th, so will consider him active in June but in M2 my count drops from 3 to 2. M3 for the May joined customers correspond to the month of August in which we have lost our customers.
Similar process of 2015-06 cohort. M1 would be the month of July and M3 the month of September.
Modified your code as below, this works thanks! Now trying to figure out a way to make M0 to M(n) dynamic.
library(readxl)
library(zoo)
library(plyr)
# Read in the data
df <- read.csv("~/Desktop/R/data.csv")
df$dj <- as.Date(df$dj,"%d/%m/%y")
df$exp <- as.Date(df$exp,"%d/%m/%y")
# The data in the file has different variable names than your example data
# so I'm changing them to match
names(df)[1:3] <- c("customer","dj","exp")
# Make a variable called Cohort that contains only the year and month of joining
# as.yearmon() comes from the 'zoo' package
df$Cohort <- as.yearmon(df$dj)
# Calculate the difference in months between date of expiry and date of joining
df$MonthDiff <- ceiling((df$exp-df$dj)/30)
#df$MonthDiff <- 12*(as.yearmon(df$exp+months(1))-df$Cohort)
# Use ddply() from the 'plyr' package to get the frequency of subjects that are
# still active after 0, 1, 2, 3, and 4 months.
df1 <- ddply(df,.(Cohort),summarize,
M0 = sum(MonthDiff > 0),
M1 = sum(MonthDiff > 1),
M2 = sum(MonthDiff > 2),
M3 = sum(MonthDiff > 3),
M4 = sum(MonthDiff > 4),
M5 = sum(MonthDiff > 5)
)
df1
df1
Cohort M0 M1 M2 M3 M4 M5
1 May 2015 3 3 2 0 0 0
2 Jun 2015 2 2 1 0 0 0
Now
Try this:
library(readxl)
library(zoo)
library(plyr)
# Read in the data
df <- read_excel("MyFile.xlsx")
# The data in the file has different variable names than your example data
# so I'm changing them to match
names(df)[1:3] <- c("customer","dj","exp")
# Make a variable called Cohort that contains only the year and month of joining
# as.yearmon() comes from the 'zoo' package
df$Cohort <- as.yearmon(df$dj)
# Calculate the difference in months between date of expiry and date of joining
df$MonthDiff <- 12*(as.yearmon(df$exp)-df$Cohort)
# Use ddply() from the 'plyr' package to get the frequency of subjects that are
# still active after 0, 1, 2, 3, and 4 months.
df1 <- ddply(df,.(Cohort),summarize,
M0 = sum(MonthDiff >= 0),
M1 = sum(MonthDiff >= 1),
M2 = sum(MonthDiff >= 2),
M3 = sum(MonthDiff >= 3),
M4 = sum(MonthDiff >= 4))
df1
# Cohort M0 M1 M2 M3 M4
# May 2015 3 3 2 0 0
# Jun 2015 2 1 0 0 0
This assumes that when you read the data in from Excel, the dates are formatted as dates. In case they are not you could use the following:
df$dj <- as.Date(df$dj,"%d/%m/%y")
df$exp <- as.Date(df$exp,"%d/%m/%y")
Related
I have data like this.
library(lubridate)
set.seed(2021)
gen_date <- seq(ymd_h("2021-01-01-00"), ymd_h("2021-09-30-23"), by = "hours")
hourx <- hour(gen_date)
datex <- date(gen_date)
sales <- round(runif(length(datex), 10, 50), 0)*100
mydata <- data.frame(datex, hourx, sales)
head(mydata)
# datex hourx sales
#1 2021-01-01 0 2800
#2 2021-01-01 1 4100
#3 2021-01-01 2 3800
#4 2021-01-01 3 2500
#5 2021-01-01 4 3500
#6 2021-01-01 5 3800
tail(mydata
# datex hourx sales
#6547 2021-09-30 18 3900
#6548 2021-09-30 19 3600
#6549 2021-09-30 20 3000
#6550 2021-09-30 21 4700
#6551 2021-09-30 22 4700
#6552 2021-09-30 23 3600
I have task to do modelling using Linear Regression but with tricky data. Assume we have data from January to March, we need those data to forecast April data. Here the steps:
We use January and February data as Independent Variables (X) and March data as Dependent Variable (Y) for building regression model, because February has the fewest days, which is 28 days, then we cut January & March data into 28 days too.
data_jan <- mydata[1:672,]
data_feb <- mydata[745:1416,]
data_mar <- mydata[1417:2088,]
Modelling Regression using lm function
mydata_reg <- data.frame(x1 = data_jan$sales,
x2 = data_feb$sales,
y = data_mar$sales)
model_reg <- lm(y~., data = mydata_reg)
After get model, we use new data within February & March as independent data (X)
mydata_reg_for <- data.frame(x1 = data_feb$sales,
x2 = data_mar$sales)
pred_data_apr <- predict(model_reg, newdata = mydata_reg_for)
Check lenght of the month, Because april has 30 days and we only get 28 days forecast data, so we still need 2 days data to complete our forecast. February only has 28 days, so we use first two dates from March, which are "2021-03-01" & "2021-03-02". Now, March has 31 days, then we don't need do anything, we just add "2021-03-29" & "2021-03-30".
data_feb_add <- mydata[1417:1464,]
data_mar_add <- mydata[2089:2136,]
mydata_reg_add <- data.frame(x1 = data_feb_add$sales,
x2 = data_mar_add$sales)
After that we do modelling using model_reg function before and Add all april forecast.
pred_data_apr_add <- predict(model_reg, newdata = mydata_reg_add)
data_apr <- c(as.numeric(pred_data_apr), as.numeric(pred_data_apr_add))
My question is how do we make this process run automatically every month using dplyr package? Because every month has different days. I use february data because it has the fewest days. This condition also is applied to other months. Many Thank You.
If you want to control the number of days after each month (or in each month) you could filter by the date not the row numbers.
I'm sure it can be tidied up more than this, but you would just need to change the forecast_date <- as.Date("2021-04-01") to whichever month you want to forecast.
##set the forecast month. This should be straight forward to automate with a list or an increment
forcast_date <- as.Date("2021-04-01") # April
##get the forecast month length. This would be used for the data_feb_add and data_mar_add step.
forcast_month_length <- days_in_month(forcast_date) #30 days
##get dates for the previous 3 months
month_1_date <- forcast_date %m-% months(3)
month_2_date <- forcast_date %m-% months(2)
month_3_date <- forcast_date %m-% months(1)
##find the shortest month in that time range.
shortest_month <- min(c(days_in_month(month_1_date),
days_in_month(month_2_date),
days_in_month(month_2_date))) #28 days
##select the first 28 days (the shortest month) for each of the months used for the variables
data_month_1 <- mydata[mydata$datex %in% month_1_date:(month_1_date + shortest_month - 1),]
data_month_2 <- mydata[mydata$datex %in% month_2_date:(month_2_date + shortest_month - 1),]
data_month_3 <- mydata[mydata$datex %in% month_3_date:(month_3_date + shortest_month - 1),]
##select the number of days needed for each month for the forecast data (30 days for april)
month_2_forecast_length <- mydata[mydata$datex %in% month_2_date:(month_2_date + forcast_month_length - 1),]
month_3_forecast_length <- mydata[mydata$datex %in% month_3_date:(month_3_date + forcast_month_length - 1),]
You can simply split data by group_split
mydata %>%
group_split(month(datex))
this code will split mydata into 12 lists, and each list elements are dataframe with each 12 month
Given the following baby fragment:
d1=as.Date('April 26, 2001',format='%B %d, %Y')
d2=as.Date('April 27, 2001',format='%B %d, %Y')
d3=as.Date('April 28, 2001',format='%B %d, %Y')
tibble(DATE=c(d1,d1,d2,d2,d3,d3), Symbol=c("A","B","A","B","A","B"), voladj=c(0.2, 0.3, -0.2, -0.1, 0.3, 0.2))
resulting in
# A tibble: 6 x 3
DATE Symbol voladj
<date> <chr> <dbl>
1 2001-04-26 A 0.2
2 2001-04-26 B 0.3
3 2001-04-27 A -0.2
4 2001-04-27 B -0.1
5 2001-04-28 A 0.3
6 2001-04-28 B 0.2
I try to compute a series of correlation/covariance matrices cor at time D2, cor at time D3, ... etc. Ideally data is exponentially weighted. What options do I have in R? To make things a bit more spicy a Symbol C may at some point show up, too. I was thinking to compute the outer product (rank 1 matrix) at time t1, t2, t3, and then use a simple moving mean.
A potential output could be the following:
DATE cov
<date>
1 2001-04-26 M1
2 2001-04-27 M2
3 2001-04-28 M3
where M_i are matrices (or frames), such as
M_1 = A B
A 1.0 c1
B c1 1.0
etc.
Obviously more interesting once more symbols are involved
Updated answer, given comments
Here is an approach using quantmod to retrieve 5 stocks for three weeks from Yahoo Finance. We combine the Close variable from the xts objects into a data frame, generate week identifiers with lubridate::week(), split() it by week, and calculate covariance matrices for each week using lapply().
library(quantmod)
from.dat <- as.Date("12/03/19",format="%m/%d/%y")
to.dat <- as.Date("12/24/19",format="%m/%d/%y")
theSymbols <- c("AAPL","AXP","BA","CAT","CSCO")
getSymbols(theSymbols,from=from.dat,to=to.dat,src="yahoo")
#combine to single data frame
combinedData <- data.frame(date = as.Date(rownames(as.data.frame(AAPL))),
AAPL$AAPL.Close,
AXP$AXP.Close,
BA$BA.Close,
CAT$CAT.Close,
CSCO$CSCO.Close)
colnames(combinedData) <- c("date","AAPL","AXP","BA","CAT","CSCO")
# split by week
library(lubridate)
combinedData$week <- week(combinedData$date)
symbolsByWeek <- split(combinedData,as.factor(combinedData$week))
covariances <- lapply(symbolsByWeek,function(x){
cov(x[,-c(1,7)])
})
covariances[[1]]
...and the output:
> covariances[[1]]
AAPL AXP BA CAT CSCO
AAPL 19.4962156 7.0959976 3.9093027 5.4158116 -0.66194433
AXP 7.0959976 3.0026695 2.0175793 2.2569625 -0.18793832
BA 3.9093027 2.0175793 10.4511473 1.8555752 0.55619975
CAT 5.4158116 2.2569625 1.8555752 1.8335361 -0.11141911
CSCO -0.6619443 -0.1879383 0.5561997 -0.1114191 0.07287982
>
Original answer
Here is an approach using quantmod to retrieve Dow 30 data for four days from Yahoo Finance, apply() and do.call() with rbind() to massage it into a single data frame, and split() to split by day to produce daily covariance matrices.
library(quantmod)
from.dat <- as.Date("12/02/19",format="%m/%d/%y")
to.dat <- as.Date("12/06/19",format="%m/%d/%y")
theSymbols <- c("AAPL","AXP","BA","CAT","CSCO","CVX","XOM","GS","HD","IBM",
"INTC","JNJ","KO","JPM","MCD","MMM","MRK","MSFT","NKE","PFE","PG",
"TRV","UNH","UTX","VZ","V","WBA","WMT","DIS","DOW")
getSymbols(theSymbols,from=from.dat,to=to.dat,src="yahoo")
# since quantmod::getSymbols() writes named xts objects, need to use
# get() with the symbol names to access each data frame
# e.g. head(get(theSymbols[[1]]))
# convert to list
symbolData <- lapply(theSymbols,function(x){
y <- as.data.frame(get(x))
colnames(y) <- c("open","high","low","close","volume","adjusted")
# add date and symbol name to output data frames
y$date <- rownames(y)
y$symbol <- x
y
})
#combine to single data frame
combinedData <- do.call(rbind,symbolData)
# split by day
symbolsByDay <- split(combinedData,as.factor(combinedData$date))
covariances <- lapply(symbolsByDay,function(x){
cov(x[,1:6]) # only use first 6 columns
})
# print first covariance matrix
covariances[1]
...and the output:
> covariances[1]
$`2019-12-02`
open high low close volume adjusted
open 5956.289 5962.359 5811.514 5818.225 -9.274871e+07 5809.939
high 5962.359 5968.557 5817.580 5824.272 -9.314473e+07 5816.005
low 5811.514 5817.580 5671.809 5678.470 -9.188418e+07 5670.276
close 5818.225 5824.272 5678.470 5685.467 -9.155485e+07 5677.246
volume -92748711.735 -93144729.578 -91884178.312 -91554853.356 4.365841e+13 -90986549.261
adjusted 5809.939 5816.005 5670.276 5677.246 -9.098655e+07 5669.171
>
Please see the sample data below.
I want to convert the quarterly sale data (with a start date and end date) into monthly sale data.
For example:
Data set A-Row 1 will be split into Data set B- Row 1, 2 and 3 for June, July and August separately and the sale will be pro rata based on number of days in that month, all other columns will be the same;
Data set A-Row 2 will pick up what was left in Row 1 (which ends in 5/9/2017) and formed a complete September.
Is there an efficient way to execute this, the actual data is a csv file with 100K x 15 data size, which will be split to approximately 300K x 15 new data set for monthly analysis.
Some key characteristic from sample question data includes:
The start day for the first quarterly sales data is the day that customer joins, so it could be any day;
All sales will be quarterly but in various days between 90, 91, or 92 days, but it is also possible to have imcomplete quarterly sale data as customer leave in the quarter.
Sample Question:
Customer.ID Country Type Sale Start..Date End.Date Days
1 1 US Commercial 91 7/06/2017 5/09/2017 91
2 1 US Commerical 92 6/09/2017 6/12/2017 92
3 2 US Casual 25 10/07/2017 3/08/2017 25
4 3 UK Commercial 64 7/06/2017 9/08/2017 64
Sample Answer:
Customer.ID Country Type Sale Start.Date End.Date Days
1 1 US Commercial 24 7/06/2017 30/06/2017 24
2 1 US Commercial 31 1/07/2017 31/07/2017 31
3 1 US Commercial 31 1/08/2017 31/08/2017 31
4 1 US Commercial 30 1/09/2017 30/09/2017 30
5 1 US Commercial 31 1/10/2017 31/10/2017 31
6 1 US Commercial 30 1/11/2017 30/11/2017 30
7 1 US Commercial 6 1/12/2017 6/12/2017 6
8 2 US Casual 22 10/07/2017 31/07/2017 22
9 2 US Casual 3 1/08/2017 3/08/2017 3
10 3 UK Commercial 24 7/06/2017 30/06/2017 24
11 3 UK Commercial 31 1/07/2017 31/07/2017 31
12 3 UK Commercial 9 1/08/2017 9/08/2017 9
I just ran CIAndrews' code. It seems to work for the most part, but it is very slow when run on a dataset with 10,000 rows. I eventually cancelled the execution after a few minutes of waiting. There's also an issue with the number of days: For example, July has 31 days, but the days variable only shows thirty. It's true that 31-1 = 30, but the first day should be counted as well.
The code below only takes about 21 seconds on my 2015 MacBook Pro (not including data generation), and takes care of the other problem, too.
library(tidyverse)
library(lubridate)
# generate data -------------------------------------------------------------
set.seed(666)
# assign variables
customer <- sample.int(n = 2000, size = 10000, replace = T)
country <- sample(c("US", "UK", "DE", "FR", "IS"), 10000, replace = T)
type <- sample(c("commercial", "casual", "other"), 10000, replace = T)
start <- sample(seq(dmy("7/06/2011"), today(), by = "day"), 10000, replace = T)
days <- sample(85:105, 10000, replace = T)
end <- start + days
sale <- sample(500:3000, 10000, replace = T)
# generate dataframe of artificial data
df_quarterly <- tibble(customer, country, type, sale, start, end, days)
# split quarters into months ----------------------------------------------
# initialize empty list with length == nrow(dataframe)
list_date_dfs <- vector(mode = "list", length = nrow(df_quarterly))
# for-loop generates new dates and adds as dataframe to list
for (i in 1:length(list_date_dfs)) {
# transfer dataframe row to variable `row`
row <- df_quarterly[i,]
# correct end date so split successful when interval doesn't cover full month
end_corr <- row$end + day(row$start) - day(row$end)
# use lubridate to compute first and last days of relevant months
m_start <- seq(row$start, end_corr, by = "month") %>%
floor_date(unit = "month")
m_end <- m_start + days_in_month(m_start) - 1
# replace first and last elements with original dates
m_start[1] <- row$start
m_end[length(m_end)] <- row$end
# compute the number of days per month as well as sales per month
# correct difference by adding 1
m_days <- as.integer(m_end - m_start) + 1
m_sale <- (row$sale / sum(m_days)) * m_days
# add tibble to list
list_date_dfs[[i]] <- tibble(customer = row$customer,
country = row$country,
type = row$type,
sale = m_sale,
start = m_start,
end = m_end,
days = m_days
)
}
# bind dataframe list elements into single dataframe
df_monthly <- bind_rows(list_date_dfs)
It's not pretty as it uses multiple functions and loops, since it consists out of multiple operations:
# Creating the dataset
library(tidyr)
customer <- c(1,1,2,3)
country <- c("US","US","US","UK")
type <- c("Commercial","Commercial","Casual","Commercial")
sale <- c(91,92,25,64)
Start <- as.Date(c("7/06/2017","6/09/2017","10/07/2017","7/06/2017"),"%d/%m/%Y")
Finish <- as.Date(c("5/09/2017","6/12/2017","3/08/2017","9/08/2017"),"%d/%m/%Y")
days <- c(91,92,25,64)
df <- data.frame(customer,country, type,sale, Start,Finish,days)
# Function to split per month
library(zoo)
addrowFun <- function(y){
temp <- do.call("rbind", by(y, 1:nrow(y), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))
# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
totalSale <- sum(tempSum$sale)
totalDays <- sum(tempSum$days)
return(x$days / totalDays * totalSale)
}
for(i in 1:length(loop$customer)){
loop$sale[i] <- sumFun(loop[i,])
}
loop
CiAndrews,
Thanks for the help and patience. I have managed to get the answer with small change. I have replace the "rbind" with "rbind.fill" from "plyr" package and everything runs smoothly after that.
Please see the head of sample2.csv below
customer country type sale Start Finish days
1 43108181108 US Commercial 3330 17/11/2016 24/02/2017 99
2 43108181108 US Commercial 2753 24/02/2017 23/05/2017 88
3 43108181108 US Commercial 3043 13/02/2018 18/05/2018 94
4 43108181108 US Commercial 4261 23/05/2017 18/08/2017 87
5 43103703637 UK Casual 881 4/11/2016 15/02/2017 103
6 43103703637 UK Casual 1172 26/07/2018 1/11/2018 98
Please see the codes below:
library(tidyr)
#read data and change the start and finish to data type
data <- read.csv("Sample2.csv")
data$Start <- as.Date(data$Start, "%d/%m/%Y")
data$Finish <- as.Date(data$Finish, "%d/%m/%Y")
customer <- data$customer
country <- data$country
days <- data$days
Finish <- data$Finish
Start <- data$Start
sale <- data$sale
type <- data$type
df <- data.frame(customer, country, type, sale, Start, Finish, days)
# Function to split per month
library(zoo)
library(plyr)
addrowFun <- function(y){
temp <- do.call("rbind.fill", by(y, 1:nrow(y), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))
# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
totalSale <- sum(tempSum$sale)
totalDays <- sum(tempSum$days)
return(x$days / totalDays * totalSale)
}
for(i in 1:length(loop$customer)){
loop$sale[i] <- sumFun(loop[i,])
}
loop
I have two dataframes.
I would like to make the average of sp variable for the previous 5 days defined by a specific date from a second dataframe.
For example, the mean from the day 1997.05.05 (that would be between the day 1997.05.05 and 1997.05.01) and the average between 1997.05.27 and 1997.05.31 calculate the days that have values (in this case 3).
Here are the variables:
sp < - c(7,9,9,4,2,5,2,9,NA,14,NA,NA,NA,NA,NA,14,25,NA,11,10,12,NA,9,NA,6,8,6,1,NA,7,NA)
Date <- c("1997-05-01","1997-05-02","1997-05-03","1997-05-04","1997-05-05",
"1997-05-06","1997-05-07","1997-05-08","1997-05-09","1997-05-10",
"1997-05-11","1997-05-12","1997-05-13","1997-05-14","1997-05-15",
"1997-05-16","1997-05-17","1997-05-18","1997-05-19","1997-05-20",
"1997-05-21","1997-05-22","1997-05-23","1997-05-24","1997-05-25",
"1997-05-26","1997-05-27","1997-05-28","1997-05-29","1997-05-30",
"1997-05-31")
data1 <- data.frame(sp, Date)
DateX <- c("1997-05-05","1997-05-15","1997-05-31")
data2 <- data.frame(DateX)
how to do that best? Help would be much appreciated.
Here is my expected result (in the second dataframe, data2):
1. DateX spMean
2. 1997-05-05 6.2
3. 1997-05-15 NA
4. 1997-05-31 4.6
I have made a few type changes to your initial code. Give the below a shot...I use lapply to run a quick function against the data1 object using the dates in your second object.
sp <- c(7,9,9,4,2,5,2,9,NA,14,NA,NA,NA,NA,NA,14,25,NA,11,10,12,NA,9,NA,6,8,6,1,NA,7,NA)
Date <- as.Date(c("1997-05-01","1997-05-02","1997-05-03","1997-05-04","1997-05-05",
"1997-05-06","1997-05-07","1997-05-08","1997-05-09","1997-05-10",
"1997-05-11","1997-05-12","1997-05-13","1997-05-14","1997-05-15",
"1997-05-16","1997-05-17","1997-05-18","1997-05-19","1997-05-20",
"1997-05-21","1997-05-22","1997-05-23","1997-05-24","1997-05-25",
"1997-05-26","1997-05-27","1997-05-28","1997-05-29","1997-05-30",
"1997-05-31"))
data1 <- data.frame(sp, Date)
DateX <- as.Date(c("1997-05-05","1997-05-15","1997-05-31"))
data2 <- data.frame(DateX)
#Add column for mean, NA values return NA
data2$spMean_na <- lapply(DateX,
function(m) mean(data1$sp[data1$Date >= m - 5 & data1$Date <= m]))
#Add column for mean, remove NA values
data2$spMean_na_omit <- lapply(DateX,
function(m) mean(data1$sp[data1$Date >= m - 5 & data1$Date <= m],
na.rm = TRUE))
> data2
DateX spMean_na spMean_na_omit
1 1997-05-05 6.2 6.2
2 1997-05-15 NA 14
3 1997-05-31 NA 5.5
I think you might need to change your expected result. Row 29 has an NA for the sp value and is within 5 days of 1997-05-31. So it should return an NA per your requirements as I understand them.
The above code calculates the cohort-wise retention rates. The cohort is the month of joining. So, the code caluclates the number customers who joined in the month of May 2015, how many are active month on month. The final output is stored in data frame df1 (shown below)
I need help creating dynamic column names which are currently hard coded in the ddply function. M0 which means month of joining, M1 means 1st month from joining, M2 means 2 month from joining to M(n) should be variables. This can calculated by subtracting the farthest expiry date from the earliest joining date.
Unfortunately, I'm not able to auto calculate M0 to M(n) range dynamically.
Here is my code dump which works but is not optimal because I have hardcoded M0 to M3 as a variable in ddply function. So if my input data has a customer with a longer subscription period than 5 months my code will fail.
The input to the code is the following dummy data.
customer dj exp
abc 01/05/15 25/06/15
efg 01/05/15 25/07/15
ghd 01/05/15 25/07/15
mkd 01/06/15 25/07/15
kskm 01/06/15 05/08/15
Reproducible code.
library(zoo)
library(plyr)
customer<-c("abc","efg","ghd","mkd","kskm")
dj<-c("2015-05-01", "2015-05-01", "2015-05-01","2015-06-01","2015-06-01")
exp<-c("2015-06-25", "2015-07-25", "2015-07-25","2015-07-01","2015-08-05")
data.frame(customer,dj,exp)
df$dj <- as.Date(df$dj,"%d/%m/%y")
df$exp <- as.Date(df$exp,"%d/%m/%y")
# The data in the file has different variable names than your example data
# so I'm changing them to match
names(df)[1:3] <- c("customer","dj","exp")
# Make a variable called Cohort that contains only the year and month of joining
# as.yearmon() comes from the 'zoo' package
df$Cohort <- as.yearmon(df$dj)
# Calculate the difference in months between date of expiry and date of joining
df$MonthDiff <- ceiling((df$exp-df$dj)/30)
#df$MonthDiff <- 12*(as.yearmon(df$exp+months(1))-df$Cohort)
range<-as.integer(ceiling((max(df$exp)-min(df$dj)))/30)
# Use ddply() from the 'plyr' package to get the frequency of subjects that are
# still active after 0, 1, 2, and 3 months.
df1 <- ddply(df,.(Cohort),summarize,
M0 = sum(MonthDiff > 0),
M1 = sum(MonthDiff > 1),
M2 = sum(MonthDiff > 2),
M3 = sum(MonthDiff > 3)
)
df1
df1
Cohort M0 M1 M2 M3
1 May 2015 3 3 2 0
2 Jun 2015 2 2 1 0
The above is the output working output. Ask is to make column M0 to M3 dynamic
Try inserting this after creating range:
for(i in 0:range) df <- within(df,assign(paste0("M",i),MonthDiff>i))
df1 <- ddply(df,.(Cohort),function(x) colSums(x[,paste0("M",0:range)]))
df1
# Cohort M0 M1 M2 M3
# 1 May 2015 3 3 2 0
# 2 Jun 2015 2 1 1 0