Churn Rate in R - r

I have a data set named df_a running in millions. I want to calculate the churn rate and group into months.
On the sample data I ran the code to prepare my data.
The logic is to find the minimum month(acquired month)
find the last month based on the records
find the difference in months and group the difference in months
The code below
df_a<-data.table(df)
df_a[,"min_date" := min(yw), by=c("CUSTOMER_DIMENSION_ID")]
df_a[,"max_date" := max(yw), by=c("CUSTOMER_DIMENSION_ID")]
df_a$min_date_m<-anydate(df_a$min_date)
df_a$max_date_m<-anydate(df_a$max_date)
df_a$diff_days <- df_a$max_date_m - df_a$min_date_m
df_a$difference <- as.numeric(df_a$diff_days) /(365.25/12)
df_a$Month_Bucket<-ifelse((df_a$difference>=0 & df_a$difference<3),"3",
ifelse((df_a$difference>=3 & df_a$difference<6),"3-6",
ifelse((df_a$difference>=6 & df_a$difference<9),"6-9",
ifelse((df_a$difference>=9 & df_a$difference<12),"9-12",
ifelse((df_a$difference>=12 & df_a$difference<24),"12-24",
"24+")))))
data_a <- df_a[c(1,1:nrow(df_a)),]
setDT(data_a)
xxx<-(cohorts <-dcast(unique(data_a)[,cohort:=min(yw),by=CUSTOMER_DIMENSION_ID],cohort~Month_Bucket))
I am getting the output in the following format
Month 3
2020-08 92876
2020-07 144873
However the output is not correct
What I want is
Month no of unique customers acquired 0-3 3-6 6-9
2019-08 85749
2019-07 128060
The output basically is summing up the customers across months and assigning a bucket. However if I acquire 85749 customers in 2019-08 i will have lets say 25k customers in 0-3 25k again in 3-6 months

One here could do :
data_unique <- unique(data_a)
ccc <- ( cohorts <- dcast( data_unique[ ,
cohort := min(yw),
by=CUSTOMER_DIMENSION_ID],
cohort ~ Month_Bucket,
value.var = "CUSTOMER_DIMENSION_ID",
function(x) { length(unique(x) } ) )
)

Related

data.table aggregation based on multiple criteria

I am trying to calculate how many pid within a set fid's have a yob smaller than person's yob. The second question is about unique pid. Updating the question based on efforts #langtang and my own reflections:
#Libraries:
library(data.table)
library(tictoc)
#Make it replicable:
set.seed(1)
#Define parameters of the simulation:
pid<-1:1000
fid<-1:5
time_periods<-1:12
yob<-sample(seq(1900,2010),length(pid),replace = TRUE)
#Obtain in how many firms a given pid works in a givem month:
nr_firms_pid_time<-sample(1:length(fid),length(pid),replace = TRUE)
#This means:
#First pid: works in first firm;
#Second pid: works in first four firms;
#Third pid: works in first firm;
#Fourth pid: works in two firms.
#Aux functions:
function_rep<-function(x){
rep(1:12,x)
}
function_seq<-function(x){
1:x
}
#Create panel
data_panel<-data.table(pid = rep(pid,nr_firms_pid_time*length(time_periods)))
data_panel[,yearmonth:=do.call(c,sapply(nr_firms_pid_time,function_rep))]
data_panel[,fid:=rep(do.call(c,sapply(nr_firms_pid_time,function_seq)),each = 12)]
#Merge in yob:
data_yob<-data.table(pid = pid,yob = yob)
data_panel<-merge(data_panel,data_yob,by = c("pid"),all.x = TRUE)
#Remove not needed stuff:
rm(pid)
rm(fid)
rm(time_periods)
rm(yob)
rm(data_yob)
#Solution 1 (terribly slow):
# make a small function that counts the number of coworkers with
# earlier dob than this individual
older_coworkers = function(id,yrmonth) {
#First obtain firms in which a worker works in a given month:
id_firms<-data_panel[pid==id&yearmonth==yrmonth,fid]
#Then extract data at a given month:
data_func<-data_panel[(fid %in% id_firms)&(yearmonth==yrmonth)]
#Then extract his dob:
dob_to_use<-unique(data_func[pid==id,yob])
sum(data_func[pid!=id]$yob<dob_to_use)
}
older_coworkers_unique = function(id,yrmonth) {
#First obtain firms in which a worker works in a given month:
id_firms<-data_panel[pid==id&yearmonth==yrmonth,fid]
#Then extract data at a given month:
data_func<-data_panel[(fid %in% id_firms)&(yearmonth==yrmonth)]
#Then extract his dob:
dob_to_use<-unique(data_func[pid==id,yob])
#Get UNIQUE number of coworkers:
sum(unique(data_func[pid!=id],by = c("pid"))$yob<dob_to_use)
}
#Works but is terrible slow:
tic()
sol_1<-data_panel[, .(older_coworkers(.BY$pid,.BY$yearmonth)),by = c("pid","yearmonth")]
toc()
#Solution 2 (better but do not like it, what if I want unique older coworkers)
function_older<-function(x){
noc<-lapply(
1:length(x),
function(i){
sum(x[-i]<x[i])
}
)
unlist(noc)
}
#This is fast but I cannot get unique number:
tic()
sol_2<-data_panel[,.(pid,function_older(yob)),by = c("fid","yearmonth")][,sum(V2),by = c("pid","yearmonth")][order(pid,yearmonth)]
toc()
#Everything works:
identical(sol_1,sol_2)
The question is how to implement older_coworkers_unique in a very fast manner. Any suggestions would be greatly appreciated.
Update, based on OP's new reproducible dataset
If you want a one-liner to reproduce sol_2 above, you can do this:
data_panel[data_panel, on=.(yearmonth, fid, yob<yob )][, .N, by=.(i.pid, yearmonth)]
Explanation:
The above is using a non-equi join, which can be a helpful approach when using data.table. I am joining data_panel on itself, requiring that yearmonth and fid be equal, but that year of birth (left side of join) is less than year of birth (right side of join). This will return a data.table where firms and yearmonth matches, but where every older coworker (pid) is matched to their younger coworkers (i.pid). We can thus count the rows (.N) by each younger coworker (i.pid) and yearmonth. This produces the same as sol_1 and sol_2 above. You commented that you would like to find the unique coworkers, and so the second approach below does that, by using len(unique(pid)) as below, in Option 2.
The same non-equi join approach can be used to get unique older coworkers, like this:
data_panel[data_panel, on=.(yearmonth, fid, yob<yob )] %>%
.[, .(older_coworkers = length(unique(pid))), by=.(i.pid, yearmonth)]
Previous Response, based on OP's original very small example dataset
I'm not sure exactly what you want the output to look like. However in your example data, I first drop the duplicate row (because I couldn't understand why it was there (see my comment above)), and then I apply a function that counts that number of older coworkers for each pid/fid/ym.
# make your example data unique
data=unique(data)
# make a small function that counts the number of coworkers with
# earlier dob than this individual
older_coworkers = function(birth,firm,yrmonth,id) {
data[dob<birth & fid==firm & ym==yrmonth & pid!=id,.N]
}
# apply the function to the data
data[, .(num_older_coworkers = older_coworkers(dob,.BY$fid, .BY$ym, .BY$pid)), by=.(pid,fid,ym)]
Output:
pid fid ym num_older_coworkers
1: 1 1 200801 1
2: 1 2 200802 0
3: 2 1 200801 0
4: 3 2 200801 0
Person 1 at Firm 1 has one older coworker in the month of 2008-01 -- that is, Person 2 at Firm 1 in 2008-01.
Person 1 at Firm 2 (born in 1950) would also have an older coworker, namely, Person 3 at Firm 2 (born in 1930), but the result shows 0, because Person 1 at Firm 2 ym (i.e. 2008-01) does not match with that potential older coworker's ym (i.e. 2008-02).

r data.table filter based on count of rows satisfying a condition

I am learning data.table and got confused at one place. Need help to understand how the below can be achieved. The data I am having, I need to filter out those brands which have sales of 0 in the 1st period OR do not have sales > 0 in atleast 14 periods. I have tried and I think I have achieved the 1st part....however not able to get how I can get the second part of filtering those brands which do not have sales > 0 in atleast 14 periods.
Below is my sample data and code that I have written. Please suggest how I can I achieve the second part?
library(data.table)
#### set the seed value
set.seed(9901)
#### create the sample variables for creating the data
group <- sample(1:7,1200,replace = T)
brn <- sample(1:10,1200,replace = T)
period <- rep(101:116,75)
sales <- sample(0:50,1200,replace = T)
#### create the data.table
df1 <- data.table(cbind(group,brn,period,sales))
#### taking the minimum value by group x brand x period
df1_min <- df1[,.(min1 = min(sales,na.rm = T)),by = c('group','brn','period')][order(group,brn,period)]
#### creating the filter
df1_min$fil1 <- ifelse(df1_min$period == 101 & df1_min$min1 == 0,1,0)
Thank you !!
Assuming that the first restriction applies on the dataset wide minimum period (101), implying that brn/group pairs starting with a 0-sales period greater than 101 are still included.
# 1. brn/group pairs with sales of 0 in the 1st period.
brngroup_zerosales101 = df1[sales == 0 & period == min(period), .(brn, group)]
# 2a. Identify brn/group pairs with <14 positive sale periods
df1[, posSale := ifelse(sales > 0, 1, 0)] # Was the period sale positive?
# 2b. For each brn/group pair, sum posSale and filter posSale < 14
brngroup_sub14 = df1[, .(GroupBrnPosSales = sum(posSale)), by = .(brn, group)][GroupBrnPosSales < 14, .(brn, group)]
# 3. Join the two restrictions
restr = rbindlist(list(brngroup_zerosales101, brngroup_sub14))
df1[, ID := paste(brn, group)] # Create a brn-group ID
restr[, ID := paste(brn, group)] # See above
filtered = df1[!(ID %in% restr[,ID]),]

Summation over nearby observations

I have a large data.frame which includes the price of goods and the quantity that are sold with each price. I like to find the total quantity of goods that is sold with a price similar (within a range) to price of each row. For example for the jth observation (row) I like to find the sum of quantity of goods that are sold with price lower than Price_j+50 and higher than Price_j-50, and similarly for other observations.
I can run a for loop over observations and filter the data for each observation's price.
df<-data.frame(Price = runif(100)*100 , Q = runif(100)*1000)
SumQ = data.frame()
for (i in c(1:nrow(df))){
df_filterd <- df %>% filter(Price < Price[i]+50 & Price > Price[i]-50)%>% summarize(sumQ = sum(Q))
SumQ<-rbind(SumQ, df_filterd$sumQ)
}
Is there a more efficient way to do this? I have a large dataset and it takes a lot of time to run the for loop over all observations.
You want to avoid looping and binding the results - this will be very slow. Instead, try:
with(df, sapply(Price, function(x) sum(Q[Price < x+50 & Price > x-50])))
Or with dplyr and purrr you could do
df %>% mutate(sumQ = map_dbl(Price,
~sum(Q[Price < .+50 & Price > .-50])))
Price Q sumQ
1 5.2272345 284.433416 28356.80
2 17.7292069 454.122990 35459.90
3 9.7329295 509.266254 29989.69
4 68.1042808 131.169813 41230.23
5 38.5612268 938.653962 45227.63
6 44.5808938 774.296761 47758.30
...

R counts events between time range if meet specific word

I have a difficult question I was hoping you friendly folks can help me with.
I have a dataset (df1) that includes variable company id, day, day-3,day+3 ,
Each id represents a company ticker symbol.
dataset (news) includes variable company id, day,newstitle,weight
I want to create two variables event1 and event2 in dataset(df1)
event1 =it counts events of newstile from dataset(news) between the time range[df1$range-3,df2$range+3] with news$newstitle contains "order" for each df$id
event2 = it counts events of newstile between the time range[df1$range-3,df2$range+3] with news$newstitle contains "dividend" for each df#id
I have rewrite my question for better understanding. I really appreciate your help.
Best.
#this would be desired result with new vars event1,event2
out <- read.table(text="
id,date,date_bef3,date_aft3,event1,event2
1605,1992/12/15,1992/12/12,1992/12/18,0,0
1705,1992/12/30,1992/12/27,1993/1/2,1,1
3412,1992/12/31,1992/12/28,1993/1/3,0,0
9921,1993/1/7,1993/1/4,1993/1/10,0,0
2314,1993/1/18,1993/1/15,1993/1/21,1,0",
header=T,sep=",")
#this is index data
df1<-out[,-c(5,6)]
#this is simulated news source data
news<- read.table(text="
id,date,newstitle,weight
2543,1992/12/30,new order,1
1705,1992/12/29,dividend payment,1
1705,1993/1/1,new order,1
9921,1993/1/1,new product,1
2314,1993/1/16,new order,1",
header=T,sep=",")
it is like the sql code like that:
proc sql;select *
sum (case when news$newstitle="order" ) as event1,
sum (case when news$newstitle="dividend" ) as event2,
from df1,new
where news$date between df$date_bf3 and df$date_af3 ,
news$id=df$id
my initial procs:
1: inner merge df1 and news by "id"
2: set dummy event1=1 if
newdate between(date_bef3,date_aft3) and
newstitle contain "order"
set dummy event2=1 if
newdate between(date_bef3,date_aft3) and
newstitle contain "dividend"
3: collapse (sum)event1 event2 by(id,date)
setDT(df1,key="id")
setDT(news,key="id")
%inner merge
df<-df1[news,]
%set dummy event1, something wrong here
df[newstitle=="order",event1:=as.numeric(between(newsdate,date_bef3,date_aft3))]
Based on your comment below, I believe that what you want is:
# Make dates actual dates instead of factors
library(lubridate)
df1$date_bef3 <- ymd(df1$date_bef3)
df1$date_aft3 <- ymd(df1$date_aft3)
news$date <- ymd(news$date)
event1 = sum(news$newstitle[news$id == df1$id & df1$date_bef3 <= news$date & df1$date_aft3 >= news$date]=="new order")
event2 = sum(news$newstitle[news$id == df1$id & df1$date_bef3 <= news$date & df1$date_aft3 >= news$date]=="dividend payment")

In R, iterating over two datasets and getting back results without looping

I have two data sets, Transaction_long, and Transaction_short. Transaction_long has many quotes of policy and price with a purchase point (denoted by true) in the dataset. Transaction_short has only entries of the purchase points.
My objective is to add a column in the Transaction_short dataset called Policy_Change_Frequency. For every customer in the short dataset, iterate over the rows for that customer in the long dataset and calculate how many time the policy changed.
To find the policy change I can use sum(diff(Transaction_Long$policy)!=0) but not sure how to iterate over these two data sets and get results
Details:
Customer_Name : name of customer
Customer_ID: Customer Identifier number
Purchase: Boolean variable (Yes-1,No-0)
Policy: Categorical (takes values 1-5)
Price : Price quoted
Data set1-Transaction_Long
Customer_Name,Customer_ID,Purchased,Policy,Price
Joe,101,0,1,500
Joe,101,0,1,505
Joe,101,0,2,510
Joe,101,0,2,504
Joe,101,0,2,507
Joe,101,0,1,505
Joe,101,1,3,501
Mary,103,0,1,675
Mary,103,0,3,650
Mary,103,0,2,620
Mary,103,0,2,624
Mary,103,0,2,630
Mary,103,1,2,627
Data set 2:Transaction_Short
Customer_Name , Customer_ID,Purchased,Policy, Price
Joe,101,1,3,501
Mary,103,1,2,627
Need to add a Policy Change Frequency column in the Transaction Short Dataset, so my final Transcation short Dataset will look like
Final Dataset should look like this
Customer_Name , Customer_ID,Purchased, Policy, Price,Policy_ChangeFreq
Joe,101,1,3,501,3
Mary,103,1,2,627,2
Consider a calculated column for policy change which tags changes from previous row within each customer with one. Then, aggregates the ones for a count. Merge is used due to two aggregations needed (final row for each customer and PolicyChanged count):
Transaction_Long$PolicyChangedFreq <- sapply(1:nrow(Transaction_Long),
function(i)
if (i > 1) {
ifelse(Transaction_Long$Policy[i-1]==
Transaction_Long$Policy[i], 0,
ifelse(Transaction_Long$Customer_ID[i-1] !=
Transaction_Long$Customer_ID[i], 0, 1))
} else { 0 }
)
Transaction_Final <- merge(aggregate(.~ Customer_ID + Customer_Name,
Transaction_Long[,c(1:5)], FUN = tail, n = 1),
aggregate(.~ Customer_ID + Customer_Name,
Transaction_Long[,c(1:2,6)], FUN = sum),
by = c('Customer_ID', 'Customer_Name'))
Transaction_Final
# Customer_ID Customer_Name Purchased Policy Price PolicyChangedFreq
#1 101 Joe 1 3 501 3
#2 103 Mary 1 2 627 2
#Parfait. Thank you for the solution. i solved this using the sqldf package in R
for (i in 1:nrow(Transaction_short)){
sql <- sprintf("SELECT policy from Transaction_long where customer_ID = %s",ML_Train_short$customer_ID[i])
df<- sqldf(sql)
NF <- sum(df$policy[-1]!= df$policy[-length(df$policy)])
ML_Train_short$Policy_Change_Freq[i] <- NF
}
Since i have about 500K rows in the long dataset and about 100K in the short dataset..this is taking a while..is there any other solution that does not require loops? Thank you

Resources