I have a database of cash register transactions. The records are split by Products in a Basket:
Date Hour Cust Prod Basket Spend
1| 20160416 8 C1 P1 B2 10
2| 20160416 8 C1 P2 B2 20
3| 20160115 15 C1 P3 B1 30
4| 20160115 15 C1 P2 B1 50
5| 20161023 11 C1 P4 B3 60
I would like to see:
DaysSinceLastVisit Cust Basket Spend
NULL C1 B1 30
92 C1 B2 80
190 C1 B3 60
AND
AvgDaysBetweenVisits Cust AvgSpent
141 C1 56.57
I can't figure out how to perform aggregate functions on Dates during a GROUP BY. All the other posts on SO seem to have 2 for start/end dates [1] [2] [3].
Here's what I have tried so far:
SELECT SUM(DATE(Date)), Cust, Basket, SUM(Spend) FROM 'a' GROUP BY CUST_CODE,BASKET # Sums the numeric values
SELECT DIFF(DATE(Date)), Cust, Basket, AVG(Spend) FROM 'a' GROUP BY CUST_CODE,BASKET # DIFF/DIFFERENCE not a function
Also, it should be noted that I'm running this on r with sqldf, which uses SQLite syntax. However, I'd prefer an SQLite solution.
Try this-
df <- data.frame("Date"=c("20160416","20160416","20160115","20160115","20161023"),
"Hour"=c(8,8,15,15,11), "Cust"=c("C1","C1","C1","C1","C1"),
"Prod"=c("P1","P2","P3","P2","P4"), "Basket"=c("B2","B2","B1","B1","B3"),
"Spend"=c(10,20,30,50,60))
df$Date <- as.Date(df$Date, format = "%Y%m%d")
# Aggregate the data first
df2 <- aggregate(Spend ~ Date + Cust + Basket, data = df, FUN = sum)
# Now get days since last visit
df2$Date <- c(0, diff(df2$Date, 1))
# And finally
df3 <- aggregate(cbind(Date, Spend) ~ Cust, data = df2, FUN = mean)
day_since_last_visit is with respect to today's date+time , as it is more practical. However if you get the difference btween 1st and 2nd and 2nd and 3rd, it will be 92 and 190, which is similar to your data. Best way to handle that part will be in cursor, can be done in query too, but will be bit more complex..
select round( julianday('now') - min ( julianday (substr(date,1,4) || "-"||substr(date,5,2) || "-"|| substr(date,7) ) ) ,2 ) days_since_last_visit,
date, cust, basket, sum(spend) total_spend
from customer
group by cust, basket, date
Average for date visited and today's date for each record
select round(avg( julian_days) ,2) average_days , cust, round(avg(total_spend) ,2) average_spent
from
( select julianday('now') - min ( julianday (substr(date,1,4) || "-"||substr(date,5,2) || "-"|| substr(date,7) ) ) julian_days, date,
cust, basket, sum(spend) total_spend
from customer
group by cust, basket, date )
group by cust
create and insert script only for reference
create table customer ( date text , hour integer, cust text, prod text, basket text, spend integer )
insert into customer ( date, hour, cust, prod, basket, spend ) values ( "20161023", 11, "C1", "P4", "B3",60)
This uses SQLite via sqldf as in the question.
We first define three tables (only for the duration of the SQL statement) in the with clause:
aa is table a with an additional julian date column suitable for differencing
tab_days is a table that uses aa for defining the differenced days via an appropriately aggregated join
tab_sum_spend is a table that constains the Spend sums
Finally we join the last two and sort appropriately.
library(sqldf)
# see note at end for a in reproducible form
t1 <- sqldf("
WITH aa AS (SELECT julianday(substr(Date, 1, 4) || '-' ||
substr(Date, 5, 2) || '-' ||
substr(Date, 7, 2)) juldate,
*
FROM a),
tab_days AS (SELECT a1.Date, min(a1.juldate - a2.juldate) Days, a1.Cust, a1.Basket
FROM aa a1
LEFT JOIN aa a2 ON a1.Date > a2.Date AND a1.Cust = a2.Cust
GROUP BY a1.Cust, a1.Date, a1.Basket),
tab_sum_spend AS (SELECT Cust, Date, Basket, sum(Spend) Spend
FROM aa
GROUP BY Cust, Date, Basket)
SELECT Days, Cust, Basket, Spend
FROM tab_days
JOIN tab_sum_spend USING(Cust, Date, Basket)
ORDER BY Cust, Date, Basket
")
t1
## Days Cust Basket Spend
## 1 <NA> C1 B1 80
## 2 92.0 C1 B2 30
## 3 190.0 C1 B3 60
and for the second question:
sqldf("SELECT avg(Days) AvgDays, Cust, avg(Spend) AvgSpend FROM t1")
## AvgDays Cust AvgSpend
## 1 141 C1 56.66667
Note: The data.frame a in reproducible form is:
Lines <- "Date Hour Cust Prod Basket Spend
1 20160416 8 C1 P1 B2 10
2 20160416 8 C1 P2 B2 20
3 20160115 15 C1 P3 B1 30
4 20160115 15 C1 P2 B1 50
5 20161023 11 C1 P4 B3 60"
a <- read.table(text = Lines, as.is = TRUE)
Related
I have a data.table with dates in it (as factor variables). I am getting the lag values from these. How can I tell R to run the get the lag values only for the observations dated semi-last? In this case this would be start == "01.01.2015"?
example data:
ID <- rep("A5", 15)
product <- rep(c("prod1","prod2","prod3", "prod55", "prod4", "prod9", "prod83"),3)
start <- c(rep("01.01.2016", 3), rep("01.01.2015", 3), rep("01.01.2014",3),
rep("01.01.2013",3), rep("01.01.2012",3))
prodID <- c(3,1,2,3,1,2,3,1,2,3,2,1,3,1,2)
mydata <- cbind(ID, product[1:15], start, prodID)
mydata <- as.data.table(mydata)
mydata[, (nameCols) := shift(.SD, 3, fill = "NA", "lead"), .SDcols= c("start", "V2"), by = "prodID"]
For now I have used this to get to my results:
mydata[start == "01.01.2015"]
The problem is that the semi-last date is not always the same date. I will be repeating this procedure many times and i want to avoid having to specify this by hand. Any ideas?
Convert the data to date object and sort to select semi-last date.
library(data.table)
mydata[, start := as.IDate(start, '%d.%m.%Y')]
mydata[start == sort(unique(start), decreasing = TRUE)[2]]
# ID V2 start prodID
#1: A5 prod55 2015-01-01 3
#2: A5 prod4 2015-01-01 1
#3: A5 prod9 2015-01-01 2
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Below is an example of the problem I have. Specifically I have a situation where there are multiple entries for the person ID's of consecutive date entries. I want to group these entries that are consecutive to show specific 'episodes.'
I have a table like this:
ID DATE
A 11/16/2017
A 11/17/2017
A 11/18/2017
A 11/18/2017
B 11/12/2017
B 11/13/2017
B 11/14/2017
C 10/31/2017
C 10/31/2017
A 11/22/2017
A 11/22/2017
A 11/23/2017
And I would like the result table to be like this from this sample table
ID StartDATE EndDATE
A 11/16/2017 11/18/2017
B 11/12/2017 11/14/2017
C 10/31/2017 10/31/2017
A 11/22/2017 11/23/2017
You could do trick with row number with something like this:
select ID, min(DATE), max(DATE) from
(
select *, datediff(day, RN, DATE) GRP
from (
select *, row_number () over (partition by ID order by DATE asc) as RN
from (
select distinct ID, DATE from Table1
) X
) Y
) Z
group by ID, GRP
This calculates the difference between the "day" of the row number and the date, and as long as the difference stays the same, it's a sequential date.
Here's a long-winded solution. First I detect consecutive groups and store it in the grp variable. Then, I split the data by ID, then apply a range function to each subset and make it pretty by creating a data.frame which holds ID and both dates. In the end, the do.call function just glues everything together.
xy <- read.table(text = "ID DATE
A 11/16/2017
A 11/17/2017
A 11/18/2017
A 11/18/2017
B 11/12/2017
B 11/13/2017
B 11/14/2017
C 10/31/2017
C 10/31/2017
A 11/22/2017
A 11/22/2017
A 11/23/2017", header = TRUE)
xy$DATE <- as.Date(xy$DATE, format = "%m/%d/%Y")
xy$grp <- cumsum(c(1, diff(as.numeric(as.factor(xy$ID))) != 0))
split.by.id <- split(xy, f = xy$grp)
run.by.id <- lapply(split.by.id, FUN = function(x) {
rng <- range(x$DATE)
data.frame(ID = unique(x$ID), StartDate = rng[1], EndDate = rng[2])
})
range.by.id <- do.call(rbind, run.by.id)
range.by.id
ID StartDate EndDate
1 A 2017-11-16 2017-11-18
2 B 2017-11-12 2017-11-14
3 C 2017-10-31 2017-10-31
4 A 2017-11-22 2017-11-23
I'm new to R have gone through several tutorials online and paid but still struggling with the implementation of my requirement. I would like to build a retention cohort in R. I currently do this in excel, which takes me 4-5 hours every time I need to implement one. Therefore, exploring to see if R can help out. It seems so but need some direction.
Below is the output in excel. If you notice May 2011 I had 31 customers who joined and their progression monthwise till the current month.
Output in excel
This is the input variable
Initial columns are customer id, his date of joining, and his date of expiry. Two additional columns convert the dj & de to text. Columns K onwards is processing of the data to check if the customer is active in given month or has he churned? I use date of join in text from to indicate active and append it to with "C" to denote expiry. Later I just count the columns with the date to get the cohort.
So, how do I implement this in R.
Assuming this is the sample data and I need the cohort from 1 May 2015 to 1 Jan 2016
customer dj exp
abc 01/05/15 25/6/15
efg 01/05/15 25/7/15
ghd 01/05/15 25/7/15
mkd 01/06/15 25/7/15
kskm 01/06/15 05/8/15
This is what I would like to create from the above data.
Cohort M0 M1 M2 M3 M4
2015-05 3 3 2 0 0
2015-06 2 2 1 1 0
Explanation: M0 is the months from the date of joining. So 3 people joined us in 2015 May and all of them are active in May. M1 would be June and all of them are active in June. We lose 1 customer in the end of June on the 25th, so will consider him active in June but in M2 my count drops from 3 to 2. M3 for the May joined customers correspond to the month of August in which we have lost our customers.
Similar process of 2015-06 cohort. M1 would be the month of July and M3 the month of September.
Modified your code as below, this works thanks! Now trying to figure out a way to make M0 to M(n) dynamic.
library(readxl)
library(zoo)
library(plyr)
# Read in the data
df <- read.csv("~/Desktop/R/data.csv")
df$dj <- as.Date(df$dj,"%d/%m/%y")
df$exp <- as.Date(df$exp,"%d/%m/%y")
# The data in the file has different variable names than your example data
# so I'm changing them to match
names(df)[1:3] <- c("customer","dj","exp")
# Make a variable called Cohort that contains only the year and month of joining
# as.yearmon() comes from the 'zoo' package
df$Cohort <- as.yearmon(df$dj)
# Calculate the difference in months between date of expiry and date of joining
df$MonthDiff <- ceiling((df$exp-df$dj)/30)
#df$MonthDiff <- 12*(as.yearmon(df$exp+months(1))-df$Cohort)
# Use ddply() from the 'plyr' package to get the frequency of subjects that are
# still active after 0, 1, 2, 3, and 4 months.
df1 <- ddply(df,.(Cohort),summarize,
M0 = sum(MonthDiff > 0),
M1 = sum(MonthDiff > 1),
M2 = sum(MonthDiff > 2),
M3 = sum(MonthDiff > 3),
M4 = sum(MonthDiff > 4),
M5 = sum(MonthDiff > 5)
)
df1
df1
Cohort M0 M1 M2 M3 M4 M5
1 May 2015 3 3 2 0 0 0
2 Jun 2015 2 2 1 0 0 0
Now
Try this:
library(readxl)
library(zoo)
library(plyr)
# Read in the data
df <- read_excel("MyFile.xlsx")
# The data in the file has different variable names than your example data
# so I'm changing them to match
names(df)[1:3] <- c("customer","dj","exp")
# Make a variable called Cohort that contains only the year and month of joining
# as.yearmon() comes from the 'zoo' package
df$Cohort <- as.yearmon(df$dj)
# Calculate the difference in months between date of expiry and date of joining
df$MonthDiff <- 12*(as.yearmon(df$exp)-df$Cohort)
# Use ddply() from the 'plyr' package to get the frequency of subjects that are
# still active after 0, 1, 2, 3, and 4 months.
df1 <- ddply(df,.(Cohort),summarize,
M0 = sum(MonthDiff >= 0),
M1 = sum(MonthDiff >= 1),
M2 = sum(MonthDiff >= 2),
M3 = sum(MonthDiff >= 3),
M4 = sum(MonthDiff >= 4))
df1
# Cohort M0 M1 M2 M3 M4
# May 2015 3 3 2 0 0
# Jun 2015 2 1 0 0 0
This assumes that when you read the data in from Excel, the dates are formatted as dates. In case they are not you could use the following:
df$dj <- as.Date(df$dj,"%d/%m/%y")
df$exp <- as.Date(df$exp,"%d/%m/%y")
The above code calculates the cohort-wise retention rates. The cohort is the month of joining. So, the code caluclates the number customers who joined in the month of May 2015, how many are active month on month. The final output is stored in data frame df1 (shown below)
I need help creating dynamic column names which are currently hard coded in the ddply function. M0 which means month of joining, M1 means 1st month from joining, M2 means 2 month from joining to M(n) should be variables. This can calculated by subtracting the farthest expiry date from the earliest joining date.
Unfortunately, I'm not able to auto calculate M0 to M(n) range dynamically.
Here is my code dump which works but is not optimal because I have hardcoded M0 to M3 as a variable in ddply function. So if my input data has a customer with a longer subscription period than 5 months my code will fail.
The input to the code is the following dummy data.
customer dj exp
abc 01/05/15 25/06/15
efg 01/05/15 25/07/15
ghd 01/05/15 25/07/15
mkd 01/06/15 25/07/15
kskm 01/06/15 05/08/15
Reproducible code.
library(zoo)
library(plyr)
customer<-c("abc","efg","ghd","mkd","kskm")
dj<-c("2015-05-01", "2015-05-01", "2015-05-01","2015-06-01","2015-06-01")
exp<-c("2015-06-25", "2015-07-25", "2015-07-25","2015-07-01","2015-08-05")
data.frame(customer,dj,exp)
df$dj <- as.Date(df$dj,"%d/%m/%y")
df$exp <- as.Date(df$exp,"%d/%m/%y")
# The data in the file has different variable names than your example data
# so I'm changing them to match
names(df)[1:3] <- c("customer","dj","exp")
# Make a variable called Cohort that contains only the year and month of joining
# as.yearmon() comes from the 'zoo' package
df$Cohort <- as.yearmon(df$dj)
# Calculate the difference in months between date of expiry and date of joining
df$MonthDiff <- ceiling((df$exp-df$dj)/30)
#df$MonthDiff <- 12*(as.yearmon(df$exp+months(1))-df$Cohort)
range<-as.integer(ceiling((max(df$exp)-min(df$dj)))/30)
# Use ddply() from the 'plyr' package to get the frequency of subjects that are
# still active after 0, 1, 2, and 3 months.
df1 <- ddply(df,.(Cohort),summarize,
M0 = sum(MonthDiff > 0),
M1 = sum(MonthDiff > 1),
M2 = sum(MonthDiff > 2),
M3 = sum(MonthDiff > 3)
)
df1
df1
Cohort M0 M1 M2 M3
1 May 2015 3 3 2 0
2 Jun 2015 2 2 1 0
The above is the output working output. Ask is to make column M0 to M3 dynamic
Try inserting this after creating range:
for(i in 0:range) df <- within(df,assign(paste0("M",i),MonthDiff>i))
df1 <- ddply(df,.(Cohort),function(x) colSums(x[,paste0("M",0:range)]))
df1
# Cohort M0 M1 M2 M3
# 1 May 2015 3 3 2 0
# 2 Jun 2015 2 1 1 0
Let me begin by saying this question pertains to R (stat programming language) but I'm open straightforward suggestions for other environments.
The goal is to merge outcomes from dataframe (df) A to sub-elements in df B. This is a one to many relationship but, here's the twist, once the records are matched by keys they also have to match over a specific frame of time given by a start time and duration.
For example, a few records in df A:
OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal
And from df B:
OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00
The desired outcome from the merge would be:
OBS ID Time Outcome
1 01 10:12:10 Normal
3 02 10:12:45 Weird
Desired result: dataframe B with outcomes merged in from A. Notice observations 2 and 4 were dropped because although they matched IDs on records in A they did not fall within any of the time intervals given.
Question
Is it possible to perform this sort of operation in R and how would you get started? If not, can you suggest an alternative tool?
Set up data
First set up the input data frames. We create two versions of the data frames: A and B just use character columns for the times and At and Bt use the chron package "times" class for the times (which has the advantage over "character" class that one can add and subtract them):
LinesA <- "OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal"
LinesB <- "OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00"
A <- At <- read.table(textConnection(LinesA), header = TRUE,
colClasses = c("numeric", rep("character", 4)))
B <- Bt <- read.table(textConnection(LinesB), header = TRUE,
colClasses = c("numeric", rep("character", 2)))
# in At and Bt convert times columns to "times" class
library(chron)
At$StartTime <- times(At$StartTime)
At$Duration <- times(At$Duration)
Bt$Time <- times(Bt$Time)
sqldf with times class
Now we can perform the calculation using the sqldf package. We use method="raw" (which does not assign classes to the output) so we must assign the "times" class to the output "Time" column ourself:
library(sqldf)
out <- sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration",
method = "raw")
out$Time <- times(as.numeric(out$Time))
The result is:
> out
OBS ID Time Outcome
1 1 01 10:12:10 Normal
2 3 02 10:12:45 Weird
With the development version of sqldf this can be done without using method="raw" and the "Time" column will automatically be set to "times" class by the sqldf class assignment heuristic:
library(sqldf)
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") # grab devel ver
sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration")
sqldf with character class
Its actually possible to not use the "times" class by performing all time calculations in sqlite out of character strings employing sqlite's strftime function. The SQL statement is unfortunately a bit more involved:
sqldf("select B.OBS, ID, Time, Outcome from A join B using(ID)
where strftime('%s', Time) - strftime('%s', StartTime)
between 0 and strftime('%s', Duration) - strftime('%s', '00:00:00')")
EDIT:
A series of edits which fixed grammar, added additional approaches and fixed/improved the read.table statements.
EDIT:
Simplified/improved final sqldf statement.
here is an example:
# first, merge by ID
z <- merge(A[, -1], B, by = "ID")
# convert string to POSIX time
z <- transform(z,
s_t = as.numeric(strptime(as.character(z$StartTime), "%H:%M:%S")),
dur = as.numeric(strptime(as.character(z$Duration), "%H:%M:%S")) -
as.numeric(strptime("00:00:00", "%H:%M:%S")),
tim = as.numeric(strptime(as.character(z$Time), "%H:%M:%S")))
# subset by time range
subset(z, s_t < tim & tim < s_t + dur)
the output:
ID StartTime Duration Outcome OBS Time s_t dur tim
1 1 10:12:06 00:00:10 Normal 1 10:12:10 1321665126 10 1321665130
2 1 10:12:06 00:00:10 Normal 2 10:12:15 1321665126 10 1321665135
7 2 10:12:30 00:00:30 Weird 3 10:12:45 1321665150 30 1321665165
OBS #2 looks to be in the range. does it make sense?
Merge the two data.frames together with merge(). Then subset() the resulting data.frame with the condition time >= startTime & time <= startTime + Duration or whatever rules make sense to you.