Calculating the average difference in purchase dates by customer id - r

Was wondering how I would use R to calculate the below.
Assuming a CSV with the following purchase data:
| Customer ID | Purchase Date |
| 1 | 01/01/2017 |
| 2 | 01/01/2017 |
| 3 | 01/01/2017 |
| 4 | 01/01/2017 |
| 1 | 02/01/2017 |
| 2 | 03/01/2017 |
| 2 | 07/01/2017 |
I want to figure out the average time between repurchases by customer.
The math would be like the one below:
| Customer ID | AVG repurchase |
| 1 | 30 days | = (02/01 - 01/01 / 1 order
| 2 | 90 days | = ( (03/01 - 01/01) + (07 - 3/1) ) /2 orders
| 3 | n/a |
| 4 | n/a |
The output would be the total average across customers -- so: 60 days = (30 avg for customer1 + 90 avg for customer2) / 2 customers.

I've assumed you have read your CSV into a dataframe named df and I've renamed your variables using snake case, since having variables with a space in the name can be inconvenient, leading many to use either snake case or camel case variable naming conventions.
Here is a base R solution:
mean(sapply(by(df$purchase_date, df$customer_id, diff), mean), na.rm=TRUE)
[1] 60.75
You may notice that we get 60.75 rather than 60 as you expected. This is because there are 31 days between customer 1's purchases (31 days in January until February 1), and similarly for customer 2's purchases -- there are not always 30 days in a month.
Explanation
by(df$purchase_date, df$customer_id, diff)
The by() function applies another function to data by groupings. Here, we are applying diff() to df$purchase_date by the unique values of df$customer_id. By itself, this would result in the following output:
df$customer_id: 1
Time difference of 31 days
-----------------------------------------------------------
df$customer_id: 2
Time differences in days
[1] 59 122
We then use
sapply(by(df$purchase_date, df$customer_id, diff), mean)
to apply mean() to the elements of the previous result. This gives us each customer's average time to repurchase:
1 2 3 4
31.0 90.5 NaN NaN
(we see customers 3 and 4 never repurchased). Finally, we need to average these average repurchase times, which means we need to also deal with those NaN values, so we use:
mean(sapply(by(df$purchase_date, df$customer_id, diff), mean), na.rm=TRUE)
which will average the previous results, ignoring missing values (which, in R include NaN values).

Here's another solution with dplyr + lubridate:
library(dplyr)
library(lubridate)
df %>%
mutate(Purchase_Date = mdy(Purchase_Date)) %>%
group_by(Customer_ID) %>%
summarize(AVG_Repurchase = sum(difftime(Purchase_Date,
lag(Purchase_Date), units = "days"),
na.rm=TRUE)/(n()-1))
or with data.table:
library(data.table)
setDT(df)[, Purchase_Date := mdy(Purchase_Date)]
df[, .(AVG_Repurchase = sum(difftime(Purchase_Date,
shift(Purchase_Date), units = "days"),
na.rm=TRUE)/(.N-1)), by = "Customer_ID"]
Result:
# A tibble: 4 x 2
Customer_ID AVG_Repurchase
<dbl> <time>
1 1 31.0 days
2 2 90.5 days
3 3 NaN days
4 4 NaN days
Customer_ID AVG_Repurchase
1: 1 31.0 days
2: 2 90.5 days
3: 3 NaN days
4: 4 NaN days
Note:
I first converted Purchase_Date to mmddyyyy format, then group_by Customer_ID. Final for each Customer_ID, I calculated the mean of the days difference between Purchase_Date and it's lag.
Data:
df = structure(list(Customer_ID = c(1, 2, 3, 4, 1, 2, 2), Purchase_Date = c(" 01/01/2017",
" 01/01/2017", " 01/01/2017", " 01/01/2017", " 02/01/2017", " 03/01/2017",
" 07/01/2017")), .Names = c("Customer_ID", "Purchase_Date"), class = "data.frame", row.names = c(NA,
-7L))

Related

Deleting rows based on a calculated criteria in R

I conducted an analysis for some M&A-Deals. My current output looks like this:
Deal-Nr | Event-Date | Target-Nation | CAR | SIC
----------------------------------------------------
1 | 01-01-1999 | Italy | 5.1% | 201
2 | 02-01-1999 | Germany | 2.3% | 202
3 | 06-01-1999 | Spain | 1.5% | 201
4 | 10-09-1999 | Germany | 0.3% | 201
5 | 15-09-1999 | UK | 1.1% | 201
6 | 25-10-2000 | Spain | 0.8% | 201
However, for my final analysis I want to exclude all deals within the same SIC-Code, which do not have at least 180 trading days between them. So in this case, I would want to exclude my deal 3 from the analysis (as they have the same SIC-code and do not have 180 days between them). Then the code should continue and check the next deal within that SIC-Code industry and remove (<180 days) or keep it (>180 days). This should be done for all the different SIC codes in my analysis.
As I'm rather new in R, I'm reaching out for help. Thank you so much for your support.
Edit:
As indicated below I provide some further information. I'm interested in the deals that are in the same SIC-Code and >180 days apart. This would mean in the table to remove row (3) and row (5). If one deal is more than 180 days apart the subsequent dates should be checked.
First, your Event.Date column needs to be a real date, not a string. I'm inferring month-day-year. From there, we need to group by SIC and calculate the difference in dates.
base R
dat$Event.Date <- as.Date(dat$Event.Date, format = "%d-%m-%Y")
keep <- ave(as.numeric(dat$Event.Date), dat$SIC, FUN = function(z) c(TRUE, diff(z) >= 180)) > 0
dat[keep,]
# Deal.Nr Event.Date Target.Nation CAR SIC
# 1 1 1999-01-01 Italy 5.1% 201
# 2 2 1999-01-02 Germany 2.3% 202
# 4 4 1999-09-10 Germany 0.3% 201
# 6 6 2000-10-25 Spain 0.8% 201
dplyr
library(dplyr)
dat %>%
# mutate(Event.Date = as.Date(Event.Date, format = "%d-%m-%Y")) %>%
# group_by(SIC) %>%
# filter(c(TRUE, diff(Event.Date) >= 180)) %>%
# ungroup()
# . + # A tibble: 4 x 5
# Deal.Nr Event.Date Target.Nation CAR SIC
# <int> <date> <chr> <chr> <int>
# 1 1 1999-01-01 Italy 5.1% 201
# 2 2 1999-01-02 Germany 2.3% 202
# 3 4 1999-09-10 Germany 0.3% 201
# 4 6 2000-10-25 Spain 0.8% 201
data.table
library(data.table)
as.data.table(dat
# )[, Event.Date := as.Date(Event.Date, format = "%d-%m-%Y")
# ][, .SD[c(TRUE, diff(Event.Date) >= 180),], by = .(SIC)]
+ > SIC Deal.Nr Event.Date Target.Nation CAR
# 1: 201 1 1999-01-01 Italy 5.1%
# 2: 201 4 1999-09-10 Germany 0.3%
# 3: 201 6 2000-10-25 Spain 0.8%
# 4: 202 2 1999-01-02 Germany 2.3%
Data
dat <- structure(list(Deal.Nr = 1:6, Event.Date = c("01-01-1999", "02-01-1999", "06-01-1999", "10-09-1999", "15-09-1999", "25-10-2000"), Target.Nation = c("Italy", "Germany", "Spain", "Germany", "UK", "Spain"), CAR = c("5.1%", "2.3%", "1.5%", "0.3%", "1.1%", "0.8%"), SIC = c(201L, 202L, 201L, 201L, 201L, 201L)), row.names = c(NA, -6L), class = "data.frame")

Rank df by values and sum by unique variables [duplicate]

This question already has answers here:
Can dplyr summarise over several variables without listing each one? [duplicate]
(2 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 6 years ago.
I have a large dataset containing the names of hospitals, the hospital groups and then the number of presenting patients by month. I'm trying to use dplyr to create a summary that contains the total number of presenting patients each month, aggregated by hospital group. The data frame looks like this:
Hospital | Hospital_group | Jan 03 | Feb 03 | Mar 03 | Apr 03 | .....
---------------------------------------------------------------
Hosp 1 | Group A | 5 | 5 | 6 | 4 | .....
---------------------------------------------------------------
Hosp 2 | Group A | 6 | 3 | 8 | 2 | .....
---------------------------------------------------------------
Hosp 3 | Group B | 5 | 5 | 6 | 4 | .....
---------------------------------------------------------------
Hosp 4 | Group B | 3 | 7 | 2 | 1 | .....
---------------------------------------------------------------
I'm trying to create a new dataframe that looks like this:
Hospital_group |Jan 03 | Feb 03 | Mar 03 | Apr 03 | .....
----------------------------------------------------------
Group A | 11 | 8 | 14 | 6 | .....
----------------------------------------------------------
Group B | 8 | 12 | 8 | 5 | .....
----------------------------------------------------------
I'm trying to use dplyr to summarise the data but am a little stuck (am very new at this as you might have guessed). I've managed to filter out the first column (hospital name) and group_by the hospital group but am not sure how to get a cumulative sum total for each month and year (there is a large number of date columns so I'm hoping there is a quick and easy way to do this).
Sorry about posting such a basic question - any help or advice would be greatly appreciated.
Greg
Use summarize_all:
Example:
df <- tibble(name=c("a","b", "a","b"), colA = c(1,2,3,4), colB=c(5,6,7,8))
df
# A tibble: 4 × 3
name colA colB
<chr> <dbl> <dbl>
1 a 1 5
2 b 2 6
3 a 3 7
4 b 4 8
df %>% group_by(name) %>% summarize_all(sum)
Result:
# A tibble: 2 × 3
name colA colB
<chr> <dbl> <dbl>
1 a 4 12
2 b 6 14
Edit: In your case, your data frame contains one column that you do not want to aggregate (the Hospital name.) You might have to either deselect the hospital name column first, or use summarize_at(vars(-Hospital), funs(sum)) instead of summarize_all.
We can do this using base R
We split the dataframe by Hospital_group and then sum it column-wise.
do.call(rbind, lapply(split(df[-c(1, 2)], df$Hospital_group), colSums))
# Jan_03 Feb_03 Mar_03 Apr_03
#Group_A 11 8 14 6
#Group_B 8 12 8 5

Making Pivot table with Multiple Columns and Aggregating by Unique Occurences

I'm having a tough time wrapping my head around this or finding a guideline online.
I have membership data. I want to be to see how many members last in a particular month before dropping their membership. I can see which month they have joined and I can see how long they've been active by looking at their transaction no (it increases by 1 each month). So if I track transaction no's for each month, I can get a waterfall of how many people joined that month and what the drop off was.
The kicker is that sometimes there are multiple transactions within a month by the same member, but I would only like to count that member once, so I would need to count that member only once.
Name | Joined Month | Transaction no
Adam | Jan | 1
Adam | Jan | 2
Adam | Jan | 2
Ben | Jan | 1
Ben | Jan | 2
Ben | Jan | 3
Ben | Jan | 4
Cathy| Jan | 1
Donna| Feb | 1
Donna| Feb | 2
Donna| Feb | 3
Evan | Mar | 1
Evan | Mar | 1
Frank | Mar | 1
Frank | Mar | 2
Aggregating for distinct members with months as columns, the result would look something like this:
Transaction# | Jan | Feb | March
1 | 3 | 1 | 2
2 | 2 | 1 | 1
3 | 1 | 1 | 0
4 | 1 | 0 | 0
Any tips or pointers in the correct direction would be very helpful. Should I be using reshape2 or a similar package? Hopefully I did not butcher the explanation or the formatting, please feel free to ask any questions.
Thank you!
Below is a reproducible example that uses the tidyverse functions dplyr::n_distinct and tidyr::spread.
I have first represented your data as a tibble (or you could use a data frame equally well).
Next we group by Transactionno and JoinedMonth before counting distinct Names. To get it in table format you request we use tidyr::spread. If you want the resulting columns in month order, ensuring your data frame has them as ordered factors would be important.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
library(tidyr)
x <- tribble(
~Name , ~JoinedMonth, ~Transactionno,
"Adam" , "Jan" , 1,
"Adam" , "Jan" , 2,
"Adam" , "Jan" , 2,
"Ben" , "Jan" , 1,
"Ben" , "Jan" , 2,
"Ben" , "Jan" , 3,
"Ben" , "Jan" , 4,
"Cathy", "Jan" , 1,
"Donna", "Feb" , 1,
"Donna", "Feb" , 2,
"Donna", "Feb" , 3,
"Evan" , "Mar" , 1,
"Evan" , "Mar" , 1,
"Frank" , "Mar" , 1,
"Frank" , "Mar" , 2
)
x %>%
group_by(Transactionno, JoinedMonth) %>%
summarise(ct = n_distinct(Name)) %>%
tidyr::spread(JoinedMonth, ct, fill = 0)
#> # A tibble: 4 x 4
#> # Groups: Transactionno [4]
#> Transactionno Feb Jan Mar
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1. 1. 3. 2.
#> 2 2. 1. 2. 1.
#> 3 3. 1. 1. 0.
#> 4 4. 0. 1. 0.
1) xtabs This one-liner uses base R and the input DF shown reproducibly in the Note below. Note that we assume that Joined.Month is a factor with levels Jan, Feb, Mar to ensure that the output is sorted in that order (rather than alphabetically).
xtabs(~ Transaction.no + Joined.Month, unique(DF))
giving:
Joined.Month
Transaction.no Jan Feb Mar
1 1 3 2
2 1 2 1
3 1 1 0
4 0 1 0
2) table Another base R approach.
with(unique(DF), table(Transaction.no, Joined.Month))
giving:
Joined.Month
Transaction.no Jan Feb Mar
1 3 1 2
2 2 1 1
3 1 1 0
4 1 0 0
2a) This would also work and is shorter but not quite as clear:
table(unique(DF)[3:2])
3) tapply This also uses only base R:
u <- unique(DF)
tapply(u[[1]], u[3:2], length, default = 0)
giving:
Joined.Month
Transaction.no Jan Feb Mar
1 3 1 2
2 2 1 1
3 1 1 0
4 1 0 0
Note
DF in reproducible form is assumed to be:
Lines <- "Name | Joined Month | Transaction no
Adam | Jan | 1
Adam | Jan | 2
Adam | Jan | 2
Ben | Jan | 1
Ben | Jan | 2
Ben | Jan | 3
Ben | Jan | 4
Cathy| Jan | 1
Donna| Feb | 1
Donna| Feb | 2
Donna| Feb | 3
Evan | Mar | 1
Evan | Mar | 1
Frank | Mar | 1
Frank | Mar | 2"
DF <- read.table(text = Lines, header = TRUE, sep = "|",
strip.white = TRUE, as.is = TRUE)
DF$Joined.Month <- factor(DF$Joined.Month, lev = month.abb[1:3])

R - Performing a CountIF for a multiple rows data frame

I've googled lots of examples about how to perform a CountIF in R, however I still didn't find the solution for what I want.
I basically have 2 dataframes:
df1: customer_id | date_of_export - here, we have only 1 date of export per customer
df2: customer_id | date_of_delivery - here, a customer can have different delivery dates (which means, same customer will appear more than once in the list)
And I need to count, for each customer_id in df1, how many deliveries they got after the export date. So, I need to count if df1$customer_id = df2$customer_id AND df1$date_of_export <= df2$date_of_delivery
To understand better:
customer_id | date_of_export
1 | 2018-01-12
2 | 2018-01-12
3 | 2018-01-12
customer_id | date_of_delivery
1 | 2018-01-10
1 | 2018-01-17
2 | 2018-01-13
2 | 2018-01-20
3 | 2018-01-04
My output should be:
customer_id | date_of_export | deliveries_after_export
1 | 2018-01-12 | 1 (one delivery after the export date)
2 | 2018-01-12 | 2 (two deliveries after the export date)
3 | 2018-01-12 | 0 (no delivery after the export date)
Doesn't seem that complicated but I didn't find a good approach to do that. I've been struggling for 2 days and nothing accomplished.
I hope I made myself clear here. Thank you!
I would suggest merging the two data.frames together and then it's a simple sum():
library(data.table)
df3 <- merge(df1, df2)
setDT(df3)[, .(deliveries_after_export = sum(date_of_delivery > date_of_export)), by = .(customer_id, date_of_export)]
# customer_id date_of_export deliveries_after_export
#1: 1 2018-01-12 1
#2: 2 2018-01-12 2
#3: 3 2018-01-12 0

Merge and append rows within a dataframe in R

I have read many of the threads and do not think my question has been asked before. I have a data.frame in R related to advertisements shown to customers as such:.. I have many customers, 8 different products.. so this is just a sample
mydf <- data.frame(Cust = c(1, 1), age = c(24, 24),
state = c("NJ", "NJ"), Product = c(1, 1), cost = c(400, 410),
Time = c(35, 25), Purchased = c("N", "Y"))
mydf
# Cust age state Product cost Time Purchased
# 1 1 24 NJ 1 400 35 N
# 2 1 24 NJ 1 410 23 Y
And I want to transform it to look as such ...
Cust | age | state | Product | cost.1 | time.1 | purch.1 | cost.2 | time.2 | purch.2
1 | 24 | NJ | 1 | 400 | 35 | N | 410 | 23 | Y
How can I do this? There are a few static variables for each customer such as age, state and a few others... and then there are the details associated with each offer that was presented to a given customer, the product # in the offer, the cost, the time, and if they purchased it... I want to get all of this onto 1 line for each customer to perform analysis.
It is worth noting that the number of products maxes out at 7, but for some customers it ranges from 1 to 7.
I have no sample code to really show. I have tried using the aggregate function, but I do not want to aggregate, or do any SUMs. I just want to do some joins. Research suggests the cbind, and tapply functions may be useful.
Thank you for your help. I am very new to R.
You are essentially asking to do a "long" to "wide" reshape of your data.
It looks to me like you're using "Cust", "age", "state", and "Product" as your ID variables. You don't have a an actual "time" variable though ("time" as in the sequential count of records by the IDs mentioned above). However, such a variable is easy to create:
mydf$timevar <- with(mydf,
ave(rep(1, nrow(mydf)),
Cust, age, state, Product, FUN = seq_along))
mydf
# Cust age state Product cost Time Purchased timevar
# 1 1 24 NJ 1 400 35 N 1
# 2 1 24 NJ 1 410 23 Y 2
From there, this is pretty straightforward with the reshape function in base R.
reshape(mydf, direction = "wide",
idvar=c("Cust", "age", "state", "Product"),
timevar = "timevar")
# Cust age state Product cost.1 Time.1 Purchased.1 cost.2 Time.2 Purchased.2
# 1 1 24 NJ 1 400 35 N 410 23 Y

Resources