Making Pivot table with Multiple Columns and Aggregating by Unique Occurences - r

I'm having a tough time wrapping my head around this or finding a guideline online.
I have membership data. I want to be to see how many members last in a particular month before dropping their membership. I can see which month they have joined and I can see how long they've been active by looking at their transaction no (it increases by 1 each month). So if I track transaction no's for each month, I can get a waterfall of how many people joined that month and what the drop off was.
The kicker is that sometimes there are multiple transactions within a month by the same member, but I would only like to count that member once, so I would need to count that member only once.
Name | Joined Month | Transaction no
Adam | Jan | 1
Adam | Jan | 2
Adam | Jan | 2
Ben | Jan | 1
Ben | Jan | 2
Ben | Jan | 3
Ben | Jan | 4
Cathy| Jan | 1
Donna| Feb | 1
Donna| Feb | 2
Donna| Feb | 3
Evan | Mar | 1
Evan | Mar | 1
Frank | Mar | 1
Frank | Mar | 2
Aggregating for distinct members with months as columns, the result would look something like this:
Transaction# | Jan | Feb | March
1 | 3 | 1 | 2
2 | 2 | 1 | 1
3 | 1 | 1 | 0
4 | 1 | 0 | 0
Any tips or pointers in the correct direction would be very helpful. Should I be using reshape2 or a similar package? Hopefully I did not butcher the explanation or the formatting, please feel free to ask any questions.
Thank you!

Below is a reproducible example that uses the tidyverse functions dplyr::n_distinct and tidyr::spread.
I have first represented your data as a tibble (or you could use a data frame equally well).
Next we group by Transactionno and JoinedMonth before counting distinct Names. To get it in table format you request we use tidyr::spread. If you want the resulting columns in month order, ensuring your data frame has them as ordered factors would be important.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
library(tidyr)
x <- tribble(
~Name , ~JoinedMonth, ~Transactionno,
"Adam" , "Jan" , 1,
"Adam" , "Jan" , 2,
"Adam" , "Jan" , 2,
"Ben" , "Jan" , 1,
"Ben" , "Jan" , 2,
"Ben" , "Jan" , 3,
"Ben" , "Jan" , 4,
"Cathy", "Jan" , 1,
"Donna", "Feb" , 1,
"Donna", "Feb" , 2,
"Donna", "Feb" , 3,
"Evan" , "Mar" , 1,
"Evan" , "Mar" , 1,
"Frank" , "Mar" , 1,
"Frank" , "Mar" , 2
)
x %>%
group_by(Transactionno, JoinedMonth) %>%
summarise(ct = n_distinct(Name)) %>%
tidyr::spread(JoinedMonth, ct, fill = 0)
#> # A tibble: 4 x 4
#> # Groups: Transactionno [4]
#> Transactionno Feb Jan Mar
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1. 1. 3. 2.
#> 2 2. 1. 2. 1.
#> 3 3. 1. 1. 0.
#> 4 4. 0. 1. 0.

1) xtabs This one-liner uses base R and the input DF shown reproducibly in the Note below. Note that we assume that Joined.Month is a factor with levels Jan, Feb, Mar to ensure that the output is sorted in that order (rather than alphabetically).
xtabs(~ Transaction.no + Joined.Month, unique(DF))
giving:
Joined.Month
Transaction.no Jan Feb Mar
1 1 3 2
2 1 2 1
3 1 1 0
4 0 1 0
2) table Another base R approach.
with(unique(DF), table(Transaction.no, Joined.Month))
giving:
Joined.Month
Transaction.no Jan Feb Mar
1 3 1 2
2 2 1 1
3 1 1 0
4 1 0 0
2a) This would also work and is shorter but not quite as clear:
table(unique(DF)[3:2])
3) tapply This also uses only base R:
u <- unique(DF)
tapply(u[[1]], u[3:2], length, default = 0)
giving:
Joined.Month
Transaction.no Jan Feb Mar
1 3 1 2
2 2 1 1
3 1 1 0
4 1 0 0
Note
DF in reproducible form is assumed to be:
Lines <- "Name | Joined Month | Transaction no
Adam | Jan | 1
Adam | Jan | 2
Adam | Jan | 2
Ben | Jan | 1
Ben | Jan | 2
Ben | Jan | 3
Ben | Jan | 4
Cathy| Jan | 1
Donna| Feb | 1
Donna| Feb | 2
Donna| Feb | 3
Evan | Mar | 1
Evan | Mar | 1
Frank | Mar | 1
Frank | Mar | 2"
DF <- read.table(text = Lines, header = TRUE, sep = "|",
strip.white = TRUE, as.is = TRUE)
DF$Joined.Month <- factor(DF$Joined.Month, lev = month.abb[1:3])

Related

Converting raw data into long format

I am importing data that is neither long nor wide:
clear
input str1 id purchased sold
A 2017 .
B . .
C 2016 2019
C 2018 .
D 2018 2019
D 2018 .
end
My goal is to get the data in the following long format, reflecting the count in each year:
Identifier Year Inventory
A 2016 0
A 2017 1
A 2018 1
A 2019 1
B 2016 0
B 2017 0
B 2018 0
B 2019 0
C 2016 1
C 2017 1
C 2018 2
C 2019 1
D 2016 0
D 2017 0
D 2018 2
D 2019 1
My initial approach would be to transform it first into a wide format, that is having only one row per identifier, and adding columns for years between 2016-2018. And then converting this format into the desired long format. However, this seems to be inefficient.
Is there any shorter and more efficient method to do this, as I have a much larger dataset?
This needs several small tricks. The most crucial are reshape long and fillin.
The inventory is essentially a running sum of purchases minus sales.
clear
input str1 Identifier Purchased Sold
A 2017 .
B . .
C 2016 2019
C 2018 .
D 2018 2019
D 2018 .
end
generate long id = _n
rename (Purchased Sold) year=
reshape long year, i(id) j(Event) string
drop id
fillin Id year
drop _fillin
drop if missing(year)
bysort Id (year Event) : generate inventory = sum((Event == "Purchased") - (Event == "Sold"))
drop Event
bysort Id year : keep if _n == _N
list, sepby(Id)
+----------------------------+
| Identi~r year invent~y |
|----------------------------|
1. | A 2016 0 |
2. | A 2017 1 |
3. | A 2018 1 |
4. | A 2019 1 |
|----------------------------|
5. | B 2016 0 |
6. | B 2017 0 |
7. | B 2018 0 |
8. | B 2019 0 |
|----------------------------|
9. | C 2016 1 |
10. | C 2017 1 |
11. | C 2018 2 |
12. | C 2019 1 |
|----------------------------|
13. | D 2016 0 |
14. | D 2017 0 |
15. | D 2018 2 |
16. | D 2019 1 |
+----------------------------+

Calculate sum of a column if the difference between consecutive rows meets a condition

This is a continued question from the post Remove the first row from each group if the second row meets a condition
Below is a sample dataset:
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
which would look like:
| id | Date | Buyer | diff | Amount |
|----|:----------:|------:|------|--------|
| 9 | 11/29/2018 | John | NA | 959 |
| 9 | 11/29/2018 | John | 0 | 1158 |
| 9 | 11/29/2018 | John | 0 | 596 |
| 5 | 2/13/2019 | Maria | 76 | 922 |
| 5 | 2/13/2019 | Maria | 0 | 922 |
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
| 4 | 8/17/2018 | Sandy | 58 | 4256 |
| 4 | 8/20/2018 | Sandy | 3 | 65 |
| 4 | 8/23/2018 | Sandy | 3 | 100 |
| 20 | 12/25/2018 | Paul | 124 | 313 |
| 20 | 12/25/2018 | Paul | 0 | 99 |
I need to retain those records where based on each buyer and id, the sum of amount between consecutive rows >5000 if the difference between two consecutive rows <=5. So, for example, Buyer 'Sandy' with id '4' has two transactions of 1849 and 4193 on '6/15/2018' and '6/20/2018' within a gap of 5 days, and since the sum of these two amounts>5000, the output would have these records. Whereas, for the same Buyer 'Sandy' with id '4' has another transactions of 4256, 65 and 100 on '8/17/2018', '8/20/2018' and '8/23/2018' within a gap of 3 days each, but the output will not have these records as the sum of this amount <5000.
The final output would look like:
| id | Date | Buyer | diff | Amount |
|----|:---------:|------:|------|--------|
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
Changing Date from character to Date and Amount from character to numeric:
df$Date<-as.Date(df$Date, '%m/%d/%y')
df$Amount<-as.numeric(df$Amount)
Now here I group the dataset by id, arrange it with Date, and create a rank within each id (so for example Sandy is going to have rank from 1 through 5 for 5 different days in which she has shopped), then I define a new variable called ConsecutiveSum which adds the Value of each row to it's previous row's Value (lag gives you the previous row). The ifelse statement forces consecutive sum to output a 0 if the previous row's Value doesn't exists. The next step is just enforcing your conditions:
df %>%
group_by(id) %>%
arrange(Date) %>%
mutate(rank=dense_rank(Date)) %>%
mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount + lag(Amount , default = 0)))%>%
filter(diffs<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)
# id Date Buyer Amount diffs rank ConsecutiveSum
# <chr> <chr> <chr> <dbl> <dbl> <int> <dbl>
# 1 4 6/15/2018 Sandy 1849 NA 1 0
# 2 4 6/20/2018 Sandy 4193 5 2 6042
I would use a combination of techniques available in tidyverse:
First create a grouping variable (new_id) and use the original id and new_id in combination to add together based on a grouping. Then we can filter by the criteria of the sum of the Amount > 5000. We can take this and filter then join or semi_join to filter based on the criteria.
ids is a dataset that finds the total Amount based on id and new_id and filters for when Dollars > 5000. This gives you the id and new_id that meets your criteria
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c(959,1158,596,922,922,1849,4193,4256,65,100,313,99), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
library(tidyverse)
df1 <- df %>% mutate(Date = as.Date(Date , format = "%m/%d/%Y"),
tf1 = (id != lag(id, default = 0)),
tf2 = (is.na(diffs) | diffs > 5))
df1$new_id <- cumsum(df1$tf1 + df1$tf2 > 0)
>df1
id Date Buyer Amount diffs days_post tf1 tf2 new_id
<chr> <date> <chr> <dbl> <dbl> <date> <lgl> <lgl> <int>
1 9 2018-11-29 John 959 NA 2018-12-04 TRUE TRUE 1
2 9 2018-11-29 John 1158 0 2018-12-04 FALSE FALSE 1
3 9 2018-11-29 John 596 0 2018-12-04 FALSE FALSE 1
4 5 2019-02-13 Maria 922 NA 2019-02-18 TRUE TRUE 2
5 5 2019-02-13 Maria 922 0 2019-02-18 FALSE FALSE 2
6 4 2018-06-15 Sandy 1849 NA 2018-06-20 TRUE TRUE 3
7 4 2018-06-20 Sandy 4193 5 2018-06-25 FALSE FALSE 3
8 4 2018-08-17 Sandy 4256 58 2018-08-22 FALSE TRUE 4
9 4 2018-08-20 Sandy 65 3 2018-08-25 FALSE FALSE 4
10 4 2018-08-23 Sandy 100 3 2018-08-28 FALSE FALSE 4
11 20 2018-12-25 Paul 313 NA 2018-12-30 TRUE TRUE 5
12 20 2018-12-25 Paul 99 0 2018-12-30 FALSE FALSE 5
ids <- df1 %>%
group_by(id, new_id) %>%
summarise(dollar = sum(Amount)) %>%
ungroup() %>% filter(dollar > 5000)
id new_id dollar
<chr> <int> <dbl>
1 4 3 6042
df1 %>% semi_join(ids)

Rank df by values and sum by unique variables [duplicate]

This question already has answers here:
Can dplyr summarise over several variables without listing each one? [duplicate]
(2 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 6 years ago.
I have a large dataset containing the names of hospitals, the hospital groups and then the number of presenting patients by month. I'm trying to use dplyr to create a summary that contains the total number of presenting patients each month, aggregated by hospital group. The data frame looks like this:
Hospital | Hospital_group | Jan 03 | Feb 03 | Mar 03 | Apr 03 | .....
---------------------------------------------------------------
Hosp 1 | Group A | 5 | 5 | 6 | 4 | .....
---------------------------------------------------------------
Hosp 2 | Group A | 6 | 3 | 8 | 2 | .....
---------------------------------------------------------------
Hosp 3 | Group B | 5 | 5 | 6 | 4 | .....
---------------------------------------------------------------
Hosp 4 | Group B | 3 | 7 | 2 | 1 | .....
---------------------------------------------------------------
I'm trying to create a new dataframe that looks like this:
Hospital_group |Jan 03 | Feb 03 | Mar 03 | Apr 03 | .....
----------------------------------------------------------
Group A | 11 | 8 | 14 | 6 | .....
----------------------------------------------------------
Group B | 8 | 12 | 8 | 5 | .....
----------------------------------------------------------
I'm trying to use dplyr to summarise the data but am a little stuck (am very new at this as you might have guessed). I've managed to filter out the first column (hospital name) and group_by the hospital group but am not sure how to get a cumulative sum total for each month and year (there is a large number of date columns so I'm hoping there is a quick and easy way to do this).
Sorry about posting such a basic question - any help or advice would be greatly appreciated.
Greg
Use summarize_all:
Example:
df <- tibble(name=c("a","b", "a","b"), colA = c(1,2,3,4), colB=c(5,6,7,8))
df
# A tibble: 4 × 3
name colA colB
<chr> <dbl> <dbl>
1 a 1 5
2 b 2 6
3 a 3 7
4 b 4 8
df %>% group_by(name) %>% summarize_all(sum)
Result:
# A tibble: 2 × 3
name colA colB
<chr> <dbl> <dbl>
1 a 4 12
2 b 6 14
Edit: In your case, your data frame contains one column that you do not want to aggregate (the Hospital name.) You might have to either deselect the hospital name column first, or use summarize_at(vars(-Hospital), funs(sum)) instead of summarize_all.
We can do this using base R
We split the dataframe by Hospital_group and then sum it column-wise.
do.call(rbind, lapply(split(df[-c(1, 2)], df$Hospital_group), colSums))
# Jan_03 Feb_03 Mar_03 Apr_03
#Group_A 11 8 14 6
#Group_B 8 12 8 5

Calculating the average difference in purchase dates by customer id

Was wondering how I would use R to calculate the below.
Assuming a CSV with the following purchase data:
| Customer ID | Purchase Date |
| 1 | 01/01/2017 |
| 2 | 01/01/2017 |
| 3 | 01/01/2017 |
| 4 | 01/01/2017 |
| 1 | 02/01/2017 |
| 2 | 03/01/2017 |
| 2 | 07/01/2017 |
I want to figure out the average time between repurchases by customer.
The math would be like the one below:
| Customer ID | AVG repurchase |
| 1 | 30 days | = (02/01 - 01/01 / 1 order
| 2 | 90 days | = ( (03/01 - 01/01) + (07 - 3/1) ) /2 orders
| 3 | n/a |
| 4 | n/a |
The output would be the total average across customers -- so: 60 days = (30 avg for customer1 + 90 avg for customer2) / 2 customers.
I've assumed you have read your CSV into a dataframe named df and I've renamed your variables using snake case, since having variables with a space in the name can be inconvenient, leading many to use either snake case or camel case variable naming conventions.
Here is a base R solution:
mean(sapply(by(df$purchase_date, df$customer_id, diff), mean), na.rm=TRUE)
[1] 60.75
You may notice that we get 60.75 rather than 60 as you expected. This is because there are 31 days between customer 1's purchases (31 days in January until February 1), and similarly for customer 2's purchases -- there are not always 30 days in a month.
Explanation
by(df$purchase_date, df$customer_id, diff)
The by() function applies another function to data by groupings. Here, we are applying diff() to df$purchase_date by the unique values of df$customer_id. By itself, this would result in the following output:
df$customer_id: 1
Time difference of 31 days
-----------------------------------------------------------
df$customer_id: 2
Time differences in days
[1] 59 122
We then use
sapply(by(df$purchase_date, df$customer_id, diff), mean)
to apply mean() to the elements of the previous result. This gives us each customer's average time to repurchase:
1 2 3 4
31.0 90.5 NaN NaN
(we see customers 3 and 4 never repurchased). Finally, we need to average these average repurchase times, which means we need to also deal with those NaN values, so we use:
mean(sapply(by(df$purchase_date, df$customer_id, diff), mean), na.rm=TRUE)
which will average the previous results, ignoring missing values (which, in R include NaN values).
Here's another solution with dplyr + lubridate:
library(dplyr)
library(lubridate)
df %>%
mutate(Purchase_Date = mdy(Purchase_Date)) %>%
group_by(Customer_ID) %>%
summarize(AVG_Repurchase = sum(difftime(Purchase_Date,
lag(Purchase_Date), units = "days"),
na.rm=TRUE)/(n()-1))
or with data.table:
library(data.table)
setDT(df)[, Purchase_Date := mdy(Purchase_Date)]
df[, .(AVG_Repurchase = sum(difftime(Purchase_Date,
shift(Purchase_Date), units = "days"),
na.rm=TRUE)/(.N-1)), by = "Customer_ID"]
Result:
# A tibble: 4 x 2
Customer_ID AVG_Repurchase
<dbl> <time>
1 1 31.0 days
2 2 90.5 days
3 3 NaN days
4 4 NaN days
Customer_ID AVG_Repurchase
1: 1 31.0 days
2: 2 90.5 days
3: 3 NaN days
4: 4 NaN days
Note:
I first converted Purchase_Date to mmddyyyy format, then group_by Customer_ID. Final for each Customer_ID, I calculated the mean of the days difference between Purchase_Date and it's lag.
Data:
df = structure(list(Customer_ID = c(1, 2, 3, 4, 1, 2, 2), Purchase_Date = c(" 01/01/2017",
" 01/01/2017", " 01/01/2017", " 01/01/2017", " 02/01/2017", " 03/01/2017",
" 07/01/2017")), .Names = c("Customer_ID", "Purchase_Date"), class = "data.frame", row.names = c(NA,
-7L))

How to use Row_Number based on Number of Days?

How to group and rank records based on 7 days.
Call 1 - 06-Jun-14 16.39.14 Rank 1
Call 7 - 10-Jun-14 14.28.40 Rank 7
After 7 days, whenever the next call date occurs,
I need to watch the next 7 days and rank accordingly.
Call 1 - 27-Jun-14 11.44.35 Rank 1
Call 4 - 03-Jul-14 14.23.39 Rank 4
CALL_DATE ROW_NUMBER
06-Jun-14 16.39.14 1
06-Jun-14 17.29.27 2
07-Jun-14 09.13.18 3
07-Jun-14 14.45.52 4
08-Jun-14 13.05.44 5
08-Jun-14 13.14.49 6
10-Jun-14 14.28.40 7
27-Jun-14 11.44.35 1
27-Jun-14 11.46.27 2
27-Jun-14 12.00.21 3
03-Jul-14 14.23.39 4
You can calculate the day number within the range by using the first_value() analytic function and getting the difference; then divide that by seven to get the week number (within the data); and then use that calculate the row_number() of each date within its calculated week number.
select call_date,
row_number() over (partition by week_num order by call_date) as row_num
from (
select call_date,
ceil((trunc(call_date)
- trunc(first_value(call_date) over (order by call_date))
+ 1) / 7) as week_num
from t42
)
order by call_date;
Which gives:
| CALL_DATE | ROW_NUM |
|-----------------------------|---------|
| June, 06 2014 16:39:14+0000 | 1 |
| June, 06 2014 17:29:27+0000 | 2 |
| June, 07 2014 09:13:18+0000 | 3 |
| June, 07 2014 14:45:52+0000 | 4 |
| June, 08 2014 13:05:44+0000 | 5 |
| June, 08 2014 13:14:49+0000 | 6 |
| June, 10 2014 14:28:40+0000 | 7 |
| June, 27 2014 11:44:35+0000 | 1 |
| June, 27 2014 11:46:27+0000 | 2 |
| June, 27 2014 12:00:21+0000 | 3 |
| July, 03 2014 14:23:39+0000 | 4 |
SQL Fiddle showing some of the intermediate steps and the final result.

Resources