I am importing data that is neither long nor wide:
clear
input str1 id purchased sold
A 2017 .
B . .
C 2016 2019
C 2018 .
D 2018 2019
D 2018 .
end
My goal is to get the data in the following long format, reflecting the count in each year:
Identifier Year Inventory
A 2016 0
A 2017 1
A 2018 1
A 2019 1
B 2016 0
B 2017 0
B 2018 0
B 2019 0
C 2016 1
C 2017 1
C 2018 2
C 2019 1
D 2016 0
D 2017 0
D 2018 2
D 2019 1
My initial approach would be to transform it first into a wide format, that is having only one row per identifier, and adding columns for years between 2016-2018. And then converting this format into the desired long format. However, this seems to be inefficient.
Is there any shorter and more efficient method to do this, as I have a much larger dataset?
This needs several small tricks. The most crucial are reshape long and fillin.
The inventory is essentially a running sum of purchases minus sales.
clear
input str1 Identifier Purchased Sold
A 2017 .
B . .
C 2016 2019
C 2018 .
D 2018 2019
D 2018 .
end
generate long id = _n
rename (Purchased Sold) year=
reshape long year, i(id) j(Event) string
drop id
fillin Id year
drop _fillin
drop if missing(year)
bysort Id (year Event) : generate inventory = sum((Event == "Purchased") - (Event == "Sold"))
drop Event
bysort Id year : keep if _n == _N
list, sepby(Id)
+----------------------------+
| Identi~r year invent~y |
|----------------------------|
1. | A 2016 0 |
2. | A 2017 1 |
3. | A 2018 1 |
4. | A 2019 1 |
|----------------------------|
5. | B 2016 0 |
6. | B 2017 0 |
7. | B 2018 0 |
8. | B 2019 0 |
|----------------------------|
9. | C 2016 1 |
10. | C 2017 1 |
11. | C 2018 2 |
12. | C 2019 1 |
|----------------------------|
13. | D 2016 0 |
14. | D 2017 0 |
15. | D 2018 2 |
16. | D 2019 1 |
+----------------------------+
This is a continued question from the post Remove the first row from each group if the second row meets a condition
Below is a sample dataset:
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
which would look like:
| id | Date | Buyer | diff | Amount |
|----|:----------:|------:|------|--------|
| 9 | 11/29/2018 | John | NA | 959 |
| 9 | 11/29/2018 | John | 0 | 1158 |
| 9 | 11/29/2018 | John | 0 | 596 |
| 5 | 2/13/2019 | Maria | 76 | 922 |
| 5 | 2/13/2019 | Maria | 0 | 922 |
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
| 4 | 8/17/2018 | Sandy | 58 | 4256 |
| 4 | 8/20/2018 | Sandy | 3 | 65 |
| 4 | 8/23/2018 | Sandy | 3 | 100 |
| 20 | 12/25/2018 | Paul | 124 | 313 |
| 20 | 12/25/2018 | Paul | 0 | 99 |
I need to retain those records where based on each buyer and id, the sum of amount between consecutive rows >5000 if the difference between two consecutive rows <=5. So, for example, Buyer 'Sandy' with id '4' has two transactions of 1849 and 4193 on '6/15/2018' and '6/20/2018' within a gap of 5 days, and since the sum of these two amounts>5000, the output would have these records. Whereas, for the same Buyer 'Sandy' with id '4' has another transactions of 4256, 65 and 100 on '8/17/2018', '8/20/2018' and '8/23/2018' within a gap of 3 days each, but the output will not have these records as the sum of this amount <5000.
The final output would look like:
| id | Date | Buyer | diff | Amount |
|----|:---------:|------:|------|--------|
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
Changing Date from character to Date and Amount from character to numeric:
df$Date<-as.Date(df$Date, '%m/%d/%y')
df$Amount<-as.numeric(df$Amount)
Now here I group the dataset by id, arrange it with Date, and create a rank within each id (so for example Sandy is going to have rank from 1 through 5 for 5 different days in which she has shopped), then I define a new variable called ConsecutiveSum which adds the Value of each row to it's previous row's Value (lag gives you the previous row). The ifelse statement forces consecutive sum to output a 0 if the previous row's Value doesn't exists. The next step is just enforcing your conditions:
df %>%
group_by(id) %>%
arrange(Date) %>%
mutate(rank=dense_rank(Date)) %>%
mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount + lag(Amount , default = 0)))%>%
filter(diffs<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)
# id Date Buyer Amount diffs rank ConsecutiveSum
# <chr> <chr> <chr> <dbl> <dbl> <int> <dbl>
# 1 4 6/15/2018 Sandy 1849 NA 1 0
# 2 4 6/20/2018 Sandy 4193 5 2 6042
I would use a combination of techniques available in tidyverse:
First create a grouping variable (new_id) and use the original id and new_id in combination to add together based on a grouping. Then we can filter by the criteria of the sum of the Amount > 5000. We can take this and filter then join or semi_join to filter based on the criteria.
ids is a dataset that finds the total Amount based on id and new_id and filters for when Dollars > 5000. This gives you the id and new_id that meets your criteria
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c(959,1158,596,922,922,1849,4193,4256,65,100,313,99), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
library(tidyverse)
df1 <- df %>% mutate(Date = as.Date(Date , format = "%m/%d/%Y"),
tf1 = (id != lag(id, default = 0)),
tf2 = (is.na(diffs) | diffs > 5))
df1$new_id <- cumsum(df1$tf1 + df1$tf2 > 0)
>df1
id Date Buyer Amount diffs days_post tf1 tf2 new_id
<chr> <date> <chr> <dbl> <dbl> <date> <lgl> <lgl> <int>
1 9 2018-11-29 John 959 NA 2018-12-04 TRUE TRUE 1
2 9 2018-11-29 John 1158 0 2018-12-04 FALSE FALSE 1
3 9 2018-11-29 John 596 0 2018-12-04 FALSE FALSE 1
4 5 2019-02-13 Maria 922 NA 2019-02-18 TRUE TRUE 2
5 5 2019-02-13 Maria 922 0 2019-02-18 FALSE FALSE 2
6 4 2018-06-15 Sandy 1849 NA 2018-06-20 TRUE TRUE 3
7 4 2018-06-20 Sandy 4193 5 2018-06-25 FALSE FALSE 3
8 4 2018-08-17 Sandy 4256 58 2018-08-22 FALSE TRUE 4
9 4 2018-08-20 Sandy 65 3 2018-08-25 FALSE FALSE 4
10 4 2018-08-23 Sandy 100 3 2018-08-28 FALSE FALSE 4
11 20 2018-12-25 Paul 313 NA 2018-12-30 TRUE TRUE 5
12 20 2018-12-25 Paul 99 0 2018-12-30 FALSE FALSE 5
ids <- df1 %>%
group_by(id, new_id) %>%
summarise(dollar = sum(Amount)) %>%
ungroup() %>% filter(dollar > 5000)
id new_id dollar
<chr> <int> <dbl>
1 4 3 6042
df1 %>% semi_join(ids)
This question already has answers here:
Can dplyr summarise over several variables without listing each one? [duplicate]
(2 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 6 years ago.
I have a large dataset containing the names of hospitals, the hospital groups and then the number of presenting patients by month. I'm trying to use dplyr to create a summary that contains the total number of presenting patients each month, aggregated by hospital group. The data frame looks like this:
Hospital | Hospital_group | Jan 03 | Feb 03 | Mar 03 | Apr 03 | .....
---------------------------------------------------------------
Hosp 1 | Group A | 5 | 5 | 6 | 4 | .....
---------------------------------------------------------------
Hosp 2 | Group A | 6 | 3 | 8 | 2 | .....
---------------------------------------------------------------
Hosp 3 | Group B | 5 | 5 | 6 | 4 | .....
---------------------------------------------------------------
Hosp 4 | Group B | 3 | 7 | 2 | 1 | .....
---------------------------------------------------------------
I'm trying to create a new dataframe that looks like this:
Hospital_group |Jan 03 | Feb 03 | Mar 03 | Apr 03 | .....
----------------------------------------------------------
Group A | 11 | 8 | 14 | 6 | .....
----------------------------------------------------------
Group B | 8 | 12 | 8 | 5 | .....
----------------------------------------------------------
I'm trying to use dplyr to summarise the data but am a little stuck (am very new at this as you might have guessed). I've managed to filter out the first column (hospital name) and group_by the hospital group but am not sure how to get a cumulative sum total for each month and year (there is a large number of date columns so I'm hoping there is a quick and easy way to do this).
Sorry about posting such a basic question - any help or advice would be greatly appreciated.
Greg
Use summarize_all:
Example:
df <- tibble(name=c("a","b", "a","b"), colA = c(1,2,3,4), colB=c(5,6,7,8))
df
# A tibble: 4 × 3
name colA colB
<chr> <dbl> <dbl>
1 a 1 5
2 b 2 6
3 a 3 7
4 b 4 8
df %>% group_by(name) %>% summarize_all(sum)
Result:
# A tibble: 2 × 3
name colA colB
<chr> <dbl> <dbl>
1 a 4 12
2 b 6 14
Edit: In your case, your data frame contains one column that you do not want to aggregate (the Hospital name.) You might have to either deselect the hospital name column first, or use summarize_at(vars(-Hospital), funs(sum)) instead of summarize_all.
We can do this using base R
We split the dataframe by Hospital_group and then sum it column-wise.
do.call(rbind, lapply(split(df[-c(1, 2)], df$Hospital_group), colSums))
# Jan_03 Feb_03 Mar_03 Apr_03
#Group_A 11 8 14 6
#Group_B 8 12 8 5
Was wondering how I would use R to calculate the below.
Assuming a CSV with the following purchase data:
| Customer ID | Purchase Date |
| 1 | 01/01/2017 |
| 2 | 01/01/2017 |
| 3 | 01/01/2017 |
| 4 | 01/01/2017 |
| 1 | 02/01/2017 |
| 2 | 03/01/2017 |
| 2 | 07/01/2017 |
I want to figure out the average time between repurchases by customer.
The math would be like the one below:
| Customer ID | AVG repurchase |
| 1 | 30 days | = (02/01 - 01/01 / 1 order
| 2 | 90 days | = ( (03/01 - 01/01) + (07 - 3/1) ) /2 orders
| 3 | n/a |
| 4 | n/a |
The output would be the total average across customers -- so: 60 days = (30 avg for customer1 + 90 avg for customer2) / 2 customers.
I've assumed you have read your CSV into a dataframe named df and I've renamed your variables using snake case, since having variables with a space in the name can be inconvenient, leading many to use either snake case or camel case variable naming conventions.
Here is a base R solution:
mean(sapply(by(df$purchase_date, df$customer_id, diff), mean), na.rm=TRUE)
[1] 60.75
You may notice that we get 60.75 rather than 60 as you expected. This is because there are 31 days between customer 1's purchases (31 days in January until February 1), and similarly for customer 2's purchases -- there are not always 30 days in a month.
Explanation
by(df$purchase_date, df$customer_id, diff)
The by() function applies another function to data by groupings. Here, we are applying diff() to df$purchase_date by the unique values of df$customer_id. By itself, this would result in the following output:
df$customer_id: 1
Time difference of 31 days
-----------------------------------------------------------
df$customer_id: 2
Time differences in days
[1] 59 122
We then use
sapply(by(df$purchase_date, df$customer_id, diff), mean)
to apply mean() to the elements of the previous result. This gives us each customer's average time to repurchase:
1 2 3 4
31.0 90.5 NaN NaN
(we see customers 3 and 4 never repurchased). Finally, we need to average these average repurchase times, which means we need to also deal with those NaN values, so we use:
mean(sapply(by(df$purchase_date, df$customer_id, diff), mean), na.rm=TRUE)
which will average the previous results, ignoring missing values (which, in R include NaN values).
Here's another solution with dplyr + lubridate:
library(dplyr)
library(lubridate)
df %>%
mutate(Purchase_Date = mdy(Purchase_Date)) %>%
group_by(Customer_ID) %>%
summarize(AVG_Repurchase = sum(difftime(Purchase_Date,
lag(Purchase_Date), units = "days"),
na.rm=TRUE)/(n()-1))
or with data.table:
library(data.table)
setDT(df)[, Purchase_Date := mdy(Purchase_Date)]
df[, .(AVG_Repurchase = sum(difftime(Purchase_Date,
shift(Purchase_Date), units = "days"),
na.rm=TRUE)/(.N-1)), by = "Customer_ID"]
Result:
# A tibble: 4 x 2
Customer_ID AVG_Repurchase
<dbl> <time>
1 1 31.0 days
2 2 90.5 days
3 3 NaN days
4 4 NaN days
Customer_ID AVG_Repurchase
1: 1 31.0 days
2: 2 90.5 days
3: 3 NaN days
4: 4 NaN days
Note:
I first converted Purchase_Date to mmddyyyy format, then group_by Customer_ID. Final for each Customer_ID, I calculated the mean of the days difference between Purchase_Date and it's lag.
Data:
df = structure(list(Customer_ID = c(1, 2, 3, 4, 1, 2, 2), Purchase_Date = c(" 01/01/2017",
" 01/01/2017", " 01/01/2017", " 01/01/2017", " 02/01/2017", " 03/01/2017",
" 07/01/2017")), .Names = c("Customer_ID", "Purchase_Date"), class = "data.frame", row.names = c(NA,
-7L))
How to group and rank records based on 7 days.
Call 1 - 06-Jun-14 16.39.14 Rank 1
Call 7 - 10-Jun-14 14.28.40 Rank 7
After 7 days, whenever the next call date occurs,
I need to watch the next 7 days and rank accordingly.
Call 1 - 27-Jun-14 11.44.35 Rank 1
Call 4 - 03-Jul-14 14.23.39 Rank 4
CALL_DATE ROW_NUMBER
06-Jun-14 16.39.14 1
06-Jun-14 17.29.27 2
07-Jun-14 09.13.18 3
07-Jun-14 14.45.52 4
08-Jun-14 13.05.44 5
08-Jun-14 13.14.49 6
10-Jun-14 14.28.40 7
27-Jun-14 11.44.35 1
27-Jun-14 11.46.27 2
27-Jun-14 12.00.21 3
03-Jul-14 14.23.39 4
You can calculate the day number within the range by using the first_value() analytic function and getting the difference; then divide that by seven to get the week number (within the data); and then use that calculate the row_number() of each date within its calculated week number.
select call_date,
row_number() over (partition by week_num order by call_date) as row_num
from (
select call_date,
ceil((trunc(call_date)
- trunc(first_value(call_date) over (order by call_date))
+ 1) / 7) as week_num
from t42
)
order by call_date;
Which gives:
| CALL_DATE | ROW_NUM |
|-----------------------------|---------|
| June, 06 2014 16:39:14+0000 | 1 |
| June, 06 2014 17:29:27+0000 | 2 |
| June, 07 2014 09:13:18+0000 | 3 |
| June, 07 2014 14:45:52+0000 | 4 |
| June, 08 2014 13:05:44+0000 | 5 |
| June, 08 2014 13:14:49+0000 | 6 |
| June, 10 2014 14:28:40+0000 | 7 |
| June, 27 2014 11:44:35+0000 | 1 |
| June, 27 2014 11:46:27+0000 | 2 |
| June, 27 2014 12:00:21+0000 | 3 |
| July, 03 2014 14:23:39+0000 | 4 |
SQL Fiddle showing some of the intermediate steps and the final result.