parse values based on groups in R - r

I have a very large dataset and a sample of that looks something like the one below:
| Id | Name | Start_Date | End_Date |
|----|---------|------------|------------|
| 10 | Mark | 4/2/1999 | 7/5/2018 |
| 10 | | 1/1/2000 | 9/24/2018 |
| 25 | | 5/3/1968 | 6/3/2000 |
| 25 | | 6/6/2009 | 4/23/2010 |
| 25 | Anthony | 2/20/2010 | 7/21/2016 |
| 25 | | 9/12/2014 | 11/26/2019 |
I need to parse the names from Name column based on their Id such that the output table looks like:
| Id | Name | Start_Date | End_Date |
|----|---------|------------|------------|
| 10 | Mark | 4/2/1999 | 7/5/2018 |
| 10 | Mark | 1/1/2000 | 9/24/2018 |
| 25 | Anthony | 5/3/1968 | 6/3/2000 |
| 25 | Antony | 6/6/2009 | 4/23/2010 |
| 25 | Anthony | 2/20/2010 | 7/21/2016 |
| 25 | Anthony | 9/12/2014 | 11/26/2019 |
How can I achieve an output as shown above? I went through the substitute and parse functions, but was unable to understand how they apply to this problem.
My dataset would be:
df=data.frame(Id=c("10","10","25","25","25","25"),Name=c("Mark","","","","Anthony",""),
Start_Date=c("4/2/1999", "1/1/2000","5/3/1968","6/6/2009","2/20/2010","9/12/2014"),
End_Date=c("7/5/2018","9/24/2018","6/3/2000","4/23/2010","7/21/2016","11/26/2019"))

We can change the blanks ("") to NA and use fill to replace the NA elements with the previous non-NA element
library(dplyr)
library(tidyr)
df1 %>%
mutate(Name = na_if(Name, "")) %>%
group_by(Id) %>%
fill(Name, .direction = "down") %>%
fill(Name, .direction = "up)
# A tibble: 6 x 4
# Groups: Id [2]
# Id Name Start_Date End_Date
# <chr> <chr> <chr> <chr>
#1 10 Mark 4/2/1999 7/5/2018
#2 10 Mark 1/1/2000 9/24/2018
#3 25 Anthony 5/3/1968 6/3/2000
#4 25 Anthony 6/6/2009 4/23/2010
#5 25 Anthony 2/20/2010 7/21/2016
#6 25 Anthony 9/12/2014 11/26/2019
In the devel version of tidyr (‘0.8.3.9000’), this can be done in a single fill statement as .direction = "downup" is also an option
df1 %>%
mutate(Name = na_if(Name, "")) %>%
group_by(Id) %>%
fill(Name, .direction = "downup")
Or another option is to group by 'Id', and mutate the 'Name' as the first non-blank element
df1 %>%
group_by(Id) %>%
mutate(Name = first(Name[Name!=""]))
# A tibble: 6 x 4
# Groups: Id [2]
# Id Name Start_Date End_Date
# <chr> <chr> <chr> <chr>
#1 10 Mark 4/2/1999 7/5/2018
#2 10 Mark 1/1/2000 9/24/2018
#3 25 Anthony 5/3/1968 6/3/2000
#4 25 Anthony 6/6/2009 4/23/2010
#5 25 Anthony 2/20/2010 7/21/2016
#6 25 Anthony 9/12/2014 11/26/2019
data
df1 <- structure(list(Id = c("10", "10", "25", "25", "25", "25"), Name = c("Mark",
"", "", "", "Anthony", ""), Start_Date = c("4/2/1999", "1/1/2000",
"5/3/1968", "6/6/2009", "2/20/2010", "9/12/2014"), End_Date = c("7/5/2018",
"9/24/2018", "6/3/2000", "4/23/2010", "7/21/2016", "11/26/2019"
)), class = "data.frame", row.names = c(NA, -6L))

Using DF defined reproducibly in the Note at the end, replace each zero-length element of Name with NA and then use na.omit to get the unique non-NA to use to fill. We have assumed that there is only one non-NA per Id which is the case in the question. If not we could replace na.omit with function(x) unique(na.omit(x)) assuming that the non-NAs are all the same within Id. No packages are used.
transform(DF, Name = ave(replace(Name, !nzchar(Name), NA), Id, FUN = na.omit))
giving:
Id Name Start_Date End_Date
1 10 Mark 4/2/1999 7/5/2018
2 10 Mark 1/1/2000 9/24/2018
3 25 Anthony 5/3/1968 6/3/2000
4 25 Anthony 6/6/2009 4/23/2010
5 25 Anthony 2/20/2010 7/21/2016
6 25 Anthony 9/12/2014 11/26/2019
na.strings
We can simplify this slightly if we make sure that the zero length elements of Name are NA in the first place. We replace the read.table line in the Note with the first line below. Then it is just a matter of using na.locf0.
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, sep = "|",
strip.white = TRUE, na.strings = "")
transform(DF, Name = ave(Name, Id, FUN = na.omit))
Note
The input in reproducible form:
Lines <- "
Id | Name | Start_Date | End_Date
10 | Mark | 4/2/1999 | 7/5/2018
10 | | 1/1/2000 | 9/24/2018
25 | | 5/3/1968 | 6/3/2000
25 | | 6/6/2009 | 4/23/2010
25 | Anthony | 2/20/2010 | 7/21/2016
25 | | 9/12/2014 | 11/26/2019"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, sep = "|", strip.white = TRUE)

Related

R Studio: Match first n characters between two columns, and fill in value from another column

I have a dataframe "city_table" that looks like this:
+---+---------------------+
| | city |
+---+---------------------+
| 1 | Chicago-2234dxsw |
+---+---------------------+
| 2 | Chicago,IL |
+---+---------------------+
| 3 | Chicago |
+---+---------------------+
| 4 | Chicago - 124421xsd |
+---+---------------------+
| 5 | Chicago_2133xx |
+---+---------------------+
| 6 | Atlanta- 1234xx |
+---+---------------------+
| 7 | Atlanta, GA |
+---+---------------------+
| 8 | Atlanta - 123456T |
+---+---------------------+
I have another city code lookup table "city_lookup" that looks like this:
+---+--------------+-----------+
| | city_name | city_code |
+---+--------------+-----------+
| 1 | Chicago, IL | 001 |
+---+--------------+-----------+
| 2 | Atlanta, GA | 002 |
+---+--------------+-----------+
As you can see, city names in "city" are messy and formatted differently, while as the city names in "city_code" are following unified format (city,STATE).
I would like the final table that, through matching first n characters (let's day, n=7) between city_table$city vs. city_lookup$city_name, return me the city code properly, sth like this:
+---+---------------------+-----------+
| | city_name | city_code |
+---+---------------------+-----------+
| 1 | Chicago-2234dxsw | 001 |
+---+---------------------+-----------+
| 2 | Chicago,IL | 001 |
+---+---------------------+-----------+
| 3 | Chicago | 001 |
+---+---------------------+-----------+
| 4 | Chicago - 124421xsd | 001 |
+---+---------------------+-----------+
| 5 | Chicago_2133xx | 001 |
+---+---------------------+-----------+
| 6 | Atlanta- 1234xx | 002 |
+---+---------------------+-----------+
| 7 | Atlanta, GA | 002 |
+---+---------------------+-----------+
| 8 | Atlanta - 123456T | 002 |
+---+---------------------+-----------+
I am doing this in R, preferably using tidyverse/dplyr. Thanks so much for your help!
Even better, as long as the characters after the full city names are always non-letters, you can match the entire city name as so:
city_table <- tibble(city = c("Chicago-2234dxsw", "Chicago,IL", "Atlanta - 123456T"))
city_lookup <- tibble(city_name = c("Chicago, IL", "Atlanta, GA"),
city_code = c("001", "002"))
city_table %>%
mutate(city_clean = gsub("^([a-zA-Z]*).*", "\\1", city)) %>%
left_join(city_lookup %>%
mutate(city_clean = gsub("^([a-zA-Z]*).*", "\\1", city_name, perl = T)),
by = "city_clean") %>%
select(-city_clean, -city_name)
city city_code
<chr> <chr>
1 Chicago-2234dxsw 001
2 Chicago,IL 001
3 Atlanta - 123456T 002
We can create columns with substring (as the OP asked in the question) and then do a regex_left_join
library(dplyr)
library(fuzzyjoin)
city_table %>%
mutate(city_sub = substring(city, 1, 7)) %>%
regex_left_join(city_lookup %>%
mutate(city_sub = substring(city_name, 1, 7)),
by = 'city_sub') %>%
select(city_name = city, city_code)
-output
# city_name city_code
#1 Chicago-2234dxsw 001
#2 Chicago,IL 001
#3 Chicago 001
#4 Chicago - 124421xsd 001
#5 Chicago_2133xx 001
#6 Atlanta- 1234xx 002
#7 Atlanta, GA 002
#8 Atlanta - 123456T 002
data
city_table <- structure(list(city = c("Chicago-2234dxsw", "Chicago,IL", "Chicago",
"Chicago - 124421xsd", "Chicago_2133xx", "Atlanta- 1234xx", "Atlanta, GA",
"Atlanta - 123456T")), class = "data.frame", row.names = c(NA,
-8L))
city_lookup <- structure(list(city_name = c("Chicago, IL", "Atlanta, GA"),
city_code = c("001",
"002")), class = "data.frame", row.names = c(NA, -2L))

Weekly Weight Based on a category using dplyr in R

I have the following data and looking to create the "Final Col" shown below using dplyr in R. I would appreciate your ideas.
| Year | Week | MainCat|Qty |Final Col |
|:----: |:------: |:-----: |:-----:|:------------:|
| 2017 | 1 | Edible |69 |69/(69+12) |
| 2017 | 2 | Edible |12 |12/(69+12) |
| 2017 | 1 | Flowers|88 |88/(88+47) |
| 2017 | 2 | Flowers|47 |47/(88+47) |
| 2018 | 1 | Edible |90 |90/(90+35) |
| 2018 | 2 | Edible |35 |35/(90+35) |
| 2018 | 1 | Flowers|78 |78/(78+85) |
| 2018 | 2 | Flowers|85 |85/(78+85) |
It can be done with a group_by operation i.e. grouped by 'Year', 'MainCat', divide the 'Qty' by the sum of 'Qty' to create the 'Final' column
library(dplyr)
df1 <- df1 %>%
group_by(Year, MainCat) %>%
mutate(Final = Qty/sum(Qty))
You can use prop.table :
library(dplyr)
df %>% group_by(Year, MainCat) %>% mutate(Final = prop.table(Qty))
# Year Week MainCat Qty Final
# <int> <int> <chr> <int> <dbl>
#1 2017 1 Edible 69 0.852
#2 2017 2 Edible 12 0.148
#3 2017 1 Flowers 88 0.652
#4 2017 2 Flowers 47 0.348
#5 2018 1 Edible 90 0.72
#6 2018 2 Edible 35 0.28
#7 2018 1 Flowers 78 0.479
#8 2018 2 Flowers 85 0.521
You can also do this in base R :
df$Final <- with(df, ave(Qty, Year, MainCat, FUN = prop.table))

Split a string by a eliminator into an infinite number of columns

I have a database export of user ids and dates logged in.
structure(list(User.Id = c(2542573L, 2571394L, 2770912L, 2683246L,
2832110L, 2773277L), Days.Played = c("", "2020-01-15,2020-01-16,2020-01-21,2020-01-22",
"2020-06-29", "2020-04-19,2020-04-24,2020-04-29", "2020-09-04",
"2020-06-23"), row.names = c(NA,
6L), class = "data.frame")
|---------------------|------------------|
| id | logged_in |
|---------------------|------------------|
| a | 2019-11-21, |
| | 2019-11-22, |
| | 2019-11-23,|
| | 2019-11-24,|
| | 2019-11-25 |
|---------------------|------------------|
| b | |
|---------------------|------------------|
| c | 2019-11-21, |
| | 2019-11-22, |
|---------------------|------------------|
What I am trying to do is split the date column by "," so each date is in it's own column
I want it to look like the below where there is a loggedin.[a:zz] stretching as wide as the longest string in the database. This could go to 1000 or more.
|---------------------|------------------|------------------|
| id | logged_in.a | loggedin.b |
|---------------------|------------------|------------------|
| a | 2019-11-21, | 2019-11-22 |
| | | |
| | | |
| | | |
| | | |
|---------------------|------------------|------------------|
| b | | |
|---------------------|------------------|------------------|
| c | 2019-11-21, | |
| | | 2019-11-22, |
|---------------------|------------------|------------------|
I then plan on gathering the dataset into a tall file. The code I used is below but I have to define the col names. My issue is I don't know how many there will be.
require(tidyr)
test %>% transform(.,Days.Played=colsplit(Days.Played, pattern=",", names=c('a','b')))
Does anyone know how to get around this issue or have any suggestions?
I think this is what you are looking for:
library(tidyr)
df %>% separate_rows(Days.Played, sep = ",")
#> # A tibble: 11 x 2
#> User.Id Days.Played
#> <int> <chr>
#> 1 2542573 ""
#> 2 2571394 "2020-01-15"
#> 3 2571394 "2020-01-16"
#> 4 2571394 "2020-01-21"
#> 5 2571394 "2020-01-22"
#> 6 2770912 "2020-06-29"
#> 7 2683246 "2020-04-19"
#> 8 2683246 "2020-04-24"
#> 9 2683246 "2020-04-29"
#> 10 2832110 "2020-09-04"
#> 11 2773277 "2020-06-23"
where df is:
df <- structure(list(User.Id = c(2542573L, 2571394L, 2770912L, 2683246L, 2832110L, 2773277L),
Days.Played = c("", "2020-01-15,2020-01-16,2020-01-21,2020-01-22", "2020-06-29", "2020-04-19,2020-04-24,2020-04-29", "2020-09-04", "2020-06-23")),
row.names = c(NA, 6L), class = "data.frame")
You can also try:
library(tidyverse)
#Data
df <- data.frame(id=c('a','b','c'),
logged_in=c('2019-11-21,2019-11-22,2019-11-23,2019-11-24,2019-11-25','','2019-11-21,2019-11-22,'),stringsAsFactors = F)
#Code
newdf <- df %>%
pivot_longer(-c(id)) %>%
separate_rows(value,sep=',') %>%
group_by(id) %>%
mutate(Var=paste0('logged.in.',row_number())) %>%
select(-name) %>%
pivot_wider(names_from = Var,values_from=value,values_fill='')
Output:
# A tibble: 3 x 6
# Groups: id [3]
id logged.in.1 logged.in.2 logged.in.3 logged.in.4 logged.in.5
<chr> <chr> <chr> <chr> <chr> <chr>
1 a "2019-11-21" "2019-11-22" "2019-11-23" "2019-11-24" "2019-11-25"
2 b "" "" "" "" ""
3 c "2019-11-21" "2019-11-22" "" "" ""
In base R, we can use strsplit with stack
out <- stack(setNames(strsplit(df$Days.Played, ","), df$User.Id))[2:1]
colnames(out) <- names(df)
-output
out
# User.Id Days.Played
#1 2571394 2020-01-15
#2 2571394 2020-01-16
#3 2571394 2020-01-21
#4 2571394 2020-01-22
#5 2770912 2020-06-29
#6 2683246 2020-04-19
#7 2683246 2020-04-24
#8 2683246 2020-04-29
#9 2832110 2020-09-04
#10 2773277 2020-06-23
data
df <- structure(list(User.Id = c(2542573L, 2571394L, 2770912L,
2683246L, 2832110L, 2773277L),
Days.Played = c("", "2020-01-15,2020-01-16,2020-01-21,2020-01-22",
"2020-06-29",
"2020-04-19,2020-04-24,2020-04-29", "2020-09-04", "2020-06-23")),
row.names = c(NA, 6L), class = "data.frame")

Split a row into columns with conditions in R

I've a dataframe as under
+----+-------+---------+
| ID | VALUE | DATE |
+----+-------+---------+
| 1 | 10 | 2019-08 |
| 2 | 12 | 2018-05 |
| 3 | 45 | 2019-03 |
| 3 | 33 | 2018-03 |
| 1 | 5 | 2018-08 |
| 2 | 98 | 2019-05 |
| 4 | 67 | 2019-10 |
| 4 | 34 | 2018-10 |
| 1 | 55 | 2018-07 |
| 2 | 76 | 2019-08 |
| 2 | 56 | 2018-12 |
+----+-------+---------+
What I'm trying to do here is to split the value and date into value1 and value2 and data1 and date2 based on the current year(year of systemdate) and the year before
But the condition here is if the date-month combination in DATE of the main table matched to that of current systemdate then donot consider last years date
Also disregard all the values dates that appear before the year of systemdate
The resulting output would be as under
Over here in the result ID 1,2 and 3 had corresponding values for same month in this year and last year so we split them in 2 different columns
Also we didn't consider last years result of ID 4 as its month this year matches with year-month combination of systemdate
and we also disregard all the values from lat year that don't have a corresponding month match this year ( ID 1 for 2018-07 and 2 for 2018-12 in this example)
+----+---------+---------+--------+--------+
| ID | DATE1 | DATE2 | VALUE1 | VALUE2 |
+----+---------+---------+--------+--------+
| 1 | 2019-08 | 2018-08 | 10 | 5 |
| 2 | 2019-05 | 2018-05 | 98 | 12 |
| 3 | 2019-03 | 2018-03 | 45 | 33 |
| 4 | 2019-10 | NA | 67 | NA |
| 2 | 2019-08 | NA | 76 | NA |
+----+---------+---------+--------+--------+
I think you could get everything in the right format first:
df <- data.frame(ID = c(1, 2, 3, 3, 1, 2, 4, 4, 1, 2, 2),
VALUE = c(10, 12, 45, 33, 5, 98, 67, 34, 55, 76, 56),
DATE = c("2019-08", "2018-05", "2019-03","2018-03",
"2018-08","2019-05", "2019-10", "2018-10",
"2018-07", "2019-08", "2018-12"))
library(tidyverse)
df <- df %>% mutate(
year = str_split_fixed(DATE, "-", 2)[,1],
month = str_split_fixed(DATE, "-", 2)[,2]) %>%
pivot_wider(
names_from = year,
values_from = c(VALUE, DATE))
Then, you could filter and remove those values that you do not need according to your logic. I may not fully understand your system time here, but just assume it is the string "2019-10". It could be something like this:
df %>%
filter(!is.na(VALUE_2019)) %>%
mutate(
VALUE_2018 = ifelse(DATE_2019 == "2019-10", NA, VALUE_2018),
DATE_2018 = ifelse(DATE_2019 == "2019-10", NA, as.character(DATE_2018)))
# A tibble: 5 x 6
ID month VALUE_2019 VALUE_2018 DATE_2019 DATE_2018
<dbl> <chr> <dbl> <dbl> <fct> <chr>
1 1 08 10 5 2019-08 2018-08
2 2 05 98 12 2019-05 2018-05
3 3 03 45 33 2019-03 2018-03
4 4 10 67 NA 2019-10 NA
5 2 08 76 NA 2019-08 NA

Calculate sum of a column if the difference between consecutive rows meets a condition

This is a continued question from the post Remove the first row from each group if the second row meets a condition
Below is a sample dataset:
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
which would look like:
| id | Date | Buyer | diff | Amount |
|----|:----------:|------:|------|--------|
| 9 | 11/29/2018 | John | NA | 959 |
| 9 | 11/29/2018 | John | 0 | 1158 |
| 9 | 11/29/2018 | John | 0 | 596 |
| 5 | 2/13/2019 | Maria | 76 | 922 |
| 5 | 2/13/2019 | Maria | 0 | 922 |
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
| 4 | 8/17/2018 | Sandy | 58 | 4256 |
| 4 | 8/20/2018 | Sandy | 3 | 65 |
| 4 | 8/23/2018 | Sandy | 3 | 100 |
| 20 | 12/25/2018 | Paul | 124 | 313 |
| 20 | 12/25/2018 | Paul | 0 | 99 |
I need to retain those records where based on each buyer and id, the sum of amount between consecutive rows >5000 if the difference between two consecutive rows <=5. So, for example, Buyer 'Sandy' with id '4' has two transactions of 1849 and 4193 on '6/15/2018' and '6/20/2018' within a gap of 5 days, and since the sum of these two amounts>5000, the output would have these records. Whereas, for the same Buyer 'Sandy' with id '4' has another transactions of 4256, 65 and 100 on '8/17/2018', '8/20/2018' and '8/23/2018' within a gap of 3 days each, but the output will not have these records as the sum of this amount <5000.
The final output would look like:
| id | Date | Buyer | diff | Amount |
|----|:---------:|------:|------|--------|
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
Changing Date from character to Date and Amount from character to numeric:
df$Date<-as.Date(df$Date, '%m/%d/%y')
df$Amount<-as.numeric(df$Amount)
Now here I group the dataset by id, arrange it with Date, and create a rank within each id (so for example Sandy is going to have rank from 1 through 5 for 5 different days in which she has shopped), then I define a new variable called ConsecutiveSum which adds the Value of each row to it's previous row's Value (lag gives you the previous row). The ifelse statement forces consecutive sum to output a 0 if the previous row's Value doesn't exists. The next step is just enforcing your conditions:
df %>%
group_by(id) %>%
arrange(Date) %>%
mutate(rank=dense_rank(Date)) %>%
mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount + lag(Amount , default = 0)))%>%
filter(diffs<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)
# id Date Buyer Amount diffs rank ConsecutiveSum
# <chr> <chr> <chr> <dbl> <dbl> <int> <dbl>
# 1 4 6/15/2018 Sandy 1849 NA 1 0
# 2 4 6/20/2018 Sandy 4193 5 2 6042
I would use a combination of techniques available in tidyverse:
First create a grouping variable (new_id) and use the original id and new_id in combination to add together based on a grouping. Then we can filter by the criteria of the sum of the Amount > 5000. We can take this and filter then join or semi_join to filter based on the criteria.
ids is a dataset that finds the total Amount based on id and new_id and filters for when Dollars > 5000. This gives you the id and new_id that meets your criteria
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c(959,1158,596,922,922,1849,4193,4256,65,100,313,99), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
library(tidyverse)
df1 <- df %>% mutate(Date = as.Date(Date , format = "%m/%d/%Y"),
tf1 = (id != lag(id, default = 0)),
tf2 = (is.na(diffs) | diffs > 5))
df1$new_id <- cumsum(df1$tf1 + df1$tf2 > 0)
>df1
id Date Buyer Amount diffs days_post tf1 tf2 new_id
<chr> <date> <chr> <dbl> <dbl> <date> <lgl> <lgl> <int>
1 9 2018-11-29 John 959 NA 2018-12-04 TRUE TRUE 1
2 9 2018-11-29 John 1158 0 2018-12-04 FALSE FALSE 1
3 9 2018-11-29 John 596 0 2018-12-04 FALSE FALSE 1
4 5 2019-02-13 Maria 922 NA 2019-02-18 TRUE TRUE 2
5 5 2019-02-13 Maria 922 0 2019-02-18 FALSE FALSE 2
6 4 2018-06-15 Sandy 1849 NA 2018-06-20 TRUE TRUE 3
7 4 2018-06-20 Sandy 4193 5 2018-06-25 FALSE FALSE 3
8 4 2018-08-17 Sandy 4256 58 2018-08-22 FALSE TRUE 4
9 4 2018-08-20 Sandy 65 3 2018-08-25 FALSE FALSE 4
10 4 2018-08-23 Sandy 100 3 2018-08-28 FALSE FALSE 4
11 20 2018-12-25 Paul 313 NA 2018-12-30 TRUE TRUE 5
12 20 2018-12-25 Paul 99 0 2018-12-30 FALSE FALSE 5
ids <- df1 %>%
group_by(id, new_id) %>%
summarise(dollar = sum(Amount)) %>%
ungroup() %>% filter(dollar > 5000)
id new_id dollar
<chr> <int> <dbl>
1 4 3 6042
df1 %>% semi_join(ids)

Resources