Conditionally rolling up dates in R - r

Hi I am trying to work out a way to conditionally roll up dates in R.
Suppose I have the following table below and I want to roll dates up using the Flags variable. The Flag can either be 1 or 2 and dictates which subsequent dates can be linked up.
DateStart <- c("2018-01-01", "2018-01-04", "2018-01-05", "2018-01-09", "2018-01-12", "2018-01-20")
DateEnd <- c("2018-01-05", "2018-01-09", "2018-01-12", "2018-01-15", "2018-01-20", "2018-01-21")
IndexRecord <- c(1, NA, NA, NA, NA, NA)
Flag1 <- c(1,1,1,1,1,1)
Flag2 <- c(2,1,1,1,1,1)
Flag3 <- c(1,1,2,1,2,1)
df1 <- data.frame(DateStart = as.Date(DateStart),
DateEnd = as.Date(DateEnd),
IndexRecord = IndexRecord,
Flag1 = Flag1,
Flag2 = Flag2,
Flag3 = Flag3) %>%
arrange(DateStart)
df1
| | DateStart | DateEnd | IndexRecord | Flag1 | Flag2 | Flag3 |
|---|------------|------------|-------------|-------|-------|-------|
| 1 | 2018-01-01 | 2018-01-05 | 1 | 1 | 2 | 1 |
| 2 | 2018-01-04 | 2018-01-09 | NA | 1 | 1 | 1 |
| 3 | 2018-01-05 | 2018-01-12 | NA | 1 | 1 | 2 |
| 4 | 2018-01-09 | 2018-01-15 | NA | 1 | 1 | 1 |
| 5 | 2018-01-12 | 2018-01-20 | NA | 1 | 1 | 2 |
| 6 | 2018-01-20 | 2018-01-21 | NA | 1 | 1 | 1 |
A Flag with value of 1 for the current period means that for a subsequent row to have a valid link, the subsequent row must have DateStart occurring before DateEnd of the current row. Using Flag1 as the column of interest, the result would look like:
| | DateStart | DateEnd | IndexRecord | Flag1 |
|---|------------|------------|-------------|-------|
| 1 | 2018-01-01 | 2018-01-05 | 1 | 1 |
| 2 | 2018-01-04 | 2018-01-09 | NA | 1 |
| 3 | 2018-01-05 | 2018-01-12 | NA | 1 |
| 4 | 2018-01-09 | 2018-01-15 | NA | 1 |
| 5 | 2018-01-12 | 2018-01-20 | NA | 1 |
A Flag with value of 2 for the current period means that for a subsequent row to have a valid link, the subsequent row must have DateStart occurring on the DateEnd of the current row. Using Flag2 as the column of interest, the result would look like:
| | DateStart | DateEnd | IndexRecord | Flag2 |
|---|------------|------------|-------------|-------|
| 1 | 2018-01-01 | 2018-01-05 | 1 | 2 |
| 3 | 2018-01-05 | 2018-01-12 | NA | 1 |
| 4 | 2018-01-09 | 2018-01-15 | NA | 1 |
| 5 | 2018-01-12 | 2018-01-20 | NA | 1 |
One of the more complex cases could occur with patterns such as seen in Flag3 with the desired results:
| | DateStart | DateEnd | IndexRecord | Flag3 |
|---|------------|------------|-------------|-------|
| 1 | 2018-01-01 | 2018-01-05 | 1 | 1 |
| 2 | 2018-01-04 | 2018-01-09 | NA | 1 |
| 3 | 2018-01-05 | 2018-01-12 | NA | 2 |
| 5 | 2018-01-12 | 2018-01-20 | NA | 2 |
| 6 | 2018-01-20 | 2018-01-21 | NA | 1 |
Cheers,
J
Edits:
Since this was perhaps not clear let me clarify step by step.
Flag1.
We see that in Row 1 it ends on the 2018-01-05 and Flag1 is 1.
This means that for a subsequent row to be linked to this episode, the next episode's DateStart must occur before DateEnd of Row 1. Row 2, satisfies this condition since 2018-01-04 occurs before 2018-01-05 and therefore is valid link.
If we look at the remaining rows, all these dates are nested except the Row 6. Since Flag1 of Row5 is 1, we cannot count Row 6, hence why the table stops at Row 5.
Total elapsed time is the from 2018-01-01 to 2018-01-20.
Flag2.
Row 1 has Flag2 equal to 2 which means that only a subsequent DateStart of 2018-01-05 can be linked to this row. Therefore Row 2 is dropped. If we keeping moving down we see that Row 3 has a DateStart of 2018-01-05 and therefore can be linked to Row 1.
Looking at the remaining rows, it has the same pattern as for Flag1 since Flag1 and Flag2 are identical from this point onwards.
Similarly to Flag1, total elapsed time is the from 2018-01-01 to 2018-01-20
There is no difference in elapsed time compared to Flag1 but differs in the journey taken.
Flag3.
Flag3 for Row 1 and Row 2 are the same as in Flag1 which means at this point Row 1, Row 2, and Row 3 are kept as in the Flag1 example.
Flag3 for Row 3 however is 2. Since the DateEnd of Row 3 is 2018-01-12, only Row 5 can be linked and Row 4 is removed.
Since Row 5 has Flag3 of 2 and a DateEnd of 2018-01-20, Row 6 can also be linked to this set.
The total elapsed time for this set is from 2018-01-01 to 2018-01-21.

Related

R Studio: How to perform separate data wrangling procedures for different values of a variable into a list of individual dataframes?

I have a dataframe that looks like this:
+-----------+------------+--------+------------+
| Geography | Dates | Sales | Avg_Volume |
+-----------+------------+--------+------------+
| A | 2020-01-01 | | |
+-----------+------------+--------+------------+
| A | 2020-01-02 | | |
+-----------+------------+--------+------------+
| A | 2020-01-03 | | |
+-----------+------------+--------+------------+
| A | 2020-01-04 | | |
+-----------+------------+--------+------------+
| A | 2020-01-05 | | |
+-----------+------------+--------+------------+
| B | 2020-01-01 | | |
+-----------+------------+--------+------------+
| B | 2020-01-02 | | |
+-----------+------------+--------+------------+
| B | 2020-01-03 | | |
+-----------+------------+--------+------------+
| B | 2020-01-04 | | |
+-----------+------------+--------+------------+
| B | 2020-01-05 | | |
+-----------+------------+--------+------------+
| C | 2020-01-01 | | |
+-----------+------------+--------+------------+
| C | 2020-01-02 | | |
+-----------+------------+--------+------------+
| C | 2020-01-03 | | |
+-----------+------------+--------+------------+
| C | 2020-01-04 | | |
+-----------+------------+--------+------------+
| C | 2020-01-05 | | |
+-----------+------------+--------+------------+
| D | 2020-01-01 | | |
+-----------+------------+--------+------------+
| D | 2020-01-02 | | |
+-----------+------------+--------+------------+
| D | 2020-01-03 | | |
+-----------+------------+--------+------------+
| D | 2020-01-04 | | |
+-----------+------------+--------+------------+
| D | 2020-01-05 | | |
+-----------+------------+--------+------------+
I would like to have 3 dataframes dedicated to City B,C,D that looks like this (I need A_Sales to be always present:
+------------+----------+---------+--------------+
| Dates | A_Sales | B_Sales | B_Avg_Volume |
+------------+----------+---------+--------------+
| 2020-01-01 | | | |
+------------+----------+---------+--------------+
| 2020-01-02 | | | |
+------------+----------+---------+--------------+
| 2020-01-03 | | | |
+------------+----------+---------+--------------+
| 2020-01-04 | | | |
+------------+----------+---------+--------------+
| 2020-01-05 | | | |
+------------+----------+---------+--------------+
+------------+----------+---------+--------------+
| Dates | A_Sales | C_Sales | C_Avg_Volume |
+------------+----------+---------+--------------+
| 2020-01-01 | | | |
+------------+----------+---------+--------------+
| 2020-01-02 | | | |
+------------+----------+---------+--------------+
| 2020-01-03 | | | |
+------------+----------+---------+--------------+
| 2020-01-04 | | | |
+------------+----------+---------+--------------+
| 2020-01-05 | | | |
+------------+----------+---------+--------------+
+------------+----------+---------+--------------+
| Dates | A_Sales | D_Sales | D_Avg_Volume |
+------------+----------+---------+--------------+
| 2020-01-01 | | | |
+------------+----------+---------+--------------+
| 2020-01-02 | | | |
+------------+----------+---------+--------------+
| 2020-01-03 | | | |
+------------+----------+---------+--------------+
| 2020-01-04 | | | |
+------------+----------+---------+--------------+
| 2020-01-05 | | | |
+------------+----------+---------+--------------+
Currently this is what I have:
data_A <- data %>%
filter(Geography == "A") %>%
rename("A_Sales" = Sales) %>%
select(Dates, A_Sales)
data_B <- data %>%
filter(Geography == 'B') %>%
rename("B_Sales" = Sales)%>%
rename("B_Avg_Volume" = Avg_Volume)%>%
select(Dates, B_Sales, B_Avg_Volume)
data_a_n_b <- data_A %>%
left_join(data_B, by = 'Dates')
This is very redundant and inefficient, because I would have to change Geography == '...') to "B,C,D..." everytime and re-run. My real data has ~ 50 cities so it is unrealistic for me to do this process for each city individually.
What is a elegant way to batch processing this process?
I am imagining the end result be a list of dataframes for city B,C,D ... and so on, with the name of each individual dataframe be the city name. This way I can easily access each individual dataframe. For example, calling data_result$C (or sth like that) will give me the dataframe for City C. Any other output format is also welcomed, as long as accessing individual dataframe is easy.
Thanks so much for your help!
Using purrr this could be achieved like so:
Split your df by Geography
Loop over the list (except for region "A") and join the dfs to the one for region A
Do some renaming
set.seed(42)
dat <- data.frame(
Geography = rep(LETTERS[1:4], each = 4),
Dates = rep(seq(as.Date("2020-01-01"), as.Date("2020-01-04"), by = "1 day"), 4),
Sales = runif(4 * 4),
Avg_Volume = runif(4 * 4)
)
library(purrr)
library(dplyr)
library(stringr)
dat_list <- dat %>%
split(.$Geography) %>%
map(select, -Geography)
imap(dat_list[setdiff(names(dat_list), "A")], function(x, y) {
left_join(dat_list[["A"]], x, by = "Dates", suffix = c(paste0("_", y), "_A")) %>%
rename_with(~ str_replace(.x, "(Sales|Avg_Volume)_(.*)", "\\2_\\1"), -Dates) %>%
select(-A_Avg_Volume)
})
#> $B
#> Dates B_Sales B_Avg_Volume A_Sales
#> 1 2020-01-01 0.9148060 0.9782264 0.6417455
#> 2 2020-01-02 0.9370754 0.1174874 0.5190959
#> 3 2020-01-03 0.2861395 0.4749971 0.7365883
#> 4 2020-01-04 0.8304476 0.5603327 0.1346666
#>
#> $C
#> Dates C_Sales C_Avg_Volume A_Sales
#> 1 2020-01-01 0.9148060 0.9782264 0.6569923
#> 2 2020-01-02 0.9370754 0.1174874 0.7050648
#> 3 2020-01-03 0.2861395 0.4749971 0.4577418
#> 4 2020-01-04 0.8304476 0.5603327 0.7191123
#>
#> $D
#> Dates D_Sales D_Avg_Volume A_Sales
#> 1 2020-01-01 0.9148060 0.9782264 0.9346722
#> 2 2020-01-02 0.9370754 0.1174874 0.2554288
#> 3 2020-01-03 0.2861395 0.4749971 0.4622928
#> 4 2020-01-04 0.8304476 0.5603327 0.9400145
Created on 2021-02-05 by the reprex package (v1.0.0)
I took Stefan's setup dataframe and added another way to do it. The steps are:
Get a list of the list of city names (excluding A). The way I wrote it assumes A is first, but you could also use discard() to remove "A" from the city list.
Use map with filter to get a list of data frames that have A and each city in cities. set_names to make sure each list element is accessible by its city name
Take each data frame in the list and pivot_wider, then select everything by Avg_Volume for A.
#Set up a sample data frame
library(dplyr)
set.seed(42)
dat <- tibble(
Geography = rep(LETTERS[1:4], each = 4),
Dates = rep(seq(as.Date("2020-01-01"), as.Date("2020-01-04"), by = "1 day"), 4),
Sales = runif(4 * 4),
Avg_Volume = runif(4 * 4)
)
#Code to wrangle into list of filtered, wide format data frames
library(dpylyr)
library(tidyr)
library(purrr)
cities <- unique(dat$Geography)[-1]
dat_list <- map(cities, ~ filter(dat, Geography == "A" | Geography == .x)) %>% set_names(cities)
dat_list_wider <- map(dat_list,
~pivot_wider(.x, id_cols = "Dates",
names_from = "Geography",
values_from = c("Sales","Avg_Volume")) %>%
select(-Avg_Volume_A))

Swap results of a value from last years value in same month if the month-year combination is not equal to month-year of system

I've a table as under
+----+-------+---------+
| ID | VALUE | DATE |
+----+-------+---------+
| 1 | 10 | 2019-09 |
| 1 | 12 | 2018-09 |
| 2 | 13 | 2019-10 |
| 2 | 14 | 2018-10 |
| 3 | 67 | 2019-01 |
| 3 | 78 | 2018-01 |
+----+-------+---------+
I want to be able to swap the VALUE column for all ID's where the DATE != year-month of system date
and If the DATE == year-month of system date then just keep this years value
the resulting table I need is as under
+----+-------+---------+
| ID | VALUE | DATE |
+----+-------+---------+
| 1 | 12 | 2019-09 |
| 2 | 13 | 2019-10 |
| 3 | 78 | 2019-01 |
+----+-------+---------+
As Jon and Maurits noticed, your example is unclear: you give no line with what is a wrong format to you, and you mention "current year" but do not describe the expected output for the next year for instance.
Here is an attempt of code to actually answer your question:
library(dplyr)
x = read.table(text = "
ID VALUE DATE
1 10 2019-09
1 12 2018-09
1 12 2018-09-04
1 12 2018-99
2 13 2019-10
2 14 2018-10
3 67 2019-01
3 78 2018-01
", header=T)
x %>%
mutate(DATE = paste0(DATE, "-01") %>% as.Date("%Y-%m-%d")) %>%
group_by(ID) %>%
filter(DATE==max(DATE, na.rm=T))
I inserted two lines with a "wrong" format (according to me) and treated "current year" as the maximum year you could find in the column for each ID.
This may be wrong assertions, but I'd need more information to better answer this.

Split a row into columns with conditions in R

I've a dataframe as under
+----+-------+---------+
| ID | VALUE | DATE |
+----+-------+---------+
| 1 | 10 | 2019-08 |
| 2 | 12 | 2018-05 |
| 3 | 45 | 2019-03 |
| 3 | 33 | 2018-03 |
| 1 | 5 | 2018-08 |
| 2 | 98 | 2019-05 |
| 4 | 67 | 2019-10 |
| 4 | 34 | 2018-10 |
| 1 | 55 | 2018-07 |
| 2 | 76 | 2019-08 |
| 2 | 56 | 2018-12 |
+----+-------+---------+
What I'm trying to do here is to split the value and date into value1 and value2 and data1 and date2 based on the current year(year of systemdate) and the year before
But the condition here is if the date-month combination in DATE of the main table matched to that of current systemdate then donot consider last years date
Also disregard all the values dates that appear before the year of systemdate
The resulting output would be as under
Over here in the result ID 1,2 and 3 had corresponding values for same month in this year and last year so we split them in 2 different columns
Also we didn't consider last years result of ID 4 as its month this year matches with year-month combination of systemdate
and we also disregard all the values from lat year that don't have a corresponding month match this year ( ID 1 for 2018-07 and 2 for 2018-12 in this example)
+----+---------+---------+--------+--------+
| ID | DATE1 | DATE2 | VALUE1 | VALUE2 |
+----+---------+---------+--------+--------+
| 1 | 2019-08 | 2018-08 | 10 | 5 |
| 2 | 2019-05 | 2018-05 | 98 | 12 |
| 3 | 2019-03 | 2018-03 | 45 | 33 |
| 4 | 2019-10 | NA | 67 | NA |
| 2 | 2019-08 | NA | 76 | NA |
+----+---------+---------+--------+--------+
I think you could get everything in the right format first:
df <- data.frame(ID = c(1, 2, 3, 3, 1, 2, 4, 4, 1, 2, 2),
VALUE = c(10, 12, 45, 33, 5, 98, 67, 34, 55, 76, 56),
DATE = c("2019-08", "2018-05", "2019-03","2018-03",
"2018-08","2019-05", "2019-10", "2018-10",
"2018-07", "2019-08", "2018-12"))
library(tidyverse)
df <- df %>% mutate(
year = str_split_fixed(DATE, "-", 2)[,1],
month = str_split_fixed(DATE, "-", 2)[,2]) %>%
pivot_wider(
names_from = year,
values_from = c(VALUE, DATE))
Then, you could filter and remove those values that you do not need according to your logic. I may not fully understand your system time here, but just assume it is the string "2019-10". It could be something like this:
df %>%
filter(!is.na(VALUE_2019)) %>%
mutate(
VALUE_2018 = ifelse(DATE_2019 == "2019-10", NA, VALUE_2018),
DATE_2018 = ifelse(DATE_2019 == "2019-10", NA, as.character(DATE_2018)))
# A tibble: 5 x 6
ID month VALUE_2019 VALUE_2018 DATE_2019 DATE_2018
<dbl> <chr> <dbl> <dbl> <fct> <chr>
1 1 08 10 5 2019-08 2018-08
2 2 05 98 12 2019-05 2018-05
3 3 03 45 33 2019-03 2018-03
4 4 10 67 NA 2019-10 NA
5 2 08 76 NA 2019-08 NA

Calculate sum of a column if the difference between consecutive rows meets a condition

This is a continued question from the post Remove the first row from each group if the second row meets a condition
Below is a sample dataset:
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
which would look like:
| id | Date | Buyer | diff | Amount |
|----|:----------:|------:|------|--------|
| 9 | 11/29/2018 | John | NA | 959 |
| 9 | 11/29/2018 | John | 0 | 1158 |
| 9 | 11/29/2018 | John | 0 | 596 |
| 5 | 2/13/2019 | Maria | 76 | 922 |
| 5 | 2/13/2019 | Maria | 0 | 922 |
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
| 4 | 8/17/2018 | Sandy | 58 | 4256 |
| 4 | 8/20/2018 | Sandy | 3 | 65 |
| 4 | 8/23/2018 | Sandy | 3 | 100 |
| 20 | 12/25/2018 | Paul | 124 | 313 |
| 20 | 12/25/2018 | Paul | 0 | 99 |
I need to retain those records where based on each buyer and id, the sum of amount between consecutive rows >5000 if the difference between two consecutive rows <=5. So, for example, Buyer 'Sandy' with id '4' has two transactions of 1849 and 4193 on '6/15/2018' and '6/20/2018' within a gap of 5 days, and since the sum of these two amounts>5000, the output would have these records. Whereas, for the same Buyer 'Sandy' with id '4' has another transactions of 4256, 65 and 100 on '8/17/2018', '8/20/2018' and '8/23/2018' within a gap of 3 days each, but the output will not have these records as the sum of this amount <5000.
The final output would look like:
| id | Date | Buyer | diff | Amount |
|----|:---------:|------:|------|--------|
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
Changing Date from character to Date and Amount from character to numeric:
df$Date<-as.Date(df$Date, '%m/%d/%y')
df$Amount<-as.numeric(df$Amount)
Now here I group the dataset by id, arrange it with Date, and create a rank within each id (so for example Sandy is going to have rank from 1 through 5 for 5 different days in which she has shopped), then I define a new variable called ConsecutiveSum which adds the Value of each row to it's previous row's Value (lag gives you the previous row). The ifelse statement forces consecutive sum to output a 0 if the previous row's Value doesn't exists. The next step is just enforcing your conditions:
df %>%
group_by(id) %>%
arrange(Date) %>%
mutate(rank=dense_rank(Date)) %>%
mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount + lag(Amount , default = 0)))%>%
filter(diffs<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)
# id Date Buyer Amount diffs rank ConsecutiveSum
# <chr> <chr> <chr> <dbl> <dbl> <int> <dbl>
# 1 4 6/15/2018 Sandy 1849 NA 1 0
# 2 4 6/20/2018 Sandy 4193 5 2 6042
I would use a combination of techniques available in tidyverse:
First create a grouping variable (new_id) and use the original id and new_id in combination to add together based on a grouping. Then we can filter by the criteria of the sum of the Amount > 5000. We can take this and filter then join or semi_join to filter based on the criteria.
ids is a dataset that finds the total Amount based on id and new_id and filters for when Dollars > 5000. This gives you the id and new_id that meets your criteria
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c(959,1158,596,922,922,1849,4193,4256,65,100,313,99), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
library(tidyverse)
df1 <- df %>% mutate(Date = as.Date(Date , format = "%m/%d/%Y"),
tf1 = (id != lag(id, default = 0)),
tf2 = (is.na(diffs) | diffs > 5))
df1$new_id <- cumsum(df1$tf1 + df1$tf2 > 0)
>df1
id Date Buyer Amount diffs days_post tf1 tf2 new_id
<chr> <date> <chr> <dbl> <dbl> <date> <lgl> <lgl> <int>
1 9 2018-11-29 John 959 NA 2018-12-04 TRUE TRUE 1
2 9 2018-11-29 John 1158 0 2018-12-04 FALSE FALSE 1
3 9 2018-11-29 John 596 0 2018-12-04 FALSE FALSE 1
4 5 2019-02-13 Maria 922 NA 2019-02-18 TRUE TRUE 2
5 5 2019-02-13 Maria 922 0 2019-02-18 FALSE FALSE 2
6 4 2018-06-15 Sandy 1849 NA 2018-06-20 TRUE TRUE 3
7 4 2018-06-20 Sandy 4193 5 2018-06-25 FALSE FALSE 3
8 4 2018-08-17 Sandy 4256 58 2018-08-22 FALSE TRUE 4
9 4 2018-08-20 Sandy 65 3 2018-08-25 FALSE FALSE 4
10 4 2018-08-23 Sandy 100 3 2018-08-28 FALSE FALSE 4
11 20 2018-12-25 Paul 313 NA 2018-12-30 TRUE TRUE 5
12 20 2018-12-25 Paul 99 0 2018-12-30 FALSE FALSE 5
ids <- df1 %>%
group_by(id, new_id) %>%
summarise(dollar = sum(Amount)) %>%
ungroup() %>% filter(dollar > 5000)
id new_id dollar
<chr> <int> <dbl>
1 4 3 6042
df1 %>% semi_join(ids)

Index by category with sorting by column in R sqldf package

How to add index by category in R with sorting by column in sqldf package. I look for equivalent of SQL:
ROW_NUMBER() over(partition by [Category] order by [Date] desc
Suppose we have a table:
+----------+-------+------------+
| Category | Value | Date |
+----------+-------+------------+
| apples | 3 | 2018-07-01 |
| apples | 2 | 2018-07-02 |
| apples | 1 | 2018-07-03 |
| bananas | 9 | 2018-07-01 |
| bananas | 8 | 2018-07-02 |
| bananas | 7 | 2018-07-03 |
+----------+-------+------------+
Desired results are:
+----------+-------+------------+-------------------+
| Category | Value | Date | Index by category |
+----------+-------+------------+-------------------+
| apples | 3 | 2018-07-01 | 3 |
| apples | 2 | 2018-07-02 | 2 |
| apples | 1 | 2018-07-03 | 1 |
| bananas | 9 | 2018-07-01 | 3 |
| bananas | 8 | 2018-07-02 | 2 |
| bananas | 7 | 2018-07-03 | 1 |
+----------+-------+------------+-------------------+
Thank you for hints in comments how it can be done in lots of other packages different then sqldf: Numbering rows within groups in a data frame
1) PostgreSQL This can be done with the PostgreSQL backend to sqldf:
library(RPostgreSQL)
library(sqldf)
sqldf('select *,
ROW_NUMBER() over (partition by "Category" order by "Date" desc) as seq
from "DF"
order by "Category", "Date" ')
giving:
Category Value Date seq
1 apples 3 2018-07-01 3
2 apples 2 2018-07-02 2
3 apples 1 2018-07-03 1
4 bananas 9 2018-07-01 3
5 bananas 8 2018-07-02 2
6 bananas 7 2018-07-03 1
2) SQLite To do it with the SQLite backend (which is the default backend) we need to revise the SQL statement appropriately. Be sure that RPostgreSQL is NOT loaded before doing this. We have assumed that the data is already sorted by Date within each Category based on the data shown in the question but if that were not the case it would be easy enough to extend the SQL to sort it first.
library(sqldf)
sqldf("select a.*, count(*) seq
from DF a left join DF b on a.Category = b.Category and b.rowid >= a.rowid
group by a.rowid
order by a.Category, a.Date")
Note
The input DF in reproducible form is:
Lines <- "
Category Value Date
apples 3 2018-07-01
apples 2 2018-07-02
apples 1 2018-07-03
bananas 9 2018-07-01
bananas 8 2018-07-02
bananas 7 2018-07-03
"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)

Resources