I have a dataframe that looks like this:
+-----------+------------+--------+------------+
| Geography | Dates | Sales | Avg_Volume |
+-----------+------------+--------+------------+
| A | 2020-01-01 | | |
+-----------+------------+--------+------------+
| A | 2020-01-02 | | |
+-----------+------------+--------+------------+
| A | 2020-01-03 | | |
+-----------+------------+--------+------------+
| A | 2020-01-04 | | |
+-----------+------------+--------+------------+
| A | 2020-01-05 | | |
+-----------+------------+--------+------------+
| B | 2020-01-01 | | |
+-----------+------------+--------+------------+
| B | 2020-01-02 | | |
+-----------+------------+--------+------------+
| B | 2020-01-03 | | |
+-----------+------------+--------+------------+
| B | 2020-01-04 | | |
+-----------+------------+--------+------------+
| B | 2020-01-05 | | |
+-----------+------------+--------+------------+
| C | 2020-01-01 | | |
+-----------+------------+--------+------------+
| C | 2020-01-02 | | |
+-----------+------------+--------+------------+
| C | 2020-01-03 | | |
+-----------+------------+--------+------------+
| C | 2020-01-04 | | |
+-----------+------------+--------+------------+
| C | 2020-01-05 | | |
+-----------+------------+--------+------------+
| D | 2020-01-01 | | |
+-----------+------------+--------+------------+
| D | 2020-01-02 | | |
+-----------+------------+--------+------------+
| D | 2020-01-03 | | |
+-----------+------------+--------+------------+
| D | 2020-01-04 | | |
+-----------+------------+--------+------------+
| D | 2020-01-05 | | |
+-----------+------------+--------+------------+
I would like to have 3 dataframes dedicated to City B,C,D that looks like this (I need A_Sales to be always present:
+------------+----------+---------+--------------+
| Dates | A_Sales | B_Sales | B_Avg_Volume |
+------------+----------+---------+--------------+
| 2020-01-01 | | | |
+------------+----------+---------+--------------+
| 2020-01-02 | | | |
+------------+----------+---------+--------------+
| 2020-01-03 | | | |
+------------+----------+---------+--------------+
| 2020-01-04 | | | |
+------------+----------+---------+--------------+
| 2020-01-05 | | | |
+------------+----------+---------+--------------+
+------------+----------+---------+--------------+
| Dates | A_Sales | C_Sales | C_Avg_Volume |
+------------+----------+---------+--------------+
| 2020-01-01 | | | |
+------------+----------+---------+--------------+
| 2020-01-02 | | | |
+------------+----------+---------+--------------+
| 2020-01-03 | | | |
+------------+----------+---------+--------------+
| 2020-01-04 | | | |
+------------+----------+---------+--------------+
| 2020-01-05 | | | |
+------------+----------+---------+--------------+
+------------+----------+---------+--------------+
| Dates | A_Sales | D_Sales | D_Avg_Volume |
+------------+----------+---------+--------------+
| 2020-01-01 | | | |
+------------+----------+---------+--------------+
| 2020-01-02 | | | |
+------------+----------+---------+--------------+
| 2020-01-03 | | | |
+------------+----------+---------+--------------+
| 2020-01-04 | | | |
+------------+----------+---------+--------------+
| 2020-01-05 | | | |
+------------+----------+---------+--------------+
Currently this is what I have:
data_A <- data %>%
filter(Geography == "A") %>%
rename("A_Sales" = Sales) %>%
select(Dates, A_Sales)
data_B <- data %>%
filter(Geography == 'B') %>%
rename("B_Sales" = Sales)%>%
rename("B_Avg_Volume" = Avg_Volume)%>%
select(Dates, B_Sales, B_Avg_Volume)
data_a_n_b <- data_A %>%
left_join(data_B, by = 'Dates')
This is very redundant and inefficient, because I would have to change Geography == '...') to "B,C,D..." everytime and re-run. My real data has ~ 50 cities so it is unrealistic for me to do this process for each city individually.
What is a elegant way to batch processing this process?
I am imagining the end result be a list of dataframes for city B,C,D ... and so on, with the name of each individual dataframe be the city name. This way I can easily access each individual dataframe. For example, calling data_result$C (or sth like that) will give me the dataframe for City C. Any other output format is also welcomed, as long as accessing individual dataframe is easy.
Thanks so much for your help!
Using purrr this could be achieved like so:
Split your df by Geography
Loop over the list (except for region "A") and join the dfs to the one for region A
Do some renaming
set.seed(42)
dat <- data.frame(
Geography = rep(LETTERS[1:4], each = 4),
Dates = rep(seq(as.Date("2020-01-01"), as.Date("2020-01-04"), by = "1 day"), 4),
Sales = runif(4 * 4),
Avg_Volume = runif(4 * 4)
)
library(purrr)
library(dplyr)
library(stringr)
dat_list <- dat %>%
split(.$Geography) %>%
map(select, -Geography)
imap(dat_list[setdiff(names(dat_list), "A")], function(x, y) {
left_join(dat_list[["A"]], x, by = "Dates", suffix = c(paste0("_", y), "_A")) %>%
rename_with(~ str_replace(.x, "(Sales|Avg_Volume)_(.*)", "\\2_\\1"), -Dates) %>%
select(-A_Avg_Volume)
})
#> $B
#> Dates B_Sales B_Avg_Volume A_Sales
#> 1 2020-01-01 0.9148060 0.9782264 0.6417455
#> 2 2020-01-02 0.9370754 0.1174874 0.5190959
#> 3 2020-01-03 0.2861395 0.4749971 0.7365883
#> 4 2020-01-04 0.8304476 0.5603327 0.1346666
#>
#> $C
#> Dates C_Sales C_Avg_Volume A_Sales
#> 1 2020-01-01 0.9148060 0.9782264 0.6569923
#> 2 2020-01-02 0.9370754 0.1174874 0.7050648
#> 3 2020-01-03 0.2861395 0.4749971 0.4577418
#> 4 2020-01-04 0.8304476 0.5603327 0.7191123
#>
#> $D
#> Dates D_Sales D_Avg_Volume A_Sales
#> 1 2020-01-01 0.9148060 0.9782264 0.9346722
#> 2 2020-01-02 0.9370754 0.1174874 0.2554288
#> 3 2020-01-03 0.2861395 0.4749971 0.4622928
#> 4 2020-01-04 0.8304476 0.5603327 0.9400145
Created on 2021-02-05 by the reprex package (v1.0.0)
I took Stefan's setup dataframe and added another way to do it. The steps are:
Get a list of the list of city names (excluding A). The way I wrote it assumes A is first, but you could also use discard() to remove "A" from the city list.
Use map with filter to get a list of data frames that have A and each city in cities. set_names to make sure each list element is accessible by its city name
Take each data frame in the list and pivot_wider, then select everything by Avg_Volume for A.
#Set up a sample data frame
library(dplyr)
set.seed(42)
dat <- tibble(
Geography = rep(LETTERS[1:4], each = 4),
Dates = rep(seq(as.Date("2020-01-01"), as.Date("2020-01-04"), by = "1 day"), 4),
Sales = runif(4 * 4),
Avg_Volume = runif(4 * 4)
)
#Code to wrangle into list of filtered, wide format data frames
library(dpylyr)
library(tidyr)
library(purrr)
cities <- unique(dat$Geography)[-1]
dat_list <- map(cities, ~ filter(dat, Geography == "A" | Geography == .x)) %>% set_names(cities)
dat_list_wider <- map(dat_list,
~pivot_wider(.x, id_cols = "Dates",
names_from = "Geography",
values_from = c("Sales","Avg_Volume")) %>%
select(-Avg_Volume_A))
I've a dataframe as under
+----+-------+---------+
| ID | VALUE | DATE |
+----+-------+---------+
| 1 | 10 | 2019-08 |
| 2 | 12 | 2018-05 |
| 3 | 45 | 2019-03 |
| 3 | 33 | 2018-03 |
| 1 | 5 | 2018-08 |
| 2 | 98 | 2019-05 |
| 4 | 67 | 2019-10 |
| 4 | 34 | 2018-10 |
| 1 | 55 | 2018-07 |
| 2 | 76 | 2019-08 |
| 2 | 56 | 2018-12 |
+----+-------+---------+
What I'm trying to do here is to split the value and date into value1 and value2 and data1 and date2 based on the current year(year of systemdate) and the year before
But the condition here is if the date-month combination in DATE of the main table matched to that of current systemdate then donot consider last years date
Also disregard all the values dates that appear before the year of systemdate
The resulting output would be as under
Over here in the result ID 1,2 and 3 had corresponding values for same month in this year and last year so we split them in 2 different columns
Also we didn't consider last years result of ID 4 as its month this year matches with year-month combination of systemdate
and we also disregard all the values from lat year that don't have a corresponding month match this year ( ID 1 for 2018-07 and 2 for 2018-12 in this example)
+----+---------+---------+--------+--------+
| ID | DATE1 | DATE2 | VALUE1 | VALUE2 |
+----+---------+---------+--------+--------+
| 1 | 2019-08 | 2018-08 | 10 | 5 |
| 2 | 2019-05 | 2018-05 | 98 | 12 |
| 3 | 2019-03 | 2018-03 | 45 | 33 |
| 4 | 2019-10 | NA | 67 | NA |
| 2 | 2019-08 | NA | 76 | NA |
+----+---------+---------+--------+--------+
I think you could get everything in the right format first:
df <- data.frame(ID = c(1, 2, 3, 3, 1, 2, 4, 4, 1, 2, 2),
VALUE = c(10, 12, 45, 33, 5, 98, 67, 34, 55, 76, 56),
DATE = c("2019-08", "2018-05", "2019-03","2018-03",
"2018-08","2019-05", "2019-10", "2018-10",
"2018-07", "2019-08", "2018-12"))
library(tidyverse)
df <- df %>% mutate(
year = str_split_fixed(DATE, "-", 2)[,1],
month = str_split_fixed(DATE, "-", 2)[,2]) %>%
pivot_wider(
names_from = year,
values_from = c(VALUE, DATE))
Then, you could filter and remove those values that you do not need according to your logic. I may not fully understand your system time here, but just assume it is the string "2019-10". It could be something like this:
df %>%
filter(!is.na(VALUE_2019)) %>%
mutate(
VALUE_2018 = ifelse(DATE_2019 == "2019-10", NA, VALUE_2018),
DATE_2018 = ifelse(DATE_2019 == "2019-10", NA, as.character(DATE_2018)))
# A tibble: 5 x 6
ID month VALUE_2019 VALUE_2018 DATE_2019 DATE_2018
<dbl> <chr> <dbl> <dbl> <fct> <chr>
1 1 08 10 5 2019-08 2018-08
2 2 05 98 12 2019-05 2018-05
3 3 03 45 33 2019-03 2018-03
4 4 10 67 NA 2019-10 NA
5 2 08 76 NA 2019-08 NA
This is a continued question from the post Remove the first row from each group if the second row meets a condition
Below is a sample dataset:
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
which would look like:
| id | Date | Buyer | diff | Amount |
|----|:----------:|------:|------|--------|
| 9 | 11/29/2018 | John | NA | 959 |
| 9 | 11/29/2018 | John | 0 | 1158 |
| 9 | 11/29/2018 | John | 0 | 596 |
| 5 | 2/13/2019 | Maria | 76 | 922 |
| 5 | 2/13/2019 | Maria | 0 | 922 |
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
| 4 | 8/17/2018 | Sandy | 58 | 4256 |
| 4 | 8/20/2018 | Sandy | 3 | 65 |
| 4 | 8/23/2018 | Sandy | 3 | 100 |
| 20 | 12/25/2018 | Paul | 124 | 313 |
| 20 | 12/25/2018 | Paul | 0 | 99 |
I need to retain those records where based on each buyer and id, the sum of amount between consecutive rows >5000 if the difference between two consecutive rows <=5. So, for example, Buyer 'Sandy' with id '4' has two transactions of 1849 and 4193 on '6/15/2018' and '6/20/2018' within a gap of 5 days, and since the sum of these two amounts>5000, the output would have these records. Whereas, for the same Buyer 'Sandy' with id '4' has another transactions of 4256, 65 and 100 on '8/17/2018', '8/20/2018' and '8/23/2018' within a gap of 3 days each, but the output will not have these records as the sum of this amount <5000.
The final output would look like:
| id | Date | Buyer | diff | Amount |
|----|:---------:|------:|------|--------|
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
Changing Date from character to Date and Amount from character to numeric:
df$Date<-as.Date(df$Date, '%m/%d/%y')
df$Amount<-as.numeric(df$Amount)
Now here I group the dataset by id, arrange it with Date, and create a rank within each id (so for example Sandy is going to have rank from 1 through 5 for 5 different days in which she has shopped), then I define a new variable called ConsecutiveSum which adds the Value of each row to it's previous row's Value (lag gives you the previous row). The ifelse statement forces consecutive sum to output a 0 if the previous row's Value doesn't exists. The next step is just enforcing your conditions:
df %>%
group_by(id) %>%
arrange(Date) %>%
mutate(rank=dense_rank(Date)) %>%
mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount + lag(Amount , default = 0)))%>%
filter(diffs<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)
# id Date Buyer Amount diffs rank ConsecutiveSum
# <chr> <chr> <chr> <dbl> <dbl> <int> <dbl>
# 1 4 6/15/2018 Sandy 1849 NA 1 0
# 2 4 6/20/2018 Sandy 4193 5 2 6042
I would use a combination of techniques available in tidyverse:
First create a grouping variable (new_id) and use the original id and new_id in combination to add together based on a grouping. Then we can filter by the criteria of the sum of the Amount > 5000. We can take this and filter then join or semi_join to filter based on the criteria.
ids is a dataset that finds the total Amount based on id and new_id and filters for when Dollars > 5000. This gives you the id and new_id that meets your criteria
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c(959,1158,596,922,922,1849,4193,4256,65,100,313,99), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
library(tidyverse)
df1 <- df %>% mutate(Date = as.Date(Date , format = "%m/%d/%Y"),
tf1 = (id != lag(id, default = 0)),
tf2 = (is.na(diffs) | diffs > 5))
df1$new_id <- cumsum(df1$tf1 + df1$tf2 > 0)
>df1
id Date Buyer Amount diffs days_post tf1 tf2 new_id
<chr> <date> <chr> <dbl> <dbl> <date> <lgl> <lgl> <int>
1 9 2018-11-29 John 959 NA 2018-12-04 TRUE TRUE 1
2 9 2018-11-29 John 1158 0 2018-12-04 FALSE FALSE 1
3 9 2018-11-29 John 596 0 2018-12-04 FALSE FALSE 1
4 5 2019-02-13 Maria 922 NA 2019-02-18 TRUE TRUE 2
5 5 2019-02-13 Maria 922 0 2019-02-18 FALSE FALSE 2
6 4 2018-06-15 Sandy 1849 NA 2018-06-20 TRUE TRUE 3
7 4 2018-06-20 Sandy 4193 5 2018-06-25 FALSE FALSE 3
8 4 2018-08-17 Sandy 4256 58 2018-08-22 FALSE TRUE 4
9 4 2018-08-20 Sandy 65 3 2018-08-25 FALSE FALSE 4
10 4 2018-08-23 Sandy 100 3 2018-08-28 FALSE FALSE 4
11 20 2018-12-25 Paul 313 NA 2018-12-30 TRUE TRUE 5
12 20 2018-12-25 Paul 99 0 2018-12-30 FALSE FALSE 5
ids <- df1 %>%
group_by(id, new_id) %>%
summarise(dollar = sum(Amount)) %>%
ungroup() %>% filter(dollar > 5000)
id new_id dollar
<chr> <int> <dbl>
1 4 3 6042
df1 %>% semi_join(ids)
How to add index by category in R with sorting by column in sqldf package. I look for equivalent of SQL:
ROW_NUMBER() over(partition by [Category] order by [Date] desc
Suppose we have a table:
+----------+-------+------------+
| Category | Value | Date |
+----------+-------+------------+
| apples | 3 | 2018-07-01 |
| apples | 2 | 2018-07-02 |
| apples | 1 | 2018-07-03 |
| bananas | 9 | 2018-07-01 |
| bananas | 8 | 2018-07-02 |
| bananas | 7 | 2018-07-03 |
+----------+-------+------------+
Desired results are:
+----------+-------+------------+-------------------+
| Category | Value | Date | Index by category |
+----------+-------+------------+-------------------+
| apples | 3 | 2018-07-01 | 3 |
| apples | 2 | 2018-07-02 | 2 |
| apples | 1 | 2018-07-03 | 1 |
| bananas | 9 | 2018-07-01 | 3 |
| bananas | 8 | 2018-07-02 | 2 |
| bananas | 7 | 2018-07-03 | 1 |
+----------+-------+------------+-------------------+
Thank you for hints in comments how it can be done in lots of other packages different then sqldf: Numbering rows within groups in a data frame
1) PostgreSQL This can be done with the PostgreSQL backend to sqldf:
library(RPostgreSQL)
library(sqldf)
sqldf('select *,
ROW_NUMBER() over (partition by "Category" order by "Date" desc) as seq
from "DF"
order by "Category", "Date" ')
giving:
Category Value Date seq
1 apples 3 2018-07-01 3
2 apples 2 2018-07-02 2
3 apples 1 2018-07-03 1
4 bananas 9 2018-07-01 3
5 bananas 8 2018-07-02 2
6 bananas 7 2018-07-03 1
2) SQLite To do it with the SQLite backend (which is the default backend) we need to revise the SQL statement appropriately. Be sure that RPostgreSQL is NOT loaded before doing this. We have assumed that the data is already sorted by Date within each Category based on the data shown in the question but if that were not the case it would be easy enough to extend the SQL to sort it first.
library(sqldf)
sqldf("select a.*, count(*) seq
from DF a left join DF b on a.Category = b.Category and b.rowid >= a.rowid
group by a.rowid
order by a.Category, a.Date")
Note
The input DF in reproducible form is:
Lines <- "
Category Value Date
apples 3 2018-07-01
apples 2 2018-07-02
apples 1 2018-07-03
bananas 9 2018-07-01
bananas 8 2018-07-02
bananas 7 2018-07-03
"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)