I am working on a dataset in which I need to calculate how long does it take for a retail store to replenish some products from shortage, and here is a quick view of the dataset in the simplest form:
Date <- c("2019-1-1","2019-1-2","2019-1-3","2019-1-4","2019-1-5","2019-1-6","2019-1-7","2019-1-8")
Product <- rep("Product A",8)
Net_Available_Qty <- c(-2,-2,10,8,-5,-6,-7,0)
sample_df <- data.frame(Date,Product,Net_Available_Qty)
When the Net_Available_Qty becomes negative, it means there is a shortage. When it turns back to 0 or positive qty, it means the supply has been recovered. What I need to calculate is the days between when we first see shortage and when it is recovered. In this case, for the 1st shortage, it took 2 days to recover and for the second shortage, it took 3 days to recover.
A tidyverse solution would be most welcome.
I hope someone else finds a cleaner solution. But this produces diffDate which assigns the date difference from when a negative turns positive/zero.
sample_df %>%
mutate(sign = ifelse(Net_Available_Qty > 0, "pos", ifelse(Net_Available_Qty < 0, "neg", "zero")),
sign_lag = lag(sign, default = sign[1]), # get previous value (exception in the first place)
change = ifelse(sign != sign_lag, 1 , 0), # check if there's a change
sequence=sequence(rle(as.character(sign))$lengths)) %>%
group_by(sequence) %>%
mutate(diffDate = as.numeric(difftime(Date, lag(Date,1))),
diffDate=ifelse(Net_Available_Qty <0, NA, ifelse((sign=='pos'| sign=='zero') & sequence==1, diffDate, NA))) %>%
ungroup() %>%
select(Date, Product, Net_Available_Qty, diffDate)
#Schilker had a great idea using rle. I am building on his answer and offering a slightly shorter version including the use of cumsum
Date <- c("2019-1-1","2019-1-2","2019-1-3","2019-1-4","2019-1-5","2019-1-6","2019-1-7","2019-1-8")
Product <- rep("Product A",8)
Net_Available_Qty <- c(-2,-2,10,8,-5,-6,-7,0)
sample_df <- data.frame(Date,Product,Net_Available_Qty)
library(tidyverse)
sample_df %>%
mutate(
diffDate = c(1, diff(as.Date(Date))),
sequence = sequence(rle(Net_Available_Qty >= 0)$lengths),
group = cumsum(c(TRUE, diff(sequence)) != 1L)
) %>%
group_by(group) %>%
mutate(n_days = max(cumsum(diffDate)))
#> # A tibble: 8 x 7
#> # Groups: group [4]
#> Date Product Net_Available_Qty diffDate sequence group n_days
#> <fct> <fct> <dbl> <dbl> <int> <int> <dbl>
#> 1 2019-1-1 Product A -2 1 1 0 2
#> 2 2019-1-2 Product A -2 1 2 0 2
#> 3 2019-1-3 Product A 10 1 1 1 2
#> 4 2019-1-4 Product A 8 1 2 1 2
#> 5 2019-1-5 Product A -5 1 1 2 3
#> 6 2019-1-6 Product A -6 1 2 2 3
#> 7 2019-1-7 Product A -7 1 3 2 3
#> 8 2019-1-8 Product A 0 1 1 3 1
Created on 2020-02-23 by the reprex package (v0.3.0)
Related
I have data like this:
data<-data.frame(id=c(1,1,1,1,2,2,2,3,3,3,4,4,4),
yearmonthweek=c(2012052,2012053,2012061,2012062,2013031,2013052,2013053,2012052,
2012053,2012054,2012071,2012073,2012074),
event=c(0,1,1,0,0,1,0,0,0,0,0,0,0),
a=c(11,12,13,10,11,12,15,14,13,15,19,10,20))
id stands for personal id. yearmonthweek means year, month and week. I want to clean data by the following rules. First, find id that have at least one event. In this case id=1 and 2 have events and id=3 and 4 have no events. Secondly, pick a random row from an id that has events and pick a random row from an id that has no events. So, the number of rows should be same as the number of id. My expected output looks like this:
data<-data.frame(id=c(1,2,3,4),
yearmonthweek=c(2012053,2013052,2012052,2012073),
event=c(1,1,0,0),
a=c(12,12,14,10))
Since I use random sampling, the values can be different as above, but there should be 4 rows like this.
Here is an option
set.seed(2022)
data %>%
group_by(id) %>%
mutate(has_event = any(event == 1)) %>%
filter(if_else(has_event, event == 1, event == 0)) %>%
slice_sample(n = 1) %>%
select(-has_event) %>%
ungroup()
## A tibble: 4 × 4
# id yearmonthweek event a
# <dbl> <dbl> <dbl> <dbl>
#1 1 2012061 1 13
#2 2 2013052 1 12
#3 3 2012053 0 13
#4 4 2012074 0 20
Explanation: Group by id, flag if a group has at least one event; if it does, only keep those rows where event == 1; then uniform-randomly select a single row using slice_sample per group.
Here is a dplyr way in two steps.
data <- data.frame(id=c(1,1,1,1,2,2,2,3,3,3,4,4,4),
yearmonthweek=c(2012052,2012053,2012061,2012062,2013031,2013052,2013053,2012052,
2012053,2012054,2012071,2012073,2012074),
event=c(0,1,1,0,0,1,0,0,0,0,0,0,0),
a=c(11,12,13,10,11,12,15,14,13,15,19,10,20))
suppressPackageStartupMessages(
library(dplyr)
)
bind_rows(
data %>%
filter(event != 0) %>%
group_by(id) %>%
sample_n(size = 1),
data %>%
group_by(id) %>%
mutate(event = !all(event == 0)) %>%
filter(!event) %>%
sample_n(size = 1)
)
#> # A tibble: 4 × 4
#> # Groups: id [4]
#> id yearmonthweek event a
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2012061 1 13
#> 2 2 2013052 1 12
#> 3 3 2012054 0 15
#> 4 4 2012071 0 19
Created on 2022-10-21 with reprex v2.0.2
With data.table:
library(data.table)
set.seed(1)
setorder(
unique(
setorder(
setDT(data)[
, idx := .I # add an index column to re-sort later
][
sample(nrow(data)) # randomize the table
],
-event # sort descending by event
),
by = "id" # get unique rows by id
),
idx # re-sort
)[, idx := NULL][] # remove the index column
#> id yearmonthweek event a
#> 1: 1 2012053 1 12
#> 2: 2 2013052 1 12
#> 3: 3 2012053 0 13
#> 4: 4 2012071 0 19
I have the following data.frame with columns: Id, Month, have
library(dplyr)
dt <- read.table(header = TRUE, text = '
Id Month have want
1 01-Jan-2018 1.000000000000000 1.234567901220000
1 01-Feb-2018 0.200000000000000 0.234567901233000
1 01-Mar-2018 0.030000000000000 0.034567901234400
1 01-Apr-2018 0.004000000000000 0.004567901234550
1 01-May-2018 0.000500000000000 0.000567901234566
1 01-Jun-2018 0.000060000000000 0.000067901234566
1 01-Jul-2018 0.000007000000000 0.000007901234566
1 01-Aug-2018 0.000000800000000 0.000000901234566
1 01-Sep-2018 0.000000090000000 0.000000101234566
1 01-Oct-2018 0.000000010000000 0.000000011234566
1 01-Nov-2018 0.000000001100000 0.000000001234566
1 01-Dec-2018 0.000000000120000 0.000000000134566
1 01-Jan-2019 0.000000000013000 0.000000000014566
1 01-Feb-2019 0.000000000001400 0.000000000001566
1 01-Mar-2019 0.000000000000150 0.000000000000166
1 01-Apr-2019 0.000000000000016 0.000000000000016
2 01-Jan-2018 1337.00 1338.00
2 01-Feb-2018 1.00 1.00
3 01-Jan-2018 5.000000000000000000 5.000000000000000
') %>% mutate(Month=as.Date(Month, format='%d-%b-%Y')
I would like to programmatically calculate sum of elements in a 12 month forward looking rolling window by Month and grouped by Id as demonstrated in column want. If the rolling observation window is less than 12 months, the missing elements should be ignored.
For bonus points would the solution would also allow for missing months, such as in:
dt <- read.table(header = TRUE, text = '
Id Month have want
1 01-Jan-18 1.000000000000000 1.200000000000000
1 01-Dec-18 0.200000000000000 0.230000000000000
1 01-Jan-19 0.030000000000000 0.030000000000000
') %>% mutate(Month=as.Date(Month, format='%d-%b-%Y')
I have tried different solutions, e.g. rollapplyr() of the zoo package and some functions in the runner package, but it doesn't seem to give me what I need.
You can use zoo's rollaply with partial = TRUE
library(dplyr)
dt %>%
group_by(Id) %>%
tidyr::complete(Month = seq(min(Month), max(Month), "month")) %>%
mutate(result = zoo::rollapply(have, 12, sum, na.rm = TRUE,
align = 'left', partial = TRUE)) -> result
result
If you have data for every month for each Id like in the example shared you can remove the complete step.
I suggest to use runner package in this case. runner function let you to calculate rolling window having a full control in time. k is a window length, lag is a lag of the window and in idx you specify index column which window depends on.
library(runner)
dt %>%
group_by(Id) %>%
mutate(want2 = runner(
.,
f = function(x) sum(x$have),
k = 12, # or "12 months"
lag = -11, # or "-11 months"
idx = Month)
)
# # A tibble: 19 x 5
# # Groups: Id [3]
# Id Month have want want2
# <int> <date> <dbl> <dbl> <dbl>
# 1 1 2018-01-01 1.00e+ 0 1.23e+ 0 1.00e+ 0
# 2 1 2018-02-01 2.00e- 1 2.35e- 1 2.00e- 1
# 3 1 2018-03-01 3.00e- 2 3.46e- 2 3.00e- 2
# 4 1 2018-04-01 4.00e- 3 4.57e- 3 4.00e- 3
# 5 1 2018-05-01 5.00e- 4 5.68e- 4 5.00e- 4
# 6 1 2018-06-01 6.00e- 5 6.79e- 5 6.00e- 5
I am still new to R and learning methods for conducting analysis. I have a df which I want to count the consecutive wins/losses based on column "x9". This shows the gain/loss (positive value or negative value) for the trade entered. I did find some help on code that helped with assigning a sign, sign lag and change, however, I am looking for counter to count the consecutive wins until a loss is achieved then reset, and then count the consecutive losses until a win is achieved. Overall am looking for assistance to adjust the counter to reset when consecutive wins/losses are interrupted. I have some sample code below and a attached .png to explain my thoughts
#Read in df
df=vroom::vroom(file = "analysis.csv")
#Filter df for specfic order types
df1 = filter(df, (x3=="s/l") |(x3=="t/p"))
#Create additional column to tag wins/losses in df1
index <- c("s/l","t/p")
values <- c("Loss", "Win")
df1$col2 <- values[match(df1$x3, index)]
df1
#Mutate df to review changes, attempt to review consecutive wins and losses & reset when a
#positive / negative value is encountered
df2=df1 %>%
mutate(sign = ifelse(x9 > 0, "pos", ifelse(x9 < 0, "neg", "zero")), # get the sign of the value
sign_lag = lag(sign, default = sign[9]), # get previous value (exception in the first place)
change = ifelse(sign == sign_lag, 1 , 0), # check if there's a change
series_id = cumsum(change)+1) %>% # create the series id
print() -> dt2
I think you can use rle for this. By itself, it doesn't immediately provide a grouping-like functionality, but we can either use data.table::rleid or construct our own function:
# borrowed from https://stackoverflow.com/a/62007567/3358272
myrleid <- function(x) {
rl <- rle(x)$lengths
rep(seq_along(rl), times = rl)
}
x9 <- c(-40.57,-40.57,-40.08,-40.08,-40.09,-40.08,-40.09,-40.09,-39.6,-39.6,-49.6,-39.6,-39.61,-39.12,-39.12-39.13,782.58,-41.04)
tibble(x9) %>%
mutate(grp = myrleid(x9 > 0)) %>%
group_by(grp) %>%
mutate(row = row_number()) %>%
ungroup()
# # A tibble: 17 x 3
# x9 grp row
# <dbl> <int> <int>
# 1 -40.6 1 1
# 2 -40.6 1 2
# 3 -40.1 1 3
# 4 -40.1 1 4
# 5 -40.1 1 5
# 6 -40.1 1 6
# 7 -40.1 1 7
# 8 -40.1 1 8
# 9 -39.6 1 9
# 10 -39.6 1 10
# 11 -49.6 1 11
# 12 -39.6 1 12
# 13 -39.6 1 13
# 14 -39.1 1 14
# 15 -78.2 1 15
# 16 783. 2 1
# 17 -41.0 3 1
Let's take the following sample dataset:
counterparty1 <- c("A","B","B","B","B")
counterparty2 <- c("B","C","A","A","C")
counterparty1_side <- c("buy","sell","buy","sell","sell")
price <- c(1.2,3.7,2.5,1.2,3.7)
sample.data <- data.frame(counterparty1,counterparty2,counterparty1_side,price)
Rows 1 and 4 actually give identical observations - the only issue is that row 1 says that "A" buys the asset (implying that "B" sells) and in row 4 it says that "B" sells the asset (implying that "A" buys).
I'd like code to create the following dataset:
counterparty1 <- c("A","B","B","B","B")
counterparty2 <- c("B","C","A","A","C")
counterparty1_side <- c("buy","sell","buy","sell","sell")
price <- c(1.2,3.7,2.5,1.2,3.7)
transaction_number <- c(1,2,3,1,4)
duplicate <- c(1,0,0,1,0)
clean.data <- data.frame(counterparty1,counterparty2,counterparty1_side,price,transaction_number,duplicate)
In reality of course my dataset is much, much larger so I can't hard-code.
Update: I added row 5, which is identical to row 2, including the fact that counterparty 1 and 2 are in the same order. I want the "duplicate" variable to only flag rows 1 and 4 as duplicates (since they are inverses), not rows 2 and 5.
Updated Answer:
Addressing OP's follow up question, stating that if the very same transaction happens twice, it should not be picked up as duplicates.(for instance party B selling something to party C for $3.7K on two occasions); read the comments and updated question.
library(dplyr)
sample.data %>%
mutate(transaction=if_else(counterparty1_side=="buy",
paste0(counterparty1,counterparty2),
paste0(counterparty2,counterparty1))) %>%
group_by_all %>%
mutate(dup_dum = 1:n()) %>%
group_by(transaction, dup_dum) %>%
mutate(transaction_number = group_indices(),
duplicate = +(n()!=n_distinct(transaction, dup_dum))) %>%
ungroup() %>% select(-transaction, -dup_dum)
#> # A tibble: 5 x 6
#> counterparty1 counterparty2 counterparty1_s~ price transaction_num~ duplicate
#> <fct> <fct> <fct> <dbl> <int> <int>
#> 1 A B buy 1.2 1 1
#> 2 B C sell 3.7 3 0
#> 3 B A buy 2.5 2 0
#> 4 B A sell 1.2 1 1
#> 5 B C sell 3.7 4 0
Original Answer:
Considering dupes (doesn't matter if they are dupes just because counter-party role has changed or they are actual dupes) (look at the edits to the question to see the first version of the question).
library(dplyr)
sample.data %>%
mutate(transaction=if_else(counterparty1_side=="buy",
paste0(counterparty1,counterparty2),
paste0(counterparty2,counterparty1))) %>%
group_by(transaction) %>%
mutate(transaction_number = group_indices(),
duplicate = +(n()!=n_distinct(transaction))) %>%
ungroup() %>% select(-transaction)
# # A tibble: 4 x 6
# counterparty1 counterparty2 counterparty1_side price transaction_number duplicate
# <fct> <fct> <fct> <dbl> <int> <int>
# 1 A B buy 1.2 1 1
# 2 B C sell 3.7 3 0
# 3 B A buy 2.5 2 0
# 4 B A sell 1.2 1 1
I have a dataframe :
df <- data.frame(
Group=c('A','A','A','A','B','B','B','B'),
Activity = c('EOSP','NOR','EOSP','COSP','NOR','EOSP','WL','NOR'),
TimeLine=c(1,2,3,4,1,2,3,4)
)
I want to filter for only two activities for each group and in the order in which I am filtering. For example, I am only looking for the activities EOSP and NOR but in the order too. This code:
df %>% group_by(Group) %>%
filter(all(c('EOSP','NOR') %in% Activity) & Activity %in% c('EOSP','NOR'))
results in:
# A tibble: 6 x 3
# Groups: Group [2]
Group Activity TimeLine
<fct> <fct> <dbl>
1 A EOSP 1
2 A NOR 2
3 A EOSP 3
4 B NOR 1
5 B EOSP 2
6 B NOR 4
I don't want row 3 as EOSP occurs after NOR. Similarly for group B, I don't want row 4, as NOR is occurring before EOSP. How do I achieve this?
You can use match to get the first instance of Activity == EOSP and use slice to remove everything before that. Once you do that, then you can remove duplicates and filter on EOSP and NOR, i.e.
library(tidyverse)
df %>%
group_by(Group) %>%
mutate(new = match('EOSP', Activity)) %>%
slice(new:n()) %>%
distinct(Activity, .keep_all = TRUE) %>%
filter(Activity %in% c('EOSP', 'NOR'))
which gives,
# A tibble: 4 x 4
# Groups: Group [2]
Group Activity TimeLine new
<fct> <fct> <dbl> <int>
1 A EOSP 1 1
2 A NOR 2 1
3 B EOSP 2 2
4 B NOR 4 2
NOTE 1: You can ungroup() and select(-new)
NOTE 2: The warning messages being issued here
(Warning messages:
1: In new:4L : numerical expression has 4 elements: only the first used
2: In new:4L : numerical expression has 4 elements: only the first used
)
do not affect us since we only need it to use the first element since all are the same anyway
here is an option with data.table package: you join df with itself, subsetted it to keep only EOSP Activity and computing the min of TimeLine by group, then you can keep only the rows with TimeLine greater or equal to this TimeLine, in order to be sure you keep NOR only if there is EOSP before. Then you drop duplicated Group and Activity if you want to only keep 2 activities per group:
df[df[Activity=="EOSP", min(TimeLine), by=Group], on="Group"][Activity %in% c("NOR", "EOSP") & TimeLine >= V1][!duplicated(paste(Group, Activity))]
# Group Activity TimeLine V1
#1: A EOSP 1 1
#2: A NOR 2 1
#3: B EOSP 2 2
#4: B NOR 4 2
Here is a dplyr idea:
df %>%
filter(Activity %in% c('EOSP','NOR')) %>%
group_by(Group) %>%
mutate(tmp = which(Activity == 'EOSP' & !duplicated(Activity))) %>%
filter(row_number() %in% c(tmp, tmp+1))
# A tibble: 4 x 4
# Groups: Group [2]
Group Activity TimeLine tmp
<fct> <fct> <dbl> <int>
1 A EOSP 1 1
2 A NOR 2 1
3 B EOSP 2 2
4 B NOR 4 2