How to translate excel MINIFS function to dplyr code - r

Two columns and the formulas are replicated below:
MinSell=IF(F2="SELL",0,MINIFS(C:C,B:B,B2,F:F,"SELL")-C2)
MaxSell=IF(F2="SELL",0,MAXIFS(C:C,B:B,B2,F:F,"SELL")-C2)
Column F includes transactionstatus
Column C includes Tradedate
account B includes AccountNo
I have a df containing hundreds of columns and millions of rows. Below is a small snippet of the df containing details of just one account
AccountNo<-c(11223344,11223344,11223344,11223344)
transactionstatus<-c("BUY","BUY","SELL","SELL")
Tradedate<-c("2020-01-17","2020-01-16","2020-01-13","2020-01-12")
df<-as.data.frame(cbind(AccountNo,transactionstatus,Tradedate))
Expected OutPUT
MinSell = c(-5, -4, 0, 0)
MaxSell = c(-4, -3, 0, 0)

You can create a variable containing the mindate and maxdate, then mutate columns with your condition.
Setup
library(dplyr)
# Tradedate must have Date class
df <- tibble(
AccountNo = c(11223344,11223344,11223344,11223344),
transactionstatus = c("BUY", "BUY", "SELL", "SELL"),
Tradedate = as.Date(c("2020-01-17", "2020-01-16", "2020-01-13", "2020-01-12")))
df
# A tibble: 4 x 3
AccountNo transactionstatus Tradedate
<dbl> <chr> <date>
1 11223344 BUY 2020-01-17
2 11223344 BUY 2020-01-16
3 11223344 SELL 2020-01-13
4 11223344 SELL 2020-01-12
Solution
# final df
binded <- tibble()
for (account in unique(df$AccountNo)) {
df_fltrd <- filter(df, AccountNo == account)
mindate <- min(df_fltrd$Tradedate[df_fltrd$transactionstatus == "SELL"])
maxdate <- max(df_fltrd$Tradedate[df_fltrd$transactionstatus == "SELL"])
solution <- df_fltrd %>%
mutate(minsell = if_else(transactionstatus == "SELL", 0, as.numeric(mindate-Tradedate)),
maxsell = if_else(transactionstatus == "SELL", 0, as.numeric(maxdate-Tradedate)))
binded <- bind_rows(binded, solution)
}
binded
# A tibble: 4 x 5
AccountNo transactionstatus Tradedate minsell maxsell
<dbl> <chr> <date> <dbl> <dbl>
1 11223344 BUY 2020-01-17 -5 -4
2 11223344 BUY 2020-01-16 -4 -3
3 11223344 SELL 2020-01-13 0 0
4 11223344 SELL 2020-01-12 0 0

Related

Count instances of value within overlapping dates?

I have a dataframe that includes start_date and end_date for a given unit_id along with the unit's group.
in_df <- data.frame(unit_id = c(1,2,3),
start_date = as.Date(c("2019-01-01","2019-02-05","2020-01-12")),
end_date = as.Date(c("2019-02-06","2019-02-28","2020-01-30")),
group = c("pass","fail","pass"))
For each unit_id, I need to calculate the proportion of all units that pass within the duration, start_date and end_date for the current unit_id.
Taking unit_id=1 as an example, I need to find all units that have start_date and/or end_date within the dates for unit 1, i.e. start_date = 2019-01-01 and end_date = 2019-02-06. Given my in_df, this would return two units, 1 and 2. One unit passes and one fails so the proportion of pass would be 0.5. desired_df shows the output I expect for this example.
desired_df <- data.frame(unit_id = c(1,2,3),
start_date = as.Date(c("2019-01-01","2019-02-05","2020-01-12")),
end_date = as.Date(c("2019-02-06","2019-02-28","2020-01-30")),
group = c("pass","fail","pass"),
pass_prop = c(0.5,0.5,1))
What I've tried
There are a lot of existing posts related to identifying overlapping dates. I've tried to work through some to see if I can figure this out but haven't been successful.
The following is the closest that I've gotten. It does what I want on my toy example but not on the real data (additional example data below).
library(dplyr)
library(ivs)
in_df <- data.frame(unit_id = c(1,2,3),
start_date = as.Date(c("2019-01-01","2019-02-05","2020-01-12")),
end_date = as.Date(c("2019-02-06","2019-02-28","2020-01-30")),
group = c("pass","fail","pass"))
desired_df <- data.frame(unit_id = c(1,2,3),
start_date = as.Date(c("2019-01-01","2019-02-05","2020-01-12")),
end_date = as.Date(c("2019-02-06","2019-02-28","2020-01-30")),
group = c("pass","fail","pass"),
pass_prop = c(0.5,0.5,1))
in_df <- in_df %>%
mutate(
start_dt = as.Date(start_date),
end_dt = as.Date(end_date)
) %>%
mutate(
range = iv(start_dt, end_dt),
.keep = "unused"
)
in_df$row_n <- 1:nrow(in_df)
in_df <- in_df %>%
group_by(group) %>%
mutate(groupDate = iv_identify_group(range)) %>%
group_by(groupDate, .add = TRUE)
groupCount <- in_df %>% group_by(groupDate) %>% dplyr::summarize(totalCount=n())
durationCount <- in_df %>% group_by(groupDate,group) %>% dplyr::summarize(groupCount=n())
durationCount <- dplyr::inner_join(groupCount,durationCount, by = "groupDate")
durationCount$pass_prop <- durationCount$groupCount/durationCount$totalCount
durationCount <- filter(durationCount, group == "pass")
desired_df <- dplyr::full_join(in_df,durationCount, by = "groupDate")
desired_df
The above displays exactly what I need under pass_prop. The problem with this is that iv_identify_group extends the groupDate too far when additional dates overlap as shown below.
Take unit = 1 as an example again. If I add another row to in_df that overlaps with unit = 1 and unit = 3, then the groupDate gets extended to include the ranges for units 1,2, and 4. This happens because unit 1 overlaps with 2 and 2 overlaps with 4. I want it to stop at the overlap with unit 2 since the range of unit 1 does not overlap with unit 4. Below displays this undesired output.
in_df <- data.frame(unit_id = c(1,2,3,4),
start_date = as.Date(c("2019-01-01","2019-02-05","2020-01-12","2019-02-20")),
end_date = as.Date(c("2019-02-06","2019-02-28","2020-01-30","2020-01-30")),
group = c("pass","fail","pass","pass"))
# execute same code as above
Perhaps this?
library(dplyr)
in_df %>%
fuzzyjoin::fuzzy_left_join(
in_df, by = c("start_date" = "end_date", "end_date" = "start_date"),
match_fun = list(`<=`, `>=`)) %>%
group_by(unit_id = unit_id.x, start_date = start_date.x,
end_date = end_date.x, group = group.x) %>%
summarize(pass_prop = sum(group.y == "pass") / n(), .groups = "drop")
Result
unit_id start_date end_date group pass_prop
<dbl> <date> <date> <chr> <dbl>
1 1 2019-01-01 2019-02-06 pass 0.5
2 2 2019-02-05 2019-02-28 fail 0.5
3 3 2020-01-12 2020-01-30 pass 1
I think ivs can help you, but I think you might be looking for iv_locate_overlaps() here:
library(ivs)
library(tidyverse)
# Starting with the more complex example with the 4th row
in_df <- tibble(unit_id = c(1,2,3,4),
start_date = as.Date(c("2019-01-01","2019-02-05","2020-01-12","2019-02-20")),
end_date = as.Date(c("2019-02-06","2019-02-28","2020-01-30","2020-01-30")),
group = c("pass","fail","pass","pass"))
in_df <- in_df %>%
mutate(range = iv(start_date, end_date), .keep = "unused")
in_df
#> # A tibble: 4 × 3
#> unit_id group range
#> <dbl> <chr> <iv<date>>
#> 1 1 pass [2019-01-01, 2019-02-06)
#> 2 2 fail [2019-02-05, 2019-02-28)
#> 3 3 pass [2020-01-12, 2020-01-30)
#> 4 4 pass [2019-02-20, 2020-01-30)
# "find all units that have `start_date` and/or `end_date` within the dates for unit i"
# So you are looking for "any" kind of overlap.
# `iv_locate_overlaps()` does: "For each `needle`, find every location in `haystack`
# where that `needle` has ANY overlap at all"
locs <- iv_locate_overlaps(
needles = in_df$range,
haystack = in_df$range,
type = "any"
)
# Note `needle` 1 overlaps `haystack` locations 1 and 2 (which is what you said
# you want for unit 1)
locs
#> needles haystack
#> 1 1 1
#> 2 1 2
#> 3 2 1
#> 4 2 2
#> 5 2 4
#> 6 3 3
#> 7 3 4
#> 8 4 2
#> 9 4 3
#> 10 4 4
# Slice `in_df` appropriately, keeping relevant columns needed to answer the question
needles <- in_df[locs$needles, c("unit_id", "range")]
haystack <- in_df[locs$haystack, c("group", "range")]
haystack <- rename(haystack, overlaps = range)
expanded_df <- bind_cols(needles, haystack)
expanded_df
#> # A tibble: 10 × 4
#> unit_id range group overlaps
#> <dbl> <iv<date>> <chr> <iv<date>>
#> 1 1 [2019-01-01, 2019-02-06) pass [2019-01-01, 2019-02-06)
#> 2 1 [2019-01-01, 2019-02-06) fail [2019-02-05, 2019-02-28)
#> 3 2 [2019-02-05, 2019-02-28) pass [2019-01-01, 2019-02-06)
#> 4 2 [2019-02-05, 2019-02-28) fail [2019-02-05, 2019-02-28)
#> 5 2 [2019-02-05, 2019-02-28) pass [2019-02-20, 2020-01-30)
#> 6 3 [2020-01-12, 2020-01-30) pass [2020-01-12, 2020-01-30)
#> 7 3 [2020-01-12, 2020-01-30) pass [2019-02-20, 2020-01-30)
#> 8 4 [2019-02-20, 2020-01-30) fail [2019-02-05, 2019-02-28)
#> 9 4 [2019-02-20, 2020-01-30) pass [2020-01-12, 2020-01-30)
#> 10 4 [2019-02-20, 2020-01-30) pass [2019-02-20, 2020-01-30)
# Compute the pass proportion per unit
expanded_df %>%
group_by(unit_id) %>%
summarise(pass_prop = sum(group == "pass") / length(group))
#> # A tibble: 4 × 2
#> unit_id pass_prop
#> <dbl> <dbl>
#> 1 1 0.5
#> 2 2 0.667
#> 3 3 1
#> 4 4 0.667
Created on 2022-07-19 by the reprex package (v2.0.1)

counting rows between two specific rows with a condition

df <- structure(list(inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1"), ass = c("x", "x", "x", "x", "x"), datetime = c("2010-01-01",
"2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19"), portfolio = c(10,
0, 5, 2, 0)), operation = c(10, -10, 5, -3, -2), class = "data.frame", row.names = c(NA, -5L))
So I have 4000 investors with 6000 different assets, for each investor I have his trading operations in two different variables: operation tells me if he is buying/selling; portfolio tells me how much he has in the portfolio.
What I want to do is computing the number of days a position stays open in the portfolio, so I though about computing the difference between the day in which the portfolio goes back to zero and the day in which the portfolio went positive (it is not possible to get negative portfolio).
so in the dataset above I would count row2 - row1 ==> 2010-01-02 - 2010-01-01
and row 5 - row 3 ==> 2010-01-19 - 2010-01-03 and so on...
I want to do this computation for all the investor & asset I have in my dataset for all the rows in which I find that portfolio > 0.
So my dataset will have a further column called duration which would be equal, in this case to c(0,1,0,5,16) (so of course i also had to compute raw1 - raw1 and raw3 - raw3)
Hence my problem is to restart the count everytime portfolio goes back to zero.
library(dplyr)
df %>%
mutate(datetime = as.Date(datetime, "%Y-%m-%d")) %>%
group_by(investor, asset) %>%
arrange(datetime) %>%
mutate(grp.pos = cumsum(lag(portfolio, default = 1) == 0)) %>%
group_by(investor, asset, grp.pos) %>%
mutate(`Open (#days)` = datetime - datetime[1])
#> # A tibble: 5 x 6
#> # Groups: investor, asset, grp.pos [2]
#> investor asset datetime portfolio grp.pos `Open (#days)`
#> <chr> <chr> <date> <dbl> <int> <drtn>
#> 1 INV_1 x 2010-01-01 10 0 0 days
#> 2 INV_1 x 2010-01-02 0 0 1 days
#> 3 INV_1 x 2010-01-03 5 1 0 days
#> 4 INV_1 x 2010-01-08 2 1 5 days
#> 5 INV_1 x 2010-01-19 0 1 16 days
Data:
df <- structure(list(investor = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1"),
asset = c("x", "x", "x", "x", "x"),
datetime = c("2010-01-01", "2010-01-02", "2010-01-03",
"2010-01-08", "2010-01-19"),
portfolio = c(10, 0, 5, 2, 0)),
operation = c(10, -10, 5, -3, -2),
class = "data.frame", row.names = c(NA, -5L))
Here is a way how we could do it, that is expandable if necessary for ass
First we group by inv to use for the original dataset. Then transform datetime to date format to do calculations easily (here we use ymd() function).
The next step could be done in different ways:
Main idea is to group the column portfolio indicated by the last row of the group that is 0. For this we arrange datetime in descending form to easily apply the grouping id with cumsum == 0.
After rearranging datetime we can calculate the last from the first as intended:
library(dplyr)
library(lubridate)
df %>%
group_by(inv) %>%
mutate(datetime = ymd(datetime)) %>%
arrange(desc(datetime)) %>%
group_by(position_Group = cumsum(portfolio==0)) %>%
arrange(datetime) %>%
mutate(position_open = last(datetime)-first(datetime)) %>%
ungroup()
inv ass datetime portfolio operation id_Group position_open
<chr> <chr> <date> <dbl> <dbl> <int> <drtn>
1 INV_1 x 2010-01-01 10 10 2 1 days
2 INV_1 x 2010-01-02 0 -10 2 1 days
3 INV_1 x 2010-01-03 5 5 1 16 days
4 INV_1 x 2010-01-08 2 -3 1 16 days
5 INV_1 x 2010-01-19 0 -2 1 16 days

Counting sequential dates in R to determine the length of an event

I have a dataframe containing dates when a given event occurred. Some events go on for several days, and I want to summarise each event based on its start date and its total length (in days).
I want to go from this:
Date
2020-01-01
2020-01-02
2020-01-03
2020-01-15
2020-01-20
2020-01-21
To this:
StartDate
EventLength
2020-01-01
3
2020-01-15
1
2020-01-20
2
I've tried various approaches with aggregate, ave, seq_along and lag, but I haven't managed to get a count of event length that resets when the dates aren't sequential.
Code for the example data frame in case it's helpful:
Date <- c("2020-01-01", "2020-01-02", "2020-01-03", "2020-01-15", "2020-01-20", "2020-01-21")
df <- data.frame(Date)
df$Date <- as.Date(df$Date, origin = "1970-01-01")
You can split by cumsum(c(0, diff(df$Date) != 1) and then take the first date and combine it with the length assuming the dates are sorted.
do.call(rbind, lapply(split(df$Date, cumsum(c(0, diff(df$Date) != 1))),
function(x) data.frame(StartDate=x[1], EventLength=length(x))))
# StartDate EventLength
#0 2020-01-01 3
#1 2020-01-15 1
#2 2020-01-20 2
or another option using rle:
i <- cumsum(c(0, diff(df$Date) != 1))
data.frame(StartDate = df$Date[c(1, diff(i)) == 1], EventLength=rle(i)$lengths)
# StartDate EventLength
#1 2020-01-01 3
#2 2020-01-15 1
#3 2020-01-20 2
I propose dplyr approach which is incidentally very similar to #Rui's approach
df %>% mutate(dummy = c(0, diff(Date))) %>%
group_by(grp = cumsum(dummy != 1)) %>%
summarise(Date = first(Date),
event_count = n(), .groups = 'drop')
# A tibble: 3 x 3
grp Date event_count
<int> <date> <int>
1 1 2020-01-01 3
2 2 2020-01-15 1
3 3 2020-01-20 2
Here is a base R solution with a cumsum trick followed by ave/table.
d <- c(0, diff(df$Date) != 1)
res <- ave(df$Date, cumsum(d), FUN = function(x) x[1])
res <- as.data.frame(table(a))
names(res) <- c("Date", "EventLength")
res
# Date EventLength
#1 2020-01-01 3
#2 2020-01-15 1
#3 2020-01-20 2

Grouping rows on multiple conditions

I have got a follow up question on my previous question about grouping rows on multiple conditions (Previous question).
I was wondering how I can group observations within 31 days of the first date. More importantly, after the 31 days are passed the next date within the same group will be the 'new' first date of that group. Furthermore, after each 'purchase' the grouping should also stop, and the next observation after the purchase will be the 'new' first day of that group.
Let me illustrate it with an example:
example <- structure(
list(
userID = c(1,1,1,1,1,1,2,2,2,2),
date = structure(
c(
18168, #2019-09-29
18189, #2019-10-20
18197, #2019-10-28
18205, #2019-11-05
18205, #2019-11-05
18217, #2019-11-17
18239, #2019-12-09
18270, #2020-01-09
18271, #2020-01-10
18275 #2020-01-14
),
class = "Date"
),
purchase = c(0,0, 0, 0, 0, 1, 0, 0, 1, 0)
),
row.names = c(NA, 10L),
class = "data.frame"
)
Desired outcome:
Outcome <- data.frame(
userID = c(1,1,2,2,2),
date.start = c("2019-09-29", "2019-11-05", "2019-12-09", "2020-01-10", "2020-01-14"),
date.end = c("2019-10-28", "2019-11-17", "2020-01-09", "2020-01-10", "2020-01-14"),
purchase = c(0, 1, 0, 1, 0)
)
Thanks in advance! :)
Like my answer on linked question, I again suggest accumulate strategy here
library(tidyverse)
example
#> userID date purchase
#> 1 1 2019-09-29 0
#> 2 1 2019-10-20 0
#> 3 1 2019-10-28 0
#> 4 1 2019-11-05 0
#> 5 1 2019-11-05 0
#> 6 1 2019-11-17 1
#> 7 2 2019-12-09 0
#> 8 2 2020-01-09 0
#> 9 2 2020-01-10 1
#> 10 2 2020-01-14 0
example %>% group_by(userID) %>%
group_by(grp = unlist(accumulate2(date, purchase[-n()], ~ if(as.numeric(..2 - ..1) < 31 & ..3 != 1) ..1 else ..2)),
grp = with(rle(grp), rep(seq_along(lengths), lengths)), .add = T) %>%
summarise(start.date = first(date),
last.date = last(date), .groups = 'drop')
#> # A tibble: 5 x 4
#> userID grp start.date last.date
#> <dbl> <int> <date> <date>
#> 1 1 1 2019-09-29 2019-10-28
#> 2 1 2 2019-11-05 2019-11-17
#> 3 2 3 2019-12-09 2019-12-09
#> 4 2 4 2020-01-09 2020-01-10
#> 5 2 5 2020-01-14 2020-01-14
Created on 2021-06-13 by the reprex package (v2.0.0)
We could also use the following solution:
library(dplyr)
library(data.table)
example %>%
group_by(grp = cumsum(ifelse(lag(purchase, default = 0) == 1, 1, 0))) %>%
mutate(grp2 = cumsum(as.numeric(date - lag(date, default = first(date)))) > 30) %>%
ungroup() %>%
mutate(grp2 = data.table::rleid(grp2)) %>%
group_by(userID, grp, grp2) %>%
summarise(first = first(date), last = last(date), .groups = "drop") %>%
select(-grp)
# A tibble: 5 x 4
userID grp2 first last
<dbl> <int> <date> <date>
1 1 1 2019-09-29 2019-10-28
2 1 2 2019-11-05 2019-11-17
3 2 3 2019-12-09 2019-12-09
4 2 4 2020-01-09 2020-01-10
5 2 5 2020-01-14 2020-01-14
Because there are dependencies between when one time period ends and the next one starts (given a date, you can only tell if it is the start, middle, or end of a period after investigating every prior record) I can not see any better way of doing this than using a for loop.
Something like the following:
# create output column
example = example %>% mutate(grouping = NA)
# setup tracking variables
current_date = as.Date('1900-01-01')
current_id = -1
prev_purchase = 0
current_group = 0
for(ii in 1:nrow(example)){
# reset on new identity OR on puchase OR on 31 days elapsed
if(example$userID[ii] != current_id # new identity
|| prev_purchase == 1 # just had a purchase
|| example$date[ii] - current_date > 31){ # more than 31 days elapsed
current_date = example$date[ii]
current_id = example$userID[ii]
prev_purchase = example$purchase[ii]
current_group = current_group + 1
example$grouping[ii] = current_group
}
# otherwise step forwards
else {
prev_purchase = example$purchase[ii]
example$grouping[ii] = current_group
}
}
One advantage of this approach, is you can pause after the for loop and check whether the groupings are as expected. The groups can then be collapsed to the requested output using:
output = example %>%
group_by(userID, grouping) %>%
summarise(date.start = min(date),
date.end = max(date),
purchase = max(purchase)) %>%
select(-grouping)

dplyr: calculate the days for product replenishment

I am working on a dataset in which I need to calculate how long does it take for a retail store to replenish some products from shortage, and here is a quick view of the dataset in the simplest form:
Date <- c("2019-1-1","2019-1-2","2019-1-3","2019-1-4","2019-1-5","2019-1-6","2019-1-7","2019-1-8")
Product <- rep("Product A",8)
Net_Available_Qty <- c(-2,-2,10,8,-5,-6,-7,0)
sample_df <- data.frame(Date,Product,Net_Available_Qty)
When the Net_Available_Qty becomes negative, it means there is a shortage. When it turns back to 0 or positive qty, it means the supply has been recovered. What I need to calculate is the days between when we first see shortage and when it is recovered. In this case, for the 1st shortage, it took 2 days to recover and for the second shortage, it took 3 days to recover.
A tidyverse solution would be most welcome.
I hope someone else finds a cleaner solution. But this produces diffDate which assigns the date difference from when a negative turns positive/zero.
sample_df %>%
mutate(sign = ifelse(Net_Available_Qty > 0, "pos", ifelse(Net_Available_Qty < 0, "neg", "zero")),
sign_lag = lag(sign, default = sign[1]), # get previous value (exception in the first place)
change = ifelse(sign != sign_lag, 1 , 0), # check if there's a change
sequence=sequence(rle(as.character(sign))$lengths)) %>%
group_by(sequence) %>%
mutate(diffDate = as.numeric(difftime(Date, lag(Date,1))),
diffDate=ifelse(Net_Available_Qty <0, NA, ifelse((sign=='pos'| sign=='zero') & sequence==1, diffDate, NA))) %>%
ungroup() %>%
select(Date, Product, Net_Available_Qty, diffDate)
#Schilker had a great idea using rle. I am building on his answer and offering a slightly shorter version including the use of cumsum
Date <- c("2019-1-1","2019-1-2","2019-1-3","2019-1-4","2019-1-5","2019-1-6","2019-1-7","2019-1-8")
Product <- rep("Product A",8)
Net_Available_Qty <- c(-2,-2,10,8,-5,-6,-7,0)
sample_df <- data.frame(Date,Product,Net_Available_Qty)
library(tidyverse)
sample_df %>%
mutate(
diffDate = c(1, diff(as.Date(Date))),
sequence = sequence(rle(Net_Available_Qty >= 0)$lengths),
group = cumsum(c(TRUE, diff(sequence)) != 1L)
) %>%
group_by(group) %>%
mutate(n_days = max(cumsum(diffDate)))
#> # A tibble: 8 x 7
#> # Groups: group [4]
#> Date Product Net_Available_Qty diffDate sequence group n_days
#> <fct> <fct> <dbl> <dbl> <int> <int> <dbl>
#> 1 2019-1-1 Product A -2 1 1 0 2
#> 2 2019-1-2 Product A -2 1 2 0 2
#> 3 2019-1-3 Product A 10 1 1 1 2
#> 4 2019-1-4 Product A 8 1 2 1 2
#> 5 2019-1-5 Product A -5 1 1 2 3
#> 6 2019-1-6 Product A -6 1 2 2 3
#> 7 2019-1-7 Product A -7 1 3 2 3
#> 8 2019-1-8 Product A 0 1 1 3 1
Created on 2020-02-23 by the reprex package (v0.3.0)

Resources