I'm new to R and I'm facing a problem, I have a date vector and a dataframe containing data regarding sales values and coverage start and end dates.
I need to defer the sale value at each analysis date, for the first analysis period, I can create an algorithm that gives me the desired answer. However in my real data I am working with a base of 200K+ rows and 50+ analysis periods.
I'm not able to build a loop or find an alternative function in R that allows me to create the variables Aux[i] and Test[i] according to the number of dates present in the vec_date vector.
The following is an example of code that works for the first analysis period.
library(tidyverse)
library(lubridate)
df <- tibble(DateIn = c(ymd("2021-10-21", "2021-12-25", "2022-05-11")),
DateFin = c(ymd("2022-03-10", "2022-07-12", "2023-02-15")),
Premium = c(11000, 5000, 24500))
date <- ymd("2021-12-31")
vec_date <- date %m+% months(seq(0, 12, by = 6))
df_new <- df |>
mutate(duration = as.numeric(DateFin - DateIn),
Pr_day = Premium/duration,
Aux1 = if_else(DateIn > vec_date[1] | DateFin < vec_date[1], "N", "Y"),
test1 = if_else(Aux1 == "Y" & DateFin > vec_date[1], as.numeric(DateFin - vec_date[1])*Pr_day,
if_else(DateIn > vec_date[1], Premium, 0)))
Does anyone have any idea how I could build this loop, or is there any R function/package that allows me to perform this interaction between my df dataframe and vec_date vector?
Edit: an outline of the format you would need as a result would be:
df_final <- tibble(DateIn = c(ymd("2021-10-21", "2021-12-25", "2022-05-11")),
DateFin = c(ymd("2022-03-10", "2022-07-12", "2023-02-15")),
Premium = c(11000, 5000, 24500),
Aux1 = c("Y", "Y", "N"),
test1 = c(5421.429, 4849.246, 24500.000),
Aux2 = c("N", "Y", "Y"),
test2 = c(0.0000, 301.5075, 20125.0000),
Aux3 = c("N", "N", "Y"),
test3 = c(0, 0, 4025))
Where, Aux1 and test1 are the results referring to vec_date[1], 2 = vec_date[2], 3 = vec_date[3]. For me it is important to keep the resulting variables in the same dataframe because later analysis will be done.
As #Jon Spring suggests in the comments, probably the preferred approach here
would be to use tidyr::complete() to extend your data frame, repeating each
row in it for each of your analysis dates. Then, you can stick to vectorized
calculations and get the analysis date column in the resulting data, too.
Below is how to do just that with the example data you provided. I took the
liberty of renaming some columns, and simplifying the control-flow based
calculation according to my understanding of the problem, based on what you
shared.
First, the example data slightly reframed:
library(tidyverse)
library(lubridate)
policies <- tibble(
policy_id = seq_len(3),
start = ymd("2021-10-21", "2021-12-25", "2022-05-11"),
end = ymd("2022-03-10", "2022-07-12", "2023-02-15"),
premium = c(11000, 5000, 24500)
)
policies
#> # A tibble: 3 x 4
#> policy_id start end premium
#> <int> <date> <date> <dbl>
#> 1 1 2021-10-21 2022-03-10 11000
#> 2 2 2021-12-25 2022-07-12 5000
#> 3 3 2022-05-11 2023-02-15 24500
Then, finding remaining prorated premiums for policies at given dates:
start_date <- ymd("2021-12-31")
dates <- start_date %m+% months(seq(0, 12, by = 6))
policies %>%
mutate(
days = as.numeric(end - start),
daily_premium = premium / days
) %>%
crossing(date = dates) %>%
mutate(
days_left = pmax(0, end - pmax(start, date)),
premium_left = days_left * daily_premium
) %>%
select(policy_id, date, days_left, premium_left)
#> # A tibble: 9 x 4
#> policy_id date days_left premium_left
#> <int> <date> <dbl> <dbl>
#> 1 1 2021-12-31 69 5421.
#> 2 1 2022-06-30 0 0
#> 3 1 2022-12-31 0 0
#> 4 2 2021-12-31 193 4849.
#> 5 2 2022-06-30 12 302.
#> 6 2 2022-12-31 0 0
#> 7 3 2021-12-31 280 24500
#> 8 3 2022-06-30 230 20125
#> 9 3 2022-12-31 46 4025
Related
I have daily level data as mentioned below dataframe.
a = c("a","a","a","a","a","b","b","b","b","b")
a = as.data.frame(a)
a$date = seq.Date(as.Date("2022-06-01"), as.Date("2022-06-10"), by = 1)
a$value = c(8,7,7,7,8,9,9,9,7,8)
The desired output should be
a = c("a","a","a","b","b","b")
a = as.data.frame(a)
a$startdate = c("2022-06-01","2022-06-02","2022-06-05","2022-06-06","2022-06-09","2022-06-10")
a$enddate = c("2022-06-01","2022-06-04","2022-06-05","2022-06-08","2022-06-09","2022-06-10")
a$value = c(8,7,8,9,7,8)
Thanks
I have tried one solution involving 2 for loops and then aggregation but it is very slow. It would be of great help if I get a faster solution.
It looks like you want to filter to rows where there’s a change from the previous value?
library(dplyr)
a %>%
group_by(a) %>%
filter(value != lag(value, default = -Inf)) %>%
ungroup()
# A tibble: 6 × 3
a date value
<chr> <date> <dbl>
1 a 2022-06-01 8
2 a 2022-06-02 7
3 a 2022-06-05 8
4 b 2022-06-06 9
5 b 2022-06-09 7
6 b 2022-06-10 8
I have 2 tables
df1 = data.frame("dates" = c(seq(as.Date("2020-1-1"), as.Date("2020-1-10"), by = "days")))
df2 = data.frame("observations" = c("a", "b", "c", "d"), "start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")), "end"=as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))
I would like to know the number of observation periods that occur on each day of df1, based on the start/stop dates in df2. E.g. on 1/1/2020, observations a and b were in progress, hence "2".
The expected output would be as follows:
I've tried using sums
df1$number = sum(as.Date(df2$start) <= df1$dates & as.Date(df2$end)>=df1$dates)
But that only sums up the entire column values
I've then tried to create a custom function for this:
df1$number = apply(df1, 1, function(x) sum(df2$start <= x & df2$end>=x))
But it returns an NA value.
I then tried to do embed an "ifelse" within it, but get the same issue with NAs
apply(df1, 1, function(x) sum(ifelse(df2$start <= x & df2$end>=x, 1, 0)))
Can anyone suggest what the issue is? Thanks!
edit: an interval join was suggested which is not what I'm trying to get - I think naming the observations with a numeric label was what caused confusion. I am trying to find out the TOTAL number of observations with periods that fall within the day, as compared to doing a 1:1 match.
Regards
Sing
Define the comparison in a function f and pass it through outer, rowSums is what you're looking for.
f <- \(x, y) df1[x, 1] >= df2[y, 2] & df1[x, 1] <= df2[y, 3]
cbind(df1, number=rowSums(outer(1:nrow(df1), 1:nrow(df2), f)))
# dates number
# 1 2020-01-01 2
# 2 2020-01-02 2
# 3 2020-01-03 1
# 4 2020-01-04 0
# 5 2020-01-05 1
# 6 2020-01-06 1
# 7 2020-01-07 1
# 8 2020-01-08 1
# 9 2020-01-09 1
# 10 2020-01-10 2
Here is a potential solution using dplyr/tidyverse functions and the %within% function from the lubridate package. This approach is similar to Left Join Subset of Column Based on Date Interval, however there are some important differences i.e. use summarise() instead of filter() to avoid 'losing' dates where "number" == 0, and join by 'character()' as there are no common columns between datasets:
library(dplyr)
library(lubridate)
df1 = data.frame("dates" = c(seq(as.Date("2020-1-1"),
as.Date("2020-1-10"),
by = "days")))
df2 = data.frame("observations" = c("1", "2", "3", "4"),
"start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")),
"end"=as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))
df1 %>%
full_join(df2, by = character()) %>%
mutate(number = dates %within% interval(start, end)) %>%
group_by(dates) %>%
summarise(number = sum(number))
#> # A tibble: 10 × 2
#> dates number
#> <date> <dbl>
#> 1 2020-01-01 2
#> 2 2020-01-02 2
#> 3 2020-01-03 1
#> 4 2020-01-04 0
#> 5 2020-01-05 1
#> 6 2020-01-06 1
#> 7 2020-01-07 1
#> 8 2020-01-08 1
#> 9 2020-01-09 1
#> 10 2020-01-10 2
Created on 2022-06-27 by the reprex package (v2.0.1)
Does this approach work with your actual data?
df <- structure(list(inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1"), ass = c("x", "x", "x", "x", "x"), datetime = c("2010-01-01",
"2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19"), portfolio = c(10,
0, 5, 2, 0)), operation = c(10, -10, 5, -3, -2), class = "data.frame", row.names = c(NA, -5L))
So I have 4000 investors with 6000 different assets, for each investor I have his trading operations in two different variables: operation tells me if he is buying/selling; portfolio tells me how much he has in the portfolio.
What I want to do is computing the number of days a position stays open in the portfolio, so I though about computing the difference between the day in which the portfolio goes back to zero and the day in which the portfolio went positive (it is not possible to get negative portfolio).
so in the dataset above I would count row2 - row1 ==> 2010-01-02 - 2010-01-01
and row 5 - row 3 ==> 2010-01-19 - 2010-01-03 and so on...
I want to do this computation for all the investor & asset I have in my dataset for all the rows in which I find that portfolio > 0.
So my dataset will have a further column called duration which would be equal, in this case to c(0,1,0,5,16) (so of course i also had to compute raw1 - raw1 and raw3 - raw3)
Hence my problem is to restart the count everytime portfolio goes back to zero.
library(dplyr)
df %>%
mutate(datetime = as.Date(datetime, "%Y-%m-%d")) %>%
group_by(investor, asset) %>%
arrange(datetime) %>%
mutate(grp.pos = cumsum(lag(portfolio, default = 1) == 0)) %>%
group_by(investor, asset, grp.pos) %>%
mutate(`Open (#days)` = datetime - datetime[1])
#> # A tibble: 5 x 6
#> # Groups: investor, asset, grp.pos [2]
#> investor asset datetime portfolio grp.pos `Open (#days)`
#> <chr> <chr> <date> <dbl> <int> <drtn>
#> 1 INV_1 x 2010-01-01 10 0 0 days
#> 2 INV_1 x 2010-01-02 0 0 1 days
#> 3 INV_1 x 2010-01-03 5 1 0 days
#> 4 INV_1 x 2010-01-08 2 1 5 days
#> 5 INV_1 x 2010-01-19 0 1 16 days
Data:
df <- structure(list(investor = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1"),
asset = c("x", "x", "x", "x", "x"),
datetime = c("2010-01-01", "2010-01-02", "2010-01-03",
"2010-01-08", "2010-01-19"),
portfolio = c(10, 0, 5, 2, 0)),
operation = c(10, -10, 5, -3, -2),
class = "data.frame", row.names = c(NA, -5L))
Here is a way how we could do it, that is expandable if necessary for ass
First we group by inv to use for the original dataset. Then transform datetime to date format to do calculations easily (here we use ymd() function).
The next step could be done in different ways:
Main idea is to group the column portfolio indicated by the last row of the group that is 0. For this we arrange datetime in descending form to easily apply the grouping id with cumsum == 0.
After rearranging datetime we can calculate the last from the first as intended:
library(dplyr)
library(lubridate)
df %>%
group_by(inv) %>%
mutate(datetime = ymd(datetime)) %>%
arrange(desc(datetime)) %>%
group_by(position_Group = cumsum(portfolio==0)) %>%
arrange(datetime) %>%
mutate(position_open = last(datetime)-first(datetime)) %>%
ungroup()
inv ass datetime portfolio operation id_Group position_open
<chr> <chr> <date> <dbl> <dbl> <int> <drtn>
1 INV_1 x 2010-01-01 10 10 2 1 days
2 INV_1 x 2010-01-02 0 -10 2 1 days
3 INV_1 x 2010-01-03 5 5 1 16 days
4 INV_1 x 2010-01-08 2 -3 1 16 days
5 INV_1 x 2010-01-19 0 -2 1 16 days
Background
I've got an R dataframe, d:
d <- data.frame(ID = c("a","a","b","b", "c","c","c"),
birthdate = as.Date(c("1980-01-01","1980-01-01","2000-12-23","2000-12-23","1949-03-14","1949-03-14","1949-03-14")),
event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07","2014-07-14")),
stringsAsFactors=FALSE)
It consists of an ID code and two dates: a birthdate and an event_date. Everyone's got a consistent birthdate, but people have multiple events each, all of these occurring on different dates.
The Problem
I'm trying to calculate the average age of people (IDs) in d at their first event. In other words, I'd like to get R to calculate an "age at the first event" by subtracting each ID's first event from their birthdate, and then summing them and dividing by n (3, in this case).
The answer (if my arithmetic isn't too far off this late at night) should be ~35.3 years old.
What I've tried
I'm not too familiar with date work in R, so I've only gotten so far as mutating a new column that calculates the difference between event_date and birthdate for that row:
d <- d %>%
mutate(date_difference = (event_date-birthdate)/365)
But I'm still a ways away from my summary calculation. I'm mainly hung up on how to tell R to find the first date difference for each ID. (Not to mention that dividing by 365 gives me correct years but they're still labeled "days" in the resulting df.)
As an alternative to dividing by 365, you can use the lubridate::time_length function. It computes the length of a period in different time units (seconds, minutes, days, years).
library("tidyverse")
dat <- data.frame(
ID = c("a", "a", "b", "b", "c", "c", "c"),
birthdate = as.Date(c("1980-01-01", "1980-01-01", "2000-12-23", "2000-12-23", "1949-03-14", "1949-03-14", "1949-03-14")),
event_date = as.Date(c("2011-01-01", "2012-08-21", "2011-12-23", "2011-12-31", "2013-03-14", "2013-04-07", "2014-07-14")),
stringsAsFactors = FALSE
)
dat_with_age <- dat %>%
group_by(ID) %>%
slice_min(
event_date,
n = 1
) %>%
ungroup() %>%
mutate(
age_at_first_event = lubridate::time_length(event_date - birthdate, unit = "year")
)
dat_with_age
#> # A tibble: 3 × 4
#> ID birthdate event_date age_at_first_event
#> <chr> <date> <date> <dbl>
#> 1 a 1980-01-01 2011-01-01 31.0
#> 2 b 2000-12-23 2011-12-23 11.0
#> 3 c 1949-03-14 2013-03-14 64
dat_with_age %>%
summarise(
mean(age_at_first_event)
)
#> # A tibble: 1 × 1
#> `mean(age_at_first_event)`
#> <dbl>
#> 1 35.3
Created on 2022-03-11 by the reprex package (v2.0.1)
You can use this code:
d <- d %>%
group_by(ID) %>%
arrange(event_date) %>%
slice(1) %>%
mutate(date_difference = as.numeric((event_date-birthdate)/365)) %>%
ungroup() %>%
mutate(average_age = mean(date_difference))
Output:
# A tibble: 3 × 5
ID birthdate event_date date_difference average_age
<chr> <date> <date> <dbl> <dbl>
1 a 1980-01-01 2011-01-01 31.0 35.4
2 b 2000-12-23 2011-12-23 11.0 35.4
3 c 1949-03-14 2013-03-14 64.0 35.4
I have a large data set of time periods, defined by a 'start' and and an 'end' column. Some of the periods overlap.
I would like to combine (flatten / merge / collapse) all overlapping time periods to have one 'start' value and one 'end' value.
Some example data:
ID start end
1 A 2013-01-01 2013-01-05
2 A 2013-01-01 2013-01-05
3 A 2013-01-02 2013-01-03
4 A 2013-01-04 2013-01-06
5 A 2013-01-07 2013-01-09
6 A 2013-01-08 2013-01-11
7 A 2013-01-12 2013-01-15
Desired result:
ID start end
1 A 2013-01-01 2013-01-06
2 A 2013-01-07 2013-01-11
3 A 2013-01-12 2013-01-15
What I have tried:
require(dplyr)
data <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "A"),
start = structure(c(1356998400, 1356998400, 1357084800, 1357257600,
1357516800, 1357603200, 1357948800), tzone = "UTC", class = c("POSIXct",
"POSIXt")), end = structure(c(1357344000, 1357344000, 1357171200,
1357430400, 1357689600, 1357862400, 1358208000), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), .Names = c("ID", "start", "end"), row.names = c(NA,
-7L), class = "data.frame")
remove.overlaps <- function(data){
data2 <- data
for ( i in 1:length(unique(data$start))) {
x3 <- filter(data2, start>=data$start[i] & start<=data$end[i])
x4 <- x3[1,]
x4$end <- max(x3$end)
data2 <- filter(data2, start<data$start[i] | start>data$end[i])
data2 <- rbind(data2,x4)
}
data2 <- na.omit(data2)}
data <- remove.overlaps(data)
Here's a possible solution. The basic idea here is to compare lagged start date with the maximum end date "until now" using the cummax function and create an index that will separate the data into groups
data %>%
arrange(ID, start) %>% # as suggested by #Jonno in case the data is unsorted
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = first(start), end = last(end))
# Source: local data frame [3 x 4]
# Groups: ID
#
# ID indx start end
# 1 A 0 2013-01-01 2013-01-06
# 2 A 1 2013-01-07 2013-01-11
# 3 A 2 2013-01-12 2013-01-15
#David Arenburg's answer is great - but I ran into an issue where an earlier interval ended after a later interval - but using last in the summarise call resulted in the wrong end date. I'd suggest changing first(start) and last(end) to min(start) and max(end)
data %>%
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = min(start), end = max(end))
Also, as #Jonno Bourne mentioned, sorting by start and any grouping variables is important before applying the method.
For the sake of completeness, the IRanges package on Bioconductor has some neat functions which can be used to deal with date or date time ranges. One of it is the reduce() function which merges overlapping or adjacent ranges.
However, there is a drawback because IRanges works on integer ranges (hence the name), so the convenience of using IRanges functions comes at the expense of converting Date or POSIXct objects to and fro.
Also, it seems that dplyr doesn't play well with IRanges (at least judged by my limited experience with dplyr) so I use data.table:
library(data.table)
options(datatable.print.class = TRUE)
library(IRanges)
library(lubridate)
setDT(data)[, {
ir <- reduce(IRanges(as.numeric(start), as.numeric(end)))
.(start = as_datetime(start(ir)), end = as_datetime(end(ir)))
}, by = ID]
ID start end
<fctr> <POSc> <POSc>
1: A 2013-01-01 2013-01-06
2: A 2013-01-07 2013-01-11
3: A 2013-01-12 2013-01-15
A code variant is
setDT(data)[, as.data.table(reduce(IRanges(as.numeric(start), as.numeric(end))))[
, lapply(.SD, as_datetime), .SDcols = -"width"],
by = ID]
In both variants the as_datetime() from the lubridate packages is used which spares to specify the origin when converting numbers to POSIXct objects.
Would be interesting to see a benchmark comparision of the IRanges approaches vs David's answer.
It looks like I'm a little late to the party, but I took #zach's code and re-wrote it using data.table below. I didn't do comprehensive testing, but this seemed to run about 20% faster than the tidy version. (I couldn't test the IRange method because the package is not yet available for R 3.5.1)
Also, fwiw, the accepted answer doesn't capture the edge case in which one date range is totally within another (e.g., 2018-07-07 to 2017-07-14 is within 2018-05-01 to 2018-12-01). #zach's answer does capture that edge case.
library(data.table)
start_col = c("2018-01-01","2018-03-01","2018-03-10","2018-03-20","2018-04-10","2018-05-01","2018-05-05","2018-05-10","2018-07-07")
end_col = c("2018-01-21","2018-03-21","2018-03-31","2018-04-09","2018-04-30","2018-05-21","2018-05-26","2018-05-30","2018-07-14")
# create fake data, double it, add ID
# change row 17, such that each ID grouping is a little different
# also adds an edge case in which one date range is totally within another
# (this is the edge case not currently captured by the accepted answer)
d <- data.table(start_col = as.Date(start_col), end_col = as.Date(end_col))
d2<- rbind(d,d)
d2[1:(.N/2), ID := 1]
d2[(.N/2 +1):.N, ID := 2]
d2[17,end_col := as.Date('2018-12-01')]
# set keys (also orders)
setkey(d2, ID, start_col, end_col)
# get rid of overlapping transactions and do the date math
squished <- d2[,.(START_DT = start_col,
END_DT = end_col,
indx = c(0, cumsum(as.numeric(lead(start_col)) > cummax(as.numeric(end_col)))[-.N])),
keyby=ID
][,.(start=min(START_DT),
end = max(END_DT)),
by=c("ID","indx")
]
I think that you can solve this problem pretty nicely with dplyr and the ivs package, which is designed for working with interval vectors, exactly like what you have here. It is inspired by IRanges, but is more suitable for use in the tidyverse and is completely generic so it can handle date intervals automatically (no need to convert to numeric and back).
The key is to combine the start/end boundaries into a single interval vector column, and then use iv_groups(). This merges all of the overlapping intervals in the interval vector and returns the intervals that remain after the overlaps have been merged.
It seems like you want to do this by ID, so I've also grouped by ID.
library(ivs)
library(dplyr)
data <- tribble(
~ID, ~start, ~end,
"A", "2013-01-01", "2013-01-05",
"A", "2013-01-01", "2013-01-05",
"A", "2013-01-02", "2013-01-03",
"A", "2013-01-04", "2013-01-06",
"A", "2013-01-07", "2013-01-09",
"A", "2013-01-08", "2013-01-11",
"A", "2013-01-12", "2013-01-15"
) %>%
mutate(
start = as.Date(start),
end = as.Date(end)
)
data
#> # A tibble: 7 × 3
#> ID start end
#> <chr> <date> <date>
#> 1 A 2013-01-01 2013-01-05
#> 2 A 2013-01-01 2013-01-05
#> 3 A 2013-01-02 2013-01-03
#> 4 A 2013-01-04 2013-01-06
#> 5 A 2013-01-07 2013-01-09
#> 6 A 2013-01-08 2013-01-11
#> 7 A 2013-01-12 2013-01-15
# Combine `start` and `end` into a single interval vector column
data <- data %>%
mutate(interval = iv(start, end), .keep = "unused")
# Note that this is a half-open interval!
data
#> # A tibble: 7 × 2
#> ID interval
#> <chr> <iv<date>>
#> 1 A [2013-01-01, 2013-01-05)
#> 2 A [2013-01-01, 2013-01-05)
#> 3 A [2013-01-02, 2013-01-03)
#> 4 A [2013-01-04, 2013-01-06)
#> 5 A [2013-01-07, 2013-01-09)
#> 6 A [2013-01-08, 2013-01-11)
#> 7 A [2013-01-12, 2013-01-15)
# It seems like you'd want to group by ID, so lets do that.
# Then we use `iv_groups()` which merges all overlapping intervals and returns
# the intervals that remain after all the overlaps have been merged
data %>%
group_by(ID) %>%
summarise(interval = iv_groups(interval), .groups = "drop")
#> # A tibble: 3 × 2
#> ID interval
#> <chr> <iv<date>>
#> 1 A [2013-01-01, 2013-01-06)
#> 2 A [2013-01-07, 2013-01-11)
#> 3 A [2013-01-12, 2013-01-15)
Created on 2022-04-05 by the reprex package (v2.0.1)