if_else with sequence of conditions - r

I have the following data:
library(tidyverse)
library(lubridate)
df <- tibble(date = as_date(c("2019-11-20", "2019-11-27", "2020-04-01", "2020-04-15", "2020-09-23", "2020-11-25", "2021-03-03")))
# A tibble: 7 x 1
date
<date>
1 2019-11-20
2 2019-11-27
3 2020-04-01
4 2020-04-15
5 2020-09-23
6 2020-11-25
7 2021-03-03
I also have an ordered comparison vector of dates:
comparison <- seq(as_date("2019-12-01"), today(), by = "months") - 1
I now want to compare my dates in df to those comparison dates and so something like:
if date in df is < comparison[1], then assign a 1
if date in df is < comparison[2], then assign a 2
and so on.
I know I could do it with a case_when, e.g.
df %>%
mutate(new_var = case_when(date < comparison[1] ~ 1,
date < comparison[2] ~ 2))
(of course filling this up with all comparisons).
However, this would require to manually write out all sequential conditions and I'm wondering if I couldn't just automate it. I though about creating a match lookup first (i.e. take the comparison vector, then add the respective new_var number (i.e. 1, 2, and so on)) and then match it against my data, but I only know how to do that for exact matches and don't know how I can add the "smaller than" condition.
Expected result:
# A tibble: 7 x 2
date new_var
<date> <dbl>
1 2019-11-20 1
2 2019-11-27 1
3 2020-04-01 6
4 2020-04-15 6
5 2020-09-23 11
6 2020-11-25 13
7 2021-03-03 17

You can use findInterval as follows:
df %>% mutate(new_var = df$date %>% findInterval(comparison) + 1)
# A tibble: 7 x 2
date new_var
<date> <dbl>
1 2019-11-20 1
2 2019-11-27 1
3 2020-04-01 6
4 2020-04-15 6
5 2020-09-23 11
6 2020-11-25 13
7 2021-03-03 17

Related

How to delete entire rows from a dataframe based on the date the data was collected?

Let's say I have this example dataframe (but a lot bigger)
df = data.frame(ID_number = c(111,111,111,22,22,33,33),
date = c('2021-06-14','2021-06-12','2021-03-11',
'2021-05-20','2021-05-14',
'2018-04-20','2017-03-14'),
answers = 1:7,
sex = c('F','M','F','M','M','M','F') )
The output
ID_number date answers sex
1 111 2021-06-14 1 F
2 111 2021-06-12 2 M
3 111 2021-03-11 3 F
4 22 2021-05-20 4 M
5 22 2021-05-14 5 M
6 33 2018-04-20 6 M
7 33 2017-03-14 7 F
we can see that there are 7 different members, but the one who created the dataframe has made a mistake and assigned the same ID_number to members 1,2 and 3. The same ID_number to members 4 and 5 and so on ...
In the dataframe there is the data of the collection of the data of each member and I wish to only keep the member that has the earliest date. The resulted dataframe would look like this
ID_number date answers sex
1 111 2021-03-11 3 F
2 22 2021-05-14 5 M
3 33 2017-03-14 7 F
Appreciate the help.
You could filter on the min date per group_by like this:
library(dplyr)
df %>%
group_by(ID_number) %>%
filter(date == min(date))
#> # A tibble: 3 × 4
#> # Groups: ID_number [3]
#> ID_number date answers sex
#> <dbl> <chr> <int> <chr>
#> 1 111 2021-03-11 3 F
#> 2 22 2021-05-14 5 M
#> 3 33 2017-03-14 7 F
Created on 2023-01-04 with reprex v2.0.2
With slice_min:
library(dplyr)
df %>%
group_by(ID_number) %>%
slice_min(date)
In the dev. version, you can use inline grouping with .by:
devtools::install_github("tidyverse/dplyr")
df %>%
slice_min(date, .by = ID_number)
Using base R
subset(df, as.numeric(date) == ave(as.numeric(date), ID_number, FUN = min))
ID_number date answers sex
3 111 2021-03-11 3 F
5 22 2021-05-14 5 M
7 33 2017-03-14 7 F

R - Find x days from start date while keeping dates inbetween

I am trying to find the first date of each category then subtract 5 days AND I want to keep the days inbetween! this is where I am struggling. I tried seq() but it gave me an error, so I'm not sure if this is the right way to do it.
I am able to get 5 days prior to my start date for each category, but I can't figure out how to get 0, 1, 2, 3, 4 AND 5 days prior to my start date!
The error I got is this (for the commented out part of the code):
Error in seq.default(., as.Date(first_day), by = "day", length.out = 5) :
'from' must be of length 1
Any help would be greatly appreciated!
library ("lubridate")
library("dplyr")
library("tidyr")
data <- data.frame(date = c("2020-06-08",
"2020-06-09",
"2020-06-10",
"2020-06-11",
"2020-06-12",
"2021-07-13",
"2021-07-14",
"2021-07-15",
"2021-08-16",
"2021-08-17",
"2021-08-18",
"2021-09-19",
"2021-09-20"),
value = c(2,1,7,1,0,1,2,3,4,7,6,5,10),
category = c(1,1,1,1,1,2,2,2,3,3,3,4,4))
data$date <- as.Date(data$date)
View(data)
test_dates <- data %>%
group_by(category) %>%
arrange(date) %>%
slice(1L) %>% #takes first date
mutate(first_day = as.Date(date) - 5)#%>%
#seq(as.Date(first_day),by="day",length.out=5)
#error for seq(): Error in seq.default(., as.Date(first_day), by = "day", length.out = 5) : 'from' must be of length 1
head(test_dates)
The answer I'm looking for should include these dates but in a column format! I'm also trying to input NA in the value category if the value doesnt already exist. I want to keep all possible columns, as the dataframe I'm needing to use this on has about 20 columns
Dates: "2020-06-03 ", "2020-06-04", "2020-06-05", "2020-06-06", "2020-06-07", "2020-06-08", "2020-07-08 ", "2020-07-09", "2020-07-10", "2020-07-11", "2020-07-12", "2021-07-13", "2020-08-11 ", "2020-08-12", "2020-08-13", "2020-08-14", "2020-08-15", "2021-08-16", "2020-09-14 ", "2020-09-15", "2020-09-16", "2020-09-17", "2020-09-18", "2021-09-19",
Related question here: How do I subset my df for the minimum date based on one category and including x days before that?
Here's one approach but kinda clunky:
bind_rows(
data,
data %>%
group_by(category) %>%
slice_min(date) %>%
uncount(6, .id = "id") %>%
mutate(date = date - id + 1) %>%
select(-id)) %>%
arrange(category, date)
Result
# A tibble: 37 × 3
date value category
<date> <dbl> <dbl>
1 2020-06-03 2 1
2 2020-06-04 2 1
3 2020-06-05 2 1
4 2020-06-06 2 1
5 2020-06-07 2 1
6 2020-06-08 2 1
7 2020-06-08 2 1
8 2020-06-09 1 1
9 2020-06-10 7 1
10 2020-06-11 1 1
# … with 27 more rows
This approach provides the row from each category with the minimum date, plus the five dates prior for each category (with value set to NA for these rows)
library(data.table)
setDT(data)[data[, .(date=seq(min(date)-5,by="day", length.out=6)), category], on=.(category,date)]
Output:
date value category
1: 2020-06-03 NA 1
2: 2020-06-04 NA 1
3: 2020-06-05 NA 1
4: 2020-06-06 NA 1
5: 2020-06-07 NA 1
6: 2020-06-08 2 1
7: 2021-07-08 NA 2
8: 2021-07-09 NA 2
9: 2021-07-10 NA 2
10: 2021-07-11 NA 2
11: 2021-07-12 NA 2
12: 2021-07-13 1 2
13: 2021-08-11 NA 3
14: 2021-08-12 NA 3
15: 2021-08-13 NA 3
16: 2021-08-14 NA 3
17: 2021-08-15 NA 3
18: 2021-08-16 4 3
19: 2021-09-14 NA 4
20: 2021-09-15 NA 4
21: 2021-09-16 NA 4
22: 2021-09-17 NA 4
23: 2021-09-18 NA 4
24: 2021-09-19 5 4
date value category
Note: The above uses a join; an identical result can be achieved without a join by row-binding the first row for each category with the data.table generated similarly as above:
rbind(
setDT(data)[order(date), .SD[1],category],
data[,.(date=seq(min(date)-5,by="day",length.out=5),value=NA),category]
)
You indicate you have many columns, so if you are going to take this second approach, rather than explicitly setting value=NA in the second input to rbind, you can also just leave it out, and add fill=TRUE within the rbind()
A dplyr version of the same is:
bind_rows(
data %>%
group_by(category) %>%
slice_min(date) %>%
ungroup() %>%
mutate(date=as.Date(date)),
data %>%
group_by(category) %>%
summarize(date=seq(min(as.Date(date))-5,by="day", length.out=5), .groups="drop")
)
Output:
# A tibble: 24 x 3
date value category
<date> <dbl> <dbl>
1 2020-06-08 2 1
2 2021-07-13 1 2
3 2021-08-16 4 3
4 2021-09-19 5 4
5 2020-06-03 NA 1
6 2020-06-04 NA 1
7 2020-06-05 NA 1
8 2020-06-06 NA 1
9 2020-06-07 NA 1
10 2021-07-08 NA 2
# ... with 14 more rows
Update (9/21/22) -
If you want the NA values to be filled, simply add this to the end of either data.table pipeline:
...[,value:=max(value, na.rm=T), category]
or add this to the dplyr pipeline
... %>%
group_by(category) %>%
mutate(value=max(value, na.rm=T))
#Jon Srpings answer fired this alternative approach:
Here we first get the first days - 5 as already presented in the question. Then we use bind_rows as Jon Srping does in his answer. Next step is to identify the original first dates within the dates column (we use !duplicated within filter). Last main step is to use coalesce:
library(lubridate)
library(dplyr)
data %>%
group_by(category) %>%
mutate(x = min(ymd(date))-5) %>%
slice(1) %>%
bind_rows(data) %>%
mutate(date = ymd(date)) %>%
filter(!duplicated(date)) %>%
mutate(x = coalesce(x, date)) %>%
arrange(category) %>%
select(date = x, value)
category date value
<dbl> <date> <dbl>
1 1 2020-06-03 2
2 1 2020-06-09 1
3 1 2020-06-10 7
4 1 2020-06-11 1
5 1 2020-06-12 0
6 2 2021-07-08 1
7 2 2021-07-14 2
8 2 2021-07-15 3
9 3 2021-08-11 4
10 3 2021-08-17 7
11 3 2021-08-18 6
12 4 2021-09-14 5
13 4 2021-09-20 10

Find overlapping intervals in groups and retain largest non-overlapping periods

Issue
I have a grouped dataframe with overlapping intervals (date as ymd). I want to retain only the largest non-overlapping intervals in each group.
Example data
# Packages
library(tidyverse)
library(lubridate)
# Example data
df <- tibble(
group = c(1, 1, 1, 2, 2, 3, 3, 3, 3),
start = as_date(
c("2019-01-10", "2019-02-01", "2019-10-05", "2018-07-01", "2019-01-01", "2019-10-01", "2019-10-01", "2019-11-30","2019-11-20")),
end = as_date(
c("2019-02-07", "2019-05-01", "2019-11-15", "2018-07-31", "2019-05-05", "2019-11-06", "2019-10-07", "2019-12-10","2019-12-31"))) %>%
mutate(intval = interval(start, end),
intval_length = intval / days(1))
df
#> # A tibble: 9 x 5
#> group start end intval intval_length
#> <dbl> <date> <date> <Interval> <dbl>
#> 1 1 2019-01-10 2019-02-07 2019-01-10 UTC--2019-02-07 UTC 28
#> 2 1 2019-02-01 2019-05-01 2019-02-01 UTC--2019-05-01 UTC 89
#> 3 1 2019-10-05 2019-11-15 2019-10-05 UTC--2019-11-15 UTC 41
#> 4 2 2018-07-01 2018-07-31 2018-07-01 UTC--2018-07-31 UTC 30
#> 5 2 2019-01-01 2019-05-05 2019-01-01 UTC--2019-05-05 UTC 124
#> 6 3 2019-10-01 2019-11-06 2019-10-01 UTC--2019-11-06 UTC 36
#> 7 3 2019-10-01 2019-10-07 2019-10-01 UTC--2019-10-07 UTC 6
#> 8 3 2019-11-30 2019-12-10 2019-11-30 UTC--2019-12-10 UTC 10
#> 9 3 2019-11-20 2019-12-31 2019-11-20 UTC--2019-12-31 UTC 41
# Goal
# Row: 1 and 2; 6 to 9 have overlaps; Keep rows with largest intervals (in days)
df1 <- df[-c(1, 7, 8),]
df1
#> # A tibble: 6 x 5
#> group start end intval intval_length
#> <dbl> <date> <date> <Interval> <dbl>
#> 1 1 2019-02-01 2019-05-01 2019-02-01 UTC--2019-05-01 UTC 89
#> 2 1 2019-10-05 2019-11-15 2019-10-05 UTC--2019-11-15 UTC 41
#> 3 2 2018-07-01 2018-07-31 2018-07-01 UTC--2018-07-31 UTC 30
#> 4 2 2019-01-01 2019-05-05 2019-01-01 UTC--2019-05-05 UTC 124
#> 5 3 2019-10-01 2019-11-06 2019-10-01 UTC--2019-11-06 UTC 36
#> 6 3 2019-11-20 2019-12-31 2019-11-20 UTC--2019-12-31 UTC 41
Current approach
I found a related question in another thread (see: Find dates within a period interval by group). However, the respective solution identifies all overlapping rows by group. In this way, I can't identify the largest non-overlapping intervals.
df$overlap <- unlist(tapply(df$intval, #loop through intervals
df$group, #grouped by id
function(x) rowSums(outer(x,x,int_overlaps)) > 1))
As an example, consider group 3 in my example data. Here row 6/7 and 8/9 overlap. With row 6 and 9 being the largest non-overlapping periods, I would like to remove row 7 and 8.
I would greatly appreciate it if someone could pinpoint me to a solution.
Having searched for related problems on stackoverflow, I found that the following approaches (here: Collapse and merge overlapping time intervals) and (here: How to flatten / merge overlapping time periods) could be adapted to my issue.
# Solution adapted from:
# here https://stackoverflow.com/questions/53213418/collapse-and-merge-overlapping-time-intervals
# and here: https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods/28938694#28938694
# Note: df and df1 created in the initial reprex (above)
df2 <- df %>%
group_by(group) %>%
arrange(group, start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) > # find overlaps
cummax(as.numeric(end)))[-n()])) %>%
ungroup() %>%
group_by(group, indx) %>%
arrange(desc(intval_length)) %>% # retain largest interval
filter(row_number() == 1) %>%
ungroup() %>%
select(-indx) %>%
arrange(group, start)
# Desired output?
identical(df1, df2)
#> [1] TRUE

split a data frame using after a date where the value of another variable reaches to max/min on that date

I have a dataset that is similar to the following:
df <- data.frame(
date = c("2020-02-01", "2020-02-02", "2020-02-03", "2020-02-04", "2020-02-05", "2020-02-06"),
value = c(0,1,2,7,3,4))
I would like to split my data frame into two smaller data frames such that the first data frame includes a part of the original data frame before the value reaches its max (i.e. 7) and the second part of the data frame includes the rest of the original data frame as follows:
df1 <- data.frame(
date = c("2020-02-01", "2020-02-02", "2020-02-03"),
value = c(0,1,2)
)
df2 <- data.frame(
date = c("2020-02-04", "2020-02-05", "2020-02-06"),
value = c(7, 3, 4)
)
*** The 2nd part of the question
Now assume that I have the following dataset including more than one object identified by IDs. So, I would like to the same thing as explained above and applied to all objects (IDs)
df <- data.frame( ID = c(1,1,1,1,1,1,2,2,2,2),
date = c("2020-02-01", "2020-02-02", "2020-02-03", "2020-02-04", "2020-02-05", "2020-02-06", "2020-02-01", "2020-02-02","2020-02-03", "2020-02-04"),
value = c(0,1,2,7,3,4,10,16,11,12))
Thanks for your time.
You can use which.max to get the index of max value and use it to subset the dataframe.
ind <- which.max(df$value)
df1 <- df[seq_len(ind - 1), ]
df2 <- df[ind:nrow(df), ]
df1
# A tibble: 3 x 2
# date value
# <chr> <dbl>
#1 2020-02-01 0
#2 2020-02-02 1
#3 2020-02-03 2
df2
# A tibble: 3 x 2
# date value
# <chr> <dbl>
#1 2020-02-04 7
#2 2020-02-05 3
#3 2020-02-06 4
We could create a list of dataframes if there are lot of ID's and we have to do this for each ID.
result <- df %>%
group_split(ID) %>%
purrr::map(~{.x %>%
group_split(row_number() < which.max(value), .keep = FALSE)})
## In case, someone is interested you could make a data frame from the list above as follows:
result_df <- result %>%
bind_rows()
Another approach using base R:
> df
date value
1 2020-02-01 0
2 2020-02-02 1
3 2020-02-03 2
4 2020-02-04 7
5 2020-02-05 3
6 2020-02-06 4
> df1 <- df[1:(which(df$value == max(df$value)) - 1), ]
> df2 <- df[which(df$value == max(df$value)):nrow(df), ]
> df1
date value
1 2020-02-01 0
2 2020-02-02 1
3 2020-02-03 2
> df2
date value
4 2020-02-04 7
5 2020-02-05 3
6 2020-02-06 4
>
For the grouped data:
> mylist <- df %>% split(f = df$ID)
> mylist
$`1`
ID date value
1 1 2020-02-01 0
2 1 2020-02-02 1
3 1 2020-02-03 2
4 1 2020-02-04 7
5 1 2020-02-05 3
6 1 2020-02-06 4
$`2`
ID date value
7 2 2020-02-01 10
8 2 2020-02-02 16
9 2 2020-02-03 11
10 2 2020-02-04 12
> split_list <- lapply(mylist, function(x) x[1:(which.max(x$value) - 1),])
> split_list <- append(split_list, lapply(mylist, function(x) x[which.max(x$value): nrow(x),]))
> split_list
$`1`
ID date value
1 1 2020-02-01 0
2 1 2020-02-02 1
3 1 2020-02-03 2
$`2`
ID date value
7 2 2020-02-01 10
$`1`
ID date value
4 1 2020-02-04 7
5 1 2020-02-05 3
6 1 2020-02-06 4
$`2`
ID date value
8 2 2020-02-02 16
9 2 2020-02-03 11
10 2 2020-02-04 12
>

How to sum time intervals by days: How many intervals include a given day?

I have a very large data set of time intervals (start and end date values), and need to calculate for every single day within the entire range of those dates how many of the intervals include the date.
In essence, I want to know how many people are in jail any given day. I have when they entered, and when they left. I need to be able to determine how many were in jail on every single day over a period of many years.
Example data:
require(tidyverse)
require(lubridate)
x <- tribble(~start, ~end,
today()-5, today()-3,
today()-4, today()-2,
today()-3, today()-1)
x <- x %>% mutate(dtint = interval(start, end))
x
#> # A tibble: 3 x 3
#> start end dtint
#> <date> <date> <Interval>
#> 1 2019-10-13 2019-10-15 2019-10-13 UTC--2019-10-15 UTC
#> 2 2019-10-14 2019-10-16 2019-10-14 UTC--2019-10-16 UTC
#> 3 2019-10-15 2019-10-17 2019-10-15 UTC--2019-10-17 UTC
mydays <- seq(min(x$start), max(x$end), by = "day") %>% enframe(name = NULL, value = "eachday")
mydays
#> # A tibble: 5 x 1
#> eachday
#> <date>
#> 1 2019-10-13
#> 2 2019-10-14
#> 3 2019-10-15
#> 4 2019-10-16
#> 5 2019-10-17
#Expected result:
mydays %>% add_column(expected_result = c(1, 2, 3, 2, 1))
#> # A tibble: 5 x 2
#> eachday expected_result
#> <date> <dbl>
#> 1 2019-10-13 1
#> 2 2019-10-14 2
#> 3 2019-10-15 3
#> 4 2019-10-16 2
#> 5 2019-10-17 1
Created on 2019-10-18 by the reprex package (v0.3.0)
I will also need the ability to do this on grouped tibbles so I can calculate things like totals by demographic properties.
Is there an efficient way to do this in tidyverse/lubridate?
Here's an approach of how you could track totals using the cumulative count across each grouped demographic.
# Example data
library(tidyverse)
set.seed(42)
x <- tibble(demographic = sample(LETTERS[1:3], 100, replace = T),
start = as.Date("2019-01-01") + runif(100, 0, 30),
end = start + runif(100, 1, 50))
## A tibble: 6 x 3
# demographic start end
# <chr> <date> <date>
#1 C 2019-01-19 2019-03-05
#2 C 2019-01-07 2019-02-02
#3 A 2019-01-07 2019-02-19
#4 C 2019-01-12 2019-02-04
#5 B 2019-01-29 2019-02-07
#6 B 2019-01-29 2019-02-21
First we bring it into long format. Then count each start date as incrementing up, each end date incrementing down. Then we group by demographic and take the cumulative total of those increments up and down.
x %>%
pivot_longer(-demographic, "col", values_to = "date") %>%
mutate(change = if_else(col == "start", 1, -1)) %>%
arrange(demographic, date) %>%
group_by(demographic) %>%
mutate(count = cumsum(change)) %>%
ungroup() %>%
ggplot(aes(date, count, color = demographic)) +
geom_step()
One option involving dplyr, tidyr and lubridate could be:
x %>%
mutate(eachday = list(seq.Date(min(start), max(end), by = "1 day"))) %>%
unnest(eachday) %>%
group_by(eachday) %>%
summarise(overlap = sum(int_overlaps(dtint, interval(eachday, eachday))))
eachday overlap
<date> <int>
1 2019-10-13 1
2 2019-10-14 2
3 2019-10-15 3
4 2019-10-16 2
5 2019-10-17 1
First, it creates a list of dates between the first and last date in the data and unnest it. Then, it groups by the dates and sums the overlap between the dtint interval and the interval for dates.
Here is an option with data.table. Convert the 'data.frame' to 'data.table' (setDT0, create the seq column from min of 'start' and max of 'end' , then join with the original dataset using a non-equi join and get the count (.N)
library(data.table)
setDT(x)[x[, .(eachday = seq(min(start), max(end), by = '1 day'))],
.(eachday, overlap = .N), on = .(start <= eachday,
end >= eachday ), by = .EACHI][, .(eachday, overlap)]
# eachday overlap
#1: 2019-10-13 1
#2: 2019-10-14 2
#3: 2019-10-15 3
#4: 2019-10-16 2
#5: 2019-10-17 1

Resources