Remove past observations up to nearest observation to given date by group - r

Although I have achieved what I want (see result below), I find my approach kind of convoluted. I would like to remove all observations till the nearest observation before a cut-off date (here cut-off), by group. I cannot simply calculate the nearest observation using min(abs(x - date)) because there can be ties of observations that are equally distant in the positive and negative (e.g., group "b" in mydf).
I solved it by looking for the index of the last observation where date - cut_off <=0, and then using this index in dplyr::slice(). I am very curious about other approaches though.
Very very open for non-dplyr solutions.
Every group contains at least one row where observation date - cutoff date <= 0
library(tidyverse)
set.seed(8)
mydf <- data.frame(group = rep(letters[1:3], each = 5), date1 = as.Date(sample(15), origin = '1970-01-01'), cut_off = as.Date(rep(sample(10, 3), each = 5), origin = '1970-01-01'))
mydf %>% arrange(group, date1) %>% group_by(group) %>%
mutate(diff = date1 - cut_off,
min_abs = min(abs(date1 - cut_off)))
#> # A tibble: 15 x 5
#> # Groups: group [3]
#> group date1 cut_off diff min_abs
#> <fct> <date> <date> <drtn> <drtn>
#> 1 a 1970-01-03 1970-01-05 -2 days 0 days
#> 2 a 1970-01-05 1970-01-05 0 days 0 days
#> 3 a 1970-01-08 1970-01-05 3 days 0 days
#> 4 a 1970-01-13 1970-01-05 8 days 0 days
#> 5 a 1970-01-15 1970-01-05 10 days 0 days
#> 6 b 1970-01-02 1970-01-09 -7 days 2 days
#> 7 b 1970-01-06 1970-01-09 -3 days 2 days
#> 8 b 1970-01-07 1970-01-09 -2 days 2 days
#> 9 b 1970-01-11 1970-01-09 2 days 2 days
#> 10 b 1970-01-12 1970-01-09 3 days 2 days
#> 11 c 1970-01-04 1970-01-11 -7 days 1 days
#> 12 c 1970-01-09 1970-01-11 -2 days 1 days
#> 13 c 1970-01-10 1970-01-11 -1 days 1 days
#> 14 c 1970-01-14 1970-01-11 3 days 1 days
#> 15 c 1970-01-16 1970-01-11 5 days 1 days
# min(abs(x)) does not help when the distance from neg and pos values is tied, see group b
Desired result (solution already pretty ok)
mydf %>%
arrange(group, date1) %>%
group_by(group) %>%
mutate(diff = date1 - cut_off) %>%
slice(max(which(diff <= 0)):n())
# finds index of last element in "diff" which fullfills condition
#> # A tibble: 10 x 4
#> # Groups: group [3]
#> group date1 cut_off diff
#> <fct> <date> <date> <drtn>
#> 1 a 1970-01-05 1970-01-05 0 days
#> 2 a 1970-01-08 1970-01-05 3 days
#> 3 a 1970-01-13 1970-01-05 8 days
#> 4 a 1970-01-15 1970-01-05 10 days
#> 5 b 1970-01-07 1970-01-09 -2 days
#> 6 b 1970-01-11 1970-01-09 2 days
#> 7 b 1970-01-12 1970-01-09 3 days
#> 8 c 1970-01-10 1970-01-11 -1 days
#> 9 c 1970-01-14 1970-01-11 3 days
#> 10 c 1970-01-16 1970-01-11 5 days
Created on 2019-12-16 by the reprex package (v0.3.0)

Here are couple of approaches with dplyr :
We can use top_n to select top n dates from each group where n is calculated for each group differently based on number of values which are greater than cut_off.
library(dplyr)
mydf %>% group_by(group) %>% top_n(sum(date1 > cut_off) + 1, date1)
# group date1 cut_off
# <fct> <date> <date>
# 1 a 1970-01-05 1970-01-05
# 2 a 1970-01-08 1970-01-05
# 3 a 1970-01-13 1970-01-05
# 4 a 1970-01-15 1970-01-05
# 5 b 1970-01-11 1970-01-09
# 6 b 1970-01-12 1970-01-09
# 7 b 1970-01-07 1970-01-09
# 8 c 1970-01-14 1970-01-11
# 9 c 1970-01-16 1970-01-11
#10 c 1970-01-10 1970-01-11
Although this selects the rows correctly but note that top_n does not sort the data so you might want to add arrange(group, date1) at the end of the chain.
Another approach is similar to the one posted in OP using slice
mydf %>%
arrange(group, date1) %>%
group_by(group) %>%
slice((which.max(date1 > cut_off) - 1):n())
# group date1 cut_off
# <fct> <date> <date>
# 1 a 1970-01-05 1970-01-05
# 2 a 1970-01-08 1970-01-05
# 3 a 1970-01-13 1970-01-05
# 4 a 1970-01-15 1970-01-05
# 5 b 1970-01-07 1970-01-09
# 6 b 1970-01-11 1970-01-09
# 7 b 1970-01-12 1970-01-09
# 8 c 1970-01-10 1970-01-11
# 9 c 1970-01-14 1970-01-11
#10 c 1970-01-16 1970-01-11
We can also tweak this to use in filter.
mydf %>%
arrange(group, date1) %>%
group_by(group) %>%
filter(row_number() >= which.max(date1 > cut_off) - 1)
which can be translated in base R as :
new_df <- mydf[with(mydf, order(group, date1)), ]
subset(new_df, ave(date1 > cut_off, group, FUN = function(x)
seq_along(x) >= (which.max(x) - 1)))

You could detect the rows where the dates are above your threshold, then use lead on them to keep one more value :
library(dplyr)
mydf %>%
arrange(group,date1) %>%
group_by(group) %>%
filter(lead(date1 > cut_off, default = TRUE)) %>%
ungroup()
#> # A tibble: 10 x 3
#> group date1 cut_off
#> <fct> <date> <date>
#> 1 a 1970-01-05 1970-01-05
#> 2 a 1970-01-08 1970-01-05
#> 3 a 1970-01-13 1970-01-05
#> 4 a 1970-01-15 1970-01-05
#> 5 b 1970-01-07 1970-01-09
#> 6 b 1970-01-11 1970-01-09
#> 7 b 1970-01-12 1970-01-09
#> 8 c 1970-01-10 1970-01-11
#> 9 c 1970-01-14 1970-01-11
#> 10 c 1970-01-16 1970-01-11

I would like to remove all observations up to the nearest observation
to a given date, by group. I cannot simply calculate the nearest
observation using min(abs(x - date)) because there can be ties of
observations that are equally distant in the positive and negative
(e.g., group "b" in mydf).
Your criteria is tied, you need to deliberately choose a way to break ties - you could choose to pick whichever entry comes first: (or last, check ?top_n)
mydf %>%
mutate(diff = abs(date1- cut_off)) %>%
arrange(group, diff) %>%
group_by(group) %>%
top_n(n = 1, wt = -diff )
# A tibble: 3 x 4
# Groups: group [3]
group date1 cut_off diff
<fct> <date> <date> <drtn>
1 a 1970-01-12 1970-01-11 1 days
2 b 1970-01-07 1970-01-02 5 days
3 c 1970-01-03 1970-01-04 1 days
*for some reason I got different values even while using your seed (8)
If you need to filter only cases where diff <= 0, just add that to the pipe chain.

The sorting and diff is a wise move. So for the last part, since your dates are already sorted, for each group, you use which.min to call out the row that is nearest, and keep rows that are >= this value:
mydf %>%
arrange(group, date1) %>%
group_by(group) %>%
mutate(delta = abs(date1- cut_off)) %>%
filter(1:n() >= max(which(delta == max(delta[delta<=0]))))
# A tibble: 10 x 4
# Groups: group [3]
group date1 cut_off delta
<fct> <date> <date> <drtn>
1 a 1970-01-05 1970-01-05 0 days
2 a 1970-01-08 1970-01-05 3 days
3 a 1970-01-13 1970-01-05 8 days
4 a 1970-01-15 1970-01-05 10 days
5 b 1970-01-07 1970-01-09 2 days
6 b 1970-01-11 1970-01-09 2 days
7 b 1970-01-12 1970-01-09 3 days
8 c 1970-01-10 1970-01-11 1 days
9 c 1970-01-14 1970-01-11 3 days
10 c 1970-01-16 1970-01-11 5 days

Related

How to divide group depend on idx, diff in R?

There is my dataset. I want to make group numbers depending on idx, diff. Exactly, I want to make the same number until diff over 14 days. It means that if the same idx, under diff 14 days, it should be the same group. But if they have the same idx, over 14 days, it should be different group.
idx = c("a","a","a","a","b","b","b","c","c","c","c")
date = c(20201115, 20201116, 20201117, 20201105, 20201107, 20201110, 20210113, 20160930, 20160504, 20160913, 20160927)
group = c("1","1","1","1","2","2","3","4","5","6","6")
df = data.frame(idx,date,group)
df <- df %>% arrange(idx,date)
df$date <- as.Date(as.character(df$date), format='%Y%m%d')
df <- df %>% group_by(idx) %>%
mutate(diff = date - lag(date))
This is the result of what I want.
Use cumsum to create another group criteria, and then cur_group_id().
library(dplyr)
df %>%
group_by(idx) %>%
mutate(diff = difftime(date, lag(date, default = first(date)), unit = "days"),
cu = cumsum(diff >= 14)) %>%
group_by(idx, cu) %>%
mutate(group = cur_group_id()) %>%
ungroup() %>%
select(-cu)
# A tibble: 11 × 4
idx date group diff
<chr> <date> <int> <drtn>
1 a 2020-11-05 1 0 days
2 a 2020-11-15 1 10 days
3 a 2020-11-16 1 1 days
4 a 2020-11-17 1 1 days
5 b 2020-11-07 2 0 days
6 b 2020-11-10 2 3 days
7 b 2021-01-13 3 64 days
8 c 2016-05-04 4 0 days
9 c 2016-09-13 5 132 days
10 c 2016-09-27 6 14 days
11 c 2016-09-30 6 3 days
Given that the first value of diff must be NA because of the use of lag(), you could use cumsum(diff >= 14 | is.na(diff) without grouping to create the new group:
library(dplyr)
df %>%
group_by(idx) %>%
mutate(diff = date - lag(date)) %>%
ungroup() %>%
mutate(group = cumsum(diff >= 14 | is.na(diff)))
# # A tibble: 11 × 4
# idx date diff group
# <chr> <date> <drtn> <int>
# 1 a 2020-11-05 NA days 1
# 2 a 2020-11-15 10 days 1
# 3 a 2020-11-16 1 days 1
# 4 a 2020-11-17 1 days 1
# 5 b 2020-11-07 NA days 2
# 6 b 2020-11-10 3 days 2
# 7 b 2021-01-13 64 days 3
# 8 c 2016-05-04 NA days 4
# 9 c 2016-09-13 132 days 5
# 10 c 2016-09-27 14 days 6
# 11 c 2016-09-30 3 days 6

Dataframe with start & end date to daily data

I am trying to convert below data on daily basis based on range available in start_date & end_date_ column.
to this output (sum):
Please use dput() when posting data frames next time!
Example data
# A tibble: 4 × 4
id start end inventory
<int> <chr> <chr> <dbl>
1 1 01/05/2022 02/05/2022 100
2 2 10/05/2022 15/05/2022 50
3 3 11/05/2022 21/05/2022 80
4 4 14/05/2022 17/05/2022 10
Transform the data
df %>%
mutate(across(2:3, ~ as.Date(.x,
format = "%d/%m/%Y"))) %>%
pivot_longer(cols = c(start, end), values_to = "date") %>%
arrange(date) %>%
select(date, inventory)
# A tibble: 8 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-10 50
4 2022-05-11 80
5 2022-05-14 10
6 2022-05-15 50
7 2022-05-17 10
8 2022-05-21 80
Expand the dates and left_join
left_join(tibble(date = seq(first(df$date),
last(df$date),
by = "day")), df)
# A tibble: 21 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-03 NA
4 2022-05-04 NA
5 2022-05-05 NA
6 2022-05-06 NA
7 2022-05-07 NA
8 2022-05-08 NA
9 2022-05-09 NA
10 2022-05-10 50
# … with 11 more rows

calculate number of frost change days (number of days) from the weather hourly data in r

I have to calculate the following data Number of frost change days**(NFCD)**** as weekly basis.
That means the number of days in which minimum temperature and maximum temperature cross 0°C.
Let's say I work with years 1957-1980 with hourly temp.
Example data (couple of rows look like):
Date Time (UTC) temperature
1957-07-01 00:00:00 5
1957-07-01 03:00:00 6.2
1957-07-01 05:00:00 9
1957-07-01 06:00:00 10
1957-07-01 07:00:00 10
1957-07-01 08:00:00 14
1957-07-01 09:00:00 13.2
1957-07-01 10:00:00 15
1957-07-01 11:00:00 15
1957-07-01 12:00:00 16.3
1957-07-01 13:00:00 15.8
Expected data:
year month week NFCD
1957 7 1 1
1957 7 2 5
dat <- data.frame(date=c(rep("A",5),rep("B",5)), time=rep(1:5, times=2), temp=c(1:5,-2,1:4))
dat
# date time temp
# 1 A 1 1
# 2 A 2 2
# 3 A 3 3
# 4 A 4 4
# 5 A 5 5
# 6 B 1 -2
# 7 B 2 1
# 8 B 3 2
# 9 B 4 3
# 10 B 5 4
aggregate(temp ~ date, data = dat, FUN = function(z) min(z) <= 0 && max(z) > 0)
# date temp
# 1 A FALSE
# 2 B TRUE
(then rename temp to NFCD)
Using the data from r2evans's answer you can also use tidyverse logic:
library(tidyverse)
dat %>%
group_by(date) %>%
summarize(NFCD = min(temp) < 0 & max(temp) > 0)
which gives:
# A tibble: 2 x 2
date NFCD
<chr> <lgl>
1 A FALSE
2 B TRUE

Determine the number of process running each day and average days of commencing those projects, in R

I have a large dataset of processes (their IDs), start-dates and corresponding end dates.
What I want is divided in two parts. Firstly, how many processes are running each day. Secondly the running processes' mean days of running/commencement.
Sample data set is like
> dput(df)
structure(list(Process = c("P001", "P002", "P003", "P004", "P005"
), Start = c("01-01-2020", "02-01-2020", "03-01-2020", "08-01-2020",
"13-01-2020"), End = c("10-01-2020", "09-01-2020", "04-01-2020",
"17-01-2020", "19-01-2020")), class = "data.frame", row.names = c(NA,
-5L))
df
> df
Process Start End
1 P001 01-01-2020 10-01-2020
2 P002 02-01-2020 09-01-2020
3 P003 03-01-2020 04-01-2020
4 P004 08-01-2020 17-01-2020
5 P005 13-01-2020 19-01-2020
For first part I have proceeded like this
library(tidyverse)
df %>% pivot_longer(cols = c(Start, End), names_to = 'event', values_to = 'dates') %>%
mutate(dates = as.Date(dates, format = "%d-%m-%Y")) %>%
mutate(dates = if_else(event == 'End', dates+1, dates)) %>%
arrange(dates, event) %>%
mutate(processes = ifelse(event == 'Start', 1, -1),
processes = cumsum(processes)) %>%
select(-Process, -event) %>%
complete(dates = seq.Date(min(dates), max(dates), by = '1 day')) %>%
fill(processes)
# A tibble: 20 x 2
dates processes
<date> <dbl>
1 2020-01-01 1
2 2020-01-02 2
3 2020-01-03 3
4 2020-01-04 3
5 2020-01-05 2
6 2020-01-06 2
7 2020-01-07 2
8 2020-01-08 3
9 2020-01-09 3
10 2020-01-10 2
11 2020-01-11 1
12 2020-01-12 1
13 2020-01-13 2
14 2020-01-14 2
15 2020-01-15 2
16 2020-01-16 2
17 2020-01-17 2
18 2020-01-18 1
19 2020-01-19 1
20 2020-01-20 0
For second part the desired output is like column mean days in the following screenshot with explanation-
tidyverse approach will be preferred, please.
Here is one approach :
library(tidyverse)
df %>%
#Convert to date
mutate(across(c(Start, End), lubridate::dmy),
#Create a sequence of dates from start to end
Dates = map2(Start, End, seq, by = 'day')) %>%
#Get data in long format
unnest(Dates) %>%
#Remove columns
select(-Start, -End) %>%
#For each process
group_by(Process) %>%
#Count number of days spent on it
mutate(days_spent = row_number() - 1) %>%
#For each date
group_by(Dates) %>%
#Count number of process running and average days
summarise(process = n(),
mean_days = mean(days_spent))
This returns :
# Dates process mean_days
# <date> <int> <dbl>
# 1 2020-01-01 1 0
# 2 2020-01-02 2 0.5
# 3 2020-01-03 3 1
# 4 2020-01-04 3 2
# 5 2020-01-05 2 3.5
# 6 2020-01-06 2 4.5
# 7 2020-01-07 2 5.5
# 8 2020-01-08 3 4.33
# 9 2020-01-09 3 5.33
#10 2020-01-10 2 5.5
#11 2020-01-11 1 3
#12 2020-01-12 1 4
#13 2020-01-13 2 2.5
#14 2020-01-14 2 3.5
#15 2020-01-15 2 4.5
#16 2020-01-16 2 5.5
#17 2020-01-17 2 6.5
#18 2020-01-18 1 5
#19 2020-01-19 1 6

How to show missing dates in case of application of rolling function

Suppose I have a data df of some insurance policies.
library(tidyverse)
library(lubridate)
#Example data
d <- as.Date("2020-01-01", format = "%Y-%m-%d")
set.seed(50)
df <- data.frame(id = 1:10,
activation_dt = round(runif(10)*100,0) +d,
expiry_dt = d+round(runif(10)*100,0)+c(rep(180,5), rep(240,5)))
> df
id activation_dt expiry_dt
1 1 2020-03-12 2020-08-07
2 2 2020-02-14 2020-07-26
3 3 2020-01-21 2020-09-01
4 4 2020-03-18 2020-07-07
5 5 2020-02-21 2020-07-27
6 6 2020-01-05 2020-11-04
7 7 2020-03-11 2020-11-20
8 8 2020-03-06 2020-10-03
9 9 2020-01-05 2020-09-04
10 10 2020-01-12 2020-09-14
I want to see how many policies were active during each month. That I have done by the following method.
# Getting required result
df %>% arrange(activation_dt) %>%
pivot_longer(cols = c(activation_dt, expiry_dt),
names_to = "event",
values_to = "event_date") %>%
mutate(dummy = ifelse(event == "activation_dt", 1, -1)) %>%
mutate(dummy2 = floor_date(event_date, "month")) %>%
arrange(dummy2) %>% group_by(dummy2) %>%
summarise(dummy=sum(dummy)) %>%
mutate(dummy = cumsum(dummy)) %>%
select(dummy2, dummy)
# A tibble: 8 x 2
dummy2 dummy
<date> <dbl>
1 2020-01-01 4
2 2020-02-01 6
3 2020-03-01 10
4 2020-07-01 7
5 2020-08-01 6
6 2020-09-01 3
7 2020-10-01 2
8 2020-11-01 0
Now I am having problem as to how to deal with missing months e.g. April 2020 to June 2020 etc.
A data.table solution :
generate the months sequence
use non equi joins to find policies active every month and count them
library(lubridate)
library(data.table)
setDT(df)
months <- seq(lubridate::floor_date(mindat,'month'),lubridate::floor_date(max(df$expiry_dt),'month'),by='month')
months <- data.table(months)
df[,c("activation_dt_month","expiry_dt_month"):=.(lubridate::floor_date(activation_dt,'month'),
lubridate::floor_date(expiry_dt,'month'))]
df[months, .(months),on = .(activation_dt_month<=months,expiry_dt_month>=months)][,.(nb=.N),by=months]
months nb
1: 2020-01-01 4
2: 2020-02-01 6
3: 2020-03-01 10
4: 2020-04-01 10
5: 2020-05-01 10
6: 2020-06-01 10
7: 2020-07-01 10
8: 2020-08-01 7
9: 2020-09-01 6
10: 2020-10-01 3
11: 2020-11-01 2
Here is an alternative tidyverse/lubridate solution in case you are interested. The data.table version will be faster, but this should give you the correct results with gaps in months.
First use map2 to create a sequence of months between activation and expiration for each row of data. This will allow you to group by month/year to count number of active policies for each month.
library(tidyverse)
library(lubridate)
df %>%
mutate(month = map2(floor_date(activation_dt, "month"),
floor_date(expiry_dt, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2020-01 4
2 2020-02 6
3 2020-03 10
4 2020-04 10
5 2020-05 10
6 2020-06 10
7 2020-07 10
8 2020-08 7
9 2020-09 6
10 2020-10 3
11 2020-11 2

Resources