Fill in missing rows for dates by group [duplicate] - r

This question already has answers here:
Efficient way to Fill Time-Series per group
(2 answers)
Filling missing dates by group
(3 answers)
Fastest way to add rows for missing time steps?
(4 answers)
Closed 4 years ago.
I have a data table like this, just much bigger:
customer_id <- c("1","1","1","2","2","2","2","3","3","3")
account_id <- as.character(c(11,11,11,55,55,55,55,38,38,38))
time <- c(as.Date("2017-01-01","%Y-%m-%d"), as.Date("2017-05-01","%Y-%m-
%d"), as.Date("2017-06-01","%Y-%m-%d"),
as.Date("2017-02-01","%Y-%m-%d"), as.Date("2017-04-01","%Y-%m-
%d"), as.Date("2017-05-01","%Y-%m-%d"),
as.Date("2017-06-01","%Y-%m-%d"), as.Date("2017-01-01","%Y-%m-
%d"), as.Date("2017-04-01","%Y-%m-%d"),
as.Date("2017-05-01","%Y-%m-%d"))
tenor <- c(1,2,3,1,2,3,4,1,2,3)
variable_x <- c(87,90,100,120,130,150,12,13,15,14)
my_data <- data.table(customer_id,account_id,time,tenor,variable_x)
customer_id account_id time tenor variable_x
1 11 2017-01-01 1 87
1 11 2017-05-01 2 90
1 11 2017-06-01 3 100
2 55 2017-02-01 1 120
2 55 2017-04-01 2 130
2 55 2017-05-01 3 150
2 55 2017-06-01 4 12
3 38 2017-01-01 1 13
3 38 2017-04-01 2 15
3 38 2017-05-01 3 14
in which I should observe for each pair of customer_id, account_id monthly observations from 2017-01-01 to 2017-06-01, but for some customer_id, account_id pairs some dates in this sequence of 6 months are missing. I would like to fill in those missing dates such that each customer_id, account_id pair has observations for all 6 months, just with missing variables tenor and variable_x. That is, it should look like this:
customer_id account_id time tenor variable_x
1 11 2017-01-01 1 87
1 11 2017-02-01 NA NA
1 11 2017-03-01 NA NA
1 11 2017-04-01 NA NA
1 11 2017-05-01 2 90
1 11 2017-06-01 3 100
2 55 2017-01-01 NA NA
2 55 2017-02-01 1 120
2 55 2017-03-01 NA NA
2 55 2017-04-01 2 130
2 55 2017-05-01 3 150
2 55 2017-06-01 4 12
3 38 2017-01-01 1 13
3 38 2017-02-01 NA NA
3 38 2017-03-01 NA NA
3 38 2017-04-01 2 15
3 38 2017-05-01 3 14
3 38 2017-06-01 NA NA
I tried creating a sequence of dates from 2017-01-01 to 2017-06-01 by using
ts = seq(as.Date("2017/01/01"), as.Date("2017/06/01"), by = "month")
and then merge it to the original data with
ts = data.table(ts)
colnames(ts) = "time"
merged <- merge(ts, my_data, by="time", all.x=TRUE)
but it is not working. Please, do you know how to add such rows with dates for each customer_id, account_id pair?

We can do a join. Create the sequence of 'time' from min to max by '1 month', expand the dataset grouped by 'customer_id', 'account_id' and join on with those columns and the 'time'
ts1 <- seq(min(my_data$time), max(my_data$time), by = "1 month")
my_data[my_data[, .(time =ts1 ), .(customer_id, account_id)],
on = .(customer_id, account_id, time)]
# customer_id account_id time tenor variable_x
# 1: 1 11 2017-01-01 1 87
# 2: 1 11 2017-02-01 NA NA
# 3: 1 11 2017-03-01 NA NA
# 4: 1 11 2017-04-01 NA NA
# 5: 1 11 2017-05-01 2 90
# 6: 1 11 2017-06-01 3 100
# 7: 2 55 2017-01-01 NA NA
# 8: 2 55 2017-02-01 1 120
# 9: 2 55 2017-03-01 NA NA
#10: 2 55 2017-04-01 2 130
#11: 2 55 2017-05-01 3 150
#12: 2 55 2017-06-01 4 12
#13: 3 38 2017-01-01 1 13
#14: 3 38 2017-02-01 NA NA
#15: 3 38 2017-03-01 NA NA
#16: 3 38 2017-04-01 2 15
#17: 3 38 2017-05-01 3 14
#18: 3 38 2017-06-01 NA NA
Or using tidyverse
library(tidyverse)
distinct(my_data, customer_id, account_id) %>%
mutate(time = list(ts1)) %>%
unnest %>%
left_join(my_data)
Or with complete from tidyr
my_data %>%
complete(nesting(customer_id, account_id), time = ts1)

A different data.table approach:
my_data2 <- my_data[, .(time = seq(as.Date("2017/01/01"), as.Date("2017/06/01"),
by = "month")), by = list(customer_id, account_id)]
merge(my_data2, my_data, all.x = TRUE)
customer_id account_id time tenor variable_x
1: 1 11 2017-01-01 1 87
2: 1 11 2017-02-01 NA NA
3: 1 11 2017-03-01 NA NA
4: 1 11 2017-04-01 NA NA
5: 1 11 2017-05-01 2 90
6: 1 11 2017-06-01 3 100
7: 2 55 2017-01-01 NA NA
8: 2 55 2017-02-01 1 120
9: 2 55 2017-03-01 NA NA
10: 2 55 2017-04-01 2 130
11: 2 55 2017-05-01 3 150
12: 2 55 2017-06-01 4 12
13: 3 38 2017-01-01 1 13
14: 3 38 2017-02-01 NA NA
15: 3 38 2017-03-01 NA NA
16: 3 38 2017-04-01 2 15
17: 3 38 2017-05-01 3 14
18: 3 38 2017-06-01 NA NA

Related

Group two dfs based on dates that closely match

These are subsets of two dataframes.
df1:
plot
mean_first_flower_date
gdd
1
2019-07-15
60
1
2019-07-21
50
1
2019-07-23
78
2
2019-05-13
100
2
2019-05-22
173
2
2019-05-25
245
(cont.)
df2:
plot
date
flowers
1
2019-07-12
2
1
2019-07-13
9
1
2019-07-14
3
1
2019-07-15
3
2
2019-05-12
10
2
2019-05-13
10
2
2019-05-14
14
2
2019-05-15
17
(cont.)
df2 has some matching dates with df1 but sometimes the dates are off for one or a couple days (highlighted in bold).
I would like to group both dfs based on both 'date' and 'plot', keeping df2, without losing 'gdd' data from df1.
This will happen if, for example, I inner_join both dfs because the dates will not match.
So if a date in df1 is one to three days earlier or later than what it's possible to match in df2, it's fine because the dates are relatively close. This is tricky because I want this data replacement only if there is not data available in df1 for that data range.
My goal is to have something like this:
plot
date
flowers
gdd
1
2019-07-12
2
60
1
2019-07-13
9
60
1
2019-07-14
3
60
1
2019-07-15
3
60
2
2019-05-12
10
100
2
2019-05-13
10
100
2
2019-05-14
14
100
2
2019-05-15
17
100
Is it possible to do?
I greatly appreciate any help!
Thanks!
I think a 'rolling join' from the data.table package can handle this:
library(data.table)
setDT(df1)
setDT(df2)
df1[, mean_first_flower_date := as.Date(mean_first_flower_date)]
df2[, date := as.Date(date)]
df1[df2, on=c("plot","mean_first_flower_date==date"), roll=3, rollends=TRUE]
# plot mean_first_flower_date gdd flowers
#1: 1 2019-07-12 60 2
#2: 1 2019-07-13 60 9
#3: 1 2019-07-14 60 3
#4: 1 2019-07-15 60 3
#5: 2 2019-05-12 100 10
#6: 2 2019-05-13 100 10
#7: 2 2019-05-14 100 14
#8: 2 2019-05-15 100 17
Using this data:
df1 <- read.table(text="plot mean_first_flower_date gdd
1 2019-07-15 60
1 2019-07-21 50
1 2019-07-23 78
2 2019-05-13 100
2 2019-05-22 173
2 2019-05-25 245", header=TRUE)
df2 <- read.table(text="plot date flowers
1 2019-07-12 2
1 2019-07-13 9
1 2019-07-14 3
1 2019-07-15 3
2 2019-05-12 10
2 2019-05-13 10
2 2019-05-14 14
2 2019-05-15 17", header=TRUE)
Try fill from dplyr. use this syntax
df2 %>% left_join(df1, by = c("plot" = "plot", "date" = "mean_first_flower_date")) %>%
fill(gdd, .direction = "up")
plot date flowers gdd
1 1 2019-07-12 2 60
2 1 2019-07-13 9 60
3 1 2019-07-14 3 60
4 1 2019-07-15 3 60
5 2 2019-05-12 10 100
6 2 2019-05-13 10 100
7 2 2019-05-14 14 NA
8 2 2019-05-15 17 NA
As you can notice there are two NAs in the last two rows which shouldn't be there if you'll join your actual df2 where these rows will be filled by 173 as there will be a match for 2019-05-22. Still if you want to fill the last NA rows, if any, you can use fill again with .direction = "down"
df2 %>% left_join(df1, by = c("plot" = "plot", "date" = "mean_first_flower_date")) %>%
fill(gdd, .direction = "up") %>% fill(gdd, .direction = "down")
plot date flowers gdd
1 1 2019-07-12 2 60
2 1 2019-07-13 9 60
3 1 2019-07-14 3 60
4 1 2019-07-15 3 60
5 2 2019-05-12 10 100
6 2 2019-05-13 10 100
7 2 2019-05-14 14 100
8 2 2019-05-15 17 100

How to split a data set with duplicated informations based on date

I have this situation:
ID date Weight
1 2014-12-02 23
1 2014-10-02 25
2 2014-11-03 27
2 2014-09-03 45
3 2014-07-11 56
3 NA 34
4 2014-10-05 25
4 2014-08-09 14
5 NA NA
5 NA NA
And I would like split the dataset in this, like this:
1-
ID date Weight
1 2014-12-02 23
1 2014-10-02 25
2 2014-11-03 27
2 2014-09-03 45
4 2014-10-05 25
4 2014-08-09 14
2- Lowest Date
ID date Weight
3 2014-07-11 56
3 NA 34
5 NA NA
5 NA NA
I tried this for second dataset:
dt <- dt[order(dt$ID, dt$date), ]
dt.2=dt[duplicated(dt$ID), ]
but didn't work
Get the ID's for which date are NA and then subset based on that
NA_ids <- unique(df$ID[is.na(df$date)])
subset(df, !ID %in% NA_ids)
# ID date Weight
#1 1 2014-12-02 23
#2 1 2014-10-02 25
#3 2 2014-11-03 27
#4 2 2014-09-03 45
#7 4 2014-10-05 25
#8 4 2014-08-09 14
subset(df, ID %in% NA_ids)
# ID date Weight
#5 3 2014-07-11 56
#6 3 <NA> 34
#9 5 <NA> NA
#10 5 <NA> NA
Using dplyr, we can create a new column which has TRUE/FALSE for each ID based on presence of NA and then use group_split to split into list of two.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(NA_ID = any(is.na(date))) %>%
ungroup %>%
group_split(NA_ID, keep = FALSE)
The above dplyr logic can also be implemented in base R by using ave and split
df$NA_ID <- with(df, ave(is.na(date), ID, FUN = any))
split(df[-4], df$NA_ID)

creating a unique variable based on row differences of another variable considering groups

By using the data below, I want to create a new unique customer id by considering their contact date.
Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.
I couldn't go any further than calculating date differences.
The original dataset I work is bigger; therefore, I prefer a data.table solution if possible.
library(data.table)
treshold <- 2
dt <- structure(list(customer_id = c('10','20','20','20','20','20','30','30','30','30','30','40','50','50'),
contact_date = as.Date(c("2019-01-05","2019-01-01","2019-01-01","2019-01-02",
"2019-01-08","2019-01-09","2019-02-02","2019-02-05",
"2019-02-05","2019-02-09","2019-02-12","2019-02-01",
"2019-02-01","2019-02-05")),
desired_output = c(1,2,2,2,3,3,4,5,5,6,7,8,9,10)),
class = "data.frame",
row.names = 1:14)
setDT(dt)
setorder(dt, customer_id, contact_date)
dt[, date_diff_in_days:=contact_date - shift(contact_date, type = c("lag")), by=customer_id]
dt[, date_diff_in_days:=as.numeric(date_diff_in_days)]
dt
customer_id contact_date desired_output date_diff_in_days
1: 10 2019-01-05 1 NA
2: 20 2019-01-01 2 NA
3: 20 2019-01-01 2 0
4: 20 2019-01-02 2 1
5: 20 2019-01-08 3 6
6: 20 2019-01-09 3 1
7: 30 2019-02-02 4 NA
8: 30 2019-02-05 5 3
9: 30 2019-02-05 5 0
10: 30 2019-02-09 6 4
11: 30 2019-02-12 7 3
12: 40 2019-02-01 8 NA
13: 50 2019-02-01 9 NA
14: 50 2019-02-05 10 4
Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.
When creating a new ID, if you set up the by= vectors correctly to capture the rule, the auto-counter .GRP can be used:
thresh <- 2
dt[, g := .GRP, by=.(
customer_id,
cumsum(contact_date - shift(contact_date, fill=first(contact_date)) > thresh)
)]
dt[, any(g != desired_output)]
# [1] FALSE
I think the code above is correct since it works on the example, but you might want to check on your actual data (comparing against results from, eg, Gregor's approach) to be sure.
We use cumsum to increment whenever date_diff_in_days is NA or when the threshold is exceeded.
dt[, result := cumsum(is.na(date_diff_in_days) | date_diff_in_days > treshold)]
# customer_id contact_date desired_output date_diff_in_days result
# 1: 10 2019-01-05 1 NA 1
# 2: 20 2019-01-01 2 NA 2
# 3: 20 2019-01-01 2 0 2
# 4: 20 2019-01-02 2 1 2
# 5: 20 2019-01-08 3 6 3
# 6: 20 2019-01-09 3 1 3
# 7: 30 2019-02-02 4 NA 4
# 8: 30 2019-02-05 5 3 5
# 9: 30 2019-02-05 5 0 5
# 10: 30 2019-02-09 6 4 6
# 11: 30 2019-02-12 7 3 7
# 12: 40 2019-02-01 8 NA 8
# 13: 50 2019-02-01 9 NA 9
# 14: 50 2019-02-05 10 4 10

Calculate maximum date interval - R

The challenge is a data.frame with with one group variable (id) and two date variables (start and stop). The date intervals are irregular and I'm trying to calculate the uninterrupted interval in days starting from the first startdate per group.
Example data:
data <- data.frame(
id = c(1, 2, 2, 3, 3, 3, 3, 3, 4, 5),
start = as.Date(c("2016-02-18", "2016-12-07", "2016-12-12", "2015-04-10",
"2015-04-12", "2015-04-14", "2015-05-15", "2015-07-14",
"2010-12-08", "2011-03-09")),
stop = as.Date(c("2016-02-19", "2016-12-12", "2016-12-13", "2015-04-13",
"2015-04-22", "2015-05-13", "2015-07-13", "2015-07-15",
"2010-12-10", "2011-03-11"))
)
> data
id start stop
1 1 2016-02-18 2016-02-19
2 2 2016-12-07 2016-12-12
3 2 2016-12-12 2016-12-13
4 3 2015-04-10 2015-04-13
5 3 2015-04-12 2015-04-22
6 3 2015-04-14 2015-05-13
7 3 2015-05-15 2015-07-13
8 3 2015-07-14 2015-07-15
9 4 2010-12-08 2010-12-10
10 5 2011-03-09 2011-03-11
The aim would a data.frame like this:
id start stop duration_from_start
1 1 2016-02-18 2016-02-19 2
2 2 2016-12-07 2016-12-12 7
3 2 2016-12-12 2016-12-13 7
4 3 2015-04-10 2015-04-13 34
5 3 2015-04-12 2015-04-22 34
6 3 2015-04-14 2015-05-13 34
7 3 2015-05-15 2015-07-13 34
8 3 2015-07-14 2015-07-15 34
9 4 2010-12-08 2010-12-10 3
10 5 2011-03-09 2011-03-11 3
Or this:
id start stop duration_from_start
1 1 2016-02-18 2016-02-19 2
2 2 2016-12-07 2016-12-13 7
3 3 2015-04-10 2015-05-13 34
4 4 2010-12-08 2010-12-10 3
5 5 2011-03-09 2011-03-11 3
It's important to identify the gap from row 6 to 7 and to take this point as the maximum interval (34 days). The interval 2018-10-01to 2018-10-01 would be counted as 1.
My usual lubridate approaches don't work with this example (interval %within lag(interval)).
Any idea?
library(magrittr)
library(data.table)
setDT(data)
first_int <- function(start, stop){
ind <- rleid((start - shift(stop, fill = Inf)) > 0) == 1
list(start = min(start[ind]),
stop = max(stop[ind]))
}
newdata <-
data[, first_int(start, stop), by = id] %>%
.[, duration := stop - start + 1]
# id start stop duration
# 1: 1 2016-02-18 2016-02-19 2 days
# 2: 2 2016-12-07 2016-12-13 7 days
# 3: 3 2015-04-10 2015-05-13 34 days
# 4: 4 2010-12-08 2010-12-10 3 days
# 5: 5 2011-03-09 2011-03-11 3 days

R - Calculate Time Elapsed Since Last Event with Multiple Event Types

I have a dataframe that contains the dates of multiple types of events.
df <- data.frame(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000"
,"03/01/2001","17/03/2001","23/04/2001",
"26/05/2001","01/06/2001",
"30/06/2001","02/07/2001","15/07/2001"
,"21/12/2001"), "%d/%m/%Y"),
event_type=c(0,4,1,2,4,1,0,2,3,3,4,3))
date event_type
---------------- ----------
1 2000-07-06 0
2 2000-09-15 4
3 2000-10-15 1
4 2001-01-03 2
5 2001-03-17 4
6 2001-04-23 1
7 2001-05-26 0
8 2001-06-01 2
9 2001-06-30 3
10 2001-07-02 3
11 2001-07-15 4
12 2001-12-21 3
I am trying to calculate the days between each event type so the output looks like the below:
date event_type days_since_last_event
---------------- ---------- ---------------------
1 2000-07-06 0 NA
2 2000-09-15 4 NA
3 2000-10-15 1 NA
4 2001-01-03 2 NA
5 2001-03-17 4 183
6 2001-04-23 1 190
7 2001-05-26 0 324
8 2001-06-01 2 149
9 2001-06-30 3 NA
10 2001-07-02 3 2
11 2001-07-15 4 120
12 2001-12-21 3 172
I have benefited from the answers from these two previous posts but have not been able to address my specific problem in R; multiple event types.
Calculate elapsed time since last event
Calculate days since last event in R
Below is as far as I have gotten. I have not been able to leverage the last event index to calculate the last event date.
df <- cbind(df, as.vector(data.frame(count=ave(df$event_type==df$event_type,
df$event_type, FUN=cumsum))))
df <- rename(df, c("count" = "last_event_index"))
date event_type last_event_index
--------------- ------------- ----------------
1 2000-07-06 0 1
2 2000-09-15 4 1
3 2000-10-15 1 1
4 2001-01-03 2 1
5 2001-03-17 4 2
6 2001-04-23 1 2
7 2001-05-26 0 2
8 2001-06-01 2 2
9 2001-06-30 3 1
10 2001-07-02 3 2
11 2001-07-15 4 3
12 2001-12-21 3 3
We can use diff to get the difference between adjacent 'date' after grouping by 'event_type'. Here, I am using data.table approach by converting the 'data.frame' to 'data.table' (setDT(df)), grouped by 'event_type', we get the diff of 'date'.
library(data.table)
setDT(df)[,days_since_last_event :=c(NA,diff(date)) , by = event_type]
df
# date event_type days_since_last_event
# 1: 2000-07-06 0 NA
# 2: 2000-09-15 4 NA
# 3: 2000-10-15 1 NA
# 4: 2001-01-03 2 NA
# 5: 2001-03-17 4 183
# 6: 2001-04-23 1 190
# 7: 2001-05-26 0 324
# 8: 2001-06-01 2 149
# 9: 2001-06-30 3 NA
#10: 2001-07-02 3 2
#11: 2001-07-15 4 120
#12: 2001-12-21 3 172
Or as #Frank mentioned in the comments, we can also use shift (from version v1.9.5+ onwards) to get the lag (by default, the type='lag') of 'date' and subtract from the 'date'.
setDT(df)[, days_since_last_event := as.numeric(date-shift(date,type="lag")),
by = event_type]
The base R version of this is to use split/lapply/rbind to generate the new column.
> do.call(rbind,
lapply(
split(df, df$event_type),
function(d) {
d$dsle <- c(NA, diff(d$date)); d
}
)
)
date event_type dsle
0.1 2000-07-06 0 NA
0.7 2001-05-26 0 324
1.3 2000-10-15 1 NA
1.6 2001-04-23 1 190
2.4 2001-01-03 2 NA
2.8 2001-06-01 2 149
3.9 2001-06-30 3 NA
3.10 2001-07-02 3 2
3.12 2001-12-21 3 172
4.2 2000-09-15 4 NA
4.5 2001-03-17 4 183
4.11 2001-07-15 4 120
Note that this returns the data in a different order than provided; you can re-sort by date or save the original indices if you want to preserve that order.
Above, #akrun has posted the data.tables approach, the parallel dplyr approach would be straightforward as well:
library(dplyr)
df %>% group_by(event_type) %>% mutate(days_since_last_event=date - lag(date, 1))
Source: local data frame [12 x 3]
Groups: event_type [5]
date event_type days_since_last_event
(date) (dbl) (dfft)
1 2000-07-06 0 NA days
2 2000-09-15 4 NA days
3 2000-10-15 1 NA days
4 2001-01-03 2 NA days
5 2001-03-17 4 183 days
6 2001-04-23 1 190 days
7 2001-05-26 0 324 days
8 2001-06-01 2 149 days
9 2001-06-30 3 NA days
10 2001-07-02 3 2 days
11 2001-07-15 4 120 days
12 2001-12-21 3 172 days

Resources