left_join (dplyr) the next available date - r

I have 2 datasets in "R".
The first DB contains specific dates:
Value Date
# 20 2017-10-19
# 19 2017-10-23
# 19 2017-11-03
# 20 2017-11-10
And the second contains the level of an stock index from the last 5 years
Date Index
# 2017-11-10 13.206,35
# 2017-11-03 13.378,96
# 2017-10-25 13.404,58
# 2017-10-19 13.517,98
Now I want to merge by searching for the dates from the first dataset "DB" and adding the correct Index value for this date from the second dataset "Hist".
What I did is using the left_join function:
DB <- left_join(DB, Hist, by = "Date")
The problem is some dates in the first dataset are public holidays where no data is available in the second dataset "Hist". So I have some "NA".
Value Date Index
# 20 2017-10-19 13.517,98
# 19 2017-10-23 NA
# 19 2017-11-03 13.378,96
# 20 2017-11-10 13.206,35
What I'm looking for is to take the value of the next available date instead of adding NA.
Example: Instead of adding NA taking the index of 2017-10-25 (2 days later)
Value Date Index
# 20 2017-10-19 13.517,98
# 19 2017-10-23 13.404,58
# 19 2017-11-03 13.378,96
# 20 2017-11-10 13.206,35
Has anybody an idea. Thanks in advance!

Original Request
The following is an option. It uses full_join, and then the fill function to impute the missing value.
library(tidyverse)
DB_final <- DB %>%
full_join(Hist, by = "Date") %>%
arrange(Date) %>%
fill(Index, .direction = "up") %>%
filter(!is.na(Value))
DB_final
# Value Date Index
# 1 20 2017-10-19 13.517,98
# 2 19 2017-10-23 13.404,58
# 3 19 2017-11-03 13.378,96
# 4 20 2017-11-10 13.206,35
However, the user needs to know the fill direction (up or down) in advance. It may not be useful if the user does not know that.
Impute Missing Value based on the Nearest Date
Here is another option, which I think is more robust. It will impute the missing value use the Index from the nearest date.
Step 1: Find the Nearest Date
# Collect all dates
Date_vec <- sort(unique(c(DB$Date, Hist$Date)))
# Create a distance matrix based on dates than convert to a data frame
dt <- Date_vec %>%
dist() %>%
as.matrix() %>%
as.data.frame() %>%
rowid_to_column(var = "ID") %>%
gather(ID2, Value, -ID) %>%
mutate(ID2 = as.integer(ID2)) %>%
filter(ID != ID2) %>%
arrange(ID, Value) %>%
group_by(ID) %>%
slice(1) %>%
select(-Value)
dt$ID <- Date_vec[dt$ID]
dt$ID2 <- Date_vec[dt$ID2]
names(dt) <- c("Date1", "Date2")
dt
# # A tibble: 5 x 2
# # Groups: ID [5]
# Date1 Date2
# <date> <date>
# 1 2017-10-19 2017-10-23
# 2 2017-10-23 2017-10-25
# 3 2017-10-25 2017-10-23
# 4 2017-11-03 2017-11-10
# 5 2017-11-10 2017-11-03
dt shows the nearest date of all the dates.
Step 2: Perform multiple join
Join DB and dt, and then join Hist twice based on different date columns.
DB2 <- DB %>% left_join(dt, by = c("Date" = "Date1"))
DB3 <- DB2 %>%
left_join(Hist, by = "Date") %>%
left_join(Hist, by = c("Date2" = "Date"))
DB3
# Value Date Date2 Index.x Index.y
# 1 20 2017-10-19 2017-10-23 13.517,98 <NA>
# 2 19 2017-10-23 2017-10-25 <NA> 13.404,58
# 3 19 2017-11-03 2017-11-10 13.378,96 13.206,35
# 4 20 2017-11-10 2017-11-03 13.206,35 13.378,96
Step 3: Finalize the Index
If there are values in Index.x, use that, otherwise, use the values in Index.y.
DB4 <- DB3 %>%
mutate(Index = ifelse(is.na(Index.x), Index.y, Index.x)) %>%
select(Value, Date, Index)
DB4
# Value Date Index
# 1 20 2017-10-19 13.517,98
# 2 19 2017-10-23 13.404,58
# 3 19 2017-11-03 13.378,96
# 4 20 2017-11-10 13.206,35
DB4 is the final output.
DATA
DB <- structure(list(Value = c(20L, 19L, 19L, 20L), Date = structure(c(17458,
17462, 17473, 17480), class = "Date")), class = "data.frame", .Names = c("Value",
"Date"), row.names = c(NA, -4L))
Hist <- structure(list(Date = structure(c(17480, 17473, 17464, 17458), class = "Date"),
Index = c("13.206,35", "13.378,96", "13.404,58", "13.517,98"
)), class = "data.frame", .Names = c("Date", "Index"), row.names = c(NA,
-4L))

A solution could be
library(dplyr)
library(rlang)
clean_df <- function(df) {
ix <- which(is.na(df$Index))
df$Index[ix] <- df$Index[ix + 1]
filter(df, !is.na(.data$Value))
}
full_join(DB, Hist) %>%
arrange(Date) %>%
clean_df()

What you have done, plus as.Date() to format dates:
library(data.table)
library(dplyr)
DB = data.table(
Value = c(20,19,19,29),
Date = c("2017-10-19","2017-10-23","2017-11-03","2017-11-10")
)
Hist = data.table(
Date = c("2017-11-10","2017-11-03","2017-10-25","2017-10-19"),
Index = c("13.206,35","13.378,96","13.404,58","13.517,98")
)
DB[, Date := as.Date(Date)]
Hist[, Date := as.Date(Date)]
DB <- left_join(DB,Hist,by="Date") %>% as.data.table()
Now perform the steps below:
# Get rows which are missing an Index.
DB_na <- DB[is.na(Index),]
DB <- DB[!is.na(Index),]
# Build function to find appropriate Index, given an na_date.
get_na_index <- function(na_date) {
bigger_dates = DB[Date>na_date,]
index = bigger_dates[which.min(other_dates-na_date), Index]
return(index)
}
# Use apply() to perform row-wise operation.
DB_na$Index <- apply(matrix(DB_na$Date), 1, get_na_index)
# Combine rows
DB <- rbind(DB, DB_na) %>% arrange(Date)
The Output:
DB
Value Date Index
1 20 2017-10-19 13.517,98
2 19 2017-10-23 13.378,96
3 19 2017-11-03 13.378,96
4 29 2017-11-10 13.206,35

Related

Subsetting data by time frame window for each row in a data frame after using group_by()

I have time-series data with three columns: a value column, a group_var column (used for grouping), and a date column. For each row in the data frame, I'd like to get the mean of that row's group after further subsetting by a specific timeframe. Here's an example of the code for subsetting:
df$value[df$date >= (current_row$date - 545) & df$date <= (current_row$date - 365)]
After I get this subset I can easily apply mean(), but where I'm stuck on is how I can get this code to work with something like this:
df %>%
group_by(group_var) %>%
mutate(subset_mean = mean(df$value[df$date >= (current_row$date - 545) & df$date <= (current_row$date - 365)])
)
The issues I see is that I don't think I can use 'df' inside the mutate() line after I group the original 'df'. Also I'm not sure how I can create 'current_row' variable for referencing the current row to calculate the data subset.
Edit:
Added example data and reproducible code
library(dplyr)
date <- c("2016-02-03", "2016-06-14", "2016-03-15", "2017-04-16","2016-01-27", "2016-01-13", "2017-04-24", "2017-06-15")
date <- date %>% as.Date(format = "%Y-%m-%d")
val <- c(10, 20, 50, 70, 30, 44, 67, 42)
group_var <- c("A", "B", "B", "A", "B", "A", "A", "B")
df <- data.frame(date, val, group_var)
df %>%
group_by(group_var)
I would suggest using slider::slide_index_dbl for this:
library(dplyr)
df %>%
group_by(group_var) %>%
arrange(group_var, date) %>% # slider 0.1.5 requires the window variable to be ascending
mutate(subset_mean = slider::slide_index_dbl(
val, date, mean, .before = 545, .after = -365
# negative ".after" means the window ends before the current date
)) %>%
ungroup()
With the updated data, I get
#date <- c("2016-02-03", "2016-06-14", "2016-03-15", "2017-04-16","2016-01-27", "2016-01-13", "2017-04-24", "2017-06-15")
# A tibble: 8 x 4
date val group_var subset_mean
<date> <dbl> <chr> <dbl>
1 2016-01-13 44 A NaN
2 2016-02-03 10 A NaN
3 2017-04-16 70 A 27
4 2017-04-24 67 A 27
5 2016-01-27 30 B NaN
6 2016-03-15 50 B NaN
7 2016-06-14 20 B NaN
8 2017-06-15 42 B 33.3
1) This can be done with a self join using sql:
library(sqldf)
sqldf("select a.date, a.val, a.group_var, avg(b.val) as mean
from df a
left join df b on a.group_var = b.group_var and
b.date between a.date - 595 and a.date - 365
group by a.rowid")
giving:
date val group_var mean
1 2016-02-03 10 A NA
2 2016-06-14 20 B NA
3 2016-03-15 50 B NA
4 2017-04-16 70 A 27.00000
5 2016-01-27 30 B NA
6 2016-01-13 44 A NA
7 2017-04-24 67 A 27.00000
8 2017-06-15 42 B 33.33333
2) or we can use SQL window functions:
sqldf("select date, val, group_var,
avg(val) over (partition by group_var
order by date
range between 595 preceding and 365 preceding) as mean
from df"
)
giving:
date val group_var mean
1 2016-01-13 44 A NA
2 2016-02-03 10 A NA
3 2017-04-16 70 A 27.00000
4 2017-04-24 67 A 27.00000
5 2016-01-27 30 B NA
6 2016-03-15 50 B NA
7 2016-06-14 20 B NA
8 2017-06-15 42 B 33.33333
Lubridate provides a very elegant solution...
library(tidyverse)
library(lubridate)
df = tibble(
value = runif(100,1,100),
group = rep(1:4,25),
dt = as.Date(round(runif(100,1000,2000)), origin = "1970-01-01")
)
first_year <- interval(ymd("1972-01-01"), ymd("1972-12-31"))
sec_year <- interval(ymd("1973-01-01"), ymd("1973-12-31"))
furhter <- interval(ymd("1974-01-01"), ymd("1975-12-31"))
df <- df %>%
mutate(
range = case_when(
dt %within% first_year ~"1972",
dt %within% sec_year ~"1973",
TRUE ~"1974-1975"
)
)
mean_by_group_interval <- df %>%
group_by(
group,
range
) %>%
summarise(
mean = mean(value)
)
Here is a solution that utilizes the dplyr package.
library(dplyr)
date <- c("2016-02-03", "2016-06-14", "2016-03-15", "2017-04-16","2016-01-27", "2016-01-13", "2017-04-24", "2017-06-15")
date <- date %>% as.Date(format = "%Y-%m-%d")
val <- c(10, 20, 50, 70, 30, 44, 67, 42)
group_var <- c("A", "B", "B", "A", "B", "A", "A", "B")
df <- data.frame(date, val, group_var)
df %>%
group_by(group_var) %>%
arrange(group_var, date) %>%
mutate(
# Determine if the current date - the first date of each group is between 365 and 595 days.
match = between(date - first(date), 365, 595),
# Count the number of dates that are not within the range described above to be used in calculating the mean.
count_false = sum(match == FALSE),
# Calculate the cumulative sum for rows in each group that are not within the range described above.
sum_match_false = ifelse(match == FALSE, cumsum(val), NA),
# Calculate the mean.
mean_match_true = ifelse(match == TRUE, max(sum_match_false, na.rm = TRUE) / count_false, NA)
) %>%
# Return only these variables.
select(date, val, group_var, mean_match_true)
#> date val group_var mean_match_true
#> <date> <dbl> <chr> <dbl>
#> 1 2016-01-13 44 A NA
#> 2 2016-02-03 10 A NA
#> 3 2017-04-16 70 A 27
#> 4 2017-04-24 67 A 27
#> 5 2016-01-27 30 B NA
#> 6 2016-03-15 50 B NA
#> 7 2016-06-14 20 B NA
#> 8 2017-06-15 42 B 33.3
Created on 2021-03-12 by the reprex package (v0.3.0)

Count active observations by week

I have a data frame of observations with a start and end date for each observation indicating the period it was active.
The duration active varies by observation, and can spread across multiple weeks.
Some observations are still active and do not have an end date.
For a given date range, how can I count the number of observations that were active during a week within that date range, including those still active?
I have a crude method that works, but is pretty slow. It seems like there has to be a more efficient and simpler way to do this.
EDIT: My first approach was similar to Ronak's solution, which is definitely better than mine for smaller data sets, but my real data set has more observations and longer date ranges, so I run into memory constraints.
#I'm primarily using tidyverse/lubridate, but definitely open to other solutions.
library(tidyverse)
library(lubridate)
# sample data frame of observations with start and end dates:
df_obs <- tibble(
observation = c(1:10),
date_start = as_date(c("2020-03-17", "2020-01-20", "2020-02-06", "2020-01-04", "2020-01-06", "2020-01-24", "2020-01-09", "2020-02-11", "2020-03-13", "2020-02-07")),
date_end = as_date(c("2020-03-27", "2020-03-20", NA, "2020-03-04", "2020-01-16", "2020-02-24", NA, "2020-02-19", NA, "2020-02-27"))
)
# to account for observations that are still active, NAs are converted to today's date:
df_obs <- mutate(df_obs, date_end = if_else(is.na(date_end), Sys.Date(), date_end))
# create a data frame of weeks by start and end date to count the active observations in a given week
# for this example I'm just using date ranges from the sample data:
df_weeks <-
seq(min(df_obs$date_start), max(df_obs$date_start), by = 'day') %>%
enframe(NULL, 'week_start') %>%
mutate(week_start = as_date(cut(week_start, "week"))) %>%
mutate(week_end = week_start + 6) %>%
distinct()
# create a function that filters the observations data frame based on start and end dates:
check_active <- function(d, s, e){
d %>%
filter(date_start <= e) %>%
filter(date_end >= s) %>%
nrow()
}
# applying that function to each week in the date range data frame gives the expected results:
df_weeks %>%
rowwise() %>%
mutate(total_active = check_active(df_obs, week_start, week_end)) %>%
select(-week_end) %>%
ungroup()
# A tibble: 12 x 2
week_start total_active
<date> <int>
1 2019-12-30 1
2 2020-01-06 3
3 2020-01-13 3
4 2020-01-20 4
5 2020-01-27 4
6 2020-02-03 6
7 2020-02-10 7
8 2020-02-17 7
9 2020-02-24 6
10 2020-03-02 4
11 2020-03-09 4
12 2020-03-16 5
Here is one way :
library(tidyverse)
df_obs %>%
#Replace NA with today's date
#Create sequence between start and end date
mutate(date_end = replace(date_end, is.na(date_end), Sys.Date()),
date = map2(date_start, date_end, seq, "day")) %>%
#Get data in long format
unnest(date) %>%
#Unselect start an end date
select(-date_start, -date_end) %>%
#Cut data by week
mutate(date = cut(date, "week")) %>%
#Get unique rows for observation and date
distinct(observation, date) %>%
#Count number of observation in each week
count(date)
which returns :
# A tibble: 14 x 2
# value n
# <fct> <int>
# 1 2019-12-30 1
# 2 2020-01-06 3
# 3 2020-01-13 3
# 4 2020-01-20 4
# 5 2020-01-27 4
# 6 2020-02-03 6
# 7 2020-02-10 7
# 8 2020-02-17 7
# 9 2020-02-24 6
#10 2020-03-02 4
#11 2020-03-09 4
#12 2020-03-16 5
#13 2020-03-23 4
#14 2020-03-30 3

Count calendar days within a date interval using lubridate

I have data set of hospital admission and discharge days from which I want to generate an occupied beds count for each calendar day of a period of three years. I am using the tidyverse and lubridate packages.
My approach so far has been to convert the admit/discharge columns into an interval (the data are sensitive so I can't share actual dates):
d <- d %>%
mutate(duration = admit %--% discharge)
and then to create a tibble where each row corresponds to the time range, plus a column of zeroes that can be added to in a for loop:
t <-
tibble(
days = as.Date(date("2017-01-01"):date("2019-12-31")),
count = 0
)
Unfortunately, I can't figure out how to create a for loop that would sum days that fall within each interval. Here is my attempt thus far, which gives me uniform values of 24 throughout:
for(i in timeline$days) {
if (i %within% d$duration)
timeline$count = timeline$count + 1
}
Sample data.
library(dplyr)
set.seed(42)
d <- tibble(admit = Sys.Date() - sample(300, size = 1000, replace = TRUE)) %>%
mutate(discharge = admit + sample(0:30, size = 1000, replace = TRUE))
d
# # A tibble: 1,000 x 2
# admit discharge
# <date> <date>
# 1 2019-06-18 2019-07-14
# 2 2019-06-11 2019-06-12
# 3 2019-12-24 2020-01-18
# 4 2019-07-13 2019-07-29
# 5 2019-09-08 2019-09-23
# 6 2019-10-15 2019-10-15
# 7 2019-08-11 2019-08-28
# 8 2020-02-07 2020-02-29
# 9 2019-09-03 2019-09-10
# 10 2019-08-20 2019-09-14
# # ... with 990 more rows
We can produce a list of date ranges/sequences with Map (or purrr::pmap):
Map(seq.Date, d$admit, d$discharge, list(by = "days"))[1:2]
# [[1]]
# [1] "2019-06-18" "2019-06-19" "2019-06-20" "2019-06-21" "2019-06-22" "2019-06-23" "2019-06-24"
# [8] "2019-06-25" "2019-06-26" "2019-06-27" "2019-06-28" "2019-06-29" "2019-06-30" "2019-07-01"
# [15] "2019-07-02" "2019-07-03" "2019-07-04" "2019-07-05" "2019-07-06" "2019-07-07" "2019-07-08"
# [22] "2019-07-09" "2019-07-10" "2019-07-11" "2019-07-12" "2019-07-13" "2019-07-14"
# [[2]]
# [1] "2019-06-11" "2019-06-12"
and then combine these, tabulate them (with table), and enframe them:
Map(seq.Date, d$admit, d$discharge, list(by = "days")) %>%
do.call(c, .) %>%
table() %>%
tibble::enframe(name = "date", value = "count") %>%
# because `table` preserves a *character* representation of the Date
mutate(date = as.Date(date)) %>%
arrange(date)
# # A tibble: 328 x 2
# date count
# <date> <table>
# 1 2019-05-24 1
# 2 2019-05-25 3
# 3 2019-05-26 7
# 4 2019-05-27 8
# 5 2019-05-28 9
# 6 2019-05-29 14
# 7 2019-05-30 20
# 8 2019-05-31 20
# 9 2019-06-01 20
# 10 2019-06-02 21
# # ... with 318 more rows
Here is another method using tidyverse functions.
library(tidyverse)
d %>%
mutate(days = map2(admit, discharge, seq, by = "day")) %>%
unnest(days) %>%
count(days) %>%
right_join(t, by = "days") %>%
mutate(n = coalesce(n, as.integer(count))) %>%
select(-count)
We create a sequennce of dates between admit and discharge, count every unique date, join it with t so that all the dates in t remain intact.

Expand start and end dates into a sequence of beginning and ending dates by calendar month

Given a table
id start end
1 22/03/2016 05/06/2016
2 17/08/2016 29/08/2016
3 22/09/2017 25/12/2017
I'm trying to split by Calendar month as the following table
id start end
1 22/03/2016 31/03/2016
1 01/04/2016 30/04/2016
1 01/05/2016 05/06/2016
2 17/08/2016 29/08/2016
3 22/09/2017 30/09/2017
3 01/10/2017 31/10/2017
3 01/11/2017 30/11/2017
3 01/12/2017 25/12/2017
I'm trying to modify a code extract from how to split rows of a dataframe in multiple rows based on start date and end date? , but I am not being able to modify correctly the code. The problem is generally in the months with 30 days, and maybe is easy but I am not still familiarized with regular expressions.
#sample data
df <- data.frame("starting_date" = as.Date(c("2016-03-22", "2016-08-17", "2017-09-12")),
"end_date" = as.Date(c("2016-06-05", "2016-08-29", "2017-12-25")),
col3=c('1','2', '3'))
df1 <- df[,1:2] %>%
rowwise() %>%
do(rbind(data.frame(matrix(as.character(c(
.$starting_date,
seq(.$starting_date, .$end_date, by=1)[grep("\\d{4}-\\d{2}-31|\\d{4}-\\d{2}-01", seq(.$starting_date, .$end_date, by=1))],
.$end_date)), ncol=2, byrow=T))
)
) %>%
data.frame() %>%
`colnames<-`(c("starting_date", "end_date")) %>%
mutate(starting_date= as.Date(starting_date, format= "%Y-%m-%d"),
end_date= as.Date(end_date, format= "%Y-%m-%d"))
#add temporary columns to the original and expanded date column dataframes
df$row_idx <- seq(1:nrow(df))
df$temp_col <- (year(df$end_date) - year(df$starting_date)) +1
df1 <- cbind(df1,row_idx = rep(df$row_idx,df$temp_col))
#join both dataframes to get the final result
final_df <- left_join(df1,df[,3:(ncol(df)-1)],by="row_idx") %>%
select(-row_idx)
final_df
If anyone knows how to modify the code or a better way to do it I will be very grateful.
We assume there is an error in the sample output in the question since the third row spans parts of two months and so should be split into two rows.
Define Seq which given one start and end Date variables produces a data.frame of start and end columns and then run it on each id using group_by:
library(dplyr)
library(zoo)
Seq <- function(start, end) {
ym <- seq(as.yearmon(start), as.yearmon(end), 1/12)
starts <- pmax(start, as.Date(ym, frac = 0))
ends <- pmin(end, as.Date(ym, frac = 1))
unique(data.frame(start = starts, end = ends))
}
fmt <- "%d/%m/%Y"
DF %>%
mutate(start = as.Date(start, fmt), end = as.Date(end, fmt)) %>%
group_by(id) %>%
do(Seq(.$start, .$end)) %>%
ungroup
giving:
# A tibble: 9 x 3
id start end
<int> <date> <date>
1 1 2016-03-22 2016-03-31
2 1 2016-04-01 2016-04-30
3 1 2016-05-01 2016-05-31
4 1 2016-06-01 2016-06-05
5 2 2016-08-17 2016-08-29
6 3 2017-09-22 2017-09-30
7 3 2017-10-01 2017-10-31
8 3 2017-11-01 2017-11-30
9 3 2017-12-01 2017-12-25
Note
The input DF in reproducible form:
Lines <- "
id start end
1 22/03/2016 05/06/2016
2 17/08/2016 29/08/2016
3 22/09/2017 25/12/2017"
DF <- read.table(text = Lines, header = TRUE)
So there's a probably a more elegant way to accomplish this and I feel like I've seen similar questions, but could not find a duplicate quickly, so here goes...
SETUP
library(tidyverse)
library(lubridate)
df <- data.frame(
id = c('1', '2', '3'),
starting_date = as.Date(c("2016-03-22", "2016-08-17", "2017-09-12")),
end_date = as.Date(c("2016-06-05", "2016-08-29", "2017-12-25")),
stringsAsFactors = FALSE
)
df
#> id starting_date end_date
#> 1 1 2016-03-22 2016-06-05
#> 2 2 2016-08-17 2016-08-29
#> 3 3 2017-09-12 2017-12-25
SOLUTION
df %>%
group_by(id) %>%
mutate(
date_seq = list(seq.Date(starting_date, end_date, by = "month") %>% ceiling_date("month") - 1)
) %>%
unnest() %>%
mutate(row = row_number()) %>%
mutate(
new_end_date = if_else(row == max(row), end_date, date_seq),
new_start_date = if_else(row == min(row), starting_date, floor_date(new_end_date, "month"))
) %>%
select(
id, new_start_date, new_end_date
)
#> # A tibble: 8 x 3
#> # Groups: id [3]
#> id new_start_date new_end_date
#> <chr> <date> <date>
#> 1 1 2016-03-22 2016-03-31
#> 2 1 2016-04-01 2016-04-30
#> 3 1 2016-06-01 2016-06-05
#> 4 2 2016-08-17 2016-08-29
#> 5 3 2017-09-12 2017-09-30
#> 6 3 2017-10-01 2017-10-31
#> 7 3 2017-11-01 2017-11-30
#> 8 3 2017-12-01 2017-12-25
EXPLANATION
Much of what's going on here takes place in the first mutate call which creates date_seq. To understand it, consider the following:
seq.Date(ymd("2016-03-22"), ymd("2016-06-05"), by = "month")
# [1] "2016-03-22" "2016-04-22" "2016-05-22"
seq.Date(ymd("2016-03-22"), ymd("2016-06-05"), by = "month") %>%
ceiling_date("month")
# [1] "2016-04-01" "2016-05-01" "2016-06-01"
seq.Date(ymd("2016-03-22"), ymd("2016-06-05"), by = "month") %>%
ceiling_date("month") - 1
# [1] "2016-03-31" "2016-04-30" "2016-05-31"
So basically, create a sequence of "end-of-month" dates between the original start and end dates. Putting this in a list-column allows us to organize by the id so that we unnest appropriately. Checkout the output after the end of the unnest():
df %>%
group_by(id) %>%
mutate(
date_seq = list(seq.Date(starting_date, end_date, by = "month") %>% ceiling_date("month") - 1)
) %>%
unnest()
From there I hope things are relatively straightforward. The row_number probably could have been replaced with something fancier like a first/last, but I thought this might be easier to follow.

Creating a ts time series with missing values from a data frame

I have a data frame containing a time series of monthly data, with some missing values.
dates <- seq(
as.Date("2010-01-01"), as.Date("2017-12-01"), "1 month"
)
n_dates <- length(dates)
dates <- dates[runif(n_dates) < 0.5]
time_data <- data.frame(
date = dates,
value = rnorm(length(dates))
)
## date value
## 1 2010-02-01 1.3625419
## 2 2010-06-01 0.1512481
## etc.
In order do be able to make use of time series forecasting functionality in, e.g., forecast, I'd like to convert this to a ts object.
The dumb way to do this is to create a regular set of monthly dates over the whole time period, then left join back to the original data.
library(dplyr)
first_date <- min(time_data$date)
last_date <- max(time_data$date)
full_dates <- data.frame(
date = seq(first_date, last_date, "1 month")
)
extended_time_data <- left_join(full_dates, time_data, by = "date")
## date value
## 1 2010-02-01 1.3625419
## 2 2010-03-01 NA
## etc.
Now I can create the time series using ts().
library(lubridate)
time_series <- ts(
extended_time_data$value,
start = c(year(first_date), month(first_date)),
frequency = 12
)
For such a simple task, this is long-winded and pretty gross.
I also looked into first converting to xts, and using a convertor from the timetk package, but nothing jumped out at me as an easier way.
This question is a dupe of How to create time series with missing datetime values, but the answer there was even fuzzier.
How do I create a ts object from a time series with missing values?
Using the input data frame defined in the Note at the end, convert it to a zoo object with index of class yearmon. Then as.ts will convert it to ts.
library(zoo)
z <- read.zoo(DF, FUN = as.yearmon)
as.ts(z)
## Jan Feb Mar Apr May Jun Jul Aug
## 2000 1 NA NA 2 3 NA 4 5
If you prefer to express it in terms of pipes:
library(magrittr)
library(zoo)
DF %>% read.zoo(FUN = as.yearmon) %>% as.ts
If desired, interpolate the values in the time series using na.locf (last occurrence carried forward), na.approx (linear interpolation), na.spline, na.StructTS (seasonal Kalman filter) or other zoo NA filling function. e.g.
library(forecast)
DF %>% read.zoo(FUN = as.yearmon) %>% as.ts %>% na.spline %>% forecast
Note
The data in the question is not reproducible because random numbers are used without set.seed and n_dates is undefined. Below we define a data frame DF reproducibly for purposes of example.
library(zoo)
dates <- as.Date(as.yearmon("2000-01") + c(0, 3, 4, 6, 7)/12)
DF <- data.frame(dates, values = seq_along(dates))
giving:
> DF
dates values
1 2000-01-01 1
2 2000-04-01 2
3 2000-05-01 3
4 2000-07-01 4
5 2000-08-01 5
Instead of using the left_join an easier option is complete, convert it to a tsibble object which is now compatible with the forecast package functions
library(tidyverse)
library(tsibble)
time_data %>%
complete(date = seq(min(date), max(date), by = "1 month"),
fill = list(value = NA)) %>%
as_tsibble(index = date)
# A tsibble: 94 x 2 [1D]
# date value
# <date> <dbl>
# 1 2010-02-01 1.02
# 2 2010-03-01 NA
# 3 2010-04-01 NA
# 4 2010-05-01 1.75
# 5 2010-06-01 NA
# 6 2010-07-01 NA
# 7 2010-08-01 -0.233
# 8 2010-09-01 NA
# 9 2010-10-01 NA
#10 2010-11-01 -0.987
# ... with 84 more rows
As mentioned above, it is compatible withe forecast functions
library(fable)
time_data %>%
complete(date = seq(min(date), max(date), by = "1 month"),
fill = list(value = 0)) %>%
as_tsibble(index = date) %>%
ETS(value) %>%
forecast %>%
autoplot
NOTE: Here, the missing values are imputed as 0.
It can be imputed with the previous non-NA value with fill
time_data %>%
complete(date = seq(min(date), max(date), by = "1 month")) %>%
fill(value) %>%
as_tsibble(index = date) %>%
ETS(value) %>%
forecast %>%
autoplot
data
n_dates <- 3
A base option and using set.seed(789) before running your data generation
temp <- which(full_dates$date%in%time_data$date)
full_dates$new[temp] <- time_data$value
head(full_dates, 20)
date new
1 2010-02-01 0.62589399
2 2010-03-01 0.98117664
3 2010-04-01 NA
4 2010-05-01 -0.04770986
5 2010-06-01 -1.51961483
6 2010-07-01 NA
7 2010-08-01 0.79493644
8 2010-09-01 -0.14423251
9 2010-10-01 -0.70649791
10 2010-11-01 0.61071247
11 2010-12-01 NA
12 2011-01-01 1.08506164
13 2011-02-01 -0.71134925
14 2011-03-01 1.15628805
15 2011-04-01 1.23556280
16 2011-05-01 -0.32245531
17 2011-06-01 NA
18 2011-07-01 NA
19 2011-08-01 0.73277540
20 2011-09-01 -0.28752883
or same result but using data.table
setDT(full_dates)[temp, new:= time_data$value]
Now to xts
xts::xts(full_dates[,-1], order.by = full_dates$date, frequency = 12 )

Resources