Creating a ts time series with missing values from a data frame - r

I have a data frame containing a time series of monthly data, with some missing values.
dates <- seq(
as.Date("2010-01-01"), as.Date("2017-12-01"), "1 month"
)
n_dates <- length(dates)
dates <- dates[runif(n_dates) < 0.5]
time_data <- data.frame(
date = dates,
value = rnorm(length(dates))
)
## date value
## 1 2010-02-01 1.3625419
## 2 2010-06-01 0.1512481
## etc.
In order do be able to make use of time series forecasting functionality in, e.g., forecast, I'd like to convert this to a ts object.
The dumb way to do this is to create a regular set of monthly dates over the whole time period, then left join back to the original data.
library(dplyr)
first_date <- min(time_data$date)
last_date <- max(time_data$date)
full_dates <- data.frame(
date = seq(first_date, last_date, "1 month")
)
extended_time_data <- left_join(full_dates, time_data, by = "date")
## date value
## 1 2010-02-01 1.3625419
## 2 2010-03-01 NA
## etc.
Now I can create the time series using ts().
library(lubridate)
time_series <- ts(
extended_time_data$value,
start = c(year(first_date), month(first_date)),
frequency = 12
)
For such a simple task, this is long-winded and pretty gross.
I also looked into first converting to xts, and using a convertor from the timetk package, but nothing jumped out at me as an easier way.
This question is a dupe of How to create time series with missing datetime values, but the answer there was even fuzzier.
How do I create a ts object from a time series with missing values?

Using the input data frame defined in the Note at the end, convert it to a zoo object with index of class yearmon. Then as.ts will convert it to ts.
library(zoo)
z <- read.zoo(DF, FUN = as.yearmon)
as.ts(z)
## Jan Feb Mar Apr May Jun Jul Aug
## 2000 1 NA NA 2 3 NA 4 5
If you prefer to express it in terms of pipes:
library(magrittr)
library(zoo)
DF %>% read.zoo(FUN = as.yearmon) %>% as.ts
If desired, interpolate the values in the time series using na.locf (last occurrence carried forward), na.approx (linear interpolation), na.spline, na.StructTS (seasonal Kalman filter) or other zoo NA filling function. e.g.
library(forecast)
DF %>% read.zoo(FUN = as.yearmon) %>% as.ts %>% na.spline %>% forecast
Note
The data in the question is not reproducible because random numbers are used without set.seed and n_dates is undefined. Below we define a data frame DF reproducibly for purposes of example.
library(zoo)
dates <- as.Date(as.yearmon("2000-01") + c(0, 3, 4, 6, 7)/12)
DF <- data.frame(dates, values = seq_along(dates))
giving:
> DF
dates values
1 2000-01-01 1
2 2000-04-01 2
3 2000-05-01 3
4 2000-07-01 4
5 2000-08-01 5

Instead of using the left_join an easier option is complete, convert it to a tsibble object which is now compatible with the forecast package functions
library(tidyverse)
library(tsibble)
time_data %>%
complete(date = seq(min(date), max(date), by = "1 month"),
fill = list(value = NA)) %>%
as_tsibble(index = date)
# A tsibble: 94 x 2 [1D]
# date value
# <date> <dbl>
# 1 2010-02-01 1.02
# 2 2010-03-01 NA
# 3 2010-04-01 NA
# 4 2010-05-01 1.75
# 5 2010-06-01 NA
# 6 2010-07-01 NA
# 7 2010-08-01 -0.233
# 8 2010-09-01 NA
# 9 2010-10-01 NA
#10 2010-11-01 -0.987
# ... with 84 more rows
As mentioned above, it is compatible withe forecast functions
library(fable)
time_data %>%
complete(date = seq(min(date), max(date), by = "1 month"),
fill = list(value = 0)) %>%
as_tsibble(index = date) %>%
ETS(value) %>%
forecast %>%
autoplot
NOTE: Here, the missing values are imputed as 0.
It can be imputed with the previous non-NA value with fill
time_data %>%
complete(date = seq(min(date), max(date), by = "1 month")) %>%
fill(value) %>%
as_tsibble(index = date) %>%
ETS(value) %>%
forecast %>%
autoplot
data
n_dates <- 3

A base option and using set.seed(789) before running your data generation
temp <- which(full_dates$date%in%time_data$date)
full_dates$new[temp] <- time_data$value
head(full_dates, 20)
date new
1 2010-02-01 0.62589399
2 2010-03-01 0.98117664
3 2010-04-01 NA
4 2010-05-01 -0.04770986
5 2010-06-01 -1.51961483
6 2010-07-01 NA
7 2010-08-01 0.79493644
8 2010-09-01 -0.14423251
9 2010-10-01 -0.70649791
10 2010-11-01 0.61071247
11 2010-12-01 NA
12 2011-01-01 1.08506164
13 2011-02-01 -0.71134925
14 2011-03-01 1.15628805
15 2011-04-01 1.23556280
16 2011-05-01 -0.32245531
17 2011-06-01 NA
18 2011-07-01 NA
19 2011-08-01 0.73277540
20 2011-09-01 -0.28752883
or same result but using data.table
setDT(full_dates)[temp, new:= time_data$value]
Now to xts
xts::xts(full_dates[,-1], order.by = full_dates$date, frequency = 12 )

Related

Count active observations by week

I have a data frame of observations with a start and end date for each observation indicating the period it was active.
The duration active varies by observation, and can spread across multiple weeks.
Some observations are still active and do not have an end date.
For a given date range, how can I count the number of observations that were active during a week within that date range, including those still active?
I have a crude method that works, but is pretty slow. It seems like there has to be a more efficient and simpler way to do this.
EDIT: My first approach was similar to Ronak's solution, which is definitely better than mine for smaller data sets, but my real data set has more observations and longer date ranges, so I run into memory constraints.
#I'm primarily using tidyverse/lubridate, but definitely open to other solutions.
library(tidyverse)
library(lubridate)
# sample data frame of observations with start and end dates:
df_obs <- tibble(
observation = c(1:10),
date_start = as_date(c("2020-03-17", "2020-01-20", "2020-02-06", "2020-01-04", "2020-01-06", "2020-01-24", "2020-01-09", "2020-02-11", "2020-03-13", "2020-02-07")),
date_end = as_date(c("2020-03-27", "2020-03-20", NA, "2020-03-04", "2020-01-16", "2020-02-24", NA, "2020-02-19", NA, "2020-02-27"))
)
# to account for observations that are still active, NAs are converted to today's date:
df_obs <- mutate(df_obs, date_end = if_else(is.na(date_end), Sys.Date(), date_end))
# create a data frame of weeks by start and end date to count the active observations in a given week
# for this example I'm just using date ranges from the sample data:
df_weeks <-
seq(min(df_obs$date_start), max(df_obs$date_start), by = 'day') %>%
enframe(NULL, 'week_start') %>%
mutate(week_start = as_date(cut(week_start, "week"))) %>%
mutate(week_end = week_start + 6) %>%
distinct()
# create a function that filters the observations data frame based on start and end dates:
check_active <- function(d, s, e){
d %>%
filter(date_start <= e) %>%
filter(date_end >= s) %>%
nrow()
}
# applying that function to each week in the date range data frame gives the expected results:
df_weeks %>%
rowwise() %>%
mutate(total_active = check_active(df_obs, week_start, week_end)) %>%
select(-week_end) %>%
ungroup()
# A tibble: 12 x 2
week_start total_active
<date> <int>
1 2019-12-30 1
2 2020-01-06 3
3 2020-01-13 3
4 2020-01-20 4
5 2020-01-27 4
6 2020-02-03 6
7 2020-02-10 7
8 2020-02-17 7
9 2020-02-24 6
10 2020-03-02 4
11 2020-03-09 4
12 2020-03-16 5
Here is one way :
library(tidyverse)
df_obs %>%
#Replace NA with today's date
#Create sequence between start and end date
mutate(date_end = replace(date_end, is.na(date_end), Sys.Date()),
date = map2(date_start, date_end, seq, "day")) %>%
#Get data in long format
unnest(date) %>%
#Unselect start an end date
select(-date_start, -date_end) %>%
#Cut data by week
mutate(date = cut(date, "week")) %>%
#Get unique rows for observation and date
distinct(observation, date) %>%
#Count number of observation in each week
count(date)
which returns :
# A tibble: 14 x 2
# value n
# <fct> <int>
# 1 2019-12-30 1
# 2 2020-01-06 3
# 3 2020-01-13 3
# 4 2020-01-20 4
# 5 2020-01-27 4
# 6 2020-02-03 6
# 7 2020-02-10 7
# 8 2020-02-17 7
# 9 2020-02-24 6
#10 2020-03-02 4
#11 2020-03-09 4
#12 2020-03-16 5
#13 2020-03-23 4
#14 2020-03-30 3

Count calendar days within a date interval using lubridate

I have data set of hospital admission and discharge days from which I want to generate an occupied beds count for each calendar day of a period of three years. I am using the tidyverse and lubridate packages.
My approach so far has been to convert the admit/discharge columns into an interval (the data are sensitive so I can't share actual dates):
d <- d %>%
mutate(duration = admit %--% discharge)
and then to create a tibble where each row corresponds to the time range, plus a column of zeroes that can be added to in a for loop:
t <-
tibble(
days = as.Date(date("2017-01-01"):date("2019-12-31")),
count = 0
)
Unfortunately, I can't figure out how to create a for loop that would sum days that fall within each interval. Here is my attempt thus far, which gives me uniform values of 24 throughout:
for(i in timeline$days) {
if (i %within% d$duration)
timeline$count = timeline$count + 1
}
Sample data.
library(dplyr)
set.seed(42)
d <- tibble(admit = Sys.Date() - sample(300, size = 1000, replace = TRUE)) %>%
mutate(discharge = admit + sample(0:30, size = 1000, replace = TRUE))
d
# # A tibble: 1,000 x 2
# admit discharge
# <date> <date>
# 1 2019-06-18 2019-07-14
# 2 2019-06-11 2019-06-12
# 3 2019-12-24 2020-01-18
# 4 2019-07-13 2019-07-29
# 5 2019-09-08 2019-09-23
# 6 2019-10-15 2019-10-15
# 7 2019-08-11 2019-08-28
# 8 2020-02-07 2020-02-29
# 9 2019-09-03 2019-09-10
# 10 2019-08-20 2019-09-14
# # ... with 990 more rows
We can produce a list of date ranges/sequences with Map (or purrr::pmap):
Map(seq.Date, d$admit, d$discharge, list(by = "days"))[1:2]
# [[1]]
# [1] "2019-06-18" "2019-06-19" "2019-06-20" "2019-06-21" "2019-06-22" "2019-06-23" "2019-06-24"
# [8] "2019-06-25" "2019-06-26" "2019-06-27" "2019-06-28" "2019-06-29" "2019-06-30" "2019-07-01"
# [15] "2019-07-02" "2019-07-03" "2019-07-04" "2019-07-05" "2019-07-06" "2019-07-07" "2019-07-08"
# [22] "2019-07-09" "2019-07-10" "2019-07-11" "2019-07-12" "2019-07-13" "2019-07-14"
# [[2]]
# [1] "2019-06-11" "2019-06-12"
and then combine these, tabulate them (with table), and enframe them:
Map(seq.Date, d$admit, d$discharge, list(by = "days")) %>%
do.call(c, .) %>%
table() %>%
tibble::enframe(name = "date", value = "count") %>%
# because `table` preserves a *character* representation of the Date
mutate(date = as.Date(date)) %>%
arrange(date)
# # A tibble: 328 x 2
# date count
# <date> <table>
# 1 2019-05-24 1
# 2 2019-05-25 3
# 3 2019-05-26 7
# 4 2019-05-27 8
# 5 2019-05-28 9
# 6 2019-05-29 14
# 7 2019-05-30 20
# 8 2019-05-31 20
# 9 2019-06-01 20
# 10 2019-06-02 21
# # ... with 318 more rows
Here is another method using tidyverse functions.
library(tidyverse)
d %>%
mutate(days = map2(admit, discharge, seq, by = "day")) %>%
unnest(days) %>%
count(days) %>%
right_join(t, by = "days") %>%
mutate(n = coalesce(n, as.integer(count))) %>%
select(-count)
We create a sequennce of dates between admit and discharge, count every unique date, join it with t so that all the dates in t remain intact.

Is there a way to find daily maximum from hourly data with missing values

I have measured hourly data of ground O3 but with some missing data (marked as NA). I want to calculate daily maximums, but only in case there are more than 17 hourly measurements per date. In case it is less than 18 measurement per date I want to write NA.
head(o3sat)
date hour O3
1/1/2010 0 50.2
1/1/2010 1 39.8
1/1/2010 2 41.8
1/1/2010 3 NA
1/1/2010 4 9.2
1/1/2010 5 6.0
Is there a possibility to add some argument to this function to indicate that at least 75% of the data must be available in a day for the value to be calculated, else the data is removed
maximums <- aggregate(o3sat["dnevnik"], list(Date = as.Date(o3sat$datum)), max, na.rm = TRUE)
It is better to provide a reproducible example when asking a question. Here, I created an example data frame based on the information you provided. This data frame contains hourly O3 measurements from 2010-01-01 to 2010-01-03.
library(dplyr)
library(tidyr)
library(lubridate)
o3sat <- read.table(text = " date hour O3
'1/1/2010' 0 50.2
'1/1/2010' 1 39.8
'1/1/2010' 2 41.8
'1/1/2010' 3 NA
'1/1/2010' 4 9.2
'1/1/2010' 5 6.0 ",
stringsAsFactors = FALSE, header = TRUE)
set.seed(1234)
o3sat_ex <- o3sat %>%
mutate(date = mdy(date)) %>%
complete(date = seq.Date(ymd("2010-01-01"), ymd("2010-01-03"), 1), hour = 0:23) %>%
mutate(O3 = c(o3sat$O3, rnorm(66, 30, 10))) %>%
mutate(O3 = ifelse(row_number() %in% sample(7:72, 18), NA, O3))
We can count how many non-NA value per day using the following code.
o3sat_ex %>%
group_by(date) %>%
summarize(sum(!is.na(O3)))
# # A tibble: 3 x 2
# date `sum(!is.na(O3))`
# <date> <int>
# 1 2010-01-01 18
# 2 2010-01-02 17
# 3 2010-01-03 18
Based on your description, we would like to calculate the maximum for 2010-01-01 and 2010-01-03, but not 2010-01-02 as it only contains 17 non-NA values.
Here is one way to achieve the task, we can define a function, max_helper, that only returns maximum if the count of non-NA values is larger than 17.
max_helper <- function(x, threshold){
if (sum(!is.na(x)) >= threshold) {
r <- max(x, na.rm = TRUE)
} else {
r <- NA
}
return(r)
}
We can apply this number using the dplyr code to get the answer.
o3sat_ex2 <- o3sat_ex %>%
group_by(date) %>%
summarize(O3 = max_helper(O3, 18))
o3sat_ex2
# # A tibble: 3 x 2
# date O3
# <date> <dbl>
# 1 2010-01-01 50.2
# 2 2010-01-02 NA
# 3 2010-01-03 47.8

How to generate monthly time series data with NA values for running examples?

I apologize in advance for the simplicity of this question.
How can I generate a monthly time series data set using something like set seed? I have a question about results from two packages but need to create a sample data set to show as an example. My data set needs to have some NA values within in it.
Regards,
Simon
Here's a random list of 1000 dates +- 5 years from today with some missing data using the simstudy package (please provide sample data and expected output for a more specific answer):
library(simstudy)
library(dplyr)
library(lubridate)
set.seed(1724)
# define data
def <- defData(varname = "tmp", dist = "uniform", formula = "0;1") # sumstudy seems to crash when adding missing data with only 1 column
def <- defData(def, varname = "date", dist = "uniform", formula = "-5;5") # +- 5 years
df_full <- genData(1000, def)
##### missing data ----
defM <- defMiss(varname = "date", formula = 0.1, logit.link = F)
df_missing <- genMiss(df_full, defM, idvars = "id")
# Create data with missing values
df <- genObs(df_full, df_missing, idvars = "id")
df %>%
as_tibble() %>%
select(-tmp) %>%
mutate(date = ymd(floor_date(as.POSIXct(Sys.Date()) + date * 365 * 24 * 60 * 60, unit = "day")), # +- 5 years from today
month = format(date, "%Y-%m"))
# A tibble: 1,000 x 3
id date month
<int> <date> <chr>
1 1 NA NA
2 2 2021-09-12 2021-09
3 3 2023-11-08 2023-11
4 4 2015-03-02 2015-03
5 5 2021-08-12 2021-08
6 6 2021-10-20 2021-10
7 7 2017-05-17 2017-05
8 8 2019-04-12 2019-04
9 9 NA NA
10 10 NA NA
# ... with 990 more rows

R: cumulative total at a daily level

I have the following dataset:
I want to measure the cumulative total at a daily level. So the result look something like:
I can use dplyr's cumsum function but the count for "missing days" won't show up. As an example, the date 1/3/18 does not exist in the original dataframe. I want this missed date to be in the resultant dataframe and its cumulative sum should be the same as the last known date i.e. 1/2/18 with the sum being 5.
Any help is appreciated! I am new to the language.
I'll use this second data.frame to fill out the missing dates:
daterange <- data.frame(Date = seq(min(x$Date), max(x$Date), by = "1 day"))
Base R:
transform(merge(x, daterange, all = TRUE),
Count = cumsum(ifelse(is.na(Count), 0, Count)))
# Date Count
# 1 2018-01-01 2
# 2 2018-01-02 5
# 3 2018-01-03 5
# 4 2018-01-04 5
# 5 2018-01-05 10
# 6 2018-01-06 10
# 7 2018-01-07 10
# 8 2018-01-08 11
# ...
# 32 2018-02-01 17
dplyr
library(dplyr)
x %>%
right_join(daterange) %>%
mutate(Count = cumsum(if_else(is.na(Count), 0, Count)))
Data:
x <- data.frame(Date = as.Date(c("1/1/18", "1/2/18", "1/5/18", "1/8/18", "2/1/18"), format="%m/%d/%y"),
Count = c(2,3,5,1,6))

Resources