Fill missing dates in several time series stored in same database - r

I'm a complete beginner to R and I just need to do some quick cleaning of my data. But I ran into a problem I can't wrap my head around.
So I have a Postgres db with timeseries, Columns are ID, DATE and VALUE (temperature). Each ID is a new measuring station, so I have a time serie for each id (around 2000 unique ids, 4m rows). The dates span from 1915-2016, some series are overlapping some are not. If there is missing measurement from a week I want to fill those weeks with an NA value (which i interpolate after).
The problem i run into is that complete(Date.seq) creates NA values for all weeks between 1915 and 2016, I clearly understand why it happens. How can I make so it only fills values between the actual start and end date of the specific timeserie? I want a moving min and max which is dependent on the start date and end date of each specific ID and than fill missing dates between the start and end date of each ID.
library("RpostgreSQL")
library("tidyverse")
library("lubridate")
con <- dbConnect(PostgreSQL(), user = "postgres",
dbname="", password = "", host = "localhost", port= "5432")
out <- dbGetQuery(con, "SELECT * FROM *******.Weekly_series")
out %>%
group_by(ID)%>%
mutate(DATE = as.Date(DATE)) %>%
complete(DATE = seq(ymd("1915-04-14"), ymd("2016-03-30"), by= "week"))
Ignore errors in the connect line.
Thanks in advance.
Edit1
Sample data
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Excpected output
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-22 NA
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-08 NA
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-08 NA
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1

Using the data you provided, this works. I don't know why this works and your whole code does not, but possibly in your code, the data structure is not what is needed. If so, something like out <- tibble::as_tibble(out) might work. My other guess is that complete isn't drawing from the package you need. Using tidyr::complete works on the sample.
library(lubridate)
library(dplyr)
library(tidyr)
a <- "ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1"
df <- read.table(text = a, header = TRUE)
big_df1 <- df %>%
filter(ID == 1)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df2 <- df %>%
filter(ID == 2)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df3 <- df %>%
filter(ID == 3)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df <- rbind(big_df1, big_df2, big_df3)
big_df
DATE ID VALUE
<date> <int> <int>
1 2015-10-01 1 1
2 2015-10-08 1 1
3 2015-10-15 1 1
4 2015-10-22 NA NA
5 2015-10-29 1 1
6 1956-01-01 2 1
7 1956-01-08 NA NA
8 1956-01-15 2 1
9 1956-01-22 2 1
10 1982-01-01 3 1
11 1982-01-08 NA NA
12 1982-01-15 3 1
13 1982-01-22 3 1
14 1982-01-29 3 1

Related

How to add a column with most resent recurring observation within a group, but within a certain time period, in R

If I had:
person_ID visit date
1 2/25/2001
1 2/27/2001
1 4/2/2001
2 3/18/2004
3 9/22/2004
3 10/27/2004
3 5/15/2008
and I wanted another column to indicate the earliest recurring observation within 90 days, grouped by patient ID, with the desired output:
person_ID visit date date
1 2/25/2001 2/27/2001
1 2/27/2001 4/2/2001
1 4/2/2001 NA
2 3/18/2004 NA
3 9/22/2004 10/27/2004
3 10/27/2004 NA
3 5/15/2008 NA
Thank you!
We convert the 'visit_date' to Date class, grouped by 'person_ID', create a binary column that returns 1 if the difference between the current and next visit_date is less than 90 or else 0, using this column, get the correponding next visit_date' where the value is 1
library(dplyr)
library(lubridate)
library(tidyr)
df1 %>%
mutate(visit_date = mdy(visit_date)) %>%
group_by(person_ID) %>%
mutate(i1 = replace_na(+(difftime(lead(visit_date),
visit_date, units = 'day') < 90), 0),
date = case_when(as.logical(i1)~ lead(visit_date)), i1 = NULL ) %>%
ungroup
-output
# A tibble: 7 x 3
# person_ID visit_date date
# <int> <date> <date>
#1 1 2001-02-25 2001-02-27
#2 1 2001-02-27 2001-04-02
#3 1 2001-04-02 NA
#4 2 2004-03-18 NA
#5 3 2004-09-22 2004-10-27
#6 3 2004-10-27 NA
#7 3 2008-05-15 NA

Longitudinal dataset - difference between two dates

I have a longitudinal dataset that I imported in R from Excel that looks like this:
STUDYID VISIT# VISITDate
1 1 2012-12-19
1 2 2018-09-19
2 1 2013-04-03
2 2 2014-05-14
2 3 2016-05-12
In this dataset, each patient/study ID has a different number of visits to the hospital, and their first visit dates which is likely to differ from individual to individual. I want to create a new time variable which is essentially time in years since first visit, so the dataset will look like this:
STUDYID VISIT# VISITDate Time(years)
1 1 2012-12-19 0
1 2 2018-09-19 5
2 1 2013-04-03 0
2 2 2014-05-14 1
2 3 2016-05-12 3
The reason for creating a time variable like this is to assess differential regression effects over time (which is a continuous variable). Is there any way to create a new time variable like this in R so I can use it as an independent variable in my regression analyses?
Consider ave to calculate the minimum of VISITDate by STUDYID group, then take the date difference with conversion to integer years:
df <- within(df, {
minVISITDate <- ave(VISITDate, STUDYID, FUN=min)
Time <- floor(as.double(difftime(VISITDate, minVISITDate, unit="days") / 365))
rm(minVISITDate)
})
df
# STUDYID VISIT# VISITDate Time
# 1 1 1 2012-12-19 0
# 2 1 2 2018-09-19 5
# 3 2 1 2013-04-03 0
# 4 2 2 2014-05-14 1
# 5 2 3 2016-05-12 3
Loading up packages:
library(tibble)
library(dplyr)
library(lubridate)
Setting up the data:
dat <- tribble(~STUDYID , ~VISIT , ~VISITDate ,
1 , 1 , "2012-12-19",
1 , 2 , "2018-09-19",
2 , 1 , "2013-04-03",
2 , 2 , "2014-05-14",
2 , 3 , "2016-05-12") %>%
mutate(VISITDate = as.Date(VISITDate))
Creating the wanted variable:
dat %>%
group_by(STUDYID) %>%
mutate(Time = first(VISITDate) %--% VISITDate,
Time = as.numeric(Time, "years")) %>%
ungroup()
# A tibble: 5 x 4
STUDYID VISIT VISITDate Time
<dbl> <dbl> <date> <dbl>
1 1 1 2012-12-19 0
2 1 2 2018-09-19 5.75
3 2 1 2013-04-03 0
4 2 2 2014-05-14 1.11
5 2 3 2016-05-12 3.11

Count consecutive prior dates per group

My sample data.frame (date format d/m/y), recording the dates a customer was active:
customer date
1 10/1/20
1 9/1/20
1 6/1/20
2 10/1/20
2 8/1/20
2 7/1/20
2 6/1/20
I would like to make a column "n_consecutive_days" like so:
customer date n_consecutive_days
1 10/1/20 2
1 9/1/20 1
1 6/1/20 N/A
2 10/1/20 1
2 8/1/20 3
2 7/1/20 2
2 6/1/20 N/A
The new column counts the number of previous consecutive dates per customer. I would like the customer's first date to be N/A as it makes no sense to talk about previous consecutive days if it is the first one.
Any help would be appreciated. I can calculate the difference between dates, but not the number of consecutive days as desired.
One way would be:
library(dplyr)
df %>%
group_by(customer, idx = cumsum(as.integer(c(0, diff(as.Date(date, '%d/%m/%y')))) != -1)) %>%
mutate(n_consecutive_days = rev(sequence(n()))) %>% ungroup() %>%
group_by(customer) %>%
mutate(n_consecutive_days = replace(n_consecutive_days, row_number() == n(), NA), idx = NULL)
Output:
# A tibble: 7 x 3
# Groups: customer [2]
customer date n_consecutive_days
<int> <fct> <int>
1 1 10/1/20 2
2 1 9/1/20 1
3 1 6/1/20 NA
4 2 10/1/20 1
5 2 8/1/20 3
6 2 7/1/20 2
7 2 6/1/20 NA
An option using data.table:
#ensure that data is sorted by customer and reverse chronological
setorder(DT, customer, -date)
#group by customer and consecutive dates and then create the sequence
DT[, ncd := .N:1L, .(customer, cumsum(c(0L, diff(date)!=-1L)))]
#set the first date in each customer to NA
DT[DT[, .I[.N], customer]$V1, ncd := NA]
output:
customer date ncd
1: 1 2020-01-10 2
2: 1 2020-01-09 1
3: 1 2020-01-06 NA
4: 2 2020-01-10 1
5: 2 2020-01-08 3
6: 2 2020-01-07 2
7: 2 2020-01-06 NA
data:
library(data.table)
DT <- fread("customer date
1 10/1/20
1 9/1/20
1 6/1/20
2 10/1/20
2 8/1/20
2 7/1/20
2 6/1/20")
DT[, date := as.IDate(date, format="%d/%m/%y")]

Find Consecutive Occurrences within a Group (of IDs)

My data set looks like this:
ID start.date end.date program
1 2016.05.05 2017.05.05 A
1 2017.05.06 2019.06.16 A
2 2012.06.05 2013.06.18 B
3 2014.09.09 2017.07.01 B
3 2017.09.09 2018.09.09 B
I want to identify the people who were present in a program (character variable) consecutively, and then calculate the time between each end.date and start.date (if the occurrence was consecutive).
So the resulting data should look like this:
ID start.date end.date program days
1 2016.05.05 2017.05.05 A NA
1 2017.05.06 2019.06.16 A . 1
2 2012.06.05 2013.06.18 B . NA
3 2014.09.09 2017.07.01 B . NA
3 2017.09.09 2018.09.09 B . 63
Don't know how to start on this!
library(dplyr)
dat %>%
group_by(ID, program) %>%
arrange(start.date) %>% # Added in case the data isn't sorted
mutate(days = start.date - lag(end.date))
I get slightly different results, though:
# A tibble: 5 x 5
# Groups: ID, program [3]
ID start.date end.date program days
<int> <date> <date> <chr> <time>
1 1 2016-05-05 2017-05-05 A NA
2 1 2017-05-06 2019-06-16 A 1
3 2 2012-06-05 2013-06-18 B NA
4 3 2014-09-09 2017-07-01 B NA
5 3 2017-09-09 2018-09-09 B 70
To bring the data in, I converted to dates:
dat <- read.table(header = T, stringsAsFactors = F,
text = "ID start.date end.date program
1 2016.05.05 2017.05.05 A
1 2017.05.06 2019.06.16 A
2 2012.06.05 2013.06.18 B
3 2014.09.09 2017.07.01 B
3 2017.09.09 2018.09.09 B") %>%
mutate_at(vars(matches("date")), lubridate::ymd)

dplyr: grouping and summarizing/mutating data with rolling time windows

I have irregular timeseries data representing a certain type of transaction for users. Each line of data is timestamped and represents a transaction at that time. By the irregular nature of the data some users might have 100 rows in a day and other users might have 0 or 1 transaction in a day.
The data might look something like this:
data.frame(
id = c(1, 1, 1, 1, 1, 2, 2, 3, 4),
date = c("2015-01-01",
"2015-01-01",
"2015-01-05",
"2015-01-25",
"2015-02-15",
"2015-05-05",
"2015-01-01",
"2015-08-01",
"2015-01-01"),
n_widgets = c(1,2,3,4,4,5,2,4,5)
)
id date n_widgets
1 1 2015-01-01 1
2 1 2015-01-01 2
3 1 2015-01-05 3
4 1 2015-01-25 4
5 1 2015-02-15 4
6 2 2015-05-05 5
7 2 2015-01-01 2
8 3 2015-08-01 4
9 4 2015-01-01 5
Often I'd like to know some rolling statistics about users. For example: for this user on a certain day, how many transactions occurred in the previous 30 days, how many widgets were sold in the previous 30 days etc.
Corresponding to the above example, the data should look like:
id date n_widgets n_trans_30 total_widgets_30
1 1 2015-01-01 1 1 1
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 1 2015-02-15 4 2 8
6 2 2015-05-05 5 1 5
7 2 2015-01-01 2 1 2
8 3 2015-08-01 4 1 4
9 4 2015-01-01 5 1 5
If the time window is daily then the solution is simple: data %>% group_by(id, date) %>% summarize(...)
Similarly if the time window is monthly this is also relatively simple with lubridate: data %>% group_by(id, year(date), month(date)) %>% summarize(...)
However the challenge I'm having is how to setup a time window for an arbitrary period: 5-days, 10-days etc.
There's also the RcppRoll library but both RcppRoll and the rolling functions in zoo seem more setup for regular time series. As far as I can tell these window functions work based on the number of rows instead of a specified time period -- the key difference is that a certain time period might have a differing number of rows depending on date and user.
For example, it's possible for user 1, that the number of transactions in the 5 days previous of 2015-01-01 is equal to 100 transactions and for the same user the number of transactions in the 5 days previous of 2015-02-01 is equal to 5 transactions. Thus looking back a set number of rows will simply not work.
Additionally, there is another SO thread discussing rolling dates for irregular time series type data (Create new column based on condition that exists within a rolling date) however the accepted solution was using data.table and I'm specifically looking for a dplyr way of achieving this.
I suppose at the heart of this issue, this problem can be solved by answering this question: how can I group_by arbitrary time periods in dplyr. Alternatively, if there's a different dplyr way to achieve above without a complicated group_by, how can I do it?
EDIT: updated example to make nature of the rolling window more clear.
This can be done using SQL:
library(sqldf)
dd <- transform(data, date = as.Date(date))
sqldf("select a.*, count(*) n_trans30, sum(b.n_widgets) 'total_widgets30'
from dd a
left join dd b on b.date between a.date - 30 and a.date
and b.id = a.id
and b.rowid <= a.rowid
group by a.rowid")
giving:
id date n_widgets n_trans30 total_widgets30
1 1 2015-01-01 1 1 1
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 2 2015-05-05 5 1 5
6 2 2015-01-01 2 1 2
7 3 2015-08-01 4 1 4
8 4 2015-01-01 5 1 5
Another approach is to expand your dataset to contain all possible days (using tidyr::complete), then use a rolling function (RcppRoll::roll_sum)
The fact that you have multiple observations per day is probably creating an issue though...
library(tidyr)
library(RcppRoll)
df2 <- df %>%
mutate(date=as.Date(date))
## create full dataset with all possible dates (go even 30 days back for first observation)
df_full<- df2 %>%
mutate(date=as.Date(date)) %>%
complete(id,
date=seq(from=min(.$date)-30,to=max(.$date), by=1),
fill=list(n_widgets=0))
## now use rolling function, and keep only original rows (left join)
df_roll <- df_full %>%
group_by(id) %>%
mutate(n_trans_30=roll_sum(x=n_widgets!=0, n=30, fill=0, align="right"),
total_widgets_30=roll_sum(x=n_widgets, n=30, fill=0, align="right")) %>%
ungroup() %>%
right_join(df2, by = c("date", "id", "n_widgets"))
The result is the same as yours (by chance)
id date n_widgets n_trans_30 total_widgets_30
<dbl> <date> <dbl> <dbl> <dbl>
1 1 2015-01-01 1 1 1
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 1 2015-02-15 4 2 8
6 2 2015-05-05 5 1 5
7 2 2015-01-01 2 1 2
8 3 2015-08-01 4 1 4
9 4 2015-01-01 5 1 5
But as said, it will fail for some days as it count last 30 obs, not last 30 days. So you might want first to summarise the information by day, then apply this.
EDITED based on comment below.
You can try something like this for up to 5 days:
df %>%
arrange(id, date) %>%
group_by(id) %>%
filter(as.numeric(difftime(Sys.Date(), date, unit = 'days')) <= 5) %>%
summarise(n_total_widgets = sum(n_widgets))
In this case, there are no days within five of current. So, it won't produce any output.
To get last five days for each ID, you can do something like this:
df %>%
arrange(id, date) %>%
group_by(id) %>%
filter(as.numeric(difftime(max(date), date, unit = 'days')) <= 5) %>%
summarise(n_total_widgets = sum(n_widgets))
Resulting output will be:
Source: local data frame [4 x 2]
id n_total_widgets
(dbl) (dbl)
1 1 4
2 2 5
3 3 4
4 4 5
I found a way to do this while working on this question
df <- data.frame(
id = c(1, 1, 1, 1, 1, 2, 2, 3, 4),
date = c("2015-01-01",
"2015-01-01",
"2015-01-05",
"2015-01-25",
"2015-02-15",
"2015-05-05",
"2015-01-01",
"2015-08-01",
"2015-01-01"),
n_widgets = c(1,2,3,4,4,5,2,4,5)
)
count_window <- function(df, date2, w, id2){
min_date <- date2 - w
df2 <- df %>% filter(id == id2, date >= min_date, date <= date2)
out <- length(df2$date)
return(out)
}
v_count_window <- Vectorize(count_window, vectorize.args = c("date2","id2"))
sum_window <- function(df, date2, w, id2){
min_date <- date2 - w
df2 <- df %>% filter(id == id2, date >= min_date, date <= date2)
out <- sum(df2$n_widgets)
return(out)
}
v_sum_window <- Vectorize(sum_window, vectorize.args = c("date2","id2"))
res <- df %>% mutate(date = ymd(date)) %>%
mutate(min_date = date - 30,
n_trans = v_count_window(., date, 30, id),
total_widgets = v_sum_window(., date, 30, id)) %>%
select(id, date, n_widgets, n_trans, total_widgets)
res
id date n_widgets n_trans total_widgets
1 1 2015-01-01 1 2 3
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 1 2015-02-15 4 2 8
6 2 2015-05-05 5 1 5
7 2 2015-01-01 2 1 2
8 3 2015-08-01 4 1 4
9 4 2015-01-01 5 1 5
This version is fairly case specific but you could probably make a version of the functions that is more general.
For simplicity reasons I recommend runner package which handles sliding window operations. In OP request window size k = 30 and windows depend on date idx = date. You can use runner function which applies any R function on given window, and sum_run
library(runner)
library(dplyr)
df %>%
group_by(id) %>%
arrange(date, .by_group = TRUE) %>%
mutate(
n_trans30 = runner(n_widgets, k = 30, idx = date, function(x) length(x)),
n_widgets30 = sum_run(n_widgets, k = 30, idx = date),
)
# id date n_widgets n_trans30 n_widgets30
#<dbl> <date> <dbl> <dbl> <dbl>
# 1 2015-01-01 1 1 1
# 1 2015-01-01 2 2 3
# 1 2015-01-05 3 3 6
# 1 2015-01-25 4 4 10
# 1 2015-02-15 4 2 8
# 2 2015-01-01 2 1 2
# 2 2015-05-05 5 1 5
# 3 2015-08-01 4 1 4
# 4 2015-01-01 5 1 5
Important: idx = date should be in ascending order.
For more go to documentation and vignettes

Resources