Fill incomplete time series in data.table - r

Question: In data.table is there any way to fill an incomplete date sequence with zeros? For instance, in the toy example some dates does not show up and I want to have a complete date sequence with y = 0 for these cases. Is there something like a forward fill?
Remark: Note I do not want to use merges where you create first the full date sequence and then merge it back to the initial data.table object (I think this is inefficient and rather slow).
library(data.table)
dt <- data.table(
x = c("2020-03-28", "2020-03-29", "2020-03-31", "2020-04-05"),
y = c(1, 5, 3, 70)
)
## Output:
x y
1: 2020-03-28 1
2: 2020-03-29 5
3: 2020-03-31 3
4: 2020-04-05 70
## Desired Output:
x y
1: 2020-03-28 1
2: 2020-03-29 5
3: 2020-03-30 0
4: 2020-03-31 3
5: 2020-04-01 0
6: 2020-04-02 0
7: 2020-04-03 0
8: 2020-04-04 0
9: 2020-04-05 70

How about this?
# convert to data.table's integer date type
dt[ , x := as.IDate(x)]
# find the range of dates
date_bounds = range(dt$x)
# construct a sequence of all dates
# NB: this will be integers as attributes are stripped
all_dates = date_bounds[1L]:date_bounds[2L]
# construct a table with the missing dates,
# with y filled to 0
missing = data.table(
# as.IDate uses the right origin for integer input
x = as.IDate(setdiff(all_dates, dt$x)),
y = 0
)
dt = rbind(dt, missing)
# x y
# <IDat> <num>
# 1: 2020-03-28 1
# 2: 2020-03-29 5
# 3: 2020-03-31 3
# 4: 2020-04-05 70
# 5: 2020-03-30 0
# 6: 2020-04-01 0
# 7: 2020-04-02 0
# 8: 2020-04-03 0
# 9: 2020-04-04 0
Afterwards you can setorder(dt, x) if you want the dates to be in order

Use CJ and tidyr::full_seq to create a join data table.
dt[, x := as.Date(x)] # convert x to the Date type
dt2 <- dt[CJ(x = tidyr::full_seq(x, 1)), on = .(x)] # create the full sequence
dt2[is.na(y), y := 0] # fill NAs with 0s
dt2
# x y
# 1: 2020-03-28 1
# 2: 2020-03-29 5
# 3: 2020-03-30 0
# 4: 2020-03-31 3
# 5: 2020-04-01 0
# 6: 2020-04-02 0
# 7: 2020-04-03 0
# 8: 2020-04-04 0
# 9: 2020-04-05 70

You could use complete from tidyr :
library(dplyr)
library(tidyr)
dt %>%
mutate(x = as.Date(x)) %>%
complete(x = seq(min(x), max(x), by = "day"), fill = list(y = 0))
# x y
# <date> <dbl>
#1 2020-03-28 1
#2 2020-03-29 5
#3 2020-03-30 0
#4 2020-03-31 3
#5 2020-04-01 0
#6 2020-04-02 0
#7 2020-04-03 0
#8 2020-04-04 0
#9 2020-04-05 70

You could also try this:
dt[, x := as.IDate(x)]
dt[.(seq(min(x), max(x), 1)), .(y = fifelse(is.na(y), 0, y)), .EACHI, on = "x"]
# x y
# 1: 2020-03-28 1
# 2: 2020-03-29 5
# 3: 2020-03-30 0
# 4: 2020-03-31 3
# 5: 2020-04-01 0
# 6: 2020-04-02 0
# 7: 2020-04-03 0
# 8: 2020-04-04 0
# 9: 2020-04-05 70

Related

Dates subtraction from different rows of data frame

I have big data frame (dim: 12867779x5) which looks like that:
id
group
date1
date 2
icf
id1
2
2020-03-17
2019-06-05
id1
3
2020-04-03
2019-05-09
id2
2
2020-04-10
2019-07-04
id2
3
2021-04-1
2020-06-01
id3
1
2020-04-13
2019-07-07
id3
2
2021-04-10
2020-06-01
id3
3
2020-04-10
2019-07-04
id3
4
2021-04-13
2020-06-01
Desired output:
id
group
date1
date 2
icf
id1
3
2020-04-03
2019-05-09
0
id2
2
2020-04-10
2019-07-04
52
id2
3
2021-04-01
2020-06-01
0
id3
1
2020-04-13
2019-07-07
49
id3
2
2021-04-10
2020-06-01
-646
id3
3
2020-04-10
2019-07-04
52
id3
4
2021-04-13
2020-06-01
0
To calculate icf I need to check if the id's from row i and i+1 are the same. If yes icf = date2(i+1) - date1(i).
I wrote this function to calculate icf, but it's too slow. I'm looking for ways to speed it up, I was thinking about using the apply function but I don't have idea how to re-write this icfCalculation fucntion.
icfCalculation <- function(dataFrame){
nr <- nrow(dataFrame) - 1
for (i in 1:nr) {
if(dataFrame[i, 1] == dataFrame[i+1, 1]){
dataFrame[i,5] = dataFrame[i+1, 4] - dataFrame[i, 3]
}
else{
dataFrame[i,5] = 0
}
}
return(dataFrame)
}
Thanks for putting expected output. This is not the same as what you have put - but it does give the same results as your function and should be significantly quicker to thanks to the data.table internal optimisations:
library(data.table)
# Read in data
dat <- read.table(text = "id group date1 date2
id1 2 2020-03-17 2019-06-05
id1 3 2020-04-03 2019-05-09
id2 2 2020-04-10 2019-07-04
id2 3 2021-04-1 2020-06-01
id3 1 2020-04-13 2019-07-07
id3 2 2021-04-10 2020-06-01
id3 3 2020-04-10 2019-07-04
id3 4 2021-04-13 2020-06-01",
h = T,
colClasses = c("character", "character", "Date", "Date")
)
# Make it a data.table
setDT(dat)
dat[, icf := fifelse(
id == shift(id, type = "lead"),
as.integer(
shift(date2, type = "lead") - date1
),
0)
]
dat
# id group date1 date2 icf
# 1: id1 2 2020-03-17 2019-06-05 -313
# 2: id1 3 2020-04-03 2019-05-09 0
# 3: id2 2 2020-04-10 2019-07-04 52
# 4: id2 3 2021-04-01 2020-06-01 0
# 5: id3 1 2020-04-13 2019-07-07 49
# 6: id3 2 2021-04-10 2020-06-01 -646
# 7: id3 3 2020-04-10 2019-07-04 52
# 8: id3 4 2021-04-13 2020-06-01 NA
If you want the last NA to be 0, just add dat$icf[nrow(dat)] <- 0.
library(dplyr)
library(tidyr)
df %>%
mutate(icf = replace_na(ifelse(id == lead(id), lead(date2) - date1, 0), 0))
Rather than use tidyr::replace_na you could also specify the default argument of lead.
Base R
A base R approach would be something like:
df$icf <- with(df, ifelse(id == c(id[2:nrow(df)], NA), c(date2[2:nrow(df)], NA) - date1, 0))
Output
id group date1 date2 icf
1 id1 2 2020-03-17 2019-06-05 -313
2 id1 3 2020-04-03 2019-05-09 0
3 id2 2 2020-04-10 2019-07-04 52
4 id2 3 2021-04-01 2020-06-01 0
5 id3 1 2020-04-13 2019-07-07 49
6 id3 2 2021-04-10 2020-06-01 -646
7 id3 3 2020-04-10 2019-07-04 52
8 id3 4 2021-04-13 2020-06-01 0

R conditional count of unique value over date range/window

In R, how can you count the number of observations fulfilling a condition over a time range?
Specifically, I want to count the number of different id by country over the last 8 months, but only if id occurs at least twice during these 8 months. Hence, for the count, it does not matter whether an id occurs 2x or 100x (doing this in 2 steps is maybe easier). NA exists both in id and country. Since this could otherwise be taken care off, accounting for this is not necessary but still helpful.
My current best try is, but does not account for the restriction (ID must appear at least twice in the previous 8 months) and also I find its counting odd when looking at the dates="2017-12-12", where desired_unrestricted should be equal to 4 according to my counting but the code gives 2.
dt[, date := as.Date(date)][
, totalids := sapply(date,
function(x) length(unique(id[between(date, x - lubridate::month(8), x)]))),
by = country]
Data
library(data.table)
library(lubridate)
ID <- c("1","1","1","1","1","1","2","2","2","3","3",NA,"4")
Date <- c("2017-01-01","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2018-05-02","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2017-05-01","2017-12-12","2017-12-12" )
Value <- c(2,4,3,5,2,5,8,17,17,3,7,5,3)
Country <- c("UK","UK","US","US",NA,"US","UK","UK","US","US","US","US","US")
Desired <- c(1,1,0,2,NA,0,1,2,2,2,2,1,1)
Desired_unrestricted <- c(2,2,1,3,NA,1,2,2,3,3,3,4,4)
dt <- data.frame(id=ID, date=Date, value=Value, country=Country, desired_output=Desired, desired_unrestricted=Desired_unrestricted)
setDT(dt)
Thanks in advance.
This data.table-only answer is motivated by a comment,
dt[, date := as.Date(date)] # if not already `Date`-class
dt[, date8 := do.call(c, lapply(dt$date, function(z) seq(z, length=2, by="-8 months")[2]))
][, results := dt[dt, on = .(country, date > date8, date <= date),
length(Filter(function(z) z > 1, table(id))), by = .EACHI]$V1
][, date8 := NULL ]
# id date value country desired_output desired_unrestricted results
# <char> <Date> <num> <char> <num> <num> <int>
# 1: 1 2017-01-01 2 UK 1 2 1
# 2: 1 2017-01-01 4 UK 1 2 1
# 3: 1 2017-01-05 3 US 0 1 0
# 4: 1 2017-05-01 5 US 1 3 2
# 5: 1 2017-05-01 2 <NA> NA NA 0
# 6: 1 2018-05-02 5 US 0 1 0
# 7: 2 2017-01-01 8 UK 1 2 1
# 8: 2 2017-01-05 17 UK 2 2 2
# 9: 2 2017-05-01 17 US 1 3 2
# 10: 3 2017-05-01 3 US 2 3 2
# 11: 3 2017-05-01 7 US 2 3 2
# 12: <NA> 2017-12-12 5 US 2 4 1
# 13: 4 2017-12-12 3 US 2 4 1
That's a lot to absorb.
Quick walk-through:
"8 months ago":
seq(z, length=2, by="-8 months")[2]
seq.Date (inferred by calling seq with a Date-class first argument) starts at z (current date for each row) and produces a sequence of length 2 with 8 months between them. seq always starts at the first argument, so length=1 won't work (it'll only return z); length=2 guarantees that the second value in the returned vector will be the "8 months before date" that we need.
Date subtraction:
[, date8 := do.call(c, lapply(dt$date, function(z) seq(...)[2])) ]
A simple base-R method for subtracting 8 months is seq(date, length=2, by="-8 months")[2]. seq.Date requires its first argument to be length-1, so we need to sapply or lapply it; unfortunately, sapply drops the class, so we lapply it and then programmatically combine them with do.call(c, ...) (since c(..) creates a list-column, and unlist will de-class it). (Perhaps this part can be improved.)
We need that in dt first since we do a non-equi (range-based) join based on this value.
Counting id with 2 or more visits:
length(Filter(function(z) z > 1, table(id)))
We produce a table(id), which gives us the count of each id within the join-period. Filter(fun, ...) allows us to reduce those that have a count below 2, and we're left with a named-vector of ids that had 2 or more visits. Retrieving the length is what we need.
Self non-equi join:
dt[dt, on = .(country, date > date8, date <= date), ... ]
Relatively straight-forward. This is an open/closed ranging, it can be changed to both-closed if you prefer.
Self non-equi join but count ids by-row: by=.EACHI.
Retrieve the results of that and assign into the original dt:
[, results := dt[...]$V1 ]
Since the non-equi join included a value (length(Filter(...))) without a name, it's named V1, and all we want is that. (To be honest, I don't know exactly why assigning it more directly doesn't work ... but the counts are all wrong. Perhaps it's backwards by-row tallying.)
Cleanup:
[, date8 := NULL ]
(Nothing fancy here, just proper data-stewardship :-)
There are some discrepancies in my counts versus your desired_output, I wonder if those are just typos in the OP; I think the math is right ...
Here is another option:
setkey(dt, country, date, id)
dt[, date := as.IDate(date)][,
eightmthsago := as.IDate(sapply(as.IDate(date), function(x) seq(x, by="-8 months", length.out=2L)[2L]))]
dt[, c("out", "out_unres") :=
dt[dt, on=.(country, date>=eightmthsago, date<=date),
by=.EACHI, {
v <- id[!is.na(id)]
.(uniqueN(v[duplicated(v)]), uniqueN(v))
}][,1L:3L := NULL]
]
dt
output (like r2evans, I am also getting different output from desired as there seems to be a miscount in the desired output):
id date value country desired_output desired_unrestricted eightmthsago out out_unres
1: 1 2017-05-01 2 <NA> NA NA 2016-09-01 0 1
2: 1 2017-01-01 2 UK 1 2 2016-05-01 1 2
3: 1 2017-01-01 4 UK 1 2 2016-05-01 1 2
4: 2 2017-01-01 8 UK 1 2 2016-05-01 1 2
5: 2 2017-01-05 17 UK 2 2 2016-05-05 2 2
6: 1 2017-01-05 3 US 0 1 2016-05-05 0 1
7: 1 2017-05-01 5 US 1 3 2016-09-01 2 3
8: 2 2017-05-01 17 US 1 3 2016-09-01 2 3
9: 3 2017-05-01 3 US 2 3 2016-09-01 2 3
10: 3 2017-05-01 7 US 2 3 2016-09-01 2 3
11: <NA> 2017-12-12 5 US 2 4 2017-04-12 1 4
12: 4 2017-12-12 3 US 2 4 2017-04-12 1 4
13: 1 2018-05-02 5 US 0 1 2017-09-02 0 2
Although this question is tagged with data.table, here is a dplyr::rowwise solution to the problem. Is this what you had in mind? The output looks valid to me: The number of ìds in the last 8 months which have a count of at least greater than 2.
library(dplyr)
library(lubridate)
dt <- dt %>% mutate(date = as.Date(date))
dt %>%
group_by(country) %>%
group_modify(~ .x %>%
rowwise() %>%
mutate(totalids = .x %>%
filter(date <= .env$date, date >= .env$date %m-% months(8)) %>%
pull(id) %>%
table() %>%
`[`(. >1) %>%
length
))
#> # A tibble: 13 x 7
#> # Groups: country [3]
#> country id date value desired_output desired_unrestricted totalids
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <int>
#> 1 UK 1 2017-01-01 2 1 2 1
#> 2 UK 1 2017-01-01 4 1 2 1
#> 3 UK 2 2017-01-01 8 1 2 1
#> 4 UK 2 2017-01-05 17 2 2 2
#> 5 US 1 2017-01-05 3 0 1 0
#> 6 US 1 2017-05-01 5 1 3 2
#> 7 US 1 2018-05-02 5 0 1 0
#> 8 US 2 2017-05-01 17 1 3 2
#> 9 US 3 2017-05-01 3 2 3 2
#> 10 US 3 2017-05-01 7 2 3 2
#> 11 US <NA> 2017-12-12 5 2 4 1
#> 12 US 4 2017-12-12 3 2 4 1
#> 13 <NA> 1 2017-05-01 2 NA NA 0
Created on 2021-09-02 by the reprex package (v2.0.1)

R data.table cumulative sum over time intervals

I have a table with values that exist during specific time intervals. I want a field that will sum up over all values for a given ID that exist during the start time of that interval.
Here is a paired down example:
x = data.table(ID = c(rep(1, 5), rep(2, 5)),
DT_START = as.Date(c('2017-05-28', '2017-05-29', '2017-07-03', '2018-05-28', '2018-05-29',
'2019-07-03', '2019-10-08', '2020-05-28', '2020-05-29', '2020-07-03')),
DT_END = as.Date(c('2018-05-28', '2018-05-29', '2018-07-03', '2018-05-29', '2019-05-28',
'2019-10-08', '2020-07-03', '2021-05-28', '2021-05-29', '2020-10-03')),
VALUE = c(300, 400, 200, 100, 150, 250, 350, 50, 10, 45))
The table looks like this:
x
ID DT_START DT_END VALUE
1: 1 2017-05-28 2018-05-28 300
2: 1 2017-05-29 2018-05-29 400
3: 1 2017-07-03 2018-07-03 200
4: 1 2018-05-28 2018-05-29 100
5: 1 2018-05-29 2019-05-28 150
6: 2 2019-07-03 2019-10-08 250
7: 2 2019-10-08 2020-07-03 350
8: 2 2020-05-28 2021-05-28 50
9: 2 2020-05-29 2021-05-29 10
10: 2 2020-07-03 2020-10-03 45
In the first row, that's the first start date for that ID and there are no equal dates, so the cumulative value would be just 300. By the second row, we now add the 300+400 to get 700, because as of 5/29/2017, both the 400 and the 300 were active for ID 1. The full desired output vector is obtained using the following code:
x[, VALUE_CUM := sum(x$VALUE * ifelse(x$ID == ID & x$DT_START <= DT_START & x$DT_END > DT_START, 1, 0)), by = .(ID, DT_START)]
x
ID DT_START DT_END VALUE VALUE_CUM
1: 1 2017-05-28 2018-05-28 300 300
2: 1 2017-05-29 2018-05-29 400 700
3: 1 2017-07-03 2018-07-03 200 900
4: 1 2018-05-28 2018-05-29 100 700
5: 1 2018-05-29 2019-05-28 150 350
6: 2 2019-07-03 2019-10-08 250 250
7: 2 2019-10-08 2020-07-03 350 350
8: 2 2020-05-28 2021-05-28 50 400
9: 2 2020-05-29 2021-05-29 10 410
10: 2 2020-07-03 2020-10-03 45 105
This is great but takes way to long on my huge data table with millions of rows. Any ideas for how to do this more elegantly so it takes faster?
Thanks!
Here is a possible way to do it:
y <- x[x, .(
DT_END2 = i.DT_END,
VALUE = i.VALUE, VALUE_CUM = sum(x.VALUE)),
on = .(ID, DT_START <= DT_START, DT_END > DT_START), by = .EACHI]
# DT_END is overwritten by values of DT_START, so we use DT_END2 to backup the correct DT_END values.
y[, DT_END := DT_END2][, DT_END2 := NULL]
# ID DT_START DT_END VALUE VALUE_CUM
# 1: 1 2017-05-28 2018-05-28 300 300
# 2: 1 2017-05-29 2018-05-29 400 700
# 3: 1 2017-07-03 2018-07-03 200 900
# 4: 1 2018-05-28 2018-05-29 100 700
# 5: 1 2018-05-29 2019-05-28 150 350
# 6: 2 2019-07-03 2019-10-08 250 250
# 7: 2 2019-10-08 2020-07-03 350 350
# 8: 2 2020-05-28 2021-05-28 50 400
# 9: 2 2020-05-29 2021-05-29 10 410
# 10: 2 2020-07-03 2020-10-03 45 105

Generating test data in R

I am trying to generate this table as one of the inputs to a test.
id diff d
1: 1 2 2020-07-31
2: 1 1 2020-08-01
3: 1 1 2020-08-02
4: 1 1 2020-08-03
5: 1 1 2020-08-04
6: 2 2 2020-07-31
7: 2 1 2020-08-01
8: 2 1 2020-08-02
9: 2 1 2020-08-03
10: 2 1 2020-08-04
11: 3 2 2020-07-31
12: 3 1 2020-08-01
13: 3 1 2020-08-02
14: 3 1 2020-08-03
15: 3 1 2020-08-04
16: 4 2 2020-07-31
17: 4 1 2020-08-01
18: 4 1 2020-08-02
19: 4 1 2020-08-03
20: 4 1 2020-08-04
21: 5 2 2020-07-31
22: 5 1 2020-08-01
23: 5 1 2020-08-02
24: 5 1 2020-08-03
25: 5 1 2020-08-04
id diff d
I have done it like this -
input1 = data.table(id=as.character(1:5), diff=1)
input1 = input1[,.(d=seq(as.Date('2020-07-31'), by='days', length.out = 5)),.(id, diff)]
input1[d == '2020-07-31']$diff = 2
diff is basically the number of days to the next weekday. Eg. 31st Jul 2020 is Friday. Hence diff is 2 which is the diff to the next weekday, Monday. For the others it will be 1.
Is there a more R idiomatic way of doing this ?
I personally dont like that I had to generate the date sequence for each of the ids separately or the hardcoding of the diff that I have to do in the input for 31st July. Is there a more generic way of doing this without the hardcoding?
We can create all combination of dates and id using crossing and create diff column based on whether the weekday is "Friday".
library(dplyr)
tidyr::crossing(id = 1:5, d = seq(as.Date('2020-07-31'),
by='days', length.out = 5)) %>%
mutate(diff = as.integer(weekdays(d) == 'Friday') + 1)
Similar logic using base R expand.grid :
transform(expand.grid(id = 1:5,
d = seq(as.Date('2020-07-31'), by='days', length.out = 5)),
diff = as.integer(weekdays(d) == 'Friday') + 1)
and CJ in data.table :
library(data.table)
df <- CJ(id = 1:5, d = seq(as.Date('2020-07-31'), by='days', length.out = 5))
df[, diff := as.integer(weekdays(d) == 'Friday') + 1]

Wrong column on data.table merge

Let's say I have these two tables:
library(data.table)
x <- data.table(Date = as.Date(c("1990-01-29", "1990-02-30",
"1990-01-31", "1990-02-01",
"1990-02-02", "1990-02-05",
"1990-02-06", "1990-02-07",
"1990-02-08", "1990-02-09")),
a = c(1, 1, 2, 3, 5, 8, 13, 21, 34, 55))
y <- data.table(Date1 = as.Date(c("1990-01-31", "1990-02-06", "1990-02-07")),
Date2 = as.Date(c("1990-02-06", "1990-02-07", "1990-02-09")),
b = c(5, 2, 4))
Table y is really a descriptor of different "periods" starting at Date1 and ending at Date2 (such that one row's Date2 is the next row's Date1), with a (non-unique) descriptor of that period.
I'd now like to merge these tables, such that for each date of x have both a and the respective y$b (dates outside of the period should be dropped). I tried the following, but it's not right:
x[y, on = .(Date > Date1, Date <= Date2)]
# Date x Date.1 y
# 1: 1990-01-31 3 1990-02-06 5
# 2: 1990-01-31 5 1990-02-06 5
# 3: 1990-01-31 8 1990-02-06 5
# 4: 1990-01-31 13 1990-02-06 5
# 5: 1990-02-06 21 1990-02-07 2
# 6: 1990-02-07 34 1990-02-09 4
# 7: 1990-02-07 55 1990-02-09 4
Specifically, the Date column isn't x$Date, but actually y$Date1, repeated as necessary, while the Date.1 column is Date2.
Meanwhile, the expected output would be
# Date x y
# 1: 1990-02-01 3 5
# 2: 1990-02-02 5 5
# 3: 1990-02-05 8 5
# 4: 1990-01-06 13 5
# 5: 1990-02-07 21 2
# 6: 1990-02-08 34 4
# 7: 1990-02-09 55 4
It may be better to create a duplicate column
x[,.(Daten = Date, Date, a)][y,
on = .(Date > Date1, Date <= Date2)][, .(Date = Daten, a, b)]
# Date a b
#1: 1990-02-01 3 5
#2: 1990-02-02 5 5
#3: 1990-02-05 8 5
#4: 1990-02-06 13 5
#5: 1990-02-07 21 2
#6: 1990-02-08 34 4
#7: 1990-02-09 55 4
You can refer to the columns of each table using x. and i.
x[y,
on = .(Date > Date1, Date <= Date2),
.(Date = x.Date, x = x.a, y = i.b)]
Date x y
1: 1990-02-01 3 5
2: 1990-02-02 5 5
3: 1990-02-05 8 5
4: 1990-02-06 13 5
5: 1990-02-07 21 2
6: 1990-02-08 34 4
7: 1990-02-09 55 4

Resources