For different values of id I have a start and end dates with a relative quantity, var.
For each records (for the same id), start date is the same then the previous end date (here it comes roll...).
These periods span across multiple months and possibly years. My need is to split the quantity in var into parts relative to the actual days in each months. e.g.
start end var
30/01/2006 20/02/2006 104
above I have 21 days, the lower limit will belong to the previous period and the upper to the current, so 1/21 of 104 will be assigned to Jan 2006 and the rest to Feb 2006
I currently have two methods, listed below with dummy data, but they are pretty slow and I was wondering if someone may help with me out to speed them up.
library(data.table)
# data
set.seed(1)
nsample <- 200L # To increase the data size just change nsample
dt <- data.table(id= 1L:nsample)
dt <- dt[, list(date=sample(seq(as.Date("2006-01-01"), as.Date("2012-01-01"), "day"), 51, F)), by=id]
setkey(dt)
dt <- dt[, {tmp <- embed(as.vector(date), 2);list(start = structure(tmp[,2], class="Date"),
end = structure(tmp[,1], class="Date"),
var = rnorm(50, 100, 5))}, by=id]
setkey(dt, id, end)
> dt[1:4]
id start end var
1: 1 2006-01-30 2006-02-20 104.41542
2: 1 2006-02-20 2006-05-15 106.89356
3: 1 2006-05-15 2006-08-21 106.71162
4: 1 2006-08-21 2006-09-30 96.21729
# Method 1
dt1 <- copy(dt)
system.time({
dt1[, id2 := 1:.N]
tmp <- dt1[, list(id = id,
date = seq(start+1, end, "day"),
var = var), by=id2]
tmp[, var := var/(.N), by=id2]
res1 <- tmp[, list(var = sum(var)), by=list(id, period = paste(year(date), month(date), sep="-"))]
})
#user system elapsed
#1.92 0.00 1.92
# Method 2
dt2 <- copy(dt)
system.time({
dt2[, Ndays := as.integer(end)-as.integer(start)]
tmp <- dt2[, list(date = seq(min(start)+1, max(end), "day")), by=id]
setkey(tmp)
res2 <- dt2[ tmp, roll=-Inf][ end >= start,list(var = sum(var/Ndays)), by=list(id, period = paste(year(end), month(end), sep="-")) ]
})
#user system elapsed
# 0.7 0.0 0.7
> sum(dt$var) == sum(res1$var)
[1] TRUE
> sum(dt$var) == sum(res2$var)
[1] TRUE
> all.equal(res1, res2)
[1] TRUE
> res2[1:4]
id period var
1: 1 2006-1 4.972163
2: 1 2006-2 109.623593
3: 1 2006-3 39.448815
4: 1 2006-4 38.176273
This will be a bit faster (it's 3x faster for me than your second version). I optimized several things in your second version, that you can see below:
# let's just divide here instead of later
dt2[, var := var/(as.integer(end)-as.integer(start))]
tmp <- dt2[, list(date = seq(min(start)+1, max(end), "day")), by=id]
# data is sorted, so no need to sort again, just set key without sort
setattr(tmp, "sorted", c("id", "date"))
res2 <- dt2[tmp, roll=-Inf][,
list(var = sum(var)),
# doing the paste in by slows it down quite a bit, so let's postpone it
by=list(id, year(end), month(end))][,
`:=`(period = paste(year, month, sep = '-'), year = NULL, month = NULL)]
Re comment about large sizes - you could do all of the above inside dt2. It'll be slower, but I it won't create a large tmp:
dt2[, var := var/(as.integer(end)-as.integer(start))][,
{tmp = data.table(date = seq(min(start)+1, max(end), "day"));
setattr(tmp, 'sorted', 'date');
setattr(.SD, 'sorted', 'end');
.SD[tmp, roll = -Inf][,
list(var = sum(var)), by = list(year(end), month(end))][,
`:=`(period = paste(year, month, sep = '-'), year = NULL, month = NULL)]
}, by = id]
Related
I usually work with dplyr but face a rather large data set and my approach is very slow. I basically need to filter a df group it by dates and count the occurrence within
sample data (turned already everything into data.table)
library(data.table)
library(dplyr)
set.seed(123)
df <- data.table(startmonth = seq(as.Date("2014-07-01"),as.Date("2014-11-01"),by="months"),
endmonth = seq(as.Date("2014-08-01"),as.Date("2014-12-01"),by="months")-1)
df2 <- data.table(id = sample(1:10, 5, replace = T),
start = sample(seq(as.Date("2014-07-01"),as.Date("2014-10-01"),by="days"),5),
end = df$startmonth + sample(10:90,5, replace = T)
)
#cross joining
res <- setkey(df2[,c(k=1,.SD)],k)[df[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
My dplyr approach works but is slow
res %>% filter(start <=endmonth & end>= startmonth) %>%
group_by(startmonth,endmonth) %>%
summarise(countmonth=n())
My data.table knowledge is limited but I guess we would setkeys() on the date columns and something like res[ , :=( COUNT = .N , IDX = 1:.N ) , by = startmonth, endmonth] to get the counts by group but I'm not sure how the filter goes in there.
Appreciate your help!
You could do the counting inside the join:
df2[df, on=.(start <= endmonth, end >= startmonth), allow.cartesian=TRUE, .N, by=.EACHI]
start end N
1: 2014-07-31 2014-07-01 1
2: 2014-08-31 2014-08-01 4
3: 2014-09-30 2014-09-01 5
4: 2014-10-31 2014-10-01 3
5: 2014-11-30 2014-11-01 3
or add it as a new column in df:
df[, n :=
df2[.SD, on=.(start <= endmonth, end >= startmonth), allow.cartesian=TRUE, .N, by=.EACHI]$N
]
How it works. The syntax is x[i, on=, allow.cartesian=, j, by=.EACHI]. Each row if i is used to look up values in x. The symbol .EACHI indicates that aggregation (j=.N) will be done for each row of i.
I have a data.table where each row is an event with a start date and end date, but the number of days between each start and end is variable.
Therefore, I am attempting to count how many other events have already ended at the time each one begins.
I can do this using lapply, but when I try to use data.table with the by functionality I don't get the expected output. Sample code below:
library(data.table)
DT <- data.table(
start = as.Date(c("2018-07-01","2018-07-03","2018-07-06","2018-07-08","2018-07-12","2018-07-15")),
end = as.Date(c("2018-07-10","2018-07-04","2018-07-09","2018-07-20","2018-07-14","2018-07-27")),
group_id = c("a", "a", "a", "b", "b", "b"))
# This produces the expected output (0,0,1,1,3,4):
lapply(DT$start, function(x) sum(x > DT$end))
# This also works using data.table:
DT[, count := lapply(DT$start, function(x) sum(x > DT$end))]
# However, I don't get the expected output (0,0,1,0,0,1) when I attempt to do this by group_id
DT[, count_by_group := lapply(DT$start, function(x) sum(x > DT$end)), by = group_id]
With the following output, where count_by_group is not the expected result:
start end group_id count count_by_group
1: 2018-07-01 2018-07-10 a 0 0
2: 2018-07-03 2018-07-04 a 0 0
3: 2018-07-06 2018-07-09 a 1 0
4: 2018-07-08 2018-07-20 b 1 0
5: 2018-07-12 2018-07-14 b 3 0
6: 2018-07-15 2018-07-27 b 4 0
Can someone help me understand how by changes the behavior? I've also tried to use different versions of the .SD feature, but wasn't able to get that to work either.
unlist()
unlist() works as well:
DT[, count_by_group := unlist(lapply(start, function(x) sum(x > end))), by = group_id]
Non-equi join
Alternatively, this can also be solved by aggregating in a non-equi self join:
DT[, count_by_group := DT[DT, on = .(group_id, end < start), .N, by = .EACHI]$N]
DT
start end group_id count_by_group
1: 2018-07-01 2018-07-10 a 0
2: 2018-07-03 2018-07-04 a 0
3: 2018-07-06 2018-07-09 a 1
4: 2018-07-08 2018-07-20 b 0
5: 2018-07-12 2018-07-14 b 0
6: 2018-07-15 2018-07-27 b 1
Benchmark
The non-equi join is also the fastest method for cases with more than a few hundred rows:
library(bench)
bm <- press(
n_grp = c(2L, 5L, 10L),
n_row = 10^(2:4),
{
set.seed(1L)
DT = data.table(
group_id = sample.int(n_grp, n_row, TRUE),
start = as.Date("2018-07-01") + rpois(n_row, 20L))
DT[, end := start + rpois(n_row, 10L)]
setorder(DT, group_id, start, end)
mark(
unlist = copy(DT)[, count_by_group := unlist(lapply(start, function(x) sum(x > end))), by = group_id],
sapply = copy(DT)[, count_by_group := sapply(start, function(x) sum(x > end)), by = group_id],
vapply = copy(DT)[, count_by_group := vapply(start, function(x) sum(x > end), integer(1)), by = group_id],
nej = copy(DT)[, count_by_group := DT[DT, on = .(group_id, end < start), .N, by = .EACHI]$N]
)
}
)
ggplot2::autoplot(bm)
For 10000 rows, the non-equi join is about 10 times faster than the other methods.
As DT is being updated, copy() is used to create a fresh, unmodified copy of DT for each benchmark run.
DT[, count_by_group := vapply(start, function(x) sum(x > end), integer(1)), by = group_id]
To refer to start and end by group, we need to leave the DT$ prefix out.
We use vapply() rather than lapply() because if the right hand side of := is a list, it is interpreted as a list of columns (and since only one column is expected, only the first element, a 0, is taken into account and recycled).
I'm trying to build an efficient for loop for this function proposed by minem here: (Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table)
My data are:
library(dplyr)
library(tidyr)
library(lubridate)
library(data.table)
adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01"))
names(adherence)[1] <- "ID"
names(adherence)[2] <- "year"
adherence$year <- ymd(adherence$year)
lsr <- cbind.data.frame(
c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05"), #eksd
c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
)
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"
lsr$eksd <- as.Date((lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD
lsr <- as.data.table(lsr)
adherence <- as.data.table(adherence)
The Function proposed by minem are:
by_minem2 <- function(dt = lsr2) {
d <- as.numeric(as.Date("2013-02-01"))
dt[, ENDDATE2 := as.numeric(ENDDATE)]
x <- dt[eksd <= d & ENDDATE > d, sum(ENDDATE2 - d), keyby = ID]
uid <- unique(dt$ID)
id2 <- setdiff(uid, x$ID)
id2 <- uid[!(uid %in% x$ID)]
x2 <- data.table(ID = id2, V1 = 0)
x <- rbind(x, x2)
setkey(x, ID)
x
}
This returns:
> by_minem2(lsr)
ID V1
1: 1 64
2: 2 0
3: 3 63
For the loop i need to include information about which time I evaluated at so the ideal repeated output looks like this:
cbind(as.Date("2013-02-01"),by_minem2(lsr))
I then want to repeat this for different dates a few hundred times putting everything into the same data.table:
time.months <- as.Date("2013-02-01")+(365.25/12)*(0:192) #dates to evaluate at
I'm trying to do this with a for loop like this:
for (d in min(time.months):max(time.months))
{
by_minem <- function(dt = lsr2) {
d <- as.numeric(d)
dt[, ENDDATE2 := as.numeric(ENDDATE)]
x <- dt[eksd <= d & ENDDATE > d, sum(ENDDATE2 - d), keyby = ID]
uid <- unique(dt$ID)
id2 <- setdiff(uid, x$ID)
id2 <- uid[!(uid %in% x$ID)]
x2 <- data.table(ID = id2, V1 = 0)
x <- rbind(x, x2)
setkey(x, ID)
xtot <- append(xtot,x)
xtot <- cbind(d, xtot) # i need to know time of evaluation
xtot
}
}
As indicated in the answer to the related question Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table, this can be solved by updating in a non-equi join which is possible with data.table.
The difference to the linked question is that here we need to create the cross join CJ() of all unique IDs with the vector of dates on our own before joining with lsr.
The OP has provided a series of dates time.months whose defintion
time.months <- as.Date("2013-02-01")+(365.25/12)*(0:192) #dates to evaluate at
leads to "crooked" dates which is only visible if coerced to numeric or POSIXct:
head(lubridate::as_datetime(time.months))
[1] "2013-02-01 00:00:00 UTC" "2013-03-03 10:30:00 UTC" "2013-04-02 21:00:00 UTC"
[4] "2013-05-03 07:30:00 UTC" "2013-06-02 18:00:00 UTC" "2013-07-03 04:30:00 UTC"
The issue is that these "dates" are not aligned with midnight but start somewhere during the day. To avoid these ambiguities, the seq() function can be used
dates <- seq(as.Date("2013-02-01"), length.out = 193, by = "month")
which creates a series of dates starting on the first day of each month.
In addition, data.table's IDate class is used which stores dates as integers (4 bytes) instead of double (8 bytes). This saves memory as well as processing time because the usually faster integer arithmetic can be used.
# coerce Date to IDate
idates <- as.IDate(dates)
setDT(lsr)[, eksd := as.IDate(eksd)][, ENDDATE := as.IDate(ENDDATE)]
# cross join unique IDs with dates
CJ(ID = lsr$ID, date = idates, unique = TRUE)[
# intialize result column
, AH := 0L][
# non-equi join and ...
lsr, on = .(ID, date >= eksd, date < ENDDATE),
# ... update only matching rows
AH := as.integer(ENDDATE - x.date)][
# reshape from long to wide format
, dcast(.SD, ID ~ date)]
ID 2013-02-01 2013-03-01 2013-04-01 2013-05-01 2013-06-01 2013-07-01 2013-08-01 [...]
1: 1 64 36 5 0 0 0 0
2: 2 0 0 110 80 49 19 0
3: 3 63 35 4 0 0 0 0
Caveat
Note that above code assumes that the intervals [eksd, ENDDATE) for each ID do not overlap. This can be verified by
lsr[order(eksd), all(eksd - shift(ENDDATE, fill = 0) > 0), keyby = ID]
ID V1
1: 1 TRUE
2: 2 TRUE
3: 3 TRUE
In case there are overlaps, the above code can be modified to aggregate within the non-equi join using by = .EACHI.
Benchmark
In another related question data.table by = xx How do i keep the groups of length 0 when i returns no match, the OP has pointed out that performance is crucial due to the size of his production data.
According to OP's comment, lsr has 20 mio rows and 12 columns, the adherence dataset, that I'm trying not to use has 1,5 mio rows of 2 columns. In another question, the OP mentions that lsr is a few hundred mio. rows.
#minem has responded to this by providing a benchmark in his answer. We can use this benchmark data to compare the different answers.
# create benchmark data
lsr <- data.frame(
ID = c("1", "1", "1", "2", "2", "2", "3", "3"),
eksd = as.Date(c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05")),
DDD = as.integer(c("60", "90", "90", "60", "120", "60", "30", "90")),
stringsAsFactors = FALSE)
lsr$ENDDATE <- lsr$eksd + lsr$DDD
n <- 5e4
lsr2 <- lapply(1:n, function(x) lsr)
lsr2 <- rbindlist(lsr2, use.names = T, fill = T, idcol = T)
lsr2[, ID := as.integer(paste0(.id, ID))]
Thus, the benchmark dataset consists of 400 k rows and 150 k unique IDs:
lsr2[, .(.N, uniqueN(ID))]
N V2
1: 400000 150000
# pull data preparation out of the benchmark
lsr2i <- copy(lsr2)[, eksd := as.IDate(eksd)][, ENDDATE := as.IDate(ENDDATE)]
lsr2[, ENDDATE2 := as.numeric(ENDDATE)]
# define date series
dates <- seq(as.Date("2013-02-01"), length.out = 193, by = "month")
idates <- seq(as.IDate("2013-02-01"), length.out = 193, by = "month")
# run benchmark
library(microbenchmark)
bm <- microbenchmark(
minem = {
dt <- copy(lsr2)
xtot <- lapply(dates, function(d) {
d <- as.numeric(d)
x <- dt[eksd <= d & ENDDATE > d, sum(ENDDATE2 - d), keyby = ID]
uid <- unique(dt$ID)
id2 <- setdiff(uid, x$ID)
id2 <- uid[!(uid %in% x$ID)]
if (length(id2) > 0) {
x2 <- data.table(ID = id2, V1 = 0)
x <- rbind(x, x2)
}
setkey(x, ID)
x
})
for (x in seq_along(xtot)) {
setnames(xtot[[x]], c("ID", paste0("V", x)))
}
xtot <- Reduce(function(...) merge(..., all = TRUE, by = "ID"), xtot)
xtot
},
uwe = {
dt <- copy(lsr2i)
CJ(ID = dt$ID, date = idates, unique = TRUE)[, AH := 0L][
dt, on = .(ID, date >= eksd, date < ENDDATE),
AH := as.integer(ENDDATE - x.date)][, dcast(.SD, ID ~ date)]
},
times = 1L
)
print(bm)
The result for one run shows that the non-equi join is more than 4 times faster than the lapply() approach.
Unit: seconds
expr min lq mean median uq max neval
minem 27.654703 27.654703 27.654703 27.654703 27.654703 27.654703 1
uwe 5.958907 5.958907 5.958907 5.958907 5.958907 5.958907 1
something like this :
dt <- lsr
dt[, ENDDATE2 := as.numeric(ENDDATE)]
s <- time.months
xtot <- lapply(s, function(d) {
d <- as.numeric(d)
x <- dt[eksd <= d & ENDDATE > d, sum(ENDDATE2 - d), keyby = ID]
uid <- unique(dt$ID)
id2 <- setdiff(uid, x$ID)
id2 <- uid[!(uid %in% x$ID)]
if (length(id2) > 0) {
x2 <- data.table(ID = id2, V1 = 0)
x <- rbind(x, x2)
}
setkey(x, ID)
x
})
for (x in seq_along(xtot)) {
setnames(xtot[[x]], c("ID", paste0("V", x)))
}
xtot <- Reduce(function(...) merge(..., all = TRUE, by = "ID"), xtot)
xtot
I have a dataframe loaded in RStudio with information about numerous events (millions).
Each row is an entry of a single event and apart from other information it includes two attributes with date information. The first one contains the date when the event began and the second when it ended. But the events are not sequential so they might overlap in time.
fecha fecha_fin
7510607 2014-02-13 20:09:59.8270000 2014-02-27 09:55:40.9700000
7510608 2014-02-13 20:10:01.1870000 2014-02-27 09:55:42.5630000
7557931 2014-02-16 05:32:08.6230000 2014-02-16 14:03:19.4970000
What could be the best and most efficient option to find which calendar days had no activity (without any event in process)? Please, keep in my mind that the duration of the events must be taken into consideration.
I tend to use foverlaps from the data.table package for such cases, e.g.:
library(data.table)
dt <- fread("id,fecha,fecha_fin
7510607,2014-02-01 20:09:59.8270000,2014-02-10 09:55:40.9700000
7510607,2014-02-13 20:09:59.8270000,2014-02-27 09:55:40.9700000
7510608,2014-02-13 20:10:01.1870000,2014-02-27 09:55:42.5630000
7557931,2014-02-16 05:32:08.6230000,2014-02-16 14:03:19.4970000")
setkey(dt, fecha, fecha_fin)
set(dt, j = 1L, value = NULL)
dt <- dt[,lapply(.SD, as.POSIXct, tz = "CET"),.SDcols=1:2]
dt2 <- data.table(fecha=as.POSIXct(seq(min(as.Date(dt$fecha)), max(as.Date(dt$fecha_fin)), "1 day")))[,fecha_fin:=fecha+60*60*24-1]
as.Date(foverlaps(dt2, dt)[is.na(fecha) & is.na(fecha_fin),i.fecha])
# [1] "2014-02-11" "2014-02-12"
Update, with slightly modified code from lukeA:
I hope there is nothing wrong with my benchmarking here...
library(data.table)
library(lubridate)
library(microbenchmark)
# Create dt ---------------------------------------------------------------
size = 99999
# With this size result is an empty set, check smaller sizes like 999 to confirm
# results are same for both functions
create_dt <- function() {
set.seed(2016)
dt <- data.table(
ID = 1:size,
fecha = sample(
seq(ymd('2000/01/01'), ymd('2016/11/16'), by="day"),
size, replace = TRUE)
)
dt[, fecha_fin := fecha + sample(1:3, size, replace = TRUE)]
setkey(dt, fecha, fecha_fin)
set(dt, j = 1L, value = NULL)
dt <- dt[,lapply(.SD, as.POSIXct, tz = "CET"),.SDcols=1:2]
}
dt <- create_dt()
# Declare functions -------------------------------------------------------
f_mdz <- function() {
dt_2 <- data.table(
fecha = seq(min(dt$fecha), max(dt$fecha_fin), by = '1 day')
# Function simplified here!!!
)[, fecha_fin := fecha]
# ---------------------------
as.Date(
foverlaps(dt_2, dt)[is.na(fecha) & is.na(fecha_fin),i.fecha])#,
# origin = '1970-01-01')
}
f_lukeA <- function() {
dt2 <- data.table(
fecha = seq(min(dt$fecha), max(dt$fecha_fin), "1 day")
)[,fecha_fin:=fecha+60*60*24-1]
as.Date(
foverlaps(dt2, dt)[is.na(fecha) & is.na(fecha_fin),i.fecha])
}
# Benchmark! --------------------------------------------------------------
microbenchmark(
dt_mdz <- f_mdz(),
dt_lukeA <- f_lukeA(),
times = 100)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dt_mdz <- f_mdz() 46.96793 55.11631 95.59214 60.33659 191.5536 212.4523 100 a
# dt_lukeA <- f_lukeA() 50.57496 56.42464 105.07356 60.81974 194.0779 211.8037 100 a
identical(dt_mdz, dt_lukeA)
Old answer here:
A point of departure (far from being efficient, e.g. row-wise operations on data.table...) for further investigation could be:
library(data.table)
library(lubridate)
library(magrittr)
dt <- data.table(
ID = c(7510607L, 7510608L, 7557931L),
fecha = ymd(c('2014-02-15', '2014-02-16', '2014-02-11')),
fecha_fin = ymd(c('2014-02-27', '2014-02-27', '2014-02-12'))
)
# ID fecha fecha_fin
# 1: 7510607 2014-02-15 2014-02-27
# 2: 7510608 2014-02-16 2014-02-27
# 3: 7557931 2014-02-11 2014-02-12
# Make the data "long"
long_dt <- dt[, .(days = seq(fecha, fecha_fin, by = '1 day')), by = ID]
# Get the diff with days sequence from min to max date
setdiff(
seq(long_dt[, min(days)], long_dt[, max(days)], by = '1 day'),
long_dt[, sort(unique(days))]
) %>% as.Date(origin = '1970-01-01')
# [1] "2014-02-13" "2014-02-14"
Please note I have changed your data to actually have two days (2014-02-13 and 2014-02-14) without any activity.
A base R solution would be this:
df$fecha <- strptime(df$fecha, "%Y-%m-%d")
df$fecha_fin <- strptime(df$fecha_fin, "%Y-%m-%d")
dates_list <- lapply(1:3, function(x){
interval_events <- seq(from = df$fecha[x], to = df$fecha_fin[x], by = "days")
})
interval_events <- unique(do.call("c", dates_list))
interval_complete <- seq(from = min(df$fecha), max(df$fecha_fin), by = "days")
interval_complete[!(interval_complete %in% interval_events)]
#[1] "2014-02-13 CET" "2014-02-14 CET"
Here is a simple one! You just expand the dates and take a union of all the dates.
## Data
dt1=as.Date(c('2014/01/01','2014/01/08','2014/01/05'))
dt2=as.Date(c('2014/01/10','2014/01/14','2014/01/05'))
df=data.frame(id=sample(1:3), dt1=dt1, dt2=dt2)
## Code
date=apply(df, 1, function(x) seq(as.Date(x[2]), as.Date(x[3]), by="day"))
event_dates=as.Date(Reduce(union, date), origin = "1970-01-01")
I hope this is an acceptable R/data.table problem.
I have a 3-column table with:
id geographic location IDs (303,453 locations)
month month over 25 years 1990-2014
spei a climatic index that varies between -7 and 7.
I need to count the occurrence of droughts at each location over the entire 1990-2014 period. A drought event is defined as "a period in which the SPEI is continuously negative and the SPEI reaches a value of -1.0 or less. Drought starts when the SPEI first falls below zero and ends with the first positive SPEI value following a value of -1.0 or less".
I know this should be feasible using shift() and rolling joins but would very welcome some help!
# Sample table structure
dt <- data.table(
id = rep(1:303453, each=25*12),
month = rep(seq(as.Date("1990-01-01"), as.Date("2014-12-31"), "month"), 303453),
spei = runif(303453*25*12, -7, 7))
# A minimal example with 1 location over 12 months
library(data.table)
library(xts)
dt <- data.table(
id = rep("loc1", each=12),
month = seq(as.Date("2014-01-01"), as.Date("2014-12-31"), "month"),
spei = c(-2, -1.1, -0.5, 1.2, -1.2, 2.3, -1.7, -2.1, 0.9, 1.2, -0.9, -0.2))
spei.ts <- xts(dt$spei, order.by=dt$month, frequency="month")
plot(spei.ts, type="bars")
This shows 3 drought events over a 1-year period. This is what I need to identify and count.
Hoping some of you are more used to working with time series.
Many thanks, --Mel.
Here is a starting point to get the result you want.
Probably experts can suggest improvements in speed.
EDIT: improved speed ~8x by removing paste.
library(data.table)
set.seed(42)
n <- 300 # 303453 will be ~1000 times slower
dt <- data.table(
id = rep(1:n, each=25*12),
month = rep(seq(as.Date("1990-01-01"), as.Date("2014-12-31"), "month"), n),
spei = runif(n*25*12, -7, 7))
system.time({
dt[, `:=`(neg = (spei < 0), neg1 = (spei <= -1))]
dt[, runid := ifelse(neg, rleid(neg), NA)]
res <- dt[!is.na(runid),
.(length = .N[any(neg1)], start = min(month), end = max(month)),
by = .(id, runid)][!is.na(length)]
})
# user system elapsed
# 0.345 0.000 0.344
# counts of droughts per id:
res[, .(nDroughts = .N), by = id]
# list of droughts per id: (NB: don't include 1st positive value after)
res[, .(droughtN = seq_len(.N), start, end), by = id]
Update based on comment...
If all that was needed was the counts then
# Let 'sp' = starting point of potential drought
# Let 'dv' = drought level validation
# The cumsum just gives unique ids to group by.
dt[, sp := (spei <= 0) & (shift(spei, fill = 1) > 0), by = id]
dt[, dv := min(spei) <= -1, by = .(id, cumsum(sp))]
dt[sp & dv, .N, by = id]
yet, as stated in the comments, you've already been there, so you've seen how shift can be used. Since you like the idea of identifying the dates as well. Why not use shift there as well?
# Extending the previous columns...
dt[, ep := (shift(spei, type = "lead", fill = 1) > 0) & (spei <= 0), by = id]
cbind(dt[sp & dv, .(start = month), by = id],
dt[ep & dv, .(end = month), by = id][,id := NULL])
If you wanted the dates to be as indicated by the red lines in the plot just add a month unless its the last one. We can also get the lengths too...
# Extending the previous columns again...
dt[, end.month := shift(month, type = "lead", fill = month[.N]), by = id]
dt[, orig.id := .I]
starts <- dt[sp & dv][, did := .I]
ends <- dt[ep & dv][, did := .I]
starts[ends, on = "did"][
,.(id = id, length = 1 + i.orig.id - orig.id, start = month, end = i.end.month)]
Would yield
id length start end
1: loc1 3 2014-01-01 2014-04-01
2: loc1 1 2014-05-01 2014-06-01
3: loc1 2 2014-07-01 2014-09-01
And it is still fast! With n=300
> microbenchmark(max = max.full(copy(dt))[, .(nDroughts = .N), by = id],
+ thellcounts = thell.counts(copy(dt)),
+ thell .... [TRUNCATED]
Unit: milliseconds
expr min lq mean median uq max neval
max 218.19152 220.30895 342.18605 222.75507 250.36644 1350.15847 10
thellcounts 20.36785 22.27349 28.45167 23.39313 24.38610 78.25046 10
thelldates 28.24378 28.64849 30.59897 30.57793 31.25352 34.51569 10
thelldates2 36.19724 39.79588 42.34457 41.52455 42.41872 57.28073 10
With n=3000
> microbenchmark(max = max.full(copy(dt))[, .(nDroughts = .N), by = id],
+ thellcounts = thell.counts(copy(dt)),
+ thell .... [TRUNCATED]
Unit: milliseconds
expr min lq mean median uq max neval
max 2126.1138 2148.3453 2207.7801 2205.3536 2241.2848 2340.1203 10
thellcounts 197.7312 202.4817 234.2949 205.4828 304.1556 309.1028 10
thelldates 261.9889 264.5597 283.9970 266.1244 267.8603 374.6406 10
thelldates2 320.6352 331.7558 374.4110 340.2668 439.1490 441.8473 10