I'm stuck trying to find a relatively simple way to count occurrences within a date range by group using R. I get the idea there has to be an easier way than what I'm trying.
I have over 6,000 groups, each group has anywhere from 1 to 100 IDs within, each with a start date and an end date anywhere from Jan 1, 1990 to today. I want to make a dataframe, one group per column, and one day per row, counting the number of IDs active per day from April 1, 2013 until March 31, 2018. For obvious reasons, using countifs in excel will not cut it.
I was trying to use this question as a starting point, as such:
df1 <- data.frame(group = c(1,1,2,3,3),
id = c(1,2,1,1,2),
startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01","2016-04-01"),
enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05","2999-01-01"))
report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 7))
report <- cbind(report,matrix(data=NA,nrow=7,ncol=3))
names(report) <- c('date',as.vector(unique(df1$group)))
daily <- function(i,...){
report[,i+1] <- sapply(report$date, function(x)
sum(as.Date(df1$startdate) < as.Date(x) &
as.Date(df1$enddate) > as.Date(x) &
df1$group == unique(df1$group)[i]))
}
for (i in unique(df1$group))
daily(i)
However, this doesn't seem to do anything (nor does it throw errors). Is there an easier way to do this? Am I way off base? Any help is appreciated for this non-programmer!
Additional help requested: I'm trying to modify Jaap's code in the answer below to include group start and group end times, so that the data table displays an NA when the group is not active.
Example data:
df2 <- data.frame(group = c(1,1,2,3,3),
groupopendate = c("2016-04-02","2016-04-02","2016-04-01","2016-04-02","2016-04-02"),
groupclosedate = c("2016-04-08","2016-04-08","2016-04-10","2016-04-09","2016-04-09"),
id = c(1,2,1,1,2),
startdate = c("2016-04-02","2016-04-04","2016-04-03","2016-04-02","2016-04-05"),
enddate = c("2016-04-04","2016-04-06","2016-04-10","2016-04-08","2016-04-08"))
Jaap's solution gives me this:
active grp1 grp2 grp3
1: 2016-04-02 1 0 1
2: 2016-04-03 1 1 1
3: 2016-04-04 1 1 1
4: 2016-04-05 1 1 2
5: 2016-04-06 0 1 2
6: 2016-04-07 0 1 2
However, what I want is such:
active grp1 grp2 grp3
1: 2016-04-01 NA 0 NA
2: 2016-04-02 1 0 1
3: 2016-04-03 1 1 1
4: 2016-04-04 1 1 1
5: 2016-04-05 1 1 1
6: 2016-04-06 1 1 2
7: 2016-04-07 0 1 2
8: 2016-04-08 NA 1 0
9: 2016-04-09 NA 1 NA
10: 2016-04-10 NA NA NA
Any help is appreciated!
A possible alternative solution using data.table:
# load the package & convert 'df1' to a data.table
library(data.table)
setDT(df1)
# convert the date columns to a date format
# not needed if they are
df1[, `:=` (startdate = as.Date(startdate), enddate = as.Date(enddate))]
# create a new data.table with the 'active' days
DT <- data.table(active = seq(from = as.Date("2016-04-01"), by = "day", length.out = 7))
# use a join and dcast to get the desired result
DT[df1
, on = .(active > startdate, active < enddate)
, allow = TRUE
, nomatch = 0
, .(active = x.active, group, id)
][, dcast(.SD, active ~ paste0("grp",group), value.var = "id", fun = length)]
which gives:
active grp1 grp2 grp3
1: 2016-04-01 1 1 0
2: 2016-04-02 1 1 1
3: 2016-04-03 1 1 1
4: 2016-04-04 0 1 1
5: 2016-04-05 1 1 1
6: 2016-04-06 1 1 1
7: 2016-04-07 1 1 1
NOTE: I've used paste0("grp",group) instead of just group in the dcast step as it leads to better columnnames (it is better not to use just numeric values as columnnames)
With regard to your additional example, you could solve that as follows:
setDT(df2)
df2[, c(2:3,5:6) := lapply(.SD, as.Date), .SDcols = c(2:3,5:6)]
DT <- data.table(active = seq(from = min(df2$groupopendate),
to = max(df2$groupclosedate),
by = "day"))
df2new <- df2[, .(active = seq.Date(startdate, enddate, by = "day"))
, by = .(group, id)
][, .N, by = .(group, active)
][df2[, .(active = seq.Date(groupopendate[1], groupclosedate[.N] - 1, by = "day"))
, by = .(group)]
, on = .(group, active)
][is.na(N), N := 0
][, dcast(.SD, active ~ paste0("grp",group))]
nms <- setdiff(names(df2new), "active")
DT[df2new
, on = .(active)
, (nms) := mget(paste0("i.",nms))][]
which gives:
> DT
active grp1 grp2 grp3
1: 2016-04-01 NA 0 NA
2: 2016-04-02 1 0 1
3: 2016-04-03 1 1 1
4: 2016-04-04 2 1 1
5: 2016-04-05 1 1 2
6: 2016-04-06 1 1 2
7: 2016-04-07 0 1 2
8: 2016-04-08 NA 1 2
9: 2016-04-09 NA 1 NA
10: 2016-04-10 NA 1 NA
I've figured it out! As usual, as soon as you post a question, you figure out the answer. I was overcomplicating it by putting in the function, when I could just put the sapply in the for loop.
If anyone is interested:
for (i in unique(df1$group))
{report[,i+1] <-
sapply(report$date, function(x)
sum(as.Date(df1$startdate) < as.Date(x) &
as.Date(df1$enddate) > as.Date(x) &
df1$group == unique(df1$group)[i]))}
Related
Suppose that I have two tables (DT_sportA and DT_sportB) that measure time periods in which two children (id) played sport "A" and "B".
library(data.table)
library(lubridate)
DT_sportA <- data.table(id = rep(1:2,each=2),
start_date = ymd(c("2000-01-01","2002-01-15","2014-03-12","2016-10-14")),
end_date = ymd(c("2000-02-03","2003-03-01","2014-04-03","2017-05-19")))
DT_sportA
# id start_date end_date
# 1: 1 2000-01-01 2000-02-03
# 2: 1 2002-01-15 2003-03-01
# 3: 2 2014-03-12 2014-04-03
# 4: 2 2016-10-14 2017-05-19
DT_sportB <- data.table(id = c(1L,1L,2L),
start_date = ymd(c("2000-01-15","2002-01-15","2017-02-10")),
end_date = ymd(c("2000-02-01","2006-03-19","2017-02-20")))
DT_sportB
# id start_date end_date
# 1: 1 2000-01-15 2000-02-01
# 2: 1 2002-01-15 2006-03-19
# 3: 2 2017-02-10 2017-02-20
I would like to generate a new table with all of the unique and overlapping date ranges with two categorical indicators denoting the sport played by the children. The desired DT should look like this:
id start_date end_date sportA sportB
1: 1 2000-01-01 2000-01-14 1 0
2: 1 2000-01-15 2000-02-01 1 1
3: 1 2000-02-02 2000-02-03 1 0
4: 1 2002-01-15 2002-03-01 1 1
5: 1 2002-03-02 2002-03-19 0 1
6: 2 2014-03-12 2014-04-03 1 0
7: 2 2016-10-14 2017-02-09 1 0
8: 2 2017-02-10 2017-02-20 1 1
9: 2 2017-02-21 2017-05-19 1 0
This is a fairly trivial toy example. The real data spans several million rows and approximately 20 "sports", which is why I am looking for a data.table solution.
Notes:
when doing similar/same things to multiple tables, I find it is almost always preferable to operate on them as a list of tables instead of individual objects; while this solution will work in general without this (some adaptation required), I believe it makes many things worth the mind-shift;
further, I actually think a long-format is better than a list-of-tables here, as we can still differentiate id and sport with ease;
your expected output is a little inconsistent in how it avoids overlap between rows; for example, "2000-01-14" is not in the data, but it is the end_date, suggesting that "2000-01-15" was reduced because the next row starts on that date ... but there is a start on "2000-02-02" for apparently similar (but reversed) reasons; one way around this is to subtract a really low number from end_date so that no id/sport/date range will match multiple rows, and I say "low number" and not 1 because Date-class objects are really numeric, and dates can be fractional: though not displayed fractionally, it is still fractional, compare Sys.Date()-0.1 with dput(Sys.Date()-0.1).
sports <- rbindlist(mget(ls(pattern = "DT_sport.*")), idcol = "sport")
sports[, sport := gsub("^DT_", "", sport) ] # primarily aesthetics
# sport id start_date end_date
# <char> <int> <Date> <Date>
# 1: sportA 1 2000-01-01 2000-02-03
# 2: sportA 1 2002-01-15 2003-03-01
# 3: sportA 2 2014-03-12 2014-04-03
# 4: sportA 2 2016-10-14 2017-05-19
# 5: sportB 1 2000-01-15 2000-02-01
# 6: sportB 1 2002-01-15 2006-03-19
# 7: sportB 2 2017-02-10 2017-02-20
I tend to like piping data.table, and since I'm still on R-4.0.5, I use magrittr::%>% for this; it is not strictly required, but I feel it helps readability (and therefore maintainability, etc). (I don't know if this will work as easily in R-4.1's native |> pipe, as that has more restrictions on the RHS data placement.)
library(magrittr)
out <- sports[, {
vec <- sort(unique(c(start_date, end_date)));
.(sd = vec[-length(vec)], ed = vec[-1]);
}, by = .(id) ] %>%
.[, ed := pmin(ed, shift(sd, type = "lead") - 1e-5, na.rm = TRUE), by = .(id) ] %>%
sports[., on = .(id, start_date <= sd, end_date >= ed) ] %>%
.[ !is.na(sport), ] %>%
.[, val := 1L ] %>%
dcast(id + start_date + end_date ~ sport, value.var = "val", fill = 0)
out
# id start_date end_date sportA sportB
# <int> <Date> <Date> <int> <int>
# 1: 1 2000-01-01 2000-01-14 1 0
# 2: 1 2000-01-15 2000-01-31 1 1
# 3: 1 2000-02-01 2000-02-02 1 0
# 4: 1 2002-01-15 2003-02-28 1 1
# 5: 1 2003-03-01 2006-03-19 0 1
# 6: 2 2014-03-12 2014-04-02 1 0
# 7: 2 2016-10-14 2017-02-09 1 0
# 8: 2 2017-02-10 2017-02-19 1 1
# 9: 2 2017-02-20 2017-05-19 1 0
Walk-through:
the first sports[, {...}] produces just the feasible date-ranges, per-id; it will produce more than needed, and these are filtered out a little later; I combine this with a slight offset to end_date so that rows are mutually exclusive (second note above); while they appear to be full-days separated, they are only separated by under 1 second; I add secdiff to show this here:
sports[, {
vec <- sort(unique(c(start_date, end_date)));
.(sd = vec[-length(vec)], ed = vec[-1]);
}, by = .(id) ] %>%
.[, ed := pmin(ed, shift(sd, type = "lead") - 1e-5, na.rm = TRUE), by = .(id) ] %>%
.[, secdiff := c(as.numeric(sd[-1] - ed[-.N], units="secs"), NA), by = .(id) ]
# id sd ed secdiff
# <int> <Date> <Date> <num>
# 1: 1 2000-01-01 2000-01-14 0.8640000
# 2: 1 2000-01-15 2000-01-31 0.8640000
# 3: 1 2000-02-01 2000-02-02 0.8640000
# 4: 1 2000-02-03 2002-01-14 0.8640000 # will be empty post-join
# 5: 1 2002-01-15 2003-02-28 0.8640000
# 6: 1 2003-03-01 2006-03-19 NA
# 7: 2 2014-03-12 2014-04-02 0.8640001
# 8: 2 2014-04-03 2016-10-13 0.8640001 # will be empty post-join
# 9: 2 2016-10-14 2017-02-09 0.8640001
# 10: 2 2017-02-10 2017-02-19 0.8640001
# 11: 2 2017-02-20 2017-05-19 NA
btw, the first operation on sports[..] in the previous bullet is {-blockized for a slight boost in efficiency, choosing to not sort(unique(c(start_date, end_date))) twice;
left join sports onto this, on id and the date-ranges; this will produce NA values in the sport column, which indicates the date ranges that were programmatically made (with a simple sequence of dates) but no sports are assigned; these not-needed rows are removed by the !is.na(sport);
assigning val := 1L is purely so that we have a value column during reshaping;
dcast reshapes and fills the missing values with 0.
For some reason, the i term in the for loop cannot be used as the grouping name. I have 40 elements in the for loop. I am showing just 2 here as an example.
data = data.table(id = c(1,1,1,1,1), a = c(1,1,2,3,NA), b = c(1,2,2,NA,3))
> data
id a b
1: 1 1 1
2: 1 1 2
3: 1 2 2
4: 1 3 NA
5: 1 NA 3
categories = data.table(CATEGORY = c(1,2,3,NA))
> categories
CATEGORY
1: 1
2: 2
3: 3
4: NA
What I have done:
for (i in colnames(data)[2:3]){
dt = data[, .N, i][order(i)]
setnames(dt, "N", i)
categories = cbind(categories, dt[,2])
}
> categories
CATEGORY a b
1: 1 2 1
2: 2 2 1
3: 3 2 1
4: NA 2 1
I have also tried the dplyr piping instead of the data.table .N and it did not work:
data %>% count(i)
What I need:
> categories
CATEGORY a b
1: 1 2 1
2: 2 1 2
3: 3 1 1
4: NA 1 1
You could reshape the data instead which would make this easier to calculate.
In data.table :
library(data.table)
long_data <- melt(data, 'id')
dcast(long_data[, .N, .(variable, value)], value~variable, value.var = 'N')
# value a b
#1: NA 1 1
#2: 1 2 1
#3: 2 1 2
#4: 3 1 1
Or in tidyverse :
library(dplyr)
library(tidyr)
data %>%
pivot_longer(cols = -id) %>%
count(name, value) %>%
pivot_wider(names_from = name, values_from = n)
I ran into an issue with the rcppRoll package. I want to use it to sum the value of the past 3 months, however, sometimes there is no data on 1 or more months. The "n = 3" considers the last three observations, rather than the last 3 months. I couldn't find a solid solution, so I am trying my luck here. Thank you in advance for any suggestions.
P.S. I prefer to work with data.table and rcpp_roll as my dataset is large and I am familiar with those.
Code:
library("data.table")
library("RcppRoll")
test = data.table(id = rep(1, 8),date = c("2015-01","2015-02","2015-03","2015-04","2015-08","2015-09","2015-10","2015-11"), value = 1:8)
test = test[, var:= roll_sumr(value, n = 3, na.rm = TRUE), by = id]
id date value var
1: 1 2015-01 1 NA
2: 1 2015-02 2 NA
3: 1 2015-03 3 6
4: 1 2015-04 4 9
5: 1 2015-08 5 12
6: 1 2015-09 6 15
7: 1 2015-10 7 18
8: 1 2015-11 8 21
Expected output
prefered_outcome = data.table(id = rep(1, 8),date = c("2015-01","2015-02","2015-03","2015-04","2015-08","2015-09","2015-10","2015-11"), value = 1:8,var = c(NA, NA, 6, 9, NA, NA, 18, 21))
id date value var
1: 1 2015-01 1 NA
2: 1 2015-02 2 NA
3: 1 2015-03 3 6
4: 1 2015-04 4 9
5: 1 2015-08 5 NA
6: 1 2015-09 6 NA
7: 1 2015-10 7 18
8: 1 2015-11 8 21
Define ym of yearmon class and check if the prior and second prior ym are one and two months back and if so use roll_sumr and otherwise use NA.
library(zoo)
ym <- test[, as.yearmon(date)]
test[, roll := ifelse(ym - 1/12 == shift(ym) & ym - 2/12 == shift(ym, 2),
roll_sumr(value, 3, na.rm = TRUE), NA), by = id ]
giving:
> test
id date value roll
1: 1 2015-01 1 NA
2: 1 2015-02 2 NA
3: 1 2015-03 3 6
4: 1 2015-04 4 9
5: 1 2015-08 5 NA
6: 1 2015-09 6 NA
7: 1 2015-10 7 18
8: 1 2015-11 8 21
You can add the missing months first and then performing the function. After that, the added months can be removed again
library(data.table)
library("RcppRoll")
library(zoo)
test = data.table(id = rep(1, 8),date = c("2015-01","2015-02","2015-03","2015-04","2015-08","2015-09","2015-10","2015-11"), value = 1:8)
test$date <- as.yearmon(test$date)
allMonths <- seq.Date(from=as.Date(test$date[1]),to=as.Date(test$date[length(test$date)]),by="month")
df2 <- data.frame(date=as.yearmon(allMonths))
df3 <- merge(test,df2, all=TRUE)
df3 <- df3[, var:= roll_sumr(value, n = 3, na.rm = TRUE), by = id]
df3
I am interested in finding the number of days since the last event per ID. The data looks like this:
df <- data.frame(date=as.Date(
c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001",
"06/08/2010","15/09/2010","15/10/2010","03/01/2011","17/03/2011"), "%d/%m/%Y"),
event=c(0,0,1,0,1, 1,0,0,0,1),id = c(rep(1,5),rep(2,5)))
date event id
1 2000-07-06 0 1
2 2000-09-15 0 1
3 2000-10-15 1 1
4 2001-01-03 0 1
5 2001-03-17 1 1
6 2010-08-06 1 2
7 2010-09-15 0 2
8 2010-10-15 0 2
9 2011-01-03 0 2
10 2011-03-17 1 2
I am borrowing heavily from a data table solution here but this does not consider ID's.
library(data.table)
setDT(df)
setkey(df, date,id)
df = df[event == 1, .(lastevent = date), key = date][df, roll = TRUE]
df[, tae := difftime(lastevent, shift(lastevent, 1L, "lag"), unit = "days")]
df[event == 0, tae:= difftime(date, lastevent, unit = "days")]
It generates the following output
date lastevent event id tae
1: 2000-07-06 <NA> 0 1 NA days
2: 2000-09-15 <NA> 0 1 NA days
3: 2000-10-15 2000-10-15 1 1 NA days
4: 2001-01-03 2000-10-15 0 1 80 days
5: 2001-03-17 2001-03-17 1 1 153 days
6: 2010-08-06 2010-08-06 1 2 3429 days
7: 2010-09-15 2010-08-06 0 2 40 days
8: 2010-10-15 2010-08-06 0 2 70 days
9: 2011-01-03 2010-08-06 0 2 150 days
10: 2011-03-17 2011-03-17 1 2 223 days
My desired output however is as follows:
date lastevent event id tae
1: 2000-07-06 <NA> 0 1 NA days
2: 2000-09-15 <NA> 0 1 NA days
3: 2000-10-15 2000-10-15 1 1 NA days
4: 2001-01-03 2000-10-15 0 1 80 days
5: 2001-03-17 2001-03-17 1 1 153 days
6: 2010-08-06 2010-08-06 1 2 NA days
7: 2010-09-15 2010-08-06 0 2 40 days
8: 2010-10-15 2010-08-06 0 2 70 days
9: 2011-01-03 2010-08-06 0 2 150 days
10: 2011-03-17 2011-03-17 1 2 223 days
The only difference is the NA in row 6 and column tae. This is a related post that is unanswered. I have looked here, but the solution does not work in my case. There are many other questions like this but not for calculations per ID. Thank you!
df <- data.table(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001","06/08/2010","15/09/2010","15/10/2010","03/01/2011","17/03/2011"),
"%d/%m/%Y"), event=c(0,0,1,0,1, 1,0,1,0,1),id = c(rep(1,5),rep(2,5)))
tempdt <- df[event==1,]
tempdt[,tae := date - shift(date), by = id]
df <- merge(df, tempdt, by = c("date", "event", "id"), all.x = TRUE)
df[, tae := ifelse(shift(event)==1, date - shift(date), tae), by = id]
EDIT
More general solution
df <- data.table(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001", "18/03/2001",
"06/08/2010","15/09/2010","15/10/2010","03/01/2011","17/03/2011","19/03/2011"),
"%d/%m/%Y"),
event=c(1,0,0,0,0,0,1,1,1,0,1,0),id = c(rep(1,6),rep(5,6)))
##for event = 1 observations
tempdt <- df[event==1,]
tempdt[,tae := date - shift(date), by = id]
df <- merge(df, tempdt, by = c("date", "event", "id"), all.x = TRUE)
##for event = 0 observations
for(d in df[event==0, date]){
# print(as.Date(d, origin = "1970-01-01"))
df[date == d & event == 0, tae := as.Date(d, origin = "1970-01-01") -
max(df[date<d & event==1,date]), by = id]
}
EDIT 2
Now, there must be a faster way to do this, but if first observation is event = 0, this won't result in any warning
df <- data.table(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001","06/08/2010","15/09/2010","15/10/2010","03/01/2011","17/03/2011"),
"%d/%m/%Y"), event=c(0,0,1,0,1, 1,0,0,0,1),id = c(rep(1,5),rep(2,5)))
tempdt <- df[event==1,]
tempdt[,tae := date - shift(date), by = id]
df <- merge(df, tempdt, by = c("date", "event", "id"), all.x = TRUE)
for(i in unique(df[,id])){
# print(i)
for(d in df[date>df[id == i & event==1,min(date)] & event==0, date]){
# print(as.Date(d, origin = "1970-01-01"))
df[id == i & date == d & event == 0,
tae := as.Date(d, origin = "1970-01-01") - max(df[date<d &
event==1,date])]
}
}
how to get the startdate and enddate(cumulative) from days
Variable A days stardate enddate(cumulative)
A 2 0 2
A 3 2 5
B 4 5 9
A 3 0 3
B 8 3 11
Using a combination of cumsum and shift:
dt1[, grp := cumsum(shift(Variable, fill='A')=='B')
][, startdate := cumsum(shift(days, fill = 0)), grp
][, enddate := startdate + days][]
which gives:
> dt1
Variable days grp startdate enddate
1: A 2 0 0 2
2: A 3 0 2 5
3: B 4 0 5 9
4: A 3 1 0 3
5: B 8 1 3 11
Used data:
dt1 <- data.table(Variable = c("A","A","B","A","B"), days = c(2,3,4,3,8))
If I understand correctly, you want the days + the startdate to be in the enddate column. Then you can simply use this:
df <- data.table(Variable=c("A", "A", "B", "A", "B"), days=c(2,3,4,3,8), startdate=c(0,2,5,9,12), enddate=c(0))
df$enddate <- as.numeric(df$days + df$startdate)