Suppose that I have two tables (DT_sportA and DT_sportB) that measure time periods in which two children (id) played sport "A" and "B".
library(data.table)
library(lubridate)
DT_sportA <- data.table(id = rep(1:2,each=2),
start_date = ymd(c("2000-01-01","2002-01-15","2014-03-12","2016-10-14")),
end_date = ymd(c("2000-02-03","2003-03-01","2014-04-03","2017-05-19")))
DT_sportA
# id start_date end_date
# 1: 1 2000-01-01 2000-02-03
# 2: 1 2002-01-15 2003-03-01
# 3: 2 2014-03-12 2014-04-03
# 4: 2 2016-10-14 2017-05-19
DT_sportB <- data.table(id = c(1L,1L,2L),
start_date = ymd(c("2000-01-15","2002-01-15","2017-02-10")),
end_date = ymd(c("2000-02-01","2006-03-19","2017-02-20")))
DT_sportB
# id start_date end_date
# 1: 1 2000-01-15 2000-02-01
# 2: 1 2002-01-15 2006-03-19
# 3: 2 2017-02-10 2017-02-20
I would like to generate a new table with all of the unique and overlapping date ranges with two categorical indicators denoting the sport played by the children. The desired DT should look like this:
id start_date end_date sportA sportB
1: 1 2000-01-01 2000-01-14 1 0
2: 1 2000-01-15 2000-02-01 1 1
3: 1 2000-02-02 2000-02-03 1 0
4: 1 2002-01-15 2002-03-01 1 1
5: 1 2002-03-02 2002-03-19 0 1
6: 2 2014-03-12 2014-04-03 1 0
7: 2 2016-10-14 2017-02-09 1 0
8: 2 2017-02-10 2017-02-20 1 1
9: 2 2017-02-21 2017-05-19 1 0
This is a fairly trivial toy example. The real data spans several million rows and approximately 20 "sports", which is why I am looking for a data.table solution.
Notes:
when doing similar/same things to multiple tables, I find it is almost always preferable to operate on them as a list of tables instead of individual objects; while this solution will work in general without this (some adaptation required), I believe it makes many things worth the mind-shift;
further, I actually think a long-format is better than a list-of-tables here, as we can still differentiate id and sport with ease;
your expected output is a little inconsistent in how it avoids overlap between rows; for example, "2000-01-14" is not in the data, but it is the end_date, suggesting that "2000-01-15" was reduced because the next row starts on that date ... but there is a start on "2000-02-02" for apparently similar (but reversed) reasons; one way around this is to subtract a really low number from end_date so that no id/sport/date range will match multiple rows, and I say "low number" and not 1 because Date-class objects are really numeric, and dates can be fractional: though not displayed fractionally, it is still fractional, compare Sys.Date()-0.1 with dput(Sys.Date()-0.1).
sports <- rbindlist(mget(ls(pattern = "DT_sport.*")), idcol = "sport")
sports[, sport := gsub("^DT_", "", sport) ] # primarily aesthetics
# sport id start_date end_date
# <char> <int> <Date> <Date>
# 1: sportA 1 2000-01-01 2000-02-03
# 2: sportA 1 2002-01-15 2003-03-01
# 3: sportA 2 2014-03-12 2014-04-03
# 4: sportA 2 2016-10-14 2017-05-19
# 5: sportB 1 2000-01-15 2000-02-01
# 6: sportB 1 2002-01-15 2006-03-19
# 7: sportB 2 2017-02-10 2017-02-20
I tend to like piping data.table, and since I'm still on R-4.0.5, I use magrittr::%>% for this; it is not strictly required, but I feel it helps readability (and therefore maintainability, etc). (I don't know if this will work as easily in R-4.1's native |> pipe, as that has more restrictions on the RHS data placement.)
library(magrittr)
out <- sports[, {
vec <- sort(unique(c(start_date, end_date)));
.(sd = vec[-length(vec)], ed = vec[-1]);
}, by = .(id) ] %>%
.[, ed := pmin(ed, shift(sd, type = "lead") - 1e-5, na.rm = TRUE), by = .(id) ] %>%
sports[., on = .(id, start_date <= sd, end_date >= ed) ] %>%
.[ !is.na(sport), ] %>%
.[, val := 1L ] %>%
dcast(id + start_date + end_date ~ sport, value.var = "val", fill = 0)
out
# id start_date end_date sportA sportB
# <int> <Date> <Date> <int> <int>
# 1: 1 2000-01-01 2000-01-14 1 0
# 2: 1 2000-01-15 2000-01-31 1 1
# 3: 1 2000-02-01 2000-02-02 1 0
# 4: 1 2002-01-15 2003-02-28 1 1
# 5: 1 2003-03-01 2006-03-19 0 1
# 6: 2 2014-03-12 2014-04-02 1 0
# 7: 2 2016-10-14 2017-02-09 1 0
# 8: 2 2017-02-10 2017-02-19 1 1
# 9: 2 2017-02-20 2017-05-19 1 0
Walk-through:
the first sports[, {...}] produces just the feasible date-ranges, per-id; it will produce more than needed, and these are filtered out a little later; I combine this with a slight offset to end_date so that rows are mutually exclusive (second note above); while they appear to be full-days separated, they are only separated by under 1 second; I add secdiff to show this here:
sports[, {
vec <- sort(unique(c(start_date, end_date)));
.(sd = vec[-length(vec)], ed = vec[-1]);
}, by = .(id) ] %>%
.[, ed := pmin(ed, shift(sd, type = "lead") - 1e-5, na.rm = TRUE), by = .(id) ] %>%
.[, secdiff := c(as.numeric(sd[-1] - ed[-.N], units="secs"), NA), by = .(id) ]
# id sd ed secdiff
# <int> <Date> <Date> <num>
# 1: 1 2000-01-01 2000-01-14 0.8640000
# 2: 1 2000-01-15 2000-01-31 0.8640000
# 3: 1 2000-02-01 2000-02-02 0.8640000
# 4: 1 2000-02-03 2002-01-14 0.8640000 # will be empty post-join
# 5: 1 2002-01-15 2003-02-28 0.8640000
# 6: 1 2003-03-01 2006-03-19 NA
# 7: 2 2014-03-12 2014-04-02 0.8640001
# 8: 2 2014-04-03 2016-10-13 0.8640001 # will be empty post-join
# 9: 2 2016-10-14 2017-02-09 0.8640001
# 10: 2 2017-02-10 2017-02-19 0.8640001
# 11: 2 2017-02-20 2017-05-19 NA
btw, the first operation on sports[..] in the previous bullet is {-blockized for a slight boost in efficiency, choosing to not sort(unique(c(start_date, end_date))) twice;
left join sports onto this, on id and the date-ranges; this will produce NA values in the sport column, which indicates the date ranges that were programmatically made (with a simple sequence of dates) but no sports are assigned; these not-needed rows are removed by the !is.na(sport);
assigning val := 1L is purely so that we have a value column during reshaping;
dcast reshapes and fills the missing values with 0.
Related
I have a set of patient ids and date column. I want to update date1 column with -1 day from the date column. for example :
ID Date Date1
1 23-10-2017 23-09-2018
1 24-09-2018 28-08-2019
1 29-08-2019 -
2 30-05-2016 11-06-2017
2 12-06-2017 12-07-2018
2 13-07-2018 -
I don't know if i get what you want. But if you just want a date less one day, this is the code.
x <- data.frame(ID = c(1,1,1,2,2,2), Date = as.Date(c("20-10-2017", "24-09-2018", "29-08-2019", "30-05-2016", "12-06-2017", "13-07-2018"),"%d-%m-%Y"))
x$Date1 <- x$Date-1
Shift by one row by group, then subtract one day:
library(data.table)
dt1 <- fread("
ID Date
1 23-10-2017
1 24-09-2018
1 29-08-2019
2 30-05-2016
2 12-06-2017
2 13-07-2018")
# convert to date
dt1[, Date := as.Date(Date, "%d-%m-%y")]
# shift per group, then minus 1 day
dt1[, Date1 := shift(Date, - 1) - 1, by = ID]
dt1
# ID Date Date1
# 1: 1 2020-10-23 2020-09-23
# 2: 1 2020-09-24 2020-08-28
# 3: 1 2020-08-29 <NA>
# 4: 2 2020-05-30 2020-06-11
# 5: 2 2020-06-12 2020-07-12
# 6: 2 2020-07-13 <NA>
Try using lead:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Date1 = lead(Date)-1)
# A tibble: 6 x 3
# Groups: ID [2]
ID Date Date1
<int> <date> <date>
1 1 2017-10-23 2018-09-23
2 1 2018-09-24 2019-08-28
3 1 2019-08-29 NA
4 2 2016-05-30 2017-06-11
5 2 2017-06-12 2018-07-12
6 2 2018-07-13 NA
My sample data.frame (date format d/m/y), recording the dates a customer was active:
customer date
1 10/1/20
1 9/1/20
1 6/1/20
2 10/1/20
2 8/1/20
2 7/1/20
2 6/1/20
I would like to make a column "n_consecutive_days" like so:
customer date n_consecutive_days
1 10/1/20 2
1 9/1/20 1
1 6/1/20 N/A
2 10/1/20 1
2 8/1/20 3
2 7/1/20 2
2 6/1/20 N/A
The new column counts the number of previous consecutive dates per customer. I would like the customer's first date to be N/A as it makes no sense to talk about previous consecutive days if it is the first one.
Any help would be appreciated. I can calculate the difference between dates, but not the number of consecutive days as desired.
One way would be:
library(dplyr)
df %>%
group_by(customer, idx = cumsum(as.integer(c(0, diff(as.Date(date, '%d/%m/%y')))) != -1)) %>%
mutate(n_consecutive_days = rev(sequence(n()))) %>% ungroup() %>%
group_by(customer) %>%
mutate(n_consecutive_days = replace(n_consecutive_days, row_number() == n(), NA), idx = NULL)
Output:
# A tibble: 7 x 3
# Groups: customer [2]
customer date n_consecutive_days
<int> <fct> <int>
1 1 10/1/20 2
2 1 9/1/20 1
3 1 6/1/20 NA
4 2 10/1/20 1
5 2 8/1/20 3
6 2 7/1/20 2
7 2 6/1/20 NA
An option using data.table:
#ensure that data is sorted by customer and reverse chronological
setorder(DT, customer, -date)
#group by customer and consecutive dates and then create the sequence
DT[, ncd := .N:1L, .(customer, cumsum(c(0L, diff(date)!=-1L)))]
#set the first date in each customer to NA
DT[DT[, .I[.N], customer]$V1, ncd := NA]
output:
customer date ncd
1: 1 2020-01-10 2
2: 1 2020-01-09 1
3: 1 2020-01-06 NA
4: 2 2020-01-10 1
5: 2 2020-01-08 3
6: 2 2020-01-07 2
7: 2 2020-01-06 NA
data:
library(data.table)
DT <- fread("customer date
1 10/1/20
1 9/1/20
1 6/1/20
2 10/1/20
2 8/1/20
2 7/1/20
2 6/1/20")
DT[, date := as.IDate(date, format="%d/%m/%y")]
***NOTE: please do not link a similar post. I have found several other similar postings, but their responses have not resolved the errors I get: "Error in seq.int(0, to0 - from, by) : 'to' must be a finite number" or "from must be of length 1".... I'm looking to understand why these error statements occur and how to prevent them from occuring... Thanks!
I have a data frame like the following
id startdate enddate
1 01/01/2011 01/05/2011
1 02/03/2012 02/05/2012
2 03/04/2013 03/06/2013
3 04/06/2014 04/09/2014
I want to transform the data frame so as to create the following:
id date
1 01/01/2011
1 01/02/2011
1 01/03/2011
1 01/04/2011
1 01/05/2011
1 02/03/2012
1 02/04/2012
1 02/05/2012
2 03/04/2013
2 03/05/2013
2 03/06/2013
.... and so on to fill in the sequence of dates between startdate and enddate
I have tried the following....
one<-as.data.table(one)
one[, startdate:=as.character(startdate)]
one[, enddate:=as.character(enddate)]
one[, startdate:=as.Date(startdate, format="%m/%d/%Y")]
one[, enddate:=as.Date(enddate, format="%m/%d/%Y")]
one<-as.data.frame(one)
one%>%
rowwise() %>%
do(data.frame(id=.$id, date=seq(.$startdate,.$enddate,by="day")))
When I run this, I get the following error: Error in seq.int(0, to0 - from, by) : 'to' must be a finite number
Why is this? And how can I fix this piece of code?
With data.table, we can use Map. Convert the 'startdate' 'enddate' to Date class, use Map to get the sequence of corresponding elements, replicate the 'id' based on the lengths of the list output of dates, concatenate the list of dates to create the two column output
library(data.table)
one[, {lst1 <- Map(seq, as.IDate(startdate, "%m/%d/%Y"),
as.IDate(enddate, "%m/%d/%Y"),
MoreArgs = list(by = "day"))
.(id = rep(id, lengths(lst1)), date = do.call(c, lst1))}]
# id date
# 1: 1 2011-01-01
# 2: 1 2011-01-02
# 3: 1 2011-01-03
# 4: 1 2011-01-04
# 5: 1 2011-01-05
# 6: 1 2012-02-03
# 7: 1 2012-02-04
# 8: 1 2012-02-05
# 9: 2 2013-03-04
#10: 2 2013-03-05
#11: 2 2013-03-06
#12: 3 2014-04-06
#13: 3 2014-04-07
#14: 3 2014-04-08
#15: 3 2014-04-09
If there are multiple formats in 'date' columns, one option is anydate from anytime to automatically convert some of the formats to Date class
library(anytime)
one[, {lst1 <- Map(seq, anydate(startdate),
anydate(enddate),
MoreArgs = list(by = "day"))
.(id = rep(id, lengths(lst1)), date = do.call(c, lst1))}]
Or using tidyverse
library(dplyr)
library(purrr)
library(tidyr)
library(lubridate)
one %>%
transmute(id, date = map2(mdy(startdate), mdy(enddate), seq, by = 'day')) %>%
unnest(c(date))
# A tibble: 15 x 2
# id date
# <int> <date>
# 1 1 2011-01-01
# 2 1 2011-01-02
# 3 1 2011-01-03
# 4 1 2011-01-04
# 5 1 2011-01-05
# 6 1 2012-02-03
# 7 1 2012-02-04
# 8 1 2012-02-05
# 9 2 2013-03-04
#10 2 2013-03-05
#11 2 2013-03-06
#12 3 2014-04-06
#13 3 2014-04-07
#14 3 2014-04-08
#15 3 2014-04-09
data
one <- structure(list(id = c(1L, 1L, 2L, 3L), startdate = c("01/01/2011",
"02/03/2012", "03/04/2013", "04/06/2014"), enddate = c("01/05/2011",
"02/05/2012", "03/06/2013", "04/09/2014")), class = "data.frame", row.names = c(NA,
-4L))
setDT(one)
I am searching for an efficient and fast approach to fill missing data in a table with missing dates.
library(data.table)
dt <- as.data.table(read.csv(textConnection('"date","gr1","gr2","x"
"2017-01-01","A","a",1
"2017-02-01","A","b",2
"2017-02-01","B","a",4
"2017-04-01","B","a",5
"2017-05-01","A","b",3')))
dt[,date := as.Date(date)]
Suppose that this table has all the information for x by date and groups gr1 and gr2. I want to fill the missing dates and expand this table by repeating the last known values of x by gr1 and gr2. My approach is as follows:
# define the period to expand
date_min <- as.Date('2017-01-01')
date_max <- as.Date('2017-06-01')
dates <- setDT(list(ddate = seq.Date(date_min, date_max,by = 'month')))
# cast the data
dt.c <- dcast(dt, date~gr1+gr2, value.var = "x")
# fill missing dates
dt.c <- dt.c[dates, roll=Inf]
# melt the data to return to original table format
dt.m <- melt(dt.c, id.vars = "date", value.name = "x")
# split column - the slowest part of my code
dt.m[,c("gr1","gr2") := tstrsplit(variable,'_')][,variable:=NULL]
# remove unnecessary NAs
dt.m <- dt.m[complete.cases(dt.m[,x])][,.(date,gr1,gr2,x)]
setkey(dt.m)
This is the output that I expect to see:
> dt.m
date gr1 gr2 x
1: 2017-01-01 A a 1
2: 2017-02-01 A b 2
3: 2017-02-01 B a 4
4: 2017-03-01 A b 2
5: 2017-03-01 B a 4
6: 2017-04-01 B a 5
7: 2017-05-01 A b 3
8: 2017-06-01 A b 3
Now the problem is that tstrsplit is very slow on large data sets with a lot of groups.
This approach is very close to what I need but if I follow it I could not get the desired output as it fills not only the missing dates but the NAs as well. This is my modification of the example:
# the desired dates by group
date_min <- as.Date('2017-01-01')
date_max <- as.Date('2017-06-01')
indx <- dt[,.(date=seq(date_min,date_max,"months")),.(gr1,gr2)]
# key the tables and join them using a rolling join
setkey(dt,gr1,gr2,date)
setkey(indx,gr1,gr2,date)
dt0 <- dt[indx,roll=TRUE][,.(date,gr1,gr2,x)]
setkey(dt0,date)
And this is not the output that I expect to see:
> dt0
date gr1 gr2 x
1: 2017-01-01 A a 1
2: 2017-01-01 A b NA
3: 2017-01-01 B a NA
4: 2017-02-01 A a 1
5: 2017-02-01 A b 2
6: 2017-02-01 B a 4
7: 2017-03-01 A a 1
8: 2017-03-01 A b 2
9: 2017-03-01 B a 4
10: 2017-04-01 A a 1
11: 2017-04-01 A b 2
12: 2017-04-01 B a 5
13: 2017-05-01 A a 1
14: 2017-05-01 A b 3
15: 2017-05-01 B a 5
16: 2017-06-01 A a 1
17: 2017-06-01 A b 3
18: 2017-06-01 B a 5
What is the best (fastest) way to reproduce my output above (dt.m)?
On rolling join, one 'normal' join and some column switching, aaaand you're done :)
temp <- dates[, near.date := dt[dates, x.date, on = .(date=ddate), roll = TRUE, mult = "first"]][]
dt[temp, on = .(date = near.date)][, date := ddate][,ddate := NULL][]
# date gr1 gr2 x
# 1: 2017-01-01 A a 1
# 2: 2017-02-01 A b 2
# 3: 2017-02-01 B a 4
# 4: 2017-03-01 A b 2
# 5: 2017-03-01 B a 4
# 6: 2017-04-01 B a 5
# 7: 2017-05-01 A b 3
# 8: 2017-06-01 A b 3
You can (of course) make it a one-liner by integrating the first row into the last.
I'd use IDate and an integer counter for the sequence of dates:
dt[, date := as.IDate(date)]
dates = seq(as.IDate("2017-01-01"), as.IDate("2017-06-01"), by="month")
dDT = data.table(date = dates)[, dseq := .I][]
dt[dDT, on=.(date), dseq := i.dseq]
Then enumerate all desired combos (gr1, gr2, dseq) and do a couple update joins:
cDT = CJ(dseq = dDT$dseq, gr1 = unique(dt$gr1), gr2 = unique(dt$gr2))
cDT[, x := dt[cDT, on=.(gr1, gr2, dseq), x.x]]
cDT[is.na(x), x := dt[copy(.SD), on=.(gr1, gr2, dseq), roll=1L, x.x]]
res = cDT[!is.na(x)]
res[dDT, on=.(dseq), date := i.date]
dseq gr1 gr2 x date
1: 1 A a 1 2017-01-01
2: 2 A a 1 2017-02-01
3: 2 A b 2 2017-02-01
4: 2 B a 4 2017-02-01
5: 3 A b 2 2017-03-01
6: 3 B a 4 2017-03-01
7: 4 B a 5 2017-04-01
8: 5 A b 3 2017-05-01
9: 5 B a 5 2017-05-01
10: 6 A b 3 2017-06-01
There are two extra rows here compared with what the OP expected
res[!dt.m, on=.(date, gr1, gr2)]
dseq gr1 gr2 x date
1: 2 A a 1 2017-02-01
2: 5 B a 5 2017-05-01
since I am treating each missing gr1 x gr2 value independently, rather than filling it iff the date is not in dt at all (as in the OP). To apply that rule...
drop_rows = res[!dt, on=.(gr1,gr2,date)][date %in% dt$date, .(gr1,gr2,date)]
res[!drop_rows, on=names(drop_rows)]
(The copy(.SD) is needed because of a likely bug.)
dt should have NA for all unique date for each combi of gr* but is not showing up. Hence, we use CJ and a join to fill those missing dates with NA for x.
After that, expand the dataset for all required ddates.
Finally, filter away rows where x is NA and order by date to make output have the same characteristics as the original dt.
dt[, g := .GRP, .(gr1, gr2)][
CJ(date=date, g=g, unique=T), on=.(date, g)][,
.SD[.(date=ddate), on=.(date), roll=Inf], .(g)][
!is.na(x)][order(date)]
output:
g date gr1 gr2 x
1: 1 2017-01-01 A a 1
2: 2 2017-02-01 A b 2
3: 3 2017-02-01 B a 4
4: 2 2017-03-01 A b 2
5: 3 2017-03-01 B a 4
6: 3 2017-04-01 B a 5
7: 2 2017-05-01 A b 3
8: 2 2017-06-01 A b 3
data:
library(data.table)
dt <- fread('date,gr1,gr2,x
2017-01-01,A,a,1
2017-02-01,A,b,2
2017-02-01,B,a,4
2017-04-01,B,a,5
2017-05-01,A,b,3')
dt[,date := as.Date(date)]
date_min <- as.Date('2017-01-01')
date_max <- as.Date('2017-06-01')
ddate = seq.Date(date_min, date_max,by = 'month')
Please try on your actual dataset.
This is a bit similar to another question, although note precisely a duplicate. The approach is similar, but with data.tables and with multiple columns. See also: Fill in missing date and fill with the data above
Here, it's unclear if you're seeking to fill-in columns gr2 and x or what gr2 is doing. I'm assuming you're seeking to fill-in gaps with dates in 1-month increments. Also as input data's max month is 5 (May) the example desired output has up until 6 (June) so it's unclear how June is reached if the goal is to fill-in between input dates -- but if there's an external maximum, this can be set instead of the max of input dates
library(data.table)
library(tidyr)
dt <- as.data.table(read.csv(textConnection('"date","gr1","gr2","x"
"2017-01-01","A","a",1
"2017-02-01","A","b",2
"2017-02-01","B","a",4
"2017-04-01","B","a",5
"2017-05-01","A","b",3')))
dt[,date := as.Date(date)]
setkeyv(dt,"date")
all_date_groups <- dt[,list(date=seq.Date(from=min(.SD$date),to=max(.SD$date),by="1 month")),by="gr1"]
setkeyv(all_date_groups,"date")
all_dates_dt <- dt[all_date_groups,on=c("date","gr1")]
setorderv(all_dates_dt,c("gr1","date"))
all_dates_dt <- fill(all_dates_dt,c("gr2","x"))
setorderv(all_dates_dt,c("date","gr1"))
all_dates_dt
Results:
> all_dates_dt
date gr1 gr2 x
1: 2017-01-01 A a 1
2: 2017-02-01 A b 2
3: 2017-02-01 B a 4
4: 2017-03-01 A b 2
5: 2017-03-01 B a 4
6: 2017-04-01 A b 2
7: 2017-04-01 B a 5
8: 2017-05-01 A b 3
I'm stuck trying to find a relatively simple way to count occurrences within a date range by group using R. I get the idea there has to be an easier way than what I'm trying.
I have over 6,000 groups, each group has anywhere from 1 to 100 IDs within, each with a start date and an end date anywhere from Jan 1, 1990 to today. I want to make a dataframe, one group per column, and one day per row, counting the number of IDs active per day from April 1, 2013 until March 31, 2018. For obvious reasons, using countifs in excel will not cut it.
I was trying to use this question as a starting point, as such:
df1 <- data.frame(group = c(1,1,2,3,3),
id = c(1,2,1,1,2),
startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01","2016-04-01"),
enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05","2999-01-01"))
report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 7))
report <- cbind(report,matrix(data=NA,nrow=7,ncol=3))
names(report) <- c('date',as.vector(unique(df1$group)))
daily <- function(i,...){
report[,i+1] <- sapply(report$date, function(x)
sum(as.Date(df1$startdate) < as.Date(x) &
as.Date(df1$enddate) > as.Date(x) &
df1$group == unique(df1$group)[i]))
}
for (i in unique(df1$group))
daily(i)
However, this doesn't seem to do anything (nor does it throw errors). Is there an easier way to do this? Am I way off base? Any help is appreciated for this non-programmer!
Additional help requested: I'm trying to modify Jaap's code in the answer below to include group start and group end times, so that the data table displays an NA when the group is not active.
Example data:
df2 <- data.frame(group = c(1,1,2,3,3),
groupopendate = c("2016-04-02","2016-04-02","2016-04-01","2016-04-02","2016-04-02"),
groupclosedate = c("2016-04-08","2016-04-08","2016-04-10","2016-04-09","2016-04-09"),
id = c(1,2,1,1,2),
startdate = c("2016-04-02","2016-04-04","2016-04-03","2016-04-02","2016-04-05"),
enddate = c("2016-04-04","2016-04-06","2016-04-10","2016-04-08","2016-04-08"))
Jaap's solution gives me this:
active grp1 grp2 grp3
1: 2016-04-02 1 0 1
2: 2016-04-03 1 1 1
3: 2016-04-04 1 1 1
4: 2016-04-05 1 1 2
5: 2016-04-06 0 1 2
6: 2016-04-07 0 1 2
However, what I want is such:
active grp1 grp2 grp3
1: 2016-04-01 NA 0 NA
2: 2016-04-02 1 0 1
3: 2016-04-03 1 1 1
4: 2016-04-04 1 1 1
5: 2016-04-05 1 1 1
6: 2016-04-06 1 1 2
7: 2016-04-07 0 1 2
8: 2016-04-08 NA 1 0
9: 2016-04-09 NA 1 NA
10: 2016-04-10 NA NA NA
Any help is appreciated!
A possible alternative solution using data.table:
# load the package & convert 'df1' to a data.table
library(data.table)
setDT(df1)
# convert the date columns to a date format
# not needed if they are
df1[, `:=` (startdate = as.Date(startdate), enddate = as.Date(enddate))]
# create a new data.table with the 'active' days
DT <- data.table(active = seq(from = as.Date("2016-04-01"), by = "day", length.out = 7))
# use a join and dcast to get the desired result
DT[df1
, on = .(active > startdate, active < enddate)
, allow = TRUE
, nomatch = 0
, .(active = x.active, group, id)
][, dcast(.SD, active ~ paste0("grp",group), value.var = "id", fun = length)]
which gives:
active grp1 grp2 grp3
1: 2016-04-01 1 1 0
2: 2016-04-02 1 1 1
3: 2016-04-03 1 1 1
4: 2016-04-04 0 1 1
5: 2016-04-05 1 1 1
6: 2016-04-06 1 1 1
7: 2016-04-07 1 1 1
NOTE: I've used paste0("grp",group) instead of just group in the dcast step as it leads to better columnnames (it is better not to use just numeric values as columnnames)
With regard to your additional example, you could solve that as follows:
setDT(df2)
df2[, c(2:3,5:6) := lapply(.SD, as.Date), .SDcols = c(2:3,5:6)]
DT <- data.table(active = seq(from = min(df2$groupopendate),
to = max(df2$groupclosedate),
by = "day"))
df2new <- df2[, .(active = seq.Date(startdate, enddate, by = "day"))
, by = .(group, id)
][, .N, by = .(group, active)
][df2[, .(active = seq.Date(groupopendate[1], groupclosedate[.N] - 1, by = "day"))
, by = .(group)]
, on = .(group, active)
][is.na(N), N := 0
][, dcast(.SD, active ~ paste0("grp",group))]
nms <- setdiff(names(df2new), "active")
DT[df2new
, on = .(active)
, (nms) := mget(paste0("i.",nms))][]
which gives:
> DT
active grp1 grp2 grp3
1: 2016-04-01 NA 0 NA
2: 2016-04-02 1 0 1
3: 2016-04-03 1 1 1
4: 2016-04-04 2 1 1
5: 2016-04-05 1 1 2
6: 2016-04-06 1 1 2
7: 2016-04-07 0 1 2
8: 2016-04-08 NA 1 2
9: 2016-04-09 NA 1 NA
10: 2016-04-10 NA 1 NA
I've figured it out! As usual, as soon as you post a question, you figure out the answer. I was overcomplicating it by putting in the function, when I could just put the sapply in the for loop.
If anyone is interested:
for (i in unique(df1$group))
{report[,i+1] <-
sapply(report$date, function(x)
sum(as.Date(df1$startdate) < as.Date(x) &
as.Date(df1$enddate) > as.Date(x) &
df1$group == unique(df1$group)[i]))}