I've this function to generate monthly ranges, it should consider years where february has 28 or 29 days:
starts ends
1 2017-01-01 2017-01-31
2 2017-02-01 2017-02-28
3 2017-03-01 2017-03-31
It works with:
make_date_ranges(as.Date("2017-01-01"), Sys.Date())
But gives error with:
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
Why?
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
Error in data.frame(starts, ends) :
arguments imply differing number of rows: 38, 36
add_months <- function(date, n){
seq(date, by = paste (n, "months"), length = 2)[2]
}
make_date_ranges <- function(start, end){
starts <- seq(from = start,
to = Sys.Date()-1 ,
by = "1 month")
ends <- c((seq(from = add_months(start, 1),
to = end,
by = "1 month" ))-1,
(Sys.Date()-1))
data.frame(starts,ends)
}
## useage
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
1) First, define start of month, som, and end of month, eom functions which take a Date class object, date string in standard Date format or yearmon object and produce a Date class object giving the start or end of its year/months.
Using those, create a monthly Date series s using the start of each month from the month/year of from to that of to. Use pmax to ensure that the series does not extend before from and pmin so that it does not extend past to.
The input arguments can be strings in standard Date format, Date class objects or yearmon class objects. In the yearmon case it assumes the user wanted the full month for every month. (The if statement can be omitted if you don't need to support yearmon inputs.)
library(zoo)
som <- function(x) as.Date(as.yearmon(x))
eom <- function(x) as.Date(as.yearmon(x), frac = 1)
date_ranges2 <- function(from, to) {
if (inherits(to, "yearmon")) to <- eom(to)
s <- seq(som(from), eom(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
date_ranges2("2000-01-10", "2000-06-20")
## from to
## 1 2000-01-10 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-20
date_ranges2(as.yearmon("2000-01"), as.yearmon("2000-06"))
## from to
## 1 2000-01-01 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-30
2) This alternative takes the same approach but defines start of month (som) and end of month (eom) functions without using yearmon so that only base R is needed. It takes character strings in standard Date format or Date class inputs and gives the same output as (1).
som <- function(x) as.Date(cut(as.Date(x), "month")) # start of month
eom <- function(x) som(som(x) + 32) - 1 # end of month
date_ranges3 <- function(from, to) {
s <- seq(som(from), as.Date(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
date_ranges3("2000-01-10", "2000-06-20")
## from to
## 1 2000-01-10 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-20
date_ranges3(som("2000-01-10"), eom("2000-06-20"))
## from to
## 1 2000-01-01 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-30
You don't need to use seq twice -- you can subtract 1 day from the firsts of each month to get the ends, and generate one too many starts, then shift & subset:
make_date_ranges = function(start, end) {
# format(end, "%Y-%m-01") essentially truncates end to
# the first day of end's month; 32 days later is guaranteed to be
# in the subsequent month
starts = seq(from = start, to = as.Date(format(end, '%Y-%m-01')) + 32, by = 'month')
data.frame(starts = head(starts, -1L), ends = tail(starts - 1, -1L))
}
x = make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
rbind(head(x), tail(x))
# starts ends
# 1 2017-01-01 2017-01-31
# 2 2017-02-01 2017-02-28
# 3 2017-03-01 2017-03-31
# 4 2017-04-01 2017-04-30
# 5 2017-05-01 2017-05-31
# 6 2017-06-01 2017-06-30
# 31 2019-07-01 2019-07-31
# 32 2019-08-01 2019-08-31
# 33 2019-09-01 2019-09-30
# 34 2019-10-01 2019-10-31
# 35 2019-11-01 2019-11-30
# 36 2019-12-01 2019-12-31
Related
I have a large dataset that includes date periods with different disease states per id and reference date. I would like to add a 'healthy' state for all missing date periods within +/- 5 years from the reference date per id.
I have tried to modify the solution here: Fill in missing date ranges but failed. Preferably, I would like to keep to the data.table framework. Any advice is greatly appreciated!
Sample data:
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2012-02-01 2012-03-01 2
2 2014-06-11 2012-02-01 2012-03-01 2
3 2011-08-14 NA NA NA
")
Desired output:
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2005-03-16 2008-10-10 0
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2008-10-13 2014-11-04 0
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2008-05-10 2012-01-31 0
2 2013-05-10 2012-02-01 2012-03-01 2
2 2013-05-10 2012-03-02 2018-05-10 0
2 2014-06-11 2009-06-11 2012-01-31 0
2 2014-06-11 2012-02-01 2012-03-01 2
2 2014-06-11 2012-03-02 2019-06-11 0
3 2011-08-14 2006-08-14 2016-08-14 0
")
Comment:
For the first row, the +/-5 year date interval is from 2005-01-10 to 2015-01-10. However, because of the ongoing disease state that ends 2005-03-15, the "healthy" period starts at 2005-03-16. Because there can be several reference dates per id, duplicate date periods (as observed for id 2: 2012-02-01-2012-03-01) will be present and are OK. Finally, ids with no disease states are represented by NA (as id 3).
EDIT: I had some problems with the real data, so I tweaked the solution a bit; also added so that the status is collapsed per date interval:
DT2 <- DT[,{
# +/-5 years from t0
sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
edt <- seq(reference_date, by="5 years", length.out=2L)[2L]
if(is.na(start[1L])) {
# replace NA with full time interval for 'healthy'
.(period_start=sdt, period_end=edt, status='notsick')
} else{
# Add date for -5 years if it is the minimum, otherwise use existing minimum
if (sdt < period_start[1L]) {
period_start <- c(sdt, period_start)
}
# Add date for +5 years if it is the maximum, otherwise use existing maximum
if (edt > period_end[.N]) {
period_end <- c(period_end,edt)
}
dates=unique(sort(c(period_start, period_end+1L)))
.(start=dates[-length(dates)],end=dates[-1L]-1,status='')
}
},
.(id,reference_date)]
## (c). Collapse status for overlapping periods
DT <- DT[DT2, on = .(id,reference_date, period_start <= period_start, period_end >= period_end), {
status <- paste(status, collapse = ";")
.(status=status)},
by = .EACHI, allow.cartesian = TRUE]
here is an option:
interweave <- function(x, y) c(rbind(x, y)) #see ref
ans <- DT[, {
sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
edt <- seq(reference_date, by="5 years", length.out=2L)[2L]
if(is.na(period_start[1L])) {
.(period_start=sdt, period_end=edt, Status=0L)
} else {
if (sdt < period_start[1L]) {
period_start <- c(sdt, period_start)
}
ps <- as.IDate(sort(interweave(period_start, period_end+1L)))
if (period_end[.N] > edt) {
ps <- ps[-length(ps)]
pe <- period_end[.N]
} else {
pe <- edt
}
.(period_start=ps, period_end=c(ps[-1L] - 1, pe), Status=0L)
}
},
.(id, reference_date)]
ans[DT, on=setdiff(names(DT), "Status"), Status := i.Status]
ans
data:
library(data.table)
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2012-02-01 2012-03-01 2
2 2014-06-11 2012-02-01 2012-03-01 2
3 2011-08-14 NA NA NA
")
cols <- c("reference_date","period_start","period_end")
DT[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]
Reference:
Alternate, interweave or interlace two vectors
I am trying to get a count of active clients per month, using data that has a start and end date to each client's episode. The code I am using I can't work out how to count per month, rather than per every n days.
Here is some sample data:
Start.Date <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
End.Date<- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
Make sure the dates are dates:
Start.Date <- as.Date(Start.Date, "%d/%m/%Y")
End.Date <- as.Date(End.Date, "%d/%m/%Y")
Here is the code I am using, which current counts the number per day:
library(plyr)
count(Reduce(c, Map(seq, start.month, end.month, by = 1)))
which returns:
x freq
1 2014-01-01 1
2 2014-01-02 2
3 2014-01-03 4
4 2014-01-04 2
The "by" argument can be changed to be however many days I want, but problems arise because months have different lengths.
Would anyone be able to suggest how I can count per month?
Thanks a lot.
note: I now realize that for my example data I have only used dates in the same month, but my real data has dates spanning 3 years.
Here's a solution that seems to work. First, I set the seed so that the example is reproducible.
# Set seed for reproducible example
set.seed(33550336)
Next, I create a dummy data frame.
# Test data
df <- data.frame(Start_date = as.Date(sample(seq(as.Date('2014/01/01'), as.Date('2015/01/01'), by="day"), 12))) %>%
mutate(End_date = as.Date(Start_date + sample(1:365, 12, replace = TRUE)))
which looks like,
# Start_date End_date
# 1 2014-11-13 2015-09-26
# 2 2014-05-09 2014-06-16
# 3 2014-07-11 2014-08-16
# 4 2014-01-25 2014-04-23
# 5 2014-05-16 2014-12-19
# 6 2014-11-29 2015-07-11
# 7 2014-09-21 2015-03-30
# 8 2014-09-15 2015-01-03
# 9 2014-09-17 2014-09-26
# 10 2014-12-03 2015-05-08
# 11 2014-08-03 2015-01-12
# 12 2014-01-16 2014-12-12
The function below takes a start date and end date and creates a sequence of months between these dates.
# Sequence of months
mon_seq <- function(start, end){
# Change each day to the first to aid month counting
day(start) <- 1
day(end) <- 1
# Create a sequence of months
seq(start, end, by = "month")
}
Right, this is the tricky bit. I apply my function mon_seq to all rows in the data frame using mapply. This gives the months between each start and end date. Then, I combine all these months together into a vector. I format this vector so that dates just contain months and years. Finally, I pipe (using dplyr's %>%) this into table which counts each occurrence of year-month and I cast as a data frame.
data.frame(format(do.call("c", mapply(mon_seq, df$Start_date, df$End_date)), "%Y-%m") %>% table)
This gives,
# . Freq
# 1 2014-01 2
# 2 2014-02 2
# 3 2014-03 2
# 4 2014-04 2
# 5 2014-05 3
# 6 2014-06 3
# 7 2014-07 3
# 8 2014-08 4
# 9 2014-09 6
# 10 2014-10 5
# 11 2014-11 7
# 12 2014-12 8
# 13 2015-01 6
# 14 2015-02 4
# 15 2015-03 4
# 16 2015-04 3
# 17 2015-05 3
# 18 2015-06 2
# 19 2015-07 2
# 20 2015-08 1
# 21 2015-09 1
I have a df like that (head):
date Value
1: 2016-12-31 169361280
2: 2017-01-01 169383153
3: 2017-01-02 169494585
4: 2017-01-03 167106852
5: 2017-01-04 166750164
6: 2017-01-05 164086438
I would like to calculate a ratio, for that reason I need the max of every period. The max it´s normally the last day of the month but sometime It could be some days after and before (28,29,30,31,01,02).
In order to calculate it properly I would like to assign to my reference date (the last day of the month) the max value of this group of days to be sure that the ratio reflects what it supossed to.
This could be a reproducible example:
Start<-as.Date("2016-12-31")
End<-Sys.Date()
window<-data.table(seq(Start,End,by='1 day'))
dt<-cbind(window,rep(rnorm(nrow(window))))
colnames(dt)<-c("date","value")
# Create a Dateseq
DateSeq <- function(st, en, freq) {
st <- as.Date(as.yearmon(st))
en <- as.Date(as.yearmon(en))
as.Date(as.yearmon(seq(st, en, by = paste(as.character(12/freq),
"months"))), frac = 1)
}
# df to be fulfilled with the group max.
Value.Max.Month<-data.frame(DateSeq(Start,End,12))
colnames(Value.Max.Month)<-c("date")
date
1 2016-12-31
2 2017-01-31
3 2017-02-28
4 2017-03-31
5 2017-04-30
6 2017-05-31
7 2017-06-30
8 2017-07-31
9 2017-08-31
10 2017-09-30
11 2017-10-31
12 2017-11-30
13 2017-12-31
14 2018-01-31
15 2018-02-28
16 2018-03-31
You could use data.table:
library(lubridate)
library(zoo)
Start <- as.Date("2016-12-31")
End <- Sys.Date()
window <- data.table(seq(Start,End,by='1 day'))
dt <- cbind(window,rep(rnorm(nrow(window))))
colnames(dt) <- c("date","value")
dt <- data.table(dt)
dt[,period := as.Date(as.yearmon(date)) %m+% months(1) - 1,][, maximum:=max(value), by=period][, unique(maximum), by=period]
In the first expression we create a new column called period. Then we group by this new column and look for the maximum in value. In the last expression we just output these unique rows.
Notice that to get the last day of each period we add one month using lubridate and then substract 1 day.
The output is:
period V1
1: 2016-12-31 -0.7832116
2: 2017-01-31 2.1988660
3: 2017-02-28 1.6644812
4: 2017-03-31 1.2464980
5: 2017-04-30 2.8268820
6: 2017-05-31 1.7963104
7: 2017-06-30 1.3612476
8: 2017-07-31 1.7325457
9: 2017-08-31 2.7503439
10: 2017-09-30 2.4369036
11: 2017-10-31 2.4544802
12: 2017-11-30 3.1477730
13: 2017-12-31 2.8461506
14: 2018-01-31 1.8862944
15: 2018-02-28 1.8946470
16: 2018-03-31 0.7864341
I have a dataframe ("observations") with time stamps in H:M format ("Time"). In a second dataframe ("intervals"), I have time ranges defined by "From" and "Till" variables, also in H:M format.
I want to count number of observations which falls within each interval. I have been using between from data.table, which has been working without any problem when dates are included.
However, now I only have time stamps, without date. This causes some problems for the times which occurs in the interval which spans midnight (20:00 - 05:59). These times are not counted in the code I have tried.
Example below
interval.data <- data.frame(From = c("14:00", "20:00", "06:00"), Till = c("19:59", "05:59", "13:59"), stringsAsFactors = F)
observations <- data.frame(Time = c("14:32", "15:59", "16:32", "21:34", "03:32", "02:00", "00:00", "05:57", "19:32", "01:32", "02:22", "06:00", "07:50"), stringsAsFactors = F)
interval.data
# From Till
# 1: 14:00:00 19:59:00
# 2: 20:00:00 05:59:00 # <- interval including midnight
# 3: 06:00:00 13:59:00
observations
# Time
# 1: 14:32:00
# 2: 15:59:00
# 3: 16:32:00
# 4: 21:34:00 # Row 4-8 & 10-11 falls in 'midnight interval', but are not counted
# 5: 03:32:00 #
# 6: 02:00:00 #
# 7: 00:00:00 #
# 8: 05:57:00 #
# 9: 19:32:00
# 10: 01:32:00 #
# 11: 02:22:00 #
# 12: 06:00:00
# 13: 07:50:00
library(data.table)
library(plyr)
adply(interval.data, 1, function(x, y) sum(y[, 1] %between% c(x[1], x[2])), y = observations)
# From Till V1
# 1 14:00 19:59 4
# 2 20:00 05:59 0 # <- zero counts - wrong!
# 3 06:00 13:59 2
One approach is to use a non-equi join in data.table, and their helper function as.ITime for working with time strings.
You'll have an issue with the interval that spans midnight, but, there should only ever be one of those. And as you're interested in the number of observations per 'group' of intervals, you can treat this group as the equivalent of the 'Not' of the others.
For example, first convert your data.frame to data.table
library(data.table)
## set your data.frames as `data.table`
setDT(interval.data)
setDT(observations)
Then use as.ITime to convert to an integer representation of time
## convert time stamps
interval.data[, `:=`(FromMins = as.ITime(From),
TillMins = as.ITime(Till))]
observations[, TimeMins := as.ITime(Time)]
## you could combine this step with the non-equi join directly, but I'm separating it for clarity
You can now use a non-equi join to find the interval that each time falls within. Noting that those times that reutrn 'NA' are actually those that fall inside the midnight-spanning interval
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
]
# From Till FromMins TillMins Time
# 1: 14:00 19:59 872 872 14:32
# 2: 14:00 19:59 959 959 15.59
# 3: 14:00 19:59 992 992 16:32
# 4: NA NA 1294 1294 21:34
# 5: NA NA 212 212 03:32
# 6: NA NA 120 120 02:00
# 7: NA NA 0 0 00:00
# 8: NA NA 357 357 05:57
# 9: 14:00 19:59 1172 1172 19:32
# 10: NA NA 92 92 01:32
# 11: NA NA 142 142 02:22
# 12: 06:00 13:59 360 360 06:00
# 13: 06:00 13:59 470 470 07:50
Then to get the number of observatins for the groups of intervals, you just .N grouped by each time point, which can just be chained onto the end of the above statement
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
][
, .N
, by = .(From, Till)
]
# From Till N
# 1: 14:00 19:59 4
# 2: NA NA 7
# 3: 06:00 13:59 2
Where the NA group corresponds to the one that spans midnight
I just tweaked your code to get the desired result. Hope this helps!
adply(interval.data, 1, function(x, y)
if(x[1] > x[2]) return(sum(y[, 1] %between% c(x[1], 23:59), y[, 1] %between% c(00:00, x[2]))) else return(sum(y[, 1] %between% c(x[1], x[2]))), y = observations)
Output is:
From Till V1
1 14:00 19:59 4
2 20:00 05:59 7
3 06:00 13:59 2
I have daily data for 1 year having 0 and 1 values. I want to calculate monthly events, there is consecutive 1 value for 3 on more days using R?
set.seed(123)
abts1 <- sample(0:1, 366, replace=TRUE)
library(xts)
d16 <- seq(as.Date("2016-01-01"), as.Date("2016-12-31"), 1)
ax16 <- as.Date(d16,"%y-%m-%d")
abts12 <- xts(abts1, ax16)
# but it gives events for complete period, not as monthly.
apply.monthly(abts12, function(x) sum(with(rle(c(x!=0)), lengths*values)>=3))
The last line of your code throws an error for me when I use xts_0.9-7.
R> apply.monthly(abts12, function(x) sum(with(rle(x!=0), lengths*values)>=3))
Error in rle(x != 0) : 'x' must be a vector of an atomic type
That's easy to fix though. You just need to wrap x != 0 in as.logical.
R> apply.monthly(abts12, function(x) sum(with(rle(as.logical(x!=0)), lengths*values)>=3))
[,1]
2016-01-31 2
2016-02-29 1
2016-03-31 3
2016-04-30 2
2016-05-31 1
2016-06-30 2
2016-07-31 3
2016-08-31 3
2016-09-30 2
2016-10-31 3
2016-11-30 0
2016-12-31 2
That seems like the output you expect. The number of times there are 3 or more consecutive days with a value of 1.