R: converting start/end dates into data series - r

I have the following data frame representing user subscriptions:
User StartDate EndDate
1 2015-09-03 2015-10-17
2 2015-10-27 2015-12-25
...
How can I transform it into a time series that gives me the count of active monthly subscriptions over time (assuming it is active in the month if at least for one day in that month). Something like this (based on the example above, assuming only 2 records):
Month Count
2015-08 0
2015-09 1
2015-10 2
2015-11 1
2015-12 1
2016-01 0
Rem: I took some arbitrary start and end dates for the time series, to make the example clear.

Prepare the data and make sure that the date columns are actually stored as dates:
data <- read.table(text = "User StartDate EndDate
1 2015-09-03 2015-10-17
2 2015-10-27 2015-12-25", header = TRUE)
data$StartDate <- as.Date(StartDate)
data$EndDate <- as.Date(EndDate))
This function returns a vector with all month that are within a subscription:
library(lubridate)
subscr_month <- function(start, end) {
start <- floor_date(start, "month")
seq <- seq(start, end, by = "1 month")
months <- format(seq, format = "%Y-%m")
return(months)
}
It uses the function floor_date() from the lubridate package. It is necessary to round of the start date, because otherwise the last month might be missing. For example, for user 2, if you add two month to the start date, you end up on 2015-12-27, which is after the end date, such that no date from December will be included in seq. The last line converts the Dates to character that only include year and month.
Now, you can apply this function to each start and end date from your data using mapply(). Afterwards, table() creates a table of counts of all dates in the resulting list:
all_month <- mapply(subscr_month, data$StartDate, data$EndDate, SIMPLIFY = FALSE)
table(unlist(all_month))
## 2015-09 2015-10 2015-11 2015-12
## 1 2 1 1
You can also convert the table to a data frame:
as.data.frame(table(unlist(all_month)))
## Var1 Freq
## 1 2015-09 1
## 2 2015-10 2
## 3 2015-11 1
## 4 2015-12 1
Your example output also includes the counts for months that do not appear in the data set. If you want to have this, you can convert the vector of months to a factor and set the levels to all the months you want to include:
month_list <- format(seq(as.Date("2015-08-01"), as.Date("2016-01-01"), by = "1 month"), format = "%Y-%m")
all_month_factor <- factor(unlist(all_month), levels = month_list)
table(all_month_factor)
## all_month_factor
## 2015-08 2015-09 2015-10 2015-11 2015-12 2016-01
## 0 1 2 1 1 0

read the data frame mentioned.
df = structure(list(StartDate = structure(c(16681, 16735), class = "Date"),
EndDate = structure(c(16735, 16794), class = "Date")), class = "data.frame", .Names = c("StartDate",
"EndDate"), row.names = c(NA, -2L))
Could make good use of do in dplyr package and seq
df %>%
rowwise() %>% do({
w <- seq(.$StartDate,.$EndDate,by = "15 days") #for month difference less than 1 complete month
m <- format(w,"%Y-%m") %>% unique
data.frame(Month = m)
}) %>%
group_by(Month) %>%
summarise(Count = length(Month))

Related

Split hms column into 30 minute intervals

I have participant data during an exercise test, which includes participant ID, the condition (either Environmental or Control) and the total time taken to complete the test. A small example of my data:
RawData <- data.frame(
ParticipantID = c (1:6),
Condition = c("Control","Experimental","Experimental","Control","Experimental","Control"),
Time = c("04:34:22","02:48:47","04:22:06","02:57:11","02:07:11","05:34:22"))
I then used the lubridate package so I have time in hms via:
RawData <- RawData %>%
mutate(TotalTime = hms::as_hms(Time))
Now I wish to create a new column, that bins each RawData$TotalTime result into a category including: Sub2, Sub230, Sub3, Sub330, Sub4, Sub430, Sub5, Sub530 and Sub6. I could probably do this via a long case_when statement but is there an easy way to do this in lubridate given I am after 30 minute intervals?
My desired output would be:
RawData <- data.frame(
ParticipantID = c (1:6),
Condition = c("Control","Experimental","Experimental","Control","Experimental","Control"),
Time = c("04:34:22","02:48:47","04:22:06","02:57:11","02:07:11","05:34:22"),
Category = c("Sub5","Sub3","Sub430","Sub3","Sub230","Sub6"))
Thank you!
You can use ceiling_date function with units as "30 mins".
library(dplyr)
library(lubridate)
RawData %>%
mutate(TotalTime = as.POSIXct(Time, format = '%T'),
Category = format(ceiling_date(TotalTime, '30 mins'), "%H%M")) %>%
select(-TotalTime)
# ParticipantID Condition Time Category
#1 1 Control 04:34:22 0500
#2 2 Experimental 02:48:47 0300
#3 3 Experimental 04:22:06 0430
#4 4 Control 02:57:11 0300
#5 5 Experimental 02:07:11 0230
#6 6 Control 05:34:22 0600

R convert yy-mm string to date format [duplicate]

I have a data frame (df) like the following:
Date Arrivals
2014-07 100
2014-08 150
2014-09 200
I know that I can convert the yearmon dates to the first date of each month as follows:
df$Date <- as.POSIXct(paste0(as.character(df[,1]),"-01"), format = "%Y-%m-%d")
However, given that my data is not available until the end of the month I want to index it to the end rather than the beginning, and I cannot figure it out. Any help appreciated.
If the Date variable is an actual yearmon class vector, from the zoo package, the as.Date.yearmon method can do what you want via its argument frac.
Using your data, and assuming that the Date was originally a character vector
library("zoo")
df <- data.frame(Date = c("2014-07", "2014-08", "2014-09"),
Arrivals = c(100, 150, 200))
I convert this to a yearmon vector:
df <- transform(df, Date2 = as.yearmon(Date))
Assuming this is what you have, then you can achieve what you want using as.Date() with frac = 1:
df <- transform(df, Date3 = as.Date(Date2, frac = 1))
which gives:
> df
Date Arrivals Date2 Date3
1 2014-07 100 Jul 2014 2014-07-31
2 2014-08 150 Aug 2014 2014-08-31
3 2014-09 200 Sep 2014 2014-09-30
That shows the individual steps. If you only want the final Date this is a one-liner
## assuming `Date` is a `yearmon` object
df <- transform(df, Date = as.Date(Date, frac = 1))
## or if not a `yearmon`
df <- transform(df, Date = as.Date(as.yearmon(Date), frac = 1))
The argument frac in the fraction of the month to assign to the resulting dates when converting from yearmon objects to Date objects. Hence, to get the first day of the month, rather than convert to a character and paste on "-01" as your Question showed, it's better to coerce to a Date object with frac = 0.
If the Date in your df is not a yearmon class object, then you can solve your problem by converting it to one and then using the as.Date() method as described above.
Here is a way to do it using the zoo package.
R code:
library(zoo)
df
# Date Arrivals
# 1 2014-07 100
# 2 2014-08 150
# 3 2014-09 200
df$Date <- as.Date(as.yearmon(df$Date), frac = 1)
# output
# Date Arrivals
# 1 2014-07-31 100
# 2 2014-08-31 150
# 3 2014-09-30 200
Using lubridate, you can add a month and subtract a day to get the last day of the month:
library(lubridate)
ymd(paste0(df$Date, '-01')) + months(1) - days(1)
# [1] "2014-07-31" "2014-08-31" "2014-09-30"

Count number of occurences in date range in R

I have a dataframe with a number of accounts, their status and the start and endtime for that status. I would like to report on the number of accounts in each of these statuses over a date range. The data looks like the df below, with the resulting report. (Actual data contains more state values. N/A values are shown with a dummy date far in the future.)
df <- data.frame(account = c(1,1,2,3),
state = c("Open","Closed","Open","Open"),
startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01"),
enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05")
)
report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 6),
number.open = c(2,2,2,1,1,1)
)
I have looked at options involving rowwise() and mutate from dplyr and foverlaps from data.table, but haven't been able to code it up so it works.
(See Checking if Date is Between two Dates in R)
We can use sapply to do this for us:
report$NumberOpen <-
sapply(report$date, function(x)
sum(as.Date(df1$startdate) < as.Date(x) &
as.Date(df1$enddate) > as.Date(x) &
df1$state == 'Open'))
# report
# date NumberOpen
# 1 2016-04-01 2
# 2 2016-04-02 2
# 3 2016-04-03 2
# 4 2016-04-04 1
# 5 2016-04-05 1
# 6 2016-04-06 1
data
df1 <- data.frame(account = c(1,1,2,3),
state = c("Open","Closed","Open","Open"),
startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01"),
enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05")
)
report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 6)
)

Converting yearmon column to last date of the month in R

I have a data frame (df) like the following:
Date Arrivals
2014-07 100
2014-08 150
2014-09 200
I know that I can convert the yearmon dates to the first date of each month as follows:
df$Date <- as.POSIXct(paste0(as.character(df[,1]),"-01"), format = "%Y-%m-%d")
However, given that my data is not available until the end of the month I want to index it to the end rather than the beginning, and I cannot figure it out. Any help appreciated.
If the Date variable is an actual yearmon class vector, from the zoo package, the as.Date.yearmon method can do what you want via its argument frac.
Using your data, and assuming that the Date was originally a character vector
library("zoo")
df <- data.frame(Date = c("2014-07", "2014-08", "2014-09"),
Arrivals = c(100, 150, 200))
I convert this to a yearmon vector:
df <- transform(df, Date2 = as.yearmon(Date))
Assuming this is what you have, then you can achieve what you want using as.Date() with frac = 1:
df <- transform(df, Date3 = as.Date(Date2, frac = 1))
which gives:
> df
Date Arrivals Date2 Date3
1 2014-07 100 Jul 2014 2014-07-31
2 2014-08 150 Aug 2014 2014-08-31
3 2014-09 200 Sep 2014 2014-09-30
That shows the individual steps. If you only want the final Date this is a one-liner
## assuming `Date` is a `yearmon` object
df <- transform(df, Date = as.Date(Date, frac = 1))
## or if not a `yearmon`
df <- transform(df, Date = as.Date(as.yearmon(Date), frac = 1))
The argument frac in the fraction of the month to assign to the resulting dates when converting from yearmon objects to Date objects. Hence, to get the first day of the month, rather than convert to a character and paste on "-01" as your Question showed, it's better to coerce to a Date object with frac = 0.
If the Date in your df is not a yearmon class object, then you can solve your problem by converting it to one and then using the as.Date() method as described above.
Here is a way to do it using the zoo package.
R code:
library(zoo)
df
# Date Arrivals
# 1 2014-07 100
# 2 2014-08 150
# 3 2014-09 200
df$Date <- as.Date(as.yearmon(df$Date), frac = 1)
# output
# Date Arrivals
# 1 2014-07-31 100
# 2 2014-08-31 150
# 3 2014-09-30 200
Using lubridate, you can add a month and subtract a day to get the last day of the month:
library(lubridate)
ymd(paste0(df$Date, '-01')) + months(1) - days(1)
# [1] "2014-07-31" "2014-08-31" "2014-09-30"

How to get sum of values every 8 days by date in data frame in R

I don't often have to work with dates in R, but I imagine this is fairly easy. I have daily data as below for several years with some values and I want to get for each 8 days period the sum of related values.What is the best approach?
Any help you can provide will be greatly appreciated!
str(temp)
'data.frame':648 obs. of 2 variables:
$ Date : Factor w/ 648 levels "2001-03-24","2001-03-25",..: 1 2 3 4 5 6 7 8 9 10 ...
$ conv2: num -3.93 -6.44 -5.48 -6.09 -7.46 ...
head(temp)
Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802
I tried to use aggregate function but for some reasons it doesn't work and it aggregates in wrong way:
z <- aggregate(amount ~ Date, timeSequence(from =as.Date("2001-03-24"),to =as.Date("2001-03-29"), by="day"),data=temp,FUN=sum)
I prefer the package xts for such manipulations.
I read your data, as zoo objects. see the flexibility of format option.
library(xts)
ts.dat <- read.zoo(text ='Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802',header=TRUE,format = '%d/%m/%Y')
Then I extract the index of given period
ep <- endpoints(ts.dat,'days',k=8)
finally I apply my function to the time series at each index.
period.apply(x=ts.dat,ep,FUN=sum )
2001-03-29 2001-04-01
-36.13014 -19.06520
Use cut() in your aggregate() command.
Some sample data:
set.seed(1)
mydf <- data.frame(
DATE = seq(as.Date("2000/1/1"), by="day", length.out = 365),
VALS = runif(365, -5, 5))
Now, the aggregation. See ?cut.Date for details. You can specify the number of days you want in each group using cut:
output <- aggregate(VALS ~ cut(DATE, "8 days"), mydf, sum)
list(head(output), tail(output))
# [[1]]
# cut(DATE, "8 days") VALS
# 1 2000-01-01 8.242384
# 2 2000-01-09 -5.879011
# 3 2000-01-17 7.910816
# 4 2000-01-25 -6.592012
# 5 2000-02-02 2.127678
# 6 2000-02-10 6.236126
#
# [[2]]
# cut(DATE, "8 days") VALS
# 41 2000-11-16 17.8199285
# 42 2000-11-24 -0.3772209
# 43 2000-12-02 2.4406024
# 44 2000-12-10 -7.6894484
# 45 2000-12-18 7.5528077
# 46 2000-12-26 -3.5631950
rollapply. The zoo package has a rolling apply function which can also do non-rolling aggregations. First convert the temp data frame into zoo using read.zoo like this:
library(zoo)
zz <- read.zoo(temp)
and then its just:
rollapply(zz, 8, sum, by = 8)
Drop the by = 8 if you want a rolling total instead.
(Note that the two versions of temp in your question are not the same. They have different column headings and the Date columns are in different formats. I have assumed the str(temp) output version here. For the head(temp) version one would have to add a format = "%d/%m/%Y" argument to read.zoo.)
aggregate. Here is a solution that does not use any external packages. It uses aggregate based on the original data frame.
ix <- 8 * ((1:nrow(temp) - 1) %/% 8 + 1)
aggregate(temp[2], list(period = temp[ix, 1]), sum)
Note that ix looks like this:
> ix
[1] 8 8 8 8 8 8 8 8 16
so it groups the indices of the first 8 rows, the second 8 and so on.
Those are NOT Date classed variables. (No self-respecting program would display a date like that, not to mention the fact that these are labeled as factors.) [I later noticed these were not the same objects.] Furthermore, the timeSequence function (at least the one in the timeDate package) does not return a Date class vector either. So your expectation that there would be a "right way" for two disparate non-Date objects to be aligned in a sensible manner is ill-conceived. The irony is that just using the temp$Date column would have worked since :
> z <- aggregate(amount ~ Date, data=temp , FUN=sum)
> z
Date amount
1 1/04/2001 -5.402788
2 24/03/2001 -3.927020
3 25/03/2001 -6.442700
4 26/03/2001 -5.477593
5 27/03/2001 -6.094622
6 28/03/2001 -7.456669
7 29/03/2001 -6.731541
8 30/03/2001 -6.855206
9 31/03/2001 -6.807210
But to get it in 8 day intervals use cut.Date:
> z <- aggregate(temp$amount ,
list(Dts = cut(as.Date(temp$Date, format="%d/%m/%Y"),
breaks="8 day")), FUN=sum)
> z
Dts x
1 2001-03-24 -49.792561
2 2001-04-01 -5.402788
A more cleaner approach extended to #G. Grothendieck appraoch. Note: It does not take into account if the dates are continuous or discontinuous, sum is calculated based on the fixed width.
code
interval = 8 # your desired date interval. 2 days, 3 days or whatevea
enddate = interval-1 # this sets the enddate
nrows = nrow(z)
z <- aggregate(.~V1,data = df,sum) # aggregate sum of all duplicate dates
z$V1 <- as.Date(z$V1)
data.frame ( Start.date = (z[seq(1, nrows, interval),1]),
End.date = z[seq(1, nrows, interval)+enddate,1],
Total.sum = rollapply(z$V2, interval, sum, by = interval, partial = TRUE))
output
Start.date End.date Total.sum
1 2000-01-01 2000-01-08 9.1395926
2 2000-01-09 2000-01-16 15.0343960
3 2000-01-17 2000-01-24 4.0974712
4 2000-01-25 2000-02-01 4.1102645
5 2000-02-02 2000-02-09 -11.5816277
data
df <- data.frame(
V1 = seq(as.Date("2000/1/1"), by="day", length.out = 365),
V2 = runif(365, -5, 5))

Resources