Subset data based on pentad dates with leap year - r

I'm trying to subset the following data by pentad dates. Pentad means non overlapping 5 day average. For leap years, Pentad 12 includes February 29 (6 days average instead of 5):
Link to Data
Link to pentad dates
Here's my code:
library(stringr)
dat <- read.csv("tc_filt_1981-2007.csv",header = T,sep = ",")
dat$Date = paste(dat$Year, str_pad(dat$Month,2,'left','0'), str_pad(dat$Day,2,'left','0'), sep='-')
dat$yday = as.POSIXlt(dat$Date)$yday + 1
dat$pentad = ceiling(dat$yday/5)
df<-split(dat, dat$pentad)
Problem:
The dat$y line only works for 365 days. In a given year, there should be only 73 pentads. My code above produce 74 pentads when I check the dat$pentad. The df contains the data frames for each pentad.
I did the following for checking:
test<-dat[which(dat$pentad == 74),]
Output:
SN CY Year Month Day Hour Lat Lon Cat Date yday pentad
200034 34 2000 12 31 0 12.7 128.2 TS 2000-12-31 366 74
200034 34 2000 12 31 6 13.3 128.8 TS 2000-12-31 366 74
200034 34 2000 12 31 12 13.9 129.7 TS 2000-12-31 366 74
200034 34 2000 12 31 18 14.4 130.6 TS 2000-12-31 366 74
Question:
How do I account for the leap year in my code?
Can anyone suggest how can I do this?
Many thanks,

Minor adjustment:
library(lubridate)
dat$pentad = ceiling( (dat$yday - leap_year(dat$Year)*(dat$yday > 59)) / 5 )

Related

Getting Mean of all aggregated values for every quarter hour in dataframe and assigning

I have some sampled data from a sensor with no particular time differences between samples looking like this:
> Y_cl[[1]]
index Date time Glucose POS
10 11 2017-06-10 03:01:00 136 2017-06-10 00:01:00
14 15 2017-06-10 03:06:00 132 2017-06-10 00:06:00
18 19 2017-06-10 03:11:00 133 2017-06-10 00:11:00
22 23 2017-06-10 03:16:00 130 2017-06-10 00:16:00
26 27 2017-06-10 03:20:59 119 2017-06-10 00:20:59
30 31 2017-06-10 03:26:00 115 2017-06-10 00:26:00
34 35 2017-06-10 03:30:59 117 2017-06-10 00:30:59
38 39 2017-06-10 03:36:00 114 2017-06-10 00:36:00
42 43 2017-06-10 03:40:59 113 2017-06-10 00:40:59
The data is saved in the format of Dataframes stored in list Y_cl, each list element is for one day. I am trying to select ALL samples between every quarter hour of the clock and get the mean, resulting in 4 points for each hour of each day, mathematically defined (NOT CODE) as:
mean(Glucose(H:00 <Y_cl[[1]]$time< H:15))==> Glucose_av(H:00),
mean(Glucose(H:15 <Y_cl[[1]]$time< H:30))==> Glucose_av(H:15),
mean(Glucose(H:30 <Y_cl[[1]]$time< H:45))==> Glucose_av(H:30),
mean(Glucose(H:45 <Y_cl[[1]]$time< (H+1):00))==>Glucose_av(H:45)
I have tried searching but have found links on how to select or cut every 15 minutes differences, while I need to group every hours data based on which quarter of the hour they are in, average, and assign the result to corresponding quarter. Y_cl[[1]]['POS'] is in standard POSIXct format. Any help would be appreciated.
Here is a solution using lubridate and plyr packages :
data$POS <- NULL
data$POS = as.POSIXct(paste(data$Date, data$time)) # POS correction
library(lubridate)
library(plyr)
data$day <- day(data$POS) # extract day
data$hour <- hour(data$POS) # extract hour
data$minute <- minute(data$POS) # extract minute
Create a new factor according to the quarter :
data$quarter <- NA
data$quarter[data$minute >= 0 & data$minute < 15] <- "q1" # 1st quarter
data$quarter[data$minute >= 15 & data$minute < 30] <- "q2" # 2ndquarter
data$quarter[data$minute >= 30 & data$minute < 45] <- "q3" # 3rd quarter
data$quarter[data$minute >= 45 & data$minute < 60] <- "q4" # 4th quarter
Summarize data for each quarter (compute mean of Glucose for each combination of day, hour and quarter) :
output <- ddply(data, c("day", "hour", "quarter"), summarise, result = mean(Glucose))
Result :
> output
day hour quarter result
1 10 3 q1 133.6667
2 10 3 q2 121.3333
3 10 3 q3 114.6667
I did it by flooring the result of the minutes of each time stamp divided by 15, where YPOS is the list within the time stamps for each day i with the list Y_cl exist:
SeI<- function(i){
*###seperate the hours from the minutes for use later and store in K1*
strftime(YPOS[[i]], format="%H")
K1<- (floor((as.numeric(strftime(YPOS[[i]], format="%M")))/15))*15
*###get the minutes and divide by 15, keeping the floor,multiplying by 15,store in K2*
K2<- strftime(YPOS[[i]], format="%Y-%m-%d %H", tz="GMT")
*###paste K1 and K2 together an save in POSTIXCT format as T_av*
TT<- paste0(K2, ':', K1)
T_av<- as.POSIXct(TT,format="%Y-%m-%d %H:%M", tz="GMT" )}
and then applying it over all days in the list:
lapply(1:length(Y_cl), function(i) SeI(i) )
My solution included taking the time stamps from the list Y_cl and saving it in YPOS.

how to group sales data by 4 days from yesterday to start date in r?

Date Sales
3/11/2017 1
3/12/2017 0
3/13/2017 40
3/14/2017 47
3/15/2017 83
3/16/2017 62
3/17/2017 13
3/18/2017 58
3/19/2017 27
3/20/2017 17
3/21/2017 71
3/22/2017 76
3/23/2017 8
3/24/2017 13
3/25/2017 97
3/26/2017 58
3/27/2017 80
3/28/2017 77
3/29/2017 31
3/30/2017 78
3/31/2017 0
4/1/2017 40
4/2/2017 58
4/3/2017 32
4/4/2017 31
4/5/2017 90
4/6/2017 35
4/7/2017 88
4/8/2017 16
4/9/2017 72
4/10/2017 39
4/11/2017 8
4/12/2017 88
4/13/2017 93
4/14/2017 57
4/15/2017 23
4/16/2017 15
4/17/2017 6
4/18/2017 91
4/19/2017 87
4/20/2017 44
Here current date is 20/04/2017, My question is grouping data from 19/04/2017 to 11/03/2017 with 4 equal parts with summation sales in r programming?
Eg :
library("xts")
ep <- endpoints(data, on = 'days', k = 4)
period.apply(data,ep,sum)
it's not working. However, its taking start date to current date but I need to geatherd data from yestderday (19/4/2017) to start date and split into 4 equal parts.
kindly anyone guide me soon.
Thank you
Base R has a function cut.Date() which is built for the purpose.
However, the question is not fully clear on what the OP intends. My understanding of the requirements supplied in Q and additional comment is:
Take the sales data per day in Book1 but leave out the current day, i.e., use only completed days.
Group the data in four equal parts, i.e., four periods containing an equal number of days. (Note that the title of the Q and the attempt to use xts::endpoint() with k = 4 indicates that the OP might have a different intention to group the data in periods of four days length each.)
Summarize the sales figures by period
For the sake of brevity, data.table is used here for data manipulation and aggregation, lubridate for date manipulation
library(data.table)
library(lubridate)
# coerce to data.table, convert Date column from character to class Date,
# exclude the actual date
temp <- setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()]
# cut the date range in four parts
temp[, start_date_of_period := cut.Date(Date, 4)]
temp
# Date Sales start_date_of_period
# 1: 2017-03-11 1 2017-03-11
# 2: 2017-03-12 0 2017-03-11
# 3: 2017-03-13 40 2017-03-11
# ...
#38: 2017-04-17 6 2017-04-10
#39: 2017-04-18 91 2017-04-10
#40: 2017-04-19 87 2017-04-10
# Date Sales start_date_of_period
# aggregate sales by period
temp[, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
# start_date_of_period n_days total_sales
#1: 2017-03-11 10 348
#2: 2017-03-21 10 589
#3: 2017-03-31 10 462
#4: 2017-04-10 10 507
Thanks to chaining, this can be put together in one statement without using a temporary variable:
setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()][
, start_date_of_period := cut.Date(Date, 4)][
, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
Note If you want to reproduce the result in the future, you will have to replace the call to today() which excludes the current day by mdy("4/20/2017") which is the last day in the sample data set supplied by the OP.

convert day-number within the year to month/day format

I am trying to convert a day-number within the year to month/day format.
With this df:
set.seed(123)
df1 <- data.frame(Year = rep(15,100), DayNum = seq(78,177,1), Hour = sample(0:23,100,replace = T))
df2 <- data.frame(Year = rep(16,100), DayNum = seq(78,177,1), Hour = sample(0:23,100,replace = T))
df <- rbind(df1, df2)
> head(df)
Year DayNum Hour
1 15 78 6
2 15 79 18
3 15 80 9
4 15 81 21
5 15 82 22
6 15 83 1
> tail(df)
Year DayNum Hour
195 16 172 22
196 16 173 11
197 16 174 9
198 16 175 15
199 16 176 3
200 16 177 13
which has 100 records for 2015 and 2016, how can I make a POSIXct date/time column?
While there are a number of related posts with a Julian date from a beginning origin (usually 1970-01-01), I could not find any posts with a day-number within a year and with a variable year (i.e. 2015 and 2016).
The as.POSIXct function has an option to specify the origin date when converting from a "Julian" date to the date/time object:
#calculate the origin date based on the year column
df$origin<-as.Date(paste0("20", df$Year,"-01-01"))
#convert the Julian day to a date/time object
as.POSIXct(df$JulianDay, origin=df$origin)
One may need to consider adding the timezone option for completeness:
as.POSIXct(df$JulianDay, origin=df$origin, tz="GMT")
You might need something like this, use %j to specify the day of the year:
strptime(with(df, paste(Year, DayNum, Hour)), "%y %j %H")
# [1] "2015-03-19 06:00:00 EDT"
# [2] "2015-03-20 18:00:00 EDT"
# [3] "2015-03-21 09:00:00 EDT"
# [4] "2015-03-22 21:00:00 EDT"
# [5] "2015-03-23 22:00:00 EDT"

How do I group my date variable into month/year in R?

I have a "date" vector, that contains dates in mm/dd/yyyy format:
head(Entered_Date,5)
[1] 1/5/1998 1/5/1998 1/5/1998 1/5/1998 1/5/1998
I am trying to plot a frequency variable against the date, but I want to group the dates that it is by month or year. As it is now, there is a frequency per day, but I want to plot the frequency by month or year. So instead of having a frequency of 1 for 1/5/1998, 1 for 1/7/1998, and 3 for 1/8/1998, I would like to display it as 5 for 1/1998. It is a relatively large data set, with dates from 1998 to present, and I would like to find some automated way to accomplish this.
> dput(head(Entered_Date))
structure(c(260L, 260L, 260L, 260L, 260L, 260L), .Label = c("1/1/1998",
"1/1/1999", "1/1/2001", "1/1/2002", "1/10/2000", "1/10/2001",
"1/10/2002", "1/10/2003", "1/10/2005", "1/10/2006", "1/10/2007",
"1/10/2008", "1/10/2011", "1/10/2012", "1/10/2013", "1/11/1999",
"1/11/2000", "1/11/2001", "1/11/2002", "1/11/2005", "1/11/2006",
"1/11/2008", "1/11/2010", "1/11/2011", "1/11/2012", "1/11/2013",
"1/12/1998", "1/12/1999", "1/12/2001", "1/12/2004", "1/12/2005", ...
The floor_date() function from the lubridate package does this nicely.
data %>%
group_by(month = lubridate::floor_date(date, "month")) %>%
summarize(summary_variable = sum(value))
Thanks to Roman Cheplyaka
https://ro-che.info/articles/2017-02-22-group_by_month_r
See more on how to use the function: https://lubridate.tidyverse.org/reference/round_date.html
Here is an example using dplyr. You simply use the corresponding date format string for month %m or year %Y in the format statement.
set.seed(123)
df <- data.frame(date = seq.Date(from =as.Date("01/01/1998", "%d/%m/%Y"),
to=as.Date("01/01/2000", "%d/%m/%Y"), by="day"),
value = sample(seq(5), 731, replace = TRUE))
head(df)
date value
1 1998-01-01 2
2 1998-01-02 4
3 1998-01-03 3
4 1998-01-04 5
5 1998-01-05 5
6 1998-01-06 1
library(dplyr)
df %>%
mutate(month = format(date, "%m"), year = format(date, "%Y")) %>%
group_by(month, year) %>%
summarise(total = sum(value))
Source: local data frame [25 x 3]
Groups: month [?]
month year total
(chr) (chr) (int)
1 01 1998 105
2 01 1999 91
3 01 2000 3
4 02 1998 74
5 02 1999 77
6 03 1998 96
7 03 1999 86
8 04 1998 91
9 04 1999 95
10 05 1998 93
.. ... ... ...
Just to add to #cdeterman answer, you can use lubridate along with dplyr to make this even easier:
df <- data.frame(date = seq.Date(from =as.Date("01/01/1998", "%d/%m/%Y"),
to=as.Date("01/01/2000", "%d/%m/%Y"), by="day"),
value = sample(seq(5), 731, replace = TRUE))
library(dplyr)
library(lubridate)
df %>%
mutate(month = month(date), year = year(date)) %>%
group_by(month, year) %>%
summarise(total = sum(value))
Maybe you just add a column in your data like this:
Year <- format(as.Date(Entered_Date, "%d/%m/%Y"), "%Y")
Dont need dplyr. Look at ?as.POSIXlt
df$date<-as.POSIXlt(df$date)
mon<-df$date$mon
yr<-df$date$year
monyr<-as.factor(paste(mon,yr,sep="/"))
df$date<-monyr
Don't need to use ggplot2 but its nice for this kind of thing.
c <- ggplot(df, aes(factor(date)))
c + geom_bar()
If you want to see the actual numbers
aggregate(. ~ date,data = df,FUN=length )
df2<-aggregate(. ~ date,data = df,FUN=length )
df2
date value
1 0/98 31
2 0/99 31
3 1/98 28
4 1/99 28
5 10/98 30
6 10/99 30
7 11/97 1
8 11/98 31
9 11/99 31
10 2/98 31
11 2/99 31
12 3/98 30
13 3/99 30
14 4/98 31
15 4/99 31
16 5/98 30
17 5/99 30
18 6/98 31
19 6/99 31
20 7/98 31
21 7/99 31
22 8/98 30
23 8/99 30
24 9/98 31
25 9/99 31
There is a super easy way using the cut() function:
list = as.Date(c("1998-5-2", "1993-4-16", "1998-5-10"))
cut(list, breaks = "month")
and you will get this:
[1] 1998-05-01 1993-04-01 1998-05-01
62 Levels: 1993-04-01 1993-05-01 1993-06-01 1993-07-01 1993-08-01 ... 1998-05-01
Another solution is slider::slide_period:
library(slider)
library(dplyr)
monthly_summary <- function(data) summarise(data, date = format(max(date), "%Y-%m"), value = sum(value))
slide_period_dfr(df, df$date, "month", monthly_summary)
date value
1 1998-01 92
2 1998-02 82
3 1998-03 113
4 1998-04 94
5 1998-05 92
6 1998-06 74
7 1998-07 89
8 1998-08 92
9 1998-09 91
10 1998-10 100
...
There is also group_by(month_yr = cut(date, breaks = "1 month") in base R, without needing to use lubridate or other packages.

a daily time series - calculating totals in R

I have a time series with 30 years of daily data ( two columns labelled date and value)
Date Value
01-01-1975 0.051
02-01-1975 0.051
03-01-1975 0.051
04-01-1975 0.051
05-01-1975 0.051
06-01-1975 0.051
07-01-1975 0.051
08-01-1975 0.051
09-01-1975 0.051
10-01-1975 0.048
11-01-1975 0.048
12-01-1975 0.048
.........
I am trying to aggregate 5 days totals into sum ( so for each year I would get 73 values generated, it is a leap year then it would last value would be 6 days total rather than 5 days) In other words I always want to start on January 1 and always end on 31 Dec for each year, but I need to deal with the leap year case somehow, e.g. by treating each year separately or by finding leap years and treating them differently. But I am having issues
I did the following,
test <- read.csv("~/H/x.csv")
test$Date <- as.Date(test$Date, format = "%d-%m-%Y")
output <- aggregate(Flow ~ cut(Date, "5 days"), test, sum)
But it didn't quite gave me the results I wanted, which is for each year i want 73 values computed..
This is my first go at programming and R, so your guidance would be most welcome
cut by 5 days but use ave to do it by year so that weeks do not cross year boundaries. This gives Date5. Now aggregate over the cut values:
# test data
DF <- data.frame(Date = seq(as.Date("1975-01-01"), length = 2000, by = "day"),
Value = 1:2000)
to.yr <- function(x) as.numeric(format(x, "%Y"))
Date5 <- ave(DF$Date, to.yr(DF$Date), FUN = function(x) cut(x, "5 day"))
ag <- aggregate(Value ~ Date5, DF, sum)
To count the number of weeks (full or partial) used:
> table(to.yr(ag$Date5))
1975 1976 1977 1978 1979 1980
73 74 73 73 73 35
Some sample data to play with:
test = data.frame(Date=seq(as.Date("1975-01-01"),as.Date("2005-01-01"),1))
test$value = runif(nrow(test))
head(test)
Date value
1 1975-01-01 0.2929824
2 1975-01-02 0.2222665
3 1975-01-03 0.2659065
4 1975-01-04 0.5511573
Now use lubridate package's yday function to set the day of year from 1 to 366:
> require(lubridate)
> test$yday = yday(test$Date)
Now integer divide year day minus 1 by five to give our grouping (from 0 to 73 in this case):
> test$grp = (test$yday-1) %/% 5
head(test,10)
Date value yday grp
1 1975-01-01 0.29298243 0 0
2 1975-01-02 0.22226646 1 0
3 1975-01-03 0.26590648 2 0
4 1975-01-04 0.55115730 3 0
5 1975-01-05 0.55990854 4 0
6 1975-01-06 0.70054357 5 1
7 1975-01-07 0.27184097 6 1
8 1975-01-08 0.47779337 7 1
9 1975-01-09 0.09127241 8 1
10 1975-01-10 0.65023465 9 1
So we have the odd days in each year in group 73. Which ones?
test[test$grp==73,]
Date value yday grp
731 1976-12-31 0.6636329 365 73
2192 1980-12-31 0.4586537 365 73
3653 1984-12-31 0.3473794 365 73
5114 1988-12-31 0.9160449 365 73
6575 1992-12-31 0.3215585 365 73
8036 1996-12-31 0.1965876 365 73
9497 2000-12-31 0.6795412 365 73
10958 2004-12-31 0.3622685 365 73
We want to put these in group 72:
test$grp[test$grp==73]=72
Now we can do an analysis based on that group variable, and we should only get 73 values (remember we started at zero). I'll use dplyr because its cool:
require(dplyr)
test %>% group_by(grp) %>% summarise(mean=mean(value))
Source: local data frame [73 x 2]
grp mean
1 0 0.5052336
2 1 0.5178286
3 2 0.4844037
4 3 0.5368534
5 4 0.4900208
6 5 0.5078784
7 6 0.4754043
....
73 x 2 looks right!

Resources