Let's say I have several years worth of data which look like the following
# load date package and set random seed
library(lubridate)
set.seed(42)
# create data.frame of dates and income
date <- seq(dmy("26-12-2010"), dmy("15-01-2011"), by = "days")
df <- data.frame(date = date,
wday = wday(date),
wday.name = wday(date, label = TRUE, abbr = TRUE),
income = round(runif(21, 0, 100)),
week = format(date, format="%Y-%U"),
stringsAsFactors = FALSE)
# date wday wday.name income week
# 1 2010-12-26 1 Sun 91 2010-52
# 2 2010-12-27 2 Mon 94 2010-52
# 3 2010-12-28 3 Tues 29 2010-52
# 4 2010-12-29 4 Wed 83 2010-52
# 5 2010-12-30 5 Thurs 64 2010-52
# 6 2010-12-31 6 Fri 52 2010-52
# 7 2011-01-01 7 Sat 74 2011-00
# 8 2011-01-02 1 Sun 13 2011-01
# 9 2011-01-03 2 Mon 66 2011-01
# 10 2011-01-04 3 Tues 71 2011-01
# 11 2011-01-05 4 Wed 46 2011-01
# 12 2011-01-06 5 Thurs 72 2011-01
# 13 2011-01-07 6 Fri 93 2011-01
# 14 2011-01-08 7 Sat 26 2011-01
# 15 2011-01-09 1 Sun 46 2011-02
# 16 2011-01-10 2 Mon 94 2011-02
# 17 2011-01-11 3 Tues 98 2011-02
# 18 2011-01-12 4 Wed 12 2011-02
# 19 2011-01-13 5 Thurs 47 2011-02
# 20 2011-01-14 6 Fri 56 2011-02
# 21 2011-01-15 7 Sat 90 2011-02
I would like to sum 'income' for each week (Sunday thru Saturday). Currently I do the following:
Weekending 2011-01-01 = sum(df$income[1:7]) = 487
Weekending 2011-01-08 = sum(df$income[8:14]) = 387
Weekending 2011-01-15 = sum(df$income[15:21]) = 443
However I would like a more robust approach which will automatically sum by week. I can't work out how to automatically subset the data into weeks. Any help would be much appreciated.
First use format to convert your dates to week numbers, then plyr::ddply() to calculate the summaries:
library(plyr)
df$week <- format(df$date, format="%Y-%U")
ddply(df, .(week), summarize, income=sum(income))
week income
1 2011-52 413
2 2012-01 435
3 2012-02 379
For more information on format.date, see ?strptime, particular the bit that defines %U as the week number.
EDIT:
Given the modified data and requirement, one way is to divide the date by 7 to get a numeric number indicating the week. (Or more precisely, divide by the number of seconds in a week to get the number of weeks since the epoch, which is 1970-01-01 by default.
In code:
df$week <- as.Date("1970-01-01")+7*trunc(as.numeric(df$date)/(3600*24*7))
library(plyr)
ddply(df, .(week), summarize, income=sum(income))
week income
1 2010-12-23 298
2 2010-12-30 392
3 2011-01-06 294
4 2011-01-13 152
I have not checked that the week boundaries are on Sunday. You will have to check this, and insert an appropriate offset into the formula.
This is now simple using dplyr. Also I would suggest using cut(breaks = "week") rather than format() to cut the dates into weeks.
library(dplyr)
df %>% group_by(week = cut(date, "week")) %>% mutate(weekly_income = sum(income))
I Googled "group week days into weeks R" and came across this SO question. You mention you have multiple years, so I think we need to keep up with both the week number and also the year, so I modified the answers there as so format(date, format = "%U%y")
In use it looks like this:
library(plyr) #for aggregating
df <- transform(df, weeknum = format(date, format = "%y%U"))
ddply(df, "weeknum", summarize, suminc = sum(income))
#----
weeknum suminc
1 1152 413
2 1201 435
3 1202 379
See ?strptime for all the format abbreviations.
Try rollapply from the zoo package:
rollapply(df$income, width=7, FUN = sum, by = 7)
# [1] 487 387 443
Or, use period.sum from the xts package:
period.sum(xts(df$income, order.by=df$date), which(df$wday %in% 7))
# [,1]
# 2011-01-01 487
# 2011-01-08 387
# 2011-01-15 443
Or, to get the output in the format you want:
data.frame(income = period.sum(xts(df$income, order.by=df$date),
which(df$wday %in% 7)),
week = df$week[which(df$wday %in% 7)])
# income week
# 2011-01-01 487 2011-00
# 2011-01-08 387 2011-01
# 2011-01-15 443 2011-02
Note that the first week shows as 2011-00 because that's how it is entered in your data. You could also use week = df$week[which(df$wday %in% 1)] which would match your output.
This solution is influenced by #Andrie and #Chase.
# load plyr
library(plyr)
# format weeks as per requirement (replace "00" with "52" and adjust corresponding year)
tmp <- list()
tmp$y <- format(df$date, format="%Y")
tmp$w <- format(df$date, format="%U")
tmp$y[tmp$w=="00"] <- as.character(as.numeric(tmp$y[tmp$w=="00"]) - 1)
tmp$w[tmp$w=="00"] <- "52"
df$week <- paste(tmp$y, tmp$w, sep = "-")
# get summary
df2 <- ddply(df, .(week), summarize, income=sum(income))
# include week ending date
tmp$week.ending <- lapply(df2$week, function(x) rev(df[df$week==x, "date"])[[1]])
df2$week.ending <- sapply(tmp$week.ending, as.character)
# week income week.ending
# 1 2010-52 487 2011-01-01
# 2 2011-01 387 2011-01-08
# 3 2011-02 443 2011-01-15
df.index = df['week'] #the the dt variable as index
df.resample('W').sum() #sum using resample
With dplyr:
df %>%
arrange(date) %>%
mutate(week = as.numeric(date - date[1])%/%7) %>%
group_by(week) %>%
summarise(weekincome= sum(income))
Instead of date[1] you can have any date from when you want to start your weekly study.
Related
I have a cross section data as following:
transaction_code <- c('A_111','A_222','A_333')
loan_start_date <- c('2016-01-03','2011-01-08','2013-02-13')
loan_maturity_date <- c('2017-01-03','2013-01-08','2015-02-13')
loan_data <- data.frame(cbind(transaction_code,loan_start_date,loan_maturity_date))
Now the dataframe looks like this
>loan_data
transaction_code loan_start_date loan_maturity_date
1 A_111 2016-01-03 2017-01-03
2 A_222 2011-01-08 2013-01-08
3 A_333 2013-02-13 2015-02-13
Now I want to create a monthly time series observing the time to maturity(in months) for each of the three loans for a period of 48 months. How can I achieve that? The final output should look like following:
>loan data
transaction_code loan_start_date loan_maturity_date feb13 march13 april13........
1 A_111 2016-01-03 2017-01-03 46 45 44
2 A_222 2011-01-08 2013-01-08 NA NA NA
3 A_333 2013-02-13 2015-02-13 23 22 21
Here new columns (for 48 months) represents the time to maturity for each loan from that respective months.
Would really appreciate your help. Thanks
Here's an approach using tidyverse packages.
# Define the months to use in the right-hand columns.
months <- seq.Date(from = as.Date("2013-02-01"), by = "month", length.out = 48)
library(tidyverse); library(lubridate)
loan_data2 <- loan_data %>%
# Make a row for each combination of original data and the `months` list
crossing(months) %>%
# Format dates as MonYr and make into an ordered factor
mutate(month_name = format(months, "%b%y") %>% fct_reorder(months)) %>%
# Calculate months remaining -- this task is harder than it sounds! This
# approach isn't perfect, but it's hard to accomplish more simply, since
# months are different lengths.
mutate(months_remaining =
round(interval(months, loan_maturity_date) / ddays(1) / 30.5 - 1),
months_remaining = if_else(months_remaining < 0,
NA_real_, months_remaining)) %>%
# Drop the Date format of months now that calcs done
select(-months) %>%
# Spread into wide format
spread(month_name, months_remaining)
Output
loan_data2[,1:6]
# transaction_code loan_start_date loan_maturity_date Feb13 Mar13 Apr13
# 1 A_111 2016-01-03 2017-01-03 46 45 44
# 2 A_222 2011-01-08 2013-01-08 NA NA NA
# 3 A_333 2013-02-13 2015-02-13 23 22 21
I am using a dataset which is grouped by group_by function of dplyr package.
Each Group has it's own time index which i.e. supposedly consist of 12 months sequences.
This means that it can start from January and end up in December or in other cases it can start from June of the year before and end up in May next year.
Here is the dataset example:
ID DATE
8 2017-01-31
8 2017-02-28
8 2017-03-31
8 2017-04-30
8 2017-05-31
8 2017-06-30
8 2017-07-31
8 2017-08-31
8 2017-09-30
8 2017-10-31
8 2017-11-30
8 2017-12-31
32 2017-01-31
32 2017-02-28
32 2017-03-31
32 2017-04-30
32 2017-05-31
32 2017-06-30
32 2017-07-31
32 2017-08-31
32 2017-09-30
32 2017-10-31
32 2017-11-30
32 2017-12-31
45 2016-09-30
45 2016-10-31
45 2016-11-30
45 2016-12-31
45 2017-01-31
45 2017-02-28
45 2017-03-31
45 2017-04-30
45 2017-05-31
45 2017-06-30
45 2017-07-31
45 2017-08-31
The Problem is that I can't confirm or validate visualy because of dataset dimensions if there are so called "jumps", in other words if dates are consistent. Is there any simple way in r to do that, perhaps some modification/combination of functions from tibbletime package.
Any help will by appreciated.
Thank you in advance.
Here's how I would typically approach this problem using data.table -- the cut.Date() and seq.Date() functions from base are the meat of the logic, so you use the same approach with dplyr if desired.
library(data.table)
## Convert to data.table
setDT(df)
## Convert DATE to a date in case it wasn't already
df[,DATE := as.Date(DATE)]
## Order by ID and Date
setkey(df,ID,DATE)
## Create a column with the month of each date
df[,Month := as.Date(cut.Date(DATE, breaks = "months"))]
## Generate a sequence of Dates by month for the number of observations
## in each group -- .N
df[,ExpectedMonth := seq.Date(from = min(Month),
by = "months",
length.out = .N), by = .(ID)]
## Create a summary table to test whether an ID had 12 observations where
## the actual month was equal to the expected month
Test <- df[Month == ExpectedMonth, .(Valid = ifelse(.N == 12L,TRUE,FALSE)), by = .(ID)]
print(Test)
# ID Valid
# 1: 8 TRUE
# 2: 32 TRUE
# 3: 45 TRUE
## Do a no-copy join of Test to df based on ID
## and create a column in df based on the 'Valid' column in Test
df[Test, Valid := i.Valid, on = "ID"]
## The final output:
head(df)
# ID DATE Month ExpectedMonth Valid
# 1: 8 2017-01-31 2017-01-01 2017-01-01 TRUE
# 2: 8 2017-02-28 2017-02-01 2017-02-01 TRUE
# 3: 8 2017-03-31 2017-03-01 2017-03-01 TRUE
# 4: 8 2017-04-30 2017-04-01 2017-04-01 TRUE
# 5: 8 2017-05-31 2017-05-01 2017-05-01 TRUE
# 6: 8 2017-06-30 2017-06-01 2017-06-01 TRUE
You could also do things a little more compactly if you really wanted to using a self-join and skip creating Test
setDT(df)
df[,DATE := as.Date(DATE)]
setkey(df,ID,DATE)
df[,Month := as.Date(cut.Date(DATE, breaks = "months"))]
df[,ExpectedMonth := seq.Date(from = min(Month), by = "months", length.out = .N), keyby = .(ID)]
df[df[Month == ExpectedMonth,.(Valid = ifelse(.N == 12L,TRUE,FALSE)),keyby = .(ID)], Valid := i.Valid]
You can use the summarise function from dplyr to return a logical value of whether there are any day differences greater than 31 within each ID. You do this by first constructing a temporary date using only the year and month and attaching "-01" as the fake day:
library(dplyr)
library(lubridate)
df %>%
group_by(ID) %>%
mutate(DATE2 = ymd(paste0(sub('\\-\\d+$', '', DATE),'-01')),
DATE_diff = c(0, diff(DATE2))) %>%
summarise(Valid = !any(DATE_diff > 31))
Result:
# A tibble: 3 x 2
ID Valid
<int> <lgl>
1 8 TRUE
2 32 TRUE
3 45 TRUE
You can also visually check if there are any gaps by plotting your dates for each ID:
library(ggplot2)
df %>%
mutate(DATE = ymd(paste0(sub('\\-\\d+$', '', DATE),'-01')),
ID = as.factor(ID)) %>%
ggplot(aes(x = DATE, y = ID, group = ID)) +
geom_point(aes(color = ID)) +
scale_x_date(date_breaks = "1 month",
date_labels = "%b-%Y") +
labs(title = "Time Line by ID")
I have a large dataset over many years which has several variables, but the one I am interested in is wind speed and dateTime. I want to find the time of the max wind speed for every day in the data set. I have hourly data in Posixct format, with WS as a numeric with occasional NAs. Below is a short data set that should hopefully illustrate my point, however my dateTime wasn't working out to be hourly data, but it provides enough for a sample.
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1798,rep=TRUE)
WD <- sample(0:390,1798,rep=TRUE)
Temp <- sample(0:40,1798,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I have previously tried creating a new column with just a posix date (minus time) to allow for day isolation, however all the things I have tried have only returned a shortened data frame with date and WS (aggregate, splitting, xts). Aggregate was only one that didn't do this, however, it gave me 23:00:00 as a constant time which isn't correct.
I have looked at How to calculate daily means, medians, from weather variables data collected hourly in R?, https://stats.stackexchange.com/questions/7268/how-to-aggregate-by-minute-data-for-a-week-into-hourly-means and others but none have answered this question, or the solutions have not returned an ideal result.
I need to compare the results of this analysis with another data frame, so hence the reason I need the actual time when the max wind speed occurred for each day in the dataset. I have a feeling there is a simple solution, however, this has me frustrated.
A dplyr solution may be:
library(dplyr)
df %>%
mutate(date = as.Date(dateTime)) %>%
left_join(
df %>%
mutate(date = as.Date(dateTime)) %>%
group_by(date) %>%
summarise(max_ws = max(WS, na.rm = TRUE)) %>%
ungroup(),
by = "date"
) %>%
select(-date)
# dateTime WS WD Temp max_ws
# 1 2011-01-01 00:00:00 NA 313 2 15
# 2 2011-01-01 00:24:00 7 376 1 15
# 3 2011-01-01 00:48:00 3 28 28 15
# 4 2011-01-01 01:12:00 15 262 24 15
# 5 2011-01-01 01:36:00 1 149 34 15
# 6 2011-01-01 02:00:00 4 319 33 15
# 7 2011-01-01 02:24:00 15 280 22 15
# 8 2011-01-01 02:48:00 NA 110 23 15
# 9 2011-01-01 03:12:00 12 93 15 15
# 10 2011-01-01 03:36:00 3 5 0 15
Dee asked for: "I want to find the time of the max wind speed for every day in the data set." Other answers have calculated the max(WS) for every day, but not at which hour that occured.
So I propose the following solution with dyplr:
library(dplyr)
set.seed(12345)
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1738,rep=TRUE)
WD <- sample(0:390,1738,rep=TRUE)
Temp <- sample(0:40,1738,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
df %>%
group_by(Date = as.Date(dateTime)) %>%
mutate(Hour = hour(dateTime),
Hour_with_max_ws = Hour[which.max(WS)])
I want to highlight out, that if there are several hours with the same maximal windspeed (in the example below: 15), only the first hour with max(WS) will be shown as result, though the windspeed 15 was reached on that date at the hours 0, 3, 4, 21 and 22! So you might need a more specific logic.
For the sake of completeness (and because I like the concise code) here is a "one-liner" using data.table:
library(data.table)
setDT(df)[, max.ws := max(WS, na.rm = TRUE), by = as.IDate(dateTime)][]
dateTime WS WD Temp max.ws
1: 2011-01-01 00:00:00 NA 293 22 15
2: 2011-01-01 00:24:00 15 55 14 15
3: 2011-01-01 00:48:00 NA 186 24 15
4: 2011-01-01 01:12:00 4 300 22 15
5: 2011-01-01 01:36:00 0 120 36 15
---
1734: 2011-01-29 21:12:00 12 249 5 15
1735: 2011-01-29 21:36:00 9 282 21 15
1736: 2011-01-29 22:00:00 12 238 6 15
1737: 2011-01-29 22:24:00 10 127 21 15
1738: 2011-01-29 22:48:00 13 297 0 15
I have a "date" vector, that contains dates in mm/dd/yyyy format:
head(Entered_Date,5)
[1] 1/5/1998 1/5/1998 1/5/1998 1/5/1998 1/5/1998
I am trying to plot a frequency variable against the date, but I want to group the dates that it is by month or year. As it is now, there is a frequency per day, but I want to plot the frequency by month or year. So instead of having a frequency of 1 for 1/5/1998, 1 for 1/7/1998, and 3 for 1/8/1998, I would like to display it as 5 for 1/1998. It is a relatively large data set, with dates from 1998 to present, and I would like to find some automated way to accomplish this.
> dput(head(Entered_Date))
structure(c(260L, 260L, 260L, 260L, 260L, 260L), .Label = c("1/1/1998",
"1/1/1999", "1/1/2001", "1/1/2002", "1/10/2000", "1/10/2001",
"1/10/2002", "1/10/2003", "1/10/2005", "1/10/2006", "1/10/2007",
"1/10/2008", "1/10/2011", "1/10/2012", "1/10/2013", "1/11/1999",
"1/11/2000", "1/11/2001", "1/11/2002", "1/11/2005", "1/11/2006",
"1/11/2008", "1/11/2010", "1/11/2011", "1/11/2012", "1/11/2013",
"1/12/1998", "1/12/1999", "1/12/2001", "1/12/2004", "1/12/2005", ...
The floor_date() function from the lubridate package does this nicely.
data %>%
group_by(month = lubridate::floor_date(date, "month")) %>%
summarize(summary_variable = sum(value))
Thanks to Roman Cheplyaka
https://ro-che.info/articles/2017-02-22-group_by_month_r
See more on how to use the function: https://lubridate.tidyverse.org/reference/round_date.html
Here is an example using dplyr. You simply use the corresponding date format string for month %m or year %Y in the format statement.
set.seed(123)
df <- data.frame(date = seq.Date(from =as.Date("01/01/1998", "%d/%m/%Y"),
to=as.Date("01/01/2000", "%d/%m/%Y"), by="day"),
value = sample(seq(5), 731, replace = TRUE))
head(df)
date value
1 1998-01-01 2
2 1998-01-02 4
3 1998-01-03 3
4 1998-01-04 5
5 1998-01-05 5
6 1998-01-06 1
library(dplyr)
df %>%
mutate(month = format(date, "%m"), year = format(date, "%Y")) %>%
group_by(month, year) %>%
summarise(total = sum(value))
Source: local data frame [25 x 3]
Groups: month [?]
month year total
(chr) (chr) (int)
1 01 1998 105
2 01 1999 91
3 01 2000 3
4 02 1998 74
5 02 1999 77
6 03 1998 96
7 03 1999 86
8 04 1998 91
9 04 1999 95
10 05 1998 93
.. ... ... ...
Just to add to #cdeterman answer, you can use lubridate along with dplyr to make this even easier:
df <- data.frame(date = seq.Date(from =as.Date("01/01/1998", "%d/%m/%Y"),
to=as.Date("01/01/2000", "%d/%m/%Y"), by="day"),
value = sample(seq(5), 731, replace = TRUE))
library(dplyr)
library(lubridate)
df %>%
mutate(month = month(date), year = year(date)) %>%
group_by(month, year) %>%
summarise(total = sum(value))
Maybe you just add a column in your data like this:
Year <- format(as.Date(Entered_Date, "%d/%m/%Y"), "%Y")
Dont need dplyr. Look at ?as.POSIXlt
df$date<-as.POSIXlt(df$date)
mon<-df$date$mon
yr<-df$date$year
monyr<-as.factor(paste(mon,yr,sep="/"))
df$date<-monyr
Don't need to use ggplot2 but its nice for this kind of thing.
c <- ggplot(df, aes(factor(date)))
c + geom_bar()
If you want to see the actual numbers
aggregate(. ~ date,data = df,FUN=length )
df2<-aggregate(. ~ date,data = df,FUN=length )
df2
date value
1 0/98 31
2 0/99 31
3 1/98 28
4 1/99 28
5 10/98 30
6 10/99 30
7 11/97 1
8 11/98 31
9 11/99 31
10 2/98 31
11 2/99 31
12 3/98 30
13 3/99 30
14 4/98 31
15 4/99 31
16 5/98 30
17 5/99 30
18 6/98 31
19 6/99 31
20 7/98 31
21 7/99 31
22 8/98 30
23 8/99 30
24 9/98 31
25 9/99 31
There is a super easy way using the cut() function:
list = as.Date(c("1998-5-2", "1993-4-16", "1998-5-10"))
cut(list, breaks = "month")
and you will get this:
[1] 1998-05-01 1993-04-01 1998-05-01
62 Levels: 1993-04-01 1993-05-01 1993-06-01 1993-07-01 1993-08-01 ... 1998-05-01
Another solution is slider::slide_period:
library(slider)
library(dplyr)
monthly_summary <- function(data) summarise(data, date = format(max(date), "%Y-%m"), value = sum(value))
slide_period_dfr(df, df$date, "month", monthly_summary)
date value
1 1998-01 92
2 1998-02 82
3 1998-03 113
4 1998-04 94
5 1998-05 92
6 1998-06 74
7 1998-07 89
8 1998-08 92
9 1998-09 91
10 1998-10 100
...
There is also group_by(month_yr = cut(date, breaks = "1 month") in base R, without needing to use lubridate or other packages.
I have a time series with 30 years of daily data ( two columns labelled date and value)
Date Value
01-01-1975 0.051
02-01-1975 0.051
03-01-1975 0.051
04-01-1975 0.051
05-01-1975 0.051
06-01-1975 0.051
07-01-1975 0.051
08-01-1975 0.051
09-01-1975 0.051
10-01-1975 0.048
11-01-1975 0.048
12-01-1975 0.048
.........
I am trying to aggregate 5 days totals into sum ( so for each year I would get 73 values generated, it is a leap year then it would last value would be 6 days total rather than 5 days) In other words I always want to start on January 1 and always end on 31 Dec for each year, but I need to deal with the leap year case somehow, e.g. by treating each year separately or by finding leap years and treating them differently. But I am having issues
I did the following,
test <- read.csv("~/H/x.csv")
test$Date <- as.Date(test$Date, format = "%d-%m-%Y")
output <- aggregate(Flow ~ cut(Date, "5 days"), test, sum)
But it didn't quite gave me the results I wanted, which is for each year i want 73 values computed..
This is my first go at programming and R, so your guidance would be most welcome
cut by 5 days but use ave to do it by year so that weeks do not cross year boundaries. This gives Date5. Now aggregate over the cut values:
# test data
DF <- data.frame(Date = seq(as.Date("1975-01-01"), length = 2000, by = "day"),
Value = 1:2000)
to.yr <- function(x) as.numeric(format(x, "%Y"))
Date5 <- ave(DF$Date, to.yr(DF$Date), FUN = function(x) cut(x, "5 day"))
ag <- aggregate(Value ~ Date5, DF, sum)
To count the number of weeks (full or partial) used:
> table(to.yr(ag$Date5))
1975 1976 1977 1978 1979 1980
73 74 73 73 73 35
Some sample data to play with:
test = data.frame(Date=seq(as.Date("1975-01-01"),as.Date("2005-01-01"),1))
test$value = runif(nrow(test))
head(test)
Date value
1 1975-01-01 0.2929824
2 1975-01-02 0.2222665
3 1975-01-03 0.2659065
4 1975-01-04 0.5511573
Now use lubridate package's yday function to set the day of year from 1 to 366:
> require(lubridate)
> test$yday = yday(test$Date)
Now integer divide year day minus 1 by five to give our grouping (from 0 to 73 in this case):
> test$grp = (test$yday-1) %/% 5
head(test,10)
Date value yday grp
1 1975-01-01 0.29298243 0 0
2 1975-01-02 0.22226646 1 0
3 1975-01-03 0.26590648 2 0
4 1975-01-04 0.55115730 3 0
5 1975-01-05 0.55990854 4 0
6 1975-01-06 0.70054357 5 1
7 1975-01-07 0.27184097 6 1
8 1975-01-08 0.47779337 7 1
9 1975-01-09 0.09127241 8 1
10 1975-01-10 0.65023465 9 1
So we have the odd days in each year in group 73. Which ones?
test[test$grp==73,]
Date value yday grp
731 1976-12-31 0.6636329 365 73
2192 1980-12-31 0.4586537 365 73
3653 1984-12-31 0.3473794 365 73
5114 1988-12-31 0.9160449 365 73
6575 1992-12-31 0.3215585 365 73
8036 1996-12-31 0.1965876 365 73
9497 2000-12-31 0.6795412 365 73
10958 2004-12-31 0.3622685 365 73
We want to put these in group 72:
test$grp[test$grp==73]=72
Now we can do an analysis based on that group variable, and we should only get 73 values (remember we started at zero). I'll use dplyr because its cool:
require(dplyr)
test %>% group_by(grp) %>% summarise(mean=mean(value))
Source: local data frame [73 x 2]
grp mean
1 0 0.5052336
2 1 0.5178286
3 2 0.4844037
4 3 0.5368534
5 4 0.4900208
6 5 0.5078784
7 6 0.4754043
....
73 x 2 looks right!