I have data measuring precipitation daily using R. My dates are in format 2008-01-01 and range for 10 years. I am trying to aggregate from 2008-10-01 to 2009-09-31 but I am not sure how. Is there a way in aggregate to set a start date of aggregation and group.
My current code is
data<- aggregate(data$total_snow_cm, by=list(data$year), FUN = 'sum')
but this output gives me a sum total of the snowfall for each year from jan - dec but I want it to include oct / 08 to sept / 09.
Assuming your data are in long format, I'd do something like this:
library(tidyverse)
#make sure R knows your dates are dates - you mention they're 'yyyy-mm-dd', so
yourdataframe <- yourdataframe %>%
mutate(yourcolumnforprecipdate = ymd(yourcolumnforprecipdate)
#in this script or another, define a water year function
water_year <- function(date) {
ifelse(month(date) < 10, year(date), year(date)+1)}
#new wateryear column for your data, using your new function
yourdataframe <- yourdataframe %>%
mutate(wateryear = water_year(yourcolumnforprecipdate)
#now group by water year (and location if there's more than one)
#and sum and create new data.frame
wy_sums <- yourdataframe %>% group_by(locationcolumn, wateryear) %>%
summarize(wy_totalprecip = sum(dailyprecip))
For more info, read up on the tidyverse 's great sublibrary called lubridate -
where the ymd() function is from. There are others like ymd_hms(). mutate() is from the tidyverse's dplyr libary. Both libraries are extremely useful!
I'd like to give the actual answer to the question, where the aggregate() way was asked.
You may use with() to wrap the data specification around aggregate(). In the with() you can define date intervals as you can with numbers.
df1.agg <- with(df1[as.Date("2008-10-01") <= df1$year & df1$year <= as.Date("2009-09-30"), ],
aggregate(total_snow_cm, by=list(year), FUN=sum))
Another way is to use aggregate()'s formula interface, where data and, hence, also the interval can be specified inside the aggregate() call.
df1.agg <- aggregate(total_snow_cm ~ year,
data=df1[as.Date("2008-10-01") <= df1$year &
df1$year <= as.Date("2009-09-30"), ], FUN=sum)
Result
head(df1.agg)
# year total_snow_cm
# 1 2008-10-01 171
# 2 2008-10-02 226
# 3 2008-10-03 182
# 4 2008-10-04 129
# 5 2008-10-05 135
# 6 2008-10-06 222
Data
set.seed(42)
df1 <- data.frame(total_snow_cm=sample(120:240, 4018, replace=TRUE),
year=seq(as.Date("2000-01-01"),as.Date("2010-12-31"), by="day"))
Related
This seems really simple, yet I can't find an easy solution. I'm working with future streamflow projections for every day of a 25 year period (2024-2050). I'm only interested in streamflow during the 61 day period between 11th of April and 10th of June each year. I want to extract the data from the seq and Data column that are within this period for each year and have it in a data frame together.
Example data:
library(xts)
seq <- timeBasedSeq('2024-01-01/2050-12-31')
Data <- xts(1:length(seq),seq)
I want to achieve something like this BUT with all the dates between April 11 and June 10th and for all years (2024-2050). This is a shortened sample output:
seq_x <- c("2024-04-11","2024-06-10","2025-04-11","2025-06-10","2026-04-11","2027-06-10",
"2027-04-11", "2027-06-10")
Data_x <- c(102, 162, 467, 527, 832, 892, 1197, 1257)
output <- data.frame(seq_x, Data_x)
This question is similar to:
Calculating average for certain time period in every year
and
select date ranges for multiple years in r
but doesn't provide an efficient answer to my question on how to extract the same period over multiple years.
Here is a base R approach :
dates <- index(Data)
month <- as.integer(format(dates, '%m'))
day <- as.integer(format(dates, '%d'))
result <- Data[month == 4 & day >= 11 | month == 5 | month == 6 & day <= 10]
result
#2024-04-11 102
#2024-04-12 103
#2024-04-13 104
#2024-04-14 105
#2024-04-15 106
#2024-04-16 107
#...
#...
#2024-06-07 159
#2024-06-08 160
#2024-06-09 161
#2024-06-10 162
#2025-04-11 467
#2025-04-12 468
#...
#...
Create an mmdd character string and subset using it:
mmdd <- format(time(Data), "%m%d")
Data1 <- Data[mmdd >= "0411" & mmdd <= "0610"]
These would also work. They shift the dates back by 10 days in which case it coincides with April and May
Data2 <- Data[format(time(Data)-10, "%m") %in% c("04", "05")]
or
Data3 <- Data[ cycle(as.yearmon(time(Data)-10)) %in% 4:5 ]
The command fortify.zoo(x) can be used to convert an xts object x to a data frame.
Here is an option. Do a group by year of the 'seq_x', then summarise to create a list column by subsetting 'Data' based on the first and last elements of 'seq_x' and select the column
library(dplyr)
library(lubridate)
library(tidyr)
library(purrr)
output %>%
group_by(year = year(seq_x)) %>%
summarise(new = list(Data[str_c(first(seq_x), last(seq_x), sep="::")]),
.groups = 'drop') %>%
pull(new) %>%
invoke(rbind, .)
# [,1]
#2024-04-11 102
#2024-04-12 103
#2024-04-13 104
#2024-04-14 105
#2024-04-15 106
#2024-04-16 107
# ...
I have the following data:
Date,Rain
1979_8_9_0,8.775
1979_8_9_6,8.775
1979_8_9_12,8.775
1979_8_9_18,8.775
1979_8_10_0,0
1979_8_10_6,0
1979_8_10_12,0
1979_8_10_18,0
1979_8_11_0,8.025
1979_8_12_12,0
1979_8_12_18,0
1979_8_13_0,8.025
[1] The data is six hourly but some dates have incomplete 6 hourly data. For example, August 11 1979 has only one value at 00H. I would like to get the daily accumulated from this kind of data using R. Any suggestion on how to do this easily in R?
I'll appreciate any help.
You can transform your data to dates very easily with:
dat$Date <- as.Date(strptime(dat$Date, '%Y_%m_%d_%H'))
After that you should aggregate with:
aggregate(Rain ~ Date, dat, sum)
The result:
Date Rain
1 1979-08-09 35.100
2 1979-08-10 0.000
3 1979-08-11 8.025
4 1979-08-12 0.000
5 1979-08-13 8.025
Based on the comment of Henrik, you can also transform to dates with:
dat$Date <- as.Date(dat$Date, '%Y_%m_%d')
# split the "date" variable into new, separate variable
splitDate <- stringr::str_split_fixed(string = df$Date, pattern = "_", n = 4)
df$Day <- splitDate[,3]
# split data by Day, loop over each split and add rain variable
unlist(lapply(split(df$Rain, df$Day), sum))
I have a data frame with 3 years worth of sales data that I'm trying to convert to a time series. Manually creating subsets for each of the 36 months:
mydfJan2011 <- subset(myDataFrame,
as.Date("2011-01-01") <= myDataFrame$Dates &
myDataFrame$Dates <= as.Date("2011-01-31"))
...
mydfDec2013 <- subset(myDataFrame,
as.Date("2013-12-01") <= myDataFrame$Dates &
myDataFrame$Dates <= as.Date("2013-12-31"))
and then summing them up and putting them into a vector
counts[1] <- sum(mydfJan2011$itemsSold)
...
counts[36] <- sum(mydfDec2013$itemsSold))
to get the values for the time series works fine, but I'd like to make it a little more automatic as I have to create more than one time series, so I'm trying to turn it into a loop.
In order to do that, I need to create a string with a subset command like this:
"subset(myDataFrame,
as.Date("2011-01-01") <= myDataFrame$Dates &
myDataFrame$Dates <= as.Date("2011-01-31"))"
But when I use paste, the result is this:
myString
>"subset(myDataFrame, as.Date(\"2011-02-01\") <= myDataFrame$Dates & myDataFrame$Dates <= as.Date(\"2011-02-28\"))"
and
eval(parse(text = myString))
results in the following error message:
Error in charToDate(x) :
character string is not in a standard unambiguous format
whereas just typing in the command (without escapes) results in the subset I'm trying to create.
I've tried playing around with single and double quotes, substitute and deparse, but none of it results in any kind of subset of my data frame.
Any suggestions?
Even another way of splitting up the data by month and summing it up would be welcome.
Thanks,
Signe
Here is a solution using tapply:
with(sales, tapply(itemsSold, substr(Dates, 1, 7), sum))
Produces monthly sums (I limited my data to 9 months for illustrative purposes, but this extends to longer periods):
2011-01 2011-02 2011-03 2011-04 2011-05 2011-06 2011-07 2011-08 2011-09
1592.097 1468.427 1594.386 1563.014 1595.489 1560.361 1553.128 1663.705 1325.519
tapply computes the sum of values in a vector (sales$sales) grouped by the values of another vector (substr(sales$date, 1, 7), which is basically "yyyy-mm"). with allows me to avoid me typing sales$ repeatedly. You should almost never have to use eval(parse(...)). There is almost always a better, faster way to do it without resorting to that.
And here is the data I used:
set.seed(1)
sales <- data.frame(Dates=seq(as.Date("2011-01-01"), as.Date("2011-09-30"), by="+1 day"))
sales$itemsSold <- runif(nrow(sales), 1, 100)
For reference, there are also several 3rd party packages that simplify this type of computation (see data.table, dplyr).
Here's a data.table approach that aggregates by year and month, using the first of the month as the respective group label:
library(data.table)
##
mDt <- Dt[
,list(monthSold=sum(itemsSold)),
keyby=list(mDay=as.Date(paste0(
year(Dates),"-",month(Dates),"-01")))]
##
R> head(mDt)
mDay monthSold
1: 2012-01-01 179
2: 2012-02-01 128
3: 2012-03-01 152
4: 2012-04-01 160
5: 2012-05-01 152
6: 2012-06-01 141
Data:
set.seed(123)
Dt <- data.table(
Dates=seq.Date(
from=as.Date("2012-01-01"),
to=as.Date("2014-12-31"),
by="day"),
itemsSold=rpois(1096,5))
I'm trying to load time series in R with the 'zoo' library.
The observations I have varying precision. Some have the day/month/year, others only month and year, and others year:
02/10/1915
1917
07/1917
07/1918
30/08/2018
Subsequently, I need to aggregate the rows by year, year and month.
The basic R as.Date function doesn't handle that.
How can I model this data with zoo?
Thanks,
Mulone
We use the test data formed from the index data in the question followed by a number:
# test data
Lines <- "02/10/1915 1
1917 2
07/1917 3
07/1918 4
30/08/2018 5"
yearly aggregation
library(zoo)
to.year <- function(x) as.numeric(sub(".*/", "", as.character(x)))
read.zoo(text = Lines, FUN = to.year, aggregate = mean)
The last line returns:
1915 1917 1918 2018
1.0 2.5 4.0 5.0
year/month aggregation
Since year/month aggregation of data with no months makes no sense we first drop the year only data and aggregate the rest:
DF <- read.table(text = Lines, as.is = TRUE)
# remove year-only records. DF.ym has at least year and month.
yr <- suppressWarnings(as.numeric(DF[[1]]))
DF.ym <- DF[is.na(yr), ]
# remove day, if present, and convert to yearmon.
to.yearmon <- function(x) as.yearmon( sub("\\d{1,2}/(\\d{1,2}/)", "\\1", x), "%m/%Y" )
read.zoo(DF.ym, FUN = to.yearmon, aggregate = mean)
The last line gives:
Oct 1915 Jul 1917 Jul 1918 Aug 2018
1 3 4 5
UPDATE: simplifications
I struggle mightily with dates in R and could do this pretty easily in SPSS, but I would love to stay within R for my project.
I have a date column in my data frame and want to remove the year completely in order to leave the month and day. Here is a peak at my original data.
> head(ds$date)
[1] "2003-10-09" "2003-10-11" "2003-10-13" "2003-10-15" "2003-10-18" "2003-10-20"
> class((ds$date))
[1] "Date"
I "want" it to be.
> head(ds$date)
[1] "10-09" "10-11" "10-13" "10-15" "10-18" "10-20"
> class((ds$date))
[1] "Date"
If possible, I would love to set the first date to be October 1st instead of January 1st.
Any help you can provide will be greatly appreciated.
EDIT: I felt like I should add some context. I want to plot an NHL player's performance over the course of a season which starts in October and ends in April. To add to this, I would like to facet the plots by each season which is a separate column in my data frame. Because I want to compare cumulative performance over the course of the season, I believe that I need to remove the year portion, but maybe I don't; as I indicated, I struggle with dates in R. What I am looking to accomplish is a plot that compares cumulative performance over relative dates by season and have the x-axis start in October and end in April.
> d = as.Date("2003-10-09", format="%Y-%m-%d")
> format(d, "%m-%d")
[1] "10-09"
Is this what you are looking for?
library(ggplot2)
## make up data for two seasons a and b
a = as.Date("2010/10/1")
b = as.Date("2011/10/1")
a.date <- seq(a, by='1 week', length=28)
b.date <- seq(b, by='1 week', length=28)
## make up some score data
a.score <- abs(trunc(rnorm(28, mean = 10, sd = 5)))
b.score <- abs(trunc(rnorm(28, mean = 10, sd = 5)))
## create a data frame
df <- data.frame(a.date, b.date, a.score, b.score)
df
## Since I am using ggplot I better create a "long formated" data frame
df.molt <- melt(df, measure.vars = c("a.score", "b.score"))
levels(df.molt$variable) <- c("First season", "Second season")
df.molt
Then, I am using ggplot2 for plotting the data:
## plot it
ggplot(aes(y = value, x = a.date), data = df.molt) + geom_point() +
geom_line() + facet_wrap(~variable, ncol = 1) +
scale_x_date("Date", format = "%m-%d")
If you want to modify the x-axis (e.g., display format), then you'll probably be interested in scale_date.
You have to remember Date is a numeric format, representing the number of days passed since the "origin" of the internal date counting :
> str(Date)
Class 'Date' num [1:10] 14245 14360 14475 14590 14705 ...
This is the same as in EXCEL, if you want a reference. Hence the solution with format as perfectly valid.
Now if you want to set the first date of a year as October 1st, you can construct some year index like this :
redefine.year <- function(x,start="10-1"){
year <- as.numeric(strftime(x,"%Y"))
yearstart <- as.Date(paste(year,start,sep="-"))
year + (x >= yearstart) - min(year) + 1
}
Testing code :
Start <- as.Date("2009-1-1")
Stop <- as.Date("2011-11-1")
Date <- seq(Start,Stop,length.out=10)
data.frame( Date=as.character(Date),
year=redefine.year(Date))
gives
Date year
1 2009-01-01 1
2 2009-04-25 1
3 2009-08-18 1
4 2009-12-11 2
5 2010-04-05 2
6 2010-07-29 2
7 2010-11-21 3
8 2011-03-16 3
9 2011-07-09 3
10 2011-11-01 4