Aggregate Daily Data to Month/Year intervals - datetime

I don't often have to work with dates in R, but I imagine this is fairly easy. I have a column that represents a date in a dataframe. I simply want to create a new dataframe that summarizes a 2nd column by Month/Year using the date. What is the best approach?
I want a second dataframe so I can feed it to a plot.
Any help you can provide will be greatly appreciated!
EDIT: For reference:
> str(temp)
'data.frame': 215746 obs. of 2 variables:
$ date : POSIXct, format: "2011-02-01" "2011-02-01" "2011-02-01" ...
$ amount: num 1.67 83.55 24.4 21.99 98.88 ...
> head(temp)
date amount
1 2011-02-01 1.670
2 2011-02-01 83.550
3 2011-02-01 24.400
4 2011-02-01 21.990
5 2011-02-03 98.882
6 2011-02-03 24.900

I'd do it with lubridate and plyr, rounding dates down to the nearest month to make them easier to plot:
library(lubridate)
df <- data.frame(
date = today() + days(1:300),
x = runif(300)
)
df$my <- floor_date(df$date, "month")
library(plyr)
ddply(df, "my", summarise, x = mean(x))

There is probably a more elegant solution, but splitting into months and years with strftime() and then aggregate()ing should do it. Then reassemble the date for plotting.
x <- as.POSIXct(c("2011-02-01", "2011-02-01", "2011-02-01"))
mo <- strftime(x, "%m")
yr <- strftime(x, "%Y")
amt <- runif(3)
dd <- data.frame(mo, yr, amt)
dd.agg <- aggregate(amt ~ mo + yr, dd, FUN = sum)
dd.agg$date <- as.POSIXct(paste(dd.agg$yr, dd.agg$mo, "01", sep = "-"))

A bit late to the game, but another option would be using data.table:
library(data.table)
setDT(temp)[, .(mn_amt = mean(amount)), by = .(yr = year(date), mon = months(date))]
# or if you want to apply the 'mean' function to several columns:
# setDT(temp)[, lapply(.SD, mean), by=.(year(date), month(date))]
this gives:
yr mon mn_amt
1: 2011 februari 42.610
2: 2011 maart 23.195
3: 2011 april 61.891
If you want names instead of numbers for the months, you can use:
setDT(temp)[, date := as.IDate(date)
][, .(mn_amt = mean(amount)), by = .(yr = year(date), mon = months(date))]
this gives:
yr mon mn_amt
1: 2011 februari 42.610
2: 2011 maart 23.195
3: 2011 april 61.891
As you see this will give the month names in your system language (which is Dutch in my case).
Or using a combination of lubridate and dplyr:
temp %>%
group_by(yr = year(date), mon = month(date)) %>%
summarise(mn_amt = mean(amount))
Used data:
# example data (modified the OP's data a bit)
temp <- structure(list(date = structure(1:6, .Label = c("2011-02-01", "2011-02-02", "2011-03-03", "2011-03-04", "2011-04-05", "2011-04-06"), class = "factor"),
amount = c(1.67, 83.55, 24.4, 21.99, 98.882, 24.9)),
.Names = c("date", "amount"), class = c("data.frame"), row.names = c(NA, -6L))

You can do it as:
short.date = strftime(temp$date, "%Y/%m")
aggr.stat = aggregate(temp$amount ~ short.date, FUN = sum)

Just use xts package for this.
library(xts)
ts <- xts(temp$amount, as.Date(temp$date, "%Y-%m-%d"))
# convert daily data
ts_m = apply.monthly(ts, FUN)
ts_y = apply.yearly(ts, FUN)
ts_q = apply.quarterly(ts, FUN)
where FUN is a function which you aggregate data with (for example sum)

Here's a dplyr option:
library(dplyr)
df %>%
mutate(date = as.Date(date)) %>%
mutate(ym = format(date, '%Y-%m')) %>%
group_by(ym) %>%
summarize(ym_mean = mean(x))

I have a function monyr that I use for this kind of stuff:
monyr <- function(x)
{
x <- as.POSIXlt(x)
x$mday <- 1
as.Date(x)
}
n <- as.Date(1:500, "1970-01-01")
nn <- monyr(n)
You can change the as.Date at the end to as.POSIXct to match the date format in your data. Summarising by month is then just a matter of using aggregate/by/etc.

One more solution:
rowsum(temp$amount, format(temp$date,"%Y-%m"))
For plot you could use barplot:
barplot(t(rowsum(temp$amount, format(temp$date,"%Y-%m"))), las=2)

Also, given that your time series seem to be in xts format, you can aggregate your daily time series to a monthly time series using the mean function like this:
d2m <- function(x) {
aggregate(x, format(as.Date(zoo::index(x)), "%Y-%m"), FUN=mean)
}

Related

I want to return a season and year value from a continuous list of dates

I have a continuous list of dates (yyyy-mm-dd) from 1985 to 2018 in one column (Colname = date). What I wish to do is generate another column which outputs a water season and year given the date.
To make it clearer I have two water season:
Summer = yyyy-04-01 to yyyy-09-31;
Winter = yyyy-10-01 to yyyy(+1)-03-31.
So for 2018 - Summer = 2018-04-01 to 2018-09-31; Winter 2018-10-01 to 2019-03-31.
What I would like to output is something like the following:
Many thanks.
A tidy verse approach
library(tidyverse)
df <-tibble(date = seq(from = as.Date('2000-01-01'), to = as.Date('2001-12-31'), by = '1 month'))
df
df %>%
mutate(water_season_year = case_when(
lubridate::month(date) %in% c(4:9) ~str_c('Su_', lubridate::year(date)),
lubridate::month(date) %in% c(10:12) ~str_c('Wi_', lubridate::year(date)),
lubridate::month(date) %in% c(1:3)~str_c('Wi_', lubridate::year(date) -1),
TRUE ~ 'Error'))
You can compare just the month part of the data to get the season, in base R consider doing
month <- as.integer(format(df$date, "%m"))
year <- format(df$date, "%Y")
inds <- month >= 4 & month <= 9
df$water_season_year <- NA
df$water_season_year[inds] <- paste("Su", year[inds], sep = "_")
df$water_season_year[!inds] <- paste("Wi", year[!inds], sep = "_")
#To add previous year for month <= 3 do
df$water_season_year[month <= 3] <- paste("Wi",
as.integer(year[month <= 3]) - 1, sep = "_")
df
# date water_season_year
#1 2019-01-03 Wi_2019
#2 2000-06-01 Su_2000
Make sure that date variable is of "Date" class.
data
df <-data.frame(date = as.Date(c("2019-01-03", "2000-06-01")))

Format Date to Year-Month in R

I would like to retain my current date column in year-month format as date. It currently gets converted to chr format. I have tried as_datetime but it coerces all values to NA.
The format I am looking for is: "2017-01"
library(lubridate)
df<- data.frame(Date=c("2017-01-01","2017-01-02","2017-01-03","2017-01-04",
"2018-01-01","2018-01-02","2018-02-01","2018-03-02"),
N=c(24,10,13,12,10,10,33,45))
df$Date <- as_datetime(df$Date)
df$Date <- ymd(df$Date)
df$Date <- strftime(df$Date,format="%Y-%m")
Thanks in advance!
lubridate only handle dates, and dates have days. However, as alistaire mentions, you can floor them by month of you want work monthly:
library(tidyverse)
df_month <-
df %>%
mutate(Date = floor_date(as_date(Date), "month"))
If you e.g. want to aggregate by month, just group_by() and summarize().
df_month %>%
group_by(Date) %>%
summarize(N = sum(N)) %>%
ungroup()
#> # A tibble: 4 x 2
#> Date N
#> <date> <dbl>
#>1 2017-01-01 59
#>2 2018-01-01 20
#>3 2018-02-01 33
#>4 2018-03-01 45
You can solve this with zoo::as.yearmon() function. Follows the solution:
library(tidyquant)
library(magrittr)
library(dplyr)
df <- data.frame(Date=c("2017-01-01","2017-01-02","2017-01-03","2017-01-04",
"2018-01-01","2018-01-02","2018-02-01","2018-03-02"),
N=c(24,10,13,12,10,10,33,45))
df %<>% mutate(Date = zoo::as.yearmon(Date))
You can use cut function, and use breaks="month" to transform all your days in your dates to the first day of the month. So any date within the same month will have the same date in the new created column.
This is usefull to group all other variables in your data frame by month (essentially what you are trying to do). However cut will create a factor, but this can be converted back to a date. So you can still have the date class in your data frame.
You just can't get rid of the day in a date (because then, is not a date...). Afterwards you can create a nice format for axes or tables. For example:
true_date <-
as.POSIXlt(
c(
"2017-01-01",
"2017-01-02",
"2017-01-03",
"2017-01-04",
"2018-01-01",
"2018-01-02",
"2018-02-01",
"2018-03-02"
),
format = "%F"
)
df <-
data.frame(
Date = cut(true_date, breaks = "month"),
N = c(24, 10, 13, 12, 10, 10, 33, 45)
)
## here df$Date is a 'factor'. You could use substr to create a formated column
df$formated_date <- substr(df$Date, start = 1, stop = 7)
## and you can convert back to date class. format = "%F", is ISO 8601 standard date format
df$true_date <- strptime(x = as.character(df$Date), format = "%F")
str(df)

Summarize daily data which lacks explicit grouping variable (month)

I have dataframe that has 6000 locations. For each location, I have 36 years daily data of rainfall in wide format.
A sample data:
set.seed(123)
mat <- matrix(round(rnorm(6000*36*365), digits = 2),nrow = 6000*36, ncol = 365)
dat <- data.table(mat)
names(dat) <- rep(paste0("d_",1:365))
dat$loc.id <- rep(1:6000, each = 36)
dat$year <- rep(1980:2015, times = 6000)
What I want to do is for each location, generate the long term average rainfall for each month. For e.g. for loc.id = 1, mean rainfall in Jan, Feb, March... Dec.
Let' say this data is called df which is a data table
library(dplyr)
Here's what I did:
loc.list <- unique(dat$loc.id)
my.list <- list() # a list to store results
ptm <- proc.time()
for(i in seq_along(loc.list)){
n <- loc.list[i]
df1 <- dat[dat$loc.id == n,]
df2 <- gather(df1, day, rain, -year) # this melts the data in long format
df3 <- df2 %>% mutate(day = gsub("d_","", day)) %>% # since the day column was in "d_1" format, I converted into integer (1,2,3..365)
mutate(day = as.numeric(as.character(day))) %>% # ensure that day column is numeric. For some reasonson, some NA.s appear.
arrange(year,day) %>% # ensure that they are arranged in order
mutate(month = strptime(paste(year, day), format = "%Y %j")$mon + 1) %>% # assing each day to a month
group_by(year,month) %>% # group by year and month
summarise(month.rain = sum(rain)) %>% # calculate for each location, year and month, total rainfall
group_by(month) %>% # group by month
summarise(month.mean = round(mean(month.rain), digits = 2)) # calculate for each month, the long term mean
my.list[[i]] <- df3
}
proc.time() - ptm
user system elapsed
1036.17 0.20 1040.68
I wanted to ask if there are more efficient and faster way to achieve this task
Another data.table alternative:
# change column names to month, grabbed from 365 dates of a non-leap year
setnames(dat, c(format(as.Date("2017-01-01") + 0:364, "%b"),
"loc.id", "year"))
# melt to long format
d <- melt(dat, id.vars = c("loc.id", "year"),
variable.name = "month", value.name = "rain")
# calculate mean rain by location and month
d2 <- d[ , .(mean_rain = mean(rain)), by = .(loc, month)]
This seems ~7 times faster than the answer by caw5cs. The result by Martin Morgan is in a different format though, which prevents a direct comparison of timings.
If you rather have unique column names in 'dat', you may use %b_%d (month-day) instead of %b only. Then use substr in by to grab the month part:
# change column names to month_day, using 365 dates of a non-leap year
setnames(dat, c(format(as.Date("2017-01-01") + 0:364, "%b_%d"),
"loc.id", "year"))
# melt to long format
d <- melt(dat, id.vars = c("loc.id", "year"),
variable.name = "month_day", value.name = "rain")
# calculate mean rain by location and month
d2 <- d[ , .(mean_rain = mean(rain)), by = .(loc.id, month = substr(month_day, 1, 3))]
Use the cryptically named rowsum() to sum daily rainfall at each site, over all years
loc.id = rep(1:6000, each = 36)
daily.by.loc = rowsum(mat, loc.id)
and use the same trick on the transposed matrix to sum by month (since there are 365 columns leap years must be ignored).
month = factor(
months(as.Date(0:364, origin="1970-01-01")),
levels = month.name
)
loc.by.month = rowsum(t(daily.by.loc), month)
Calculate the average by dividing by number of observations; R's column-major matrix representation and recycling rules apply. Transpose so the orientation is the same as the data.
days.per.month = tabulate(month)
ans = t(loc.by.month / (36 * days.per.month))
The result is a 6000 x 12 matrix
> dim(ans)
[1] 6000 12
> head(ans, 3)
January February March April May June
1 0.01554659 0.002043651 -0.02950717 -0.02700926 0.003521505 -0.011268519
2 0.04953405 0.032926587 -0.04959677 0.02808333 0.022051971 0.009768519
3 -0.01125448 -0.023343254 -0.02672939 0.04012963 0.018530466 0.035583333
July August September October November December
1 0.009874552 -0.030824373 -0.04958333 -0.03366487 -0.07390741 -0.07899642
2 -0.011630824 -0.003369176 -0.00100000 -0.00594086 -0.02817593 -0.01161290
3 0.031810036 0.059641577 -0.01109259 0.04646953 -0.01601852 0.03103943
in less than a second.
Grossly misread the question the first time, oops! Seems to be working as intended this time.
library(data.table)
set.seed(123)
mat <- matrix(round(rnorm(6000*36*365), digits = 2),nrow = 6000*36, ncol = 365)
dat <- data.table(mat)
names(dat) <- rep(paste0("d_",1:365))
dat$loc.id <- rep(1:6000, each = 36)
dat$year <- rep(1980:2015, times = 6000)
system.time({
# convert to long format with month # as column name
date_cols <- colnames(dat)[1:365]
setnames(dat, date_cols, as.character(1:365))
dat.long <- melt(dat, measure.vars=as.character(1:365), variable="day", value="rainfall")
# R date starts at 0 for Jan 1, so we offset the day by 1
dat.long[, day := as.numeric(day) - 1]
setkey(dat.long, year, day)
# Make table for merging year/day/month
months <- CJ(year=1980:2015, day=0:365)
months[, date := as.Date(day, origin=paste(year, "-01-01", sep=""))]
months[, month := tstrsplit(date, "-")[2]]
setkey(months, year, day)
# Merge tables to get month column
dat.merge <- merge(dat.long, months)
# aggregate by location an dmonth
dat.ag <- dat.merge[, list(mean_rainfall = mean(rainfall)), by=list(loc.id, month)]
})
Yielding
user system elapsed
14.420 4.205 18.626
> dat.ag
loc.id month mean_rainfall
1: 1 01 0.015546595
2: 2 01 0.049534050
3: 3 01 -0.011254480
4: 4 01 -0.019453405
5: 5 01 0.005860215
---
71996: 5996 12 0.027407407
71997: 5997 12 0.020334237
71998: 5998 12 0.043360434
71999: 5999 12 -0.006856369
72000: 6000 12 0.040542005

Format date as Year/Quarter

I have the following dataframe:
Data <- data.frame(
date = c("2001-01-01", "2001-02-01", "2001-03-01", "2001-04-01", "2001-05-01", "2001-06-01"),
qtr = c("NA", "NA","NA","NA","NA","NA")
)
I want to fill Data$qtr with Year/Quater - f.e. 01/01 (I need this format!).
I wrote a function:
fun <- function(x) {
if(x == "2001-01-01" | x == "2001-02-01" | x == "2001-03-01") y <- "01/01"
if(x == "2001-04-01" | x == "2001-05-01" | x == "2001-06-01") y <- "01/02"
return(y)
}
n$qtr <- sapply(n$date, fun)
But it does not work. I always get the error message:
Error in FUN(X[[1L]], ...) : Object 'y' not found
Why?
You need to explicilty Vectorize your function:
fun_v <- Vectorize(fun, "x")
fun_v(Data$date)
#[1] "01/01" "01/01" "01/01" "01/02" "01/02" "01/02"
However, when it comes to more or less standard tasks (such as datetime manipulations), there's always a solution already available:
library(zoo)
yq <- as.yearqtr(Data$date, format = "%Y-%m-%d")
yq
#[1] "2001 Q1" "2001 Q1" "2001 Q1" "2001 Q2" "2001 Q2" "2001 Q2"
To convert to your specific format, use
format(yq, format = "%y/0%q")
#[1] "01/01" "01/01" "01/01" "01/02" "01/02" "01/02"
I have been loving the lubridate package for working with dates. Super slick. The quarter function finds the quarter (of course) and then just pair that with the year.
library(lubridate)
Data <- Data %>%
mutate(qtr = paste0(substring(year(date),3,4),"/0",quarter(date)))
If you are not familiar with the %>% from magrittr the first line basically says "use data frame called Data" and the second line says "mutate (or add) a column called qtr"
EDIT 2021-Q2
If the "YY/QQ" format is not critical, then a quick and safe way to get the year and the quarter is:
library(lubridate)
Data %>%
mutate(qtr = quarter(date, with_year = T))
Using base functions:
Data$date <- as.Date(Data$date)
Data$qtr <- paste(format(Data$date, "%y"),
sprintf("%02i", (as.POSIXlt(Data$date)$mon) %/% 3L + 1L),
sep="/")
# date qtr
# 1 2001-01-01 01/01
# 2 2001-02-01 01/01
# 3 2001-03-01 01/01
# 4 2001-04-01 01/02
# 5 2001-05-01 01/02
# 6 2001-06-01 01/02
another option would be:
Data$qtr <- lubridate::quarter(Data$date, with_year = T)
I made a similar format using quarters() and sub() in R:
Data$qtr <- paste(format(Data$date, "%y/"), 0,
sub( "Q", "", quarters(Data$date) ), sep = "")
The data.table package and its IDate class have some nice convenience functions (e.g. quarter(), year()), similar to those of lubridate available. paste0() them together as you please.
Data <- data.frame(
date = c("2001-01-01", "2001-02-01", "2001-03-01",
"2001-04-01", "2001-05-01", "2001-06-01")
)
require(data.table)
setDT(Data)
Data[ , date := as.IDate(date) ] # data.table s integer based date class
Data[ , qtr := paste0(year(date), '/', quarter(date)) ]
# your specific format
Data[ , qt2 := paste0(substr(year(date),3,4), '/', '0', quarter(date)) ]
tidyverse's clock gives an alternative, with the advantage of not loosing the original day, with calendar_narrow and date precision:
library(clock)
library(dplyr)
#Function to convert a date to year-quarter
toQ <- . %>%
date_parse() %>%
as_year_quarter_day() %>%
calendar_narrow("quarter")
Data %>%
mutate(qtr = toQ(date))
output
date qtr
1 2001-01-01 2001-Q1
2 2001-02-01 2001-Q1
3 2001-03-01 2001-Q1
4 2001-04-01 2001-Q2
5 2001-05-01 2001-Q2
6 2001-06-01 2001-Q2
Another (longer) way
of doing it using if statements is this:
month <- as.numeric(format(date, format = "%m"))[1]
if (month < 4) {
quarter <- paste( format(date, format = "%Y")[1], "Q1", sep="-")
} else if (month > 3 & month < 7) {
quarter <- paste( format(date, format = "%Y")[1], "Q2", sep="-")
} else if (month > 6 & month < 10) {
quarter <- paste( format(date, format = "%Y")[1], "Q3", sep="-")
} else if (month > 9) {
quarter <- paste( format(date, format = "%Y")[1], "Q4", sep="-")
}
Returns a string in the format:
> quarter
[1] "2001-Q1"
Then you could extend that using a loop.
yq=function(x,prefix="%Y",combine="Q") paste0(ifelse(is.null(prefix),"",format(x,"%Y")),floor(as.numeric(format(x,"%m"))/3-1e-3)+1,sep=combine)
this gives the flexibility of returning any format back that has the quarter in it
no need for chron or zoo
as for your example
yq(as.Date("2013-04-30"),prefix="%y",combine="/0")
> [1] "13/02"
In case someone is looking for a format such as 1Q21 for 1st quarter 21, I used #Roland answer above and made a small change towards:
paste(sprintf("%2i", (as.POSIXlt(Notional$Date)$mon) %/% 3L + 1L), format(Notional$Date, "%y"),sep="Q")

R: ddply repeats yearly cumulative data

Related to this question here, but I decided to ask another question for the sake of clarity as the 'new' question is not directly related to the original. Briefly, I am using ddply to cumulatively sum a value for each of three years. My code takes data from the first year and repeats in in the second and third-year rows of the column. My guess is that each 1-year chunk is being copied to the whole of the column, but I don't understand why.
Q. How can I get a cumulatively summed value for each year, in the right rows of the designated column?
[Edit: the for loop - or something similar - is important, as ultimately I want to automagically calculate new columns based on a list of column names, rather than calculating each new column by hand. The loop iterates over the list of column names.]
I use the ddply and cumsum combination frequently so it is rather vexing to suddenly be having problems with it.
[Edit: this code has been updated to the solution I settled on, which is based on #Chase's answer below]
require(lubridate)
require(plyr)
require(xts)
require(reshape)
require(reshape2)
set.seed(12345)
# create dummy time series data
monthsback <- 24
startdate <- as.Date(paste(year(now()),month(now()),"1",sep = "-")) - months(monthsback)
mydf <- data.frame(mydate = seq(as.Date(startdate), by = "month", length.out = monthsback),
myvalue1 = runif(monthsback, min = 600, max = 800),
myvalue2 = runif(monthsback, min = 1900, max = 2400),
myvalue3 = runif(monthsback, min = 50, max = 80),
myvalue4 = runif(monthsback, min = 200, max = 300))
mydf$year <- as.numeric(format(as.Date(mydf$mydate), format="%Y"))
mydf$month <- as.numeric(format(as.Date(mydf$mydate), format="%m"))
# Select columns to process
newcolnames <- c('myvalue1','myvalue4','myvalue2')
# melt n' cast
mydf.m <- mydf[,c('mydate','year',newcolnames)]
mydf.m <- melt(mydf.m, measure.vars = newcolnames)
mydf.m <- ddply(mydf.m, c("year", "variable"), transform, newcol = cumsum(value))
mydf.m <- dcast(mydate ~ variable, data = mydf.m, value.var = "newcol")
colnames(mydf.m) <- c('mydate',paste(newcolnames, "_cum", sep = ""))
mydf <- merge(mydf, mydf.m, by = 'mydate', all = FALSE)
mydf
I don't really follow your for loop there, but are you overcomplicating things? Can't you just directly use transform and ddply?
#Make sure it's ordered properly
mydf <- mydf[order(mydf$year, mydf$month),]
#Use ddply to calculate the cumsum by year:
ddply(mydf, "year", transform,
cumsum1 = cumsum(myvalue1),
cumsum2 = cumsum(myvalue2))
#----------
mydate myvalue1 myvalue2 year month cumsum1 cumsum2
1 2010-05-01 744.1808 264.4543 2010 5 744.1808 264.4543
2 2010-06-01 775.1546 238.9828 2010 6 1519.3354 503.4371
3 2010-07-01 752.1965 269.8544 2010 7 2271.5319 773.2915
....
9 2011-01-01 745.5411 218.7712 2011 1 745.5411 218.7712
10 2011-02-01 797.9474 268.1834 2011 2 1543.4884 486.9546
11 2011-03-01 606.9071 237.0104 2011 3 2150.3955 723.9650
...
21 2012-01-01 690.7456 225.9681 2012 1 690.7456 225.9681
22 2012-02-01 665.3505 232.1225 2012 2 1356.0961 458.0906
23 2012-03-01 793.0831 206.0195 2012 3 2149.1792 664.1101
EDIT - this is untested as I don't have R on this machine, but this is what I had in mind:
require(reshape2)
mydf.m <- melt(mydf, measure.vars = newcolnames)
mydf.m <- ddply(mydf.m, c("year", "variable"), transform, newcol = cumsum(value))
dcast(mydate + year + month ~ variable, data = mydf.m, value.var = "newcol")

Resources