How to aggregate/ sum values by time in r - r

There are 2 issues:
I have time data in factor format and I want to change it into date format for later manipulation.
The goal is to sum values of precipitation of the same time unit, eg. precipitation per hour.
I tried to convert the time using as.POSIXct() or as.date() in lubridate but always get NA values after defining the format. This is the code I used:
tt=as.POSIXct(FixUseNew$StartTimestamp, )
df$time <- as.Date(df$time, "%d-%m-%Y")
If I leave out the format and do the following :
tt=as.POSIXct(df$time)
tt
hour(tt)
The date data looks like this now: "0010-07-14 00:38:00 LMT"
I wanted to use aggregate function to sum the precipitation in the same hour interval or day but couldn't do it as I am stuck with the date format.
Just a brain dump. I was going to change the factor date in to character then to date format as following. Please advise if that is a good idea.
df$time <-paste(substr(df$time,6,7),
substr(df$time,9,10),
substr(df$time,1,4),sep="/")
Here is a subset of the data, hope this helps to illustrate the question better:
Id <- c(1,2,3,4)
Time <- c("10/7/2014 12:30:00 am", "10/7/2014 01:00:05 am","10/7/2014 01:30:10 am", "10/7/2014 02:00:15 am")
Precipitation <- c(0.06, 0.02,0,0.25)
cbind(Id, Time, Precipitation)
Thank you so much.
Here is the outcome:
It seems like the order is distorted:
6 1/1/15 0:35 602
7 1/1/15 0:36 582
8 1/1/15 0:37 958
9 1/1/15 0:38 872
10 1/10/14 0:31 500
11 1/10/14 0:32 571
12 1/10/14 0:33 487
13 1/10/14 0:34 220
14 1/10/14 0:35 550
15 1/10/14 0:36 582
16 1/10/14 0:37 524
17 1/10/14 0:38 487
⋮
106 10/10/14 15:16 494
107 10/10/14 7:53 37
108 10/10/14 7:56 24
109 10/10/14 8:01 3
110 10/11/14 0:30 686
111 10/11/14 0:31 592
112 10/11/14 0:32 368
113 10/11/14 0:33 702
114 10/11/14 0:34 540
115 10/11/14 0:35 564

Using dplyr and lubridate packages we can extract the hour from each Time and sum.
library(dplyr)
library(lubridate)
df %>%
mutate(hour = hour(dmy_hms(Time))) %>%
group_by(hour) %>%
summarise(Precipitation = sum(Precipitation, na.rm = TRUE))
For aggregation by date, we can do
df %>%
mutate(day = as.Date(dmy_hms(Time))) %>%
group_by(day) %>%
summarise(Precipitation = sum(Precipitation, na.rm = TRUE))
Using base R, we could do
df$Hour <- format(as.POSIXct(df$Time, format = "%d/%m/%Y %I:%M:%S %p"), "%H")
df$Day <- as.Date(as.POSIXct(df$Time, format = "%d/%m/%Y %I:%M:%S %p"))
#Aggregation by hour
aggregate(Precipitation~Hour, df, sum, na.rm = TRUE)
#Aggregation by date
aggregate(Precipitation~Day, df, sum, na.rm = TRUE)
EDIT
Based on updated data and information, we can do
df <- readxl::read_xlsx("/path/to/file/df (1).xlsx")
hour_df <- df %>%
group_by(hour = hour(Time)) %>%
summarise(Precipitation = sum(Precipitation, na.rm = TRUE))
day_df <- df %>%
mutate(day = as.Date(Time)) %>%
group_by(day) %>%
summarise(Precipitation = sum(Precipitation, na.rm = TRUE))
So hour_df has got hourly sum of values without taking into consideration the date and day_df has got sum of Precipitation for each day.
data
Id <- c(1,2,3,4)
Time <- c("10/7/2014 12:30:00 am", "10/7/2014 01:00:05 am",
"10/7/2014 01:30:10 am", "10/7/2014 02:00:15 am")
Precipitation <- c(0.06, 0.02,0,0.25)
df <- data.frame(Id, Time, Precipitation)

Related

How to calculate a mean for a date interval in R?

I have a dataset (data.weather) with one weather variable (TMAX) for two locations (combination of LAT and LON) and two years. TMAX is available for ten days per year and location in this mock example. I need to calculate the mean TMAX (mean_TMAX) for each of the four rows in data.locs. This last dataset indicate the range of date for which I need to calculate the mean. That is between DATE_0 and DATE_1.
Here is the code of what I did:
library(dplyr)
library(lubridate)
data.weather <-read.csv(text = "
LAT,LON,YEAR,DATE,TMAX
36,-89,2010,1/1/2010,25
36,-89,2010,1/2/2010,25
36,-89,2010,1/3/2010,25
36,-89,2010,1/4/2010,28
36,-89,2010,1/5/2010,28
36,-89,2010,1/6/2010,29
36,-89,2010,1/7/2010,25
36,-89,2010,1/8/2010,25
36,-89,2010,1/9/2010,25
36,-89,2010,1/10/2010,28
36,-89,2011,1/1/2011,26
36,-89,2011,1/2/2011,25
36,-89,2011,1/3/2011,28
36,-89,2011,1/4/2011,26
36,-89,2011,1/5/2011,27
36,-89,2011,1/6/2011,27
36,-89,2011,1/7/2011,28
36,-89,2011,1/8/2011,29
36,-89,2011,1/9/2011,27
36,-89,2011,1/10/2011,26
40,-96,2010,1/1/2010,29
40,-96,2010,1/2/2010,28
40,-96,2010,1/3/2010,25
40,-96,2010,1/4/2010,25
40,-96,2010,1/5/2010,28
40,-96,2010,1/6/2010,29
40,-96,2010,1/7/2010,26
40,-96,2010,1/8/2010,28
40,-96,2010,1/9/2010,26
40,-96,2010,1/10/2010,25
40,-96,2011,1/1/2011,29
40,-96,2011,1/2/2011,27
40,-96,2011,1/3/2011,29
40,-96,2011,1/4/2011,25
40,-96,2011,1/5/2011,28
40,-96,2011,1/6/2011,29
40,-96,2011,1/7/2011,29
40,-96,2011,1/8/2011,25
40,-96,2011,1/9/2011,25
40,-96,2011,1/10/2011,26
") %>%
mutate(DATE = as.Date(DATE, format = "%m/%d/%Y"))
data.locs <-read.csv(text = "
LAT,LON,YEAR,DATE_0,DATE_1,GEN,PR
36,-89,2010,1/2/2010,1/9/2010,MN103,35
36,-89,2011,1/1/2011,1/10/2011,IA100,33
40,-96,2010,1/4/2010,1/8/2010,MN103,36
40,-96,2011,1/2/2011,1/6/2011,IA100,34
") %>%
mutate(DATE_0 = as.Date(DATE_0, format = "%m/%d/%Y"),
DATE_1 = as.Date(DATE_1, format = "%m/%d/%Y"))
tmax.calculation <- data.locs %>%
group_by(LAT,LON,YEAR, GEN) %>%
mutate(mean_TMAX = mean(data.weather$TMAX[data.weather$DATE %within% interval(DATE_0, DATE_1)]))
This is the expected result:
LAT LON YEAR DATE_0 DATE_1 GEN PR meam_tmax
36 -89 2010 1/2/2010 1/9/2010 MN103 35 26.25
36 -89 2011 1/1/2011 1/10/2011 IA100 33 26.90
40 -96 2010 1/4/2010 1/8/2010 MN103 36 27.20
40 -96 2011 1/2/2011 1/6/2011 IA100 34 27.60
However, this is what I am getting:
LAT LON YEAR DATE_0 DATE_1 GEN PR meam_tmax
36 -89 2010 1/2/2010 1/9/2010 MN103 35 26.5625
36 -89 2011 1/1/2011 1/10/2011 IA100 33 27.0500
40 -96 2010 1/4/2010 1/8/2010 MN103 36 27.1000
40 -96 2011 1/2/2011 1/6/2011 IA100 34 27.1000
The problem I have is that, when reading the data interval in data.weather, the calculation is being made over the correct interval BUT across the two locations (combination of LAT and LON). I couldn't find a way to indicate to calculate the mean only for each LAT and LON combination separately.
This should do it:
library(dplyr)
library(lubridate)
data.weather <-read.csv(text = "
LAT,LON,YEAR,DATE,TMAX
36,-89,2010,1/1/2010,25
36,-89,2010,1/2/2010,25
36,-89,2010,1/3/2010,25
36,-89,2010,1/4/2010,28
36,-89,2010,1/5/2010,28
36,-89,2010,1/6/2010,29
36,-89,2010,1/7/2010,25
36,-89,2010,1/8/2010,25
36,-89,2010,1/9/2010,25
36,-89,2010,1/10/2010,28
36,-89,2011,1/1/2011,26
36,-89,2011,1/2/2011,25
36,-89,2011,1/3/2011,28
36,-89,2011,1/4/2011,26
36,-89,2011,1/5/2011,27
36,-89,2011,1/6/2011,27
36,-89,2011,1/7/2011,28
36,-89,2011,1/8/2011,29
36,-89,2011,1/9/2011,27
36,-89,2011,1/10/2011,26
40,-96,2010,1/1/2010,29
40,-96,2010,1/2/2010,28
40,-96,2010,1/3/2010,25
40,-96,2010,1/4/2010,25
40,-96,2010,1/5/2010,28
40,-96,2010,1/6/2010,29
40,-96,2010,1/7/2010,26
40,-96,2010,1/8/2010,28
40,-96,2010,1/9/2010,26
40,-96,2010,1/10/2010,25
40,-96,2011,1/1/2011,29
40,-96,2011,1/2/2011,27
40,-96,2011,1/3/2011,29
40,-96,2011,1/4/2011,25
40,-96,2011,1/5/2011,28
40,-96,2011,1/6/2011,29
40,-96,2011,1/7/2011,29
40,-96,2011,1/8/2011,25
40,-96,2011,1/9/2011,25
40,-96,2011,1/10/2011,26
") %>%
mutate(DATE = as.Date(DATE, format = "%m/%d/%Y"))
data.locs <-read.csv(text = "
LAT,LON,YEAR,DATE_0,DATE_1,GEN,PR
36,-89,2010,1/2/2010,1/9/2010,MN103,35
36,-89,2011,1/1/2011,1/10/2011,IA100,33
40,-96,2010,1/4/2010,1/8/2010,MN103,36
40,-96,2011,1/2/2011,1/6/2011,IA100,34
") %>%
mutate(DATE_0 = as.Date(DATE_0, format = "%m/%d/%Y"),
DATE_1 = as.Date(DATE_1, format = "%m/%d/%Y"))
tmax.calculation <- data.locs %>%
group_by(LAT,LON,YEAR,GEN) %>%
do(data.frame(LAT=.$LAT,
LON=.$LON,
YEAR=.$YEAR,
GEN=.$GEN,
DATE=seq(.$DATE_0, .$DATE_1, by="days"))) %>%
left_join(data.weather, by=c("LAT", "LON", "YEAR", "DATE")) %>%
summarise(mean_TMAX = mean(TMAX))
Result:

Convert HMM /HHMM time column to timestamp in R

I am new here please be gentle ;)
I have two time columns in a dataframe in R that uses the HMM /HHMM format as a numeric. For example, 03:13 would be 313 and 14:14 would be 1414. An example would be sched_arr_time and sched_dep_time in the nycflights13 package.
I need to calculate the time difference in minutes. My SQL knowledge tells me I would substring this with a case when and then glue it back together as a time format somehow but I was hoping there is a more elegant way in R to deal with this?
Many thanks for your help!
This would explain the data:
library(nycflights13)
flights %>% select(sched_dep_time, sched_arr_time)
We can convert to time class with as.ITime after changing the format to HH:MM with str_pad and str_replace, and then take the difference using difftime
library(dplyr)
library(stringr)
library(data.table)
flights %>%
head %>%
select(sched_dep_time, sched_arr_time) %>%
mutate_all(~ str_pad(., width = 4, pad = 0) %>%
str_replace(., '^(..)', '\\1:') %>%
as.ITime) %>%
mutate(diff = difftime(sched_arr_time, sched_dep_time, unit = 'min'))
# A tibble: 6 x 3
# sched_dep_time sched_arr_time diff
# <ITime> <ITime> <drtn>
#1 05:15:00 08:19:00 184 mins
#2 05:29:00 08:30:00 181 mins
#3 05:40:00 08:50:00 190 mins
#4 05:45:00 10:22:00 277 mins
#5 06:00:00 08:37:00 157 mins
#6 05:58:00 07:28:00 90 mins
If we want to add a 'Date' as well, then we
library(lubridate)
flights %>%
head %>%
select(sched_dep_time, sched_arr_time) %>%
mutate_all(~ str_pad(., width = 4, pad = 0) %>%
str_replace("^(..)(..)", "\\1:\\2:00") %>%
str_c(Sys.Date(), ., sep=' ') %>%
ymd_hms) %>%
mutate(diff = difftime(sched_arr_time, sched_dep_time, unit = 'min'))
Here is another option using strptime
as_time <- function(x)
as.POSIXct(strptime(if_else(nchar(x) == 3, paste0("0", x), as.character(x)), "%H%M"))
flights %>%
select(sched_dep_time, sched_arr_time) %>%
mutate(diff_in_mins = difftime(as_time(sched_arr_time), as_time(sched_dep_time), "mins"))
## A tibble: 336,776 x 3
# sched_dep_time sched_arr_time diff_in_mins
# <int> <int> <drtn>
# 1 515 819 184 mins
# 2 529 830 181 mins
# 3 540 850 190 mins
# 4 545 1022 277 mins
# 5 600 837 157 mins
# 6 558 728 90 mins
# 7 600 854 174 mins
# 8 600 723 83 mins
# 9 600 846 166 mins
#10 600 745 105 mins
## … with 336,766 more rows

How to finding a given length of runs in a series of data?

I'm trying to study times in which flow was operating at a given level. I would like to find when flows were above a given level for 4 or more hours. How would I go about doing this?
Sample code:
Date<-format(seq(as.POSIXct("2014-01-01 01:00"), as.POSIXct("2015-01-01 00:00"), by="hour"), "%Y-%m-%d %H:%M", usetz = FALSE)
Flow<-runif(8760, 0, 2300)
IsHigh<- function(x ){
if (x < 1600) return(0)
if (1600 <= x) return(1)
}
isHighFlow = unlist(lapply(Flow, IsHigh))
df = data.frame(Date, Flow, isHighFlow )
I was asked to edit my questions to supply what I would like to see as an output.
I would like to see a data from such as the one below. The only issue is the hourseHighFlow is incorrect. I'm not sure how to fix the code to generation the correct hoursHighFlow.
temp <- df %>%
mutate(highFlowInterval = cumsum(isHighFlow==1)) %>%
group_by(highFlowInterval) %>%
summarise(hoursHighFlow = n(), minDate = min(as.character(Date)), maxDate = max(as.character(Date)))
#Then join the two tables together.
temp2<-sqldf("SELECT *
FROM temp LEFT JOIN df
ON df.Date BETWEEN temp.minDate AND temp.maxDate")
Able to use subset to select the length of time running at a high flow rate.
t<-subset(temp2,isHighFlow==1)
t<-subset(t, hoursHighFlow>=4)
Put it in a data.table:
require(data.table)
DT <- data.table(df)
Mark runs and lengths:
DT[,`:=`(r=.GRP,rlen=.N),by={r <- rle(isHighFlow);rep(1:length(r[[1]]),r$lengths)}]
Subset to long runs:
DT[rlen>4L]
How it works:
New columns are created in the second argument of DT[i,j,by] with :=.
.GRP and .N are special variables for, respectively, the index and size of the by group.
A data.table can be subset simply with DT[i], unlike a data.frame.
Apart from subsetting, most of what works with a data.frame works the same on a data.table.
Here is a solution using the dplyr package:
df %>%
mutate(interval = cumsum(isHighFlow!=lag(isHighFlow, default = 0))) %>%
group_by(interval) %>%
summarise(hoursHighFlow = n(), minDate = min(as.character(Date)), maxDate = max(as.character(Date)), isHighFlow = mean(isHighFlow)) %>%
filter(hoursHighFlow >= 4, isHighFlow == 1)
Result:
interval hoursHighFlow minDate maxDate isHighFlow
1 25 4 2014-01-03 07:00 2014-01-03 10:00 1
2 117 4 2014-01-12 01:00 2014-01-12 04:00 1
3 245 6 2014-01-23 13:00 2014-01-23 18:00 1
4 401 6 2014-02-07 03:00 2014-02-07 08:00 1
5 437 5 2014-02-11 02:00 2014-02-11 06:00 1
6 441 4 2014-02-11 21:00 2014-02-12 00:00 1
7 459 4 2014-02-13 09:00 2014-02-13 12:00 1
8 487 4 2014-02-16 03:00 2014-02-16 06:00 1
9 539 7 2014-02-21 08:00 2014-02-21 14:00 1
10 567 4 2014-02-24 11:00 2014-02-24 14:00 1
.. ... ... ... ... ...
As Frank notes, you could achieve the same result with using rle to set intervals, replacing the mutate line with:
mutate(interval = rep(1:length(rle(df$isHighFlow)[[2]]),rle(df$isHighFlow)[[1]])) %>%

Grouping every n minutes with dplyr

I have a dataset containing 10 events occuring at a certain time on a given day, with corresponding value for each event:
d1 <- data.frame(date = as.POSIXct(c("21/05/2010 19:59:37", "21/05/2010 08:40:30",
"21/05/2010 09:21:00", "21/05/2010 22:29:50", "21/05/2010 11:27:34",
"21/05/2010 18:25:14", "21/05/2010 15:16:01", "21/05/2010 09:41:53",
"21/05/2010 15:01:29", "21/05/2010 09:02:06"), format ="%d/%m/%Y %H:%M:%S"),
value = c(11313,42423,64645,643426,1313313,1313,3535,6476,11313,9875))
I want to aggregate the results every 3 minutes, in a standard dataframe format (from "21/05/2010 00:00:00" to "21/05/2010 23:57:00", so that the dataframe has 480 bins of 3 minutes each)
First, I create a dataframe containing bins of 3 minutes each:
d2 <- data.frame(date = seq(as.POSIXct("2010-05-21 00:00:00"),
by="3 min", length.out=(1440/3)))
Then, I merge the two dataframes together and remove NAs:
library(dplyr)
m <- merge(d1, d2, all=TRUE) %>% mutate(value = ifelse(is.na(value),0,value))
Finally, I use period.apply() from the xts package to sum the values for each bin:
library(xts)
a <- period.apply(m$value, endpoints(m$date, "minutes", 3), sum)
Is there a more efficient way to do this ? It does not feel optimal.
Update #1
I adjusted my code after Joshua's answer:
library(xts)
startpoints <- function (x, on = "months", k = 1) {
head(endpoints(x, on, k) + 1, -1)
}
m <- seq(as.POSIXct("2010-05-21 00:00:00"), by="3 min", length.out=1440/3)
x <- merge(value=xts(d1$value, d1$date), xts(,m))
y <- period.apply(x, c(0,startpoints(x, "minutes", 3)), sum, na.rm=TRUE)
I wasn't aware that na.rm=TRUE could be used with period.apply(), which now allows me to skip mutate(value = ifelse(is.na(value),0,value)). It's a step forward and I'm actually pleased with the xts approach here but I would like to know if there is a pure dplyr solution I could use in such a situation.
Update #2
After trying Khashaa's answer, I had an error because my timezone was not specified. So I had:
> tail(d4)
interval sumvalue
476 2010-05-21 23:45:00 NA
477 2010-05-21 23:48:00 NA
478 2010-05-21 23:51:00 NA
479 2010-05-21 23:54:00 NA
480 2010-05-21 23:57:00 11313
481 2010-05-22 02:27:00 643426
> d4[450,]
interval sumvalue
450 2010-05-21 22:27:00 NA
Now, after Sys.setenv(TZ="UTC"), it all works fine.
lubridate-dplyr-esque solution.
library(lubridate)
library(dplyr)
d2 <- data.frame(interval = seq(ymd_hms('2010-05-21 00:00:00'), by = '3 min',length.out=(1440/3)))
d3 <- d1 %>%
mutate(interval = floor_date(date, unit="hour")+minutes(floor(minute(date)/3)*3)) %>%
group_by(interval) %>%
mutate(sumvalue=sum(value)) %>%
select(interval,sumvalue)
d4 <- merge(d2,d3, all=TRUE) # better if left_join is used
tail(d4)
# interval sumvalue
#475 2010-05-21 23:42:00 NA
#476 2010-05-21 23:45:00 NA
#477 2010-05-21 23:48:00 NA
#478 2010-05-21 23:51:00 NA
#479 2010-05-21 23:54:00 NA
#480 2010-05-21 23:57:00 NA
d4[450,]
# interval sumvalue
#450 2010-05-21 22:27:00 643426
If you are comfortable working with Date (I am not), you can dispense with lubridate, and replace the final merge with left_join.
If you need to group data into n minute bins, the floor_date function can allow multiple units to be specified within the unit argument of the function. For example:
library(lubridate)
x <- ymd_hms("2009-08-03 12:25:59.23")
floor_date(x, unit = "3minutes")
"2009-08-03 12:24:00 UTC"
Using your example:
library(lubridate)
library(tidyverse)
# make complete time sequence
d2 <- data.frame(timePeriod = seq(as.POSIXct("2010-05-21 00:00:00"),
by="3 min", length.out=(1440/3)))
d1 %>%
mutate(timePeriod = floor_date(date, "3minutes")) %>%
group_by(timePeriod) %>%
summarise(sum = sum(value)) %>%
right_join(d2)
I'm not sure about a dplyr solution, but here's an xts solution:
startpoints <- function (x, on = "months", k = 1) {
head(endpoints(x, on, k) + 1, -1)
}
m3 <- seq(as.POSIXct("2010-05-21 00:00:00"),
by="3 min", length.out=1440/3)
x <- merge(value=xts(d1$value, d1$date), xts(,m3))
y <- period.apply(x, c(0,startpoints(x, "minutes", 3)), sum, na.rm=TRUE)
Update: Here's another xts solution that is a bit more careful about correctly aligning the aggregated values. Not to suggest the prior solution was wrong, but this solution is easier to follow and repeat in other analysis.
m3 <- seq(as.POSIXct("2010-05-20 23:59:59.999"),
by="3 min", length.out=1440/3)
x <- merge(value=xts(d1$value, d1$date), xts(,m3))
y <- period.apply(x, endpoints(x, "minutes", 3), sum, na.rm=TRUE)
y <- align.time(y, 60*3)
Recently, the padr package has been developed which can also solve this in a clean way.
library(lubridate)
library(dplyr)
library(padr)
d1 <- data.frame(date = as.POSIXct(c("21/05/2010 19:59:37", "21/05/2010 08:40:30",
"21/05/2010 09:21:00", "21/05/2010 22:29:50", "21/05/2010 11:27:34",
"21/05/2010 18:25:14", "21/05/2010 15:16:01", "21/05/2010 09:41:53",
"21/05/2010 15:01:29", "21/05/2010 09:02:06"), format ="%d/%m/%Y %H:%M:%S"),
value = c(11313,42423,64645,643426,1313313,1313,3535,6476,11313,9875))
res <- d1 %>%
as_tibble() %>%
arrange(date) %>%
# Thicken the results to fall in 3 minute buckets
thicken(
interval = '3 min',
start_val = as.POSIXct('2010-05-21 00:00:00'),
colname = "date_pad") %>%
# Pad the results to fill in the rest of the 3 minute buckets
pad(
interval = '3 min',
by = 'date_pad',
start_val = as.POSIXct('2010-05-21 00:00:00'),
end_val = as.POSIXct('2010-05-21 23:57:00')) %>%
select(date_pad, value)
res
#> # A tibble: 480 x 2
#> date_pad value
#> <dttm> <dbl>
#> 1 2010-05-21 00:00:00 NA
#> 2 2010-05-21 00:03:00 NA
#> 3 2010-05-21 00:06:00 NA
#> 4 2010-05-21 00:09:00 NA
#> 5 2010-05-21 00:12:00 NA
#> 6 2010-05-21 00:15:00 NA
#> 7 2010-05-21 00:18:00 NA
#> 8 2010-05-21 00:21:00 NA
#> 9 2010-05-21 00:24:00 NA
#> 10 2010-05-21 00:27:00 NA
#> # ... with 470 more rows
res[450,]
#> # A tibble: 1 x 2
#> date_pad value
#> <dttm> <dbl>
#> 1 2010-05-21 22:27:00 643426

How to subset data.frame by weeks and then sum?

Let's say I have several years worth of data which look like the following
# load date package and set random seed
library(lubridate)
set.seed(42)
# create data.frame of dates and income
date <- seq(dmy("26-12-2010"), dmy("15-01-2011"), by = "days")
df <- data.frame(date = date,
wday = wday(date),
wday.name = wday(date, label = TRUE, abbr = TRUE),
income = round(runif(21, 0, 100)),
week = format(date, format="%Y-%U"),
stringsAsFactors = FALSE)
# date wday wday.name income week
# 1 2010-12-26 1 Sun 91 2010-52
# 2 2010-12-27 2 Mon 94 2010-52
# 3 2010-12-28 3 Tues 29 2010-52
# 4 2010-12-29 4 Wed 83 2010-52
# 5 2010-12-30 5 Thurs 64 2010-52
# 6 2010-12-31 6 Fri 52 2010-52
# 7 2011-01-01 7 Sat 74 2011-00
# 8 2011-01-02 1 Sun 13 2011-01
# 9 2011-01-03 2 Mon 66 2011-01
# 10 2011-01-04 3 Tues 71 2011-01
# 11 2011-01-05 4 Wed 46 2011-01
# 12 2011-01-06 5 Thurs 72 2011-01
# 13 2011-01-07 6 Fri 93 2011-01
# 14 2011-01-08 7 Sat 26 2011-01
# 15 2011-01-09 1 Sun 46 2011-02
# 16 2011-01-10 2 Mon 94 2011-02
# 17 2011-01-11 3 Tues 98 2011-02
# 18 2011-01-12 4 Wed 12 2011-02
# 19 2011-01-13 5 Thurs 47 2011-02
# 20 2011-01-14 6 Fri 56 2011-02
# 21 2011-01-15 7 Sat 90 2011-02
I would like to sum 'income' for each week (Sunday thru Saturday). Currently I do the following:
Weekending 2011-01-01 = sum(df$income[1:7]) = 487
Weekending 2011-01-08 = sum(df$income[8:14]) = 387
Weekending 2011-01-15 = sum(df$income[15:21]) = 443
However I would like a more robust approach which will automatically sum by week. I can't work out how to automatically subset the data into weeks. Any help would be much appreciated.
First use format to convert your dates to week numbers, then plyr::ddply() to calculate the summaries:
library(plyr)
df$week <- format(df$date, format="%Y-%U")
ddply(df, .(week), summarize, income=sum(income))
week income
1 2011-52 413
2 2012-01 435
3 2012-02 379
For more information on format.date, see ?strptime, particular the bit that defines %U as the week number.
EDIT:
Given the modified data and requirement, one way is to divide the date by 7 to get a numeric number indicating the week. (Or more precisely, divide by the number of seconds in a week to get the number of weeks since the epoch, which is 1970-01-01 by default.
In code:
df$week <- as.Date("1970-01-01")+7*trunc(as.numeric(df$date)/(3600*24*7))
library(plyr)
ddply(df, .(week), summarize, income=sum(income))
week income
1 2010-12-23 298
2 2010-12-30 392
3 2011-01-06 294
4 2011-01-13 152
I have not checked that the week boundaries are on Sunday. You will have to check this, and insert an appropriate offset into the formula.
This is now simple using dplyr. Also I would suggest using cut(breaks = "week") rather than format() to cut the dates into weeks.
library(dplyr)
df %>% group_by(week = cut(date, "week")) %>% mutate(weekly_income = sum(income))
I Googled "group week days into weeks R" and came across this SO question. You mention you have multiple years, so I think we need to keep up with both the week number and also the year, so I modified the answers there as so format(date, format = "%U%y")
In use it looks like this:
library(plyr) #for aggregating
df <- transform(df, weeknum = format(date, format = "%y%U"))
ddply(df, "weeknum", summarize, suminc = sum(income))
#----
weeknum suminc
1 1152 413
2 1201 435
3 1202 379
See ?strptime for all the format abbreviations.
Try rollapply from the zoo package:
rollapply(df$income, width=7, FUN = sum, by = 7)
# [1] 487 387 443
Or, use period.sum from the xts package:
period.sum(xts(df$income, order.by=df$date), which(df$wday %in% 7))
# [,1]
# 2011-01-01 487
# 2011-01-08 387
# 2011-01-15 443
Or, to get the output in the format you want:
data.frame(income = period.sum(xts(df$income, order.by=df$date),
which(df$wday %in% 7)),
week = df$week[which(df$wday %in% 7)])
# income week
# 2011-01-01 487 2011-00
# 2011-01-08 387 2011-01
# 2011-01-15 443 2011-02
Note that the first week shows as 2011-00 because that's how it is entered in your data. You could also use week = df$week[which(df$wday %in% 1)] which would match your output.
This solution is influenced by #Andrie and #Chase.
# load plyr
library(plyr)
# format weeks as per requirement (replace "00" with "52" and adjust corresponding year)
tmp <- list()
tmp$y <- format(df$date, format="%Y")
tmp$w <- format(df$date, format="%U")
tmp$y[tmp$w=="00"] <- as.character(as.numeric(tmp$y[tmp$w=="00"]) - 1)
tmp$w[tmp$w=="00"] <- "52"
df$week <- paste(tmp$y, tmp$w, sep = "-")
# get summary
df2 <- ddply(df, .(week), summarize, income=sum(income))
# include week ending date
tmp$week.ending <- lapply(df2$week, function(x) rev(df[df$week==x, "date"])[[1]])
df2$week.ending <- sapply(tmp$week.ending, as.character)
# week income week.ending
# 1 2010-52 487 2011-01-01
# 2 2011-01 387 2011-01-08
# 3 2011-02 443 2011-01-15
df.index = df['week'] #the the dt variable as index
df.resample('W').sum() #sum using resample
With dplyr:
df %>%
arrange(date) %>%
mutate(week = as.numeric(date - date[1])%/%7) %>%
group_by(week) %>%
summarise(weekincome= sum(income))
Instead of date[1] you can have any date from when you want to start your weekly study.

Resources