I have a sample measurement data of every 2 seconds and I would like to determine mean and standard deviation of this time series every 2 minutes. Any help will be appreciated.
Date Times Pos Date and time pressure temp
01.01.2013 02:20:01 A 2013-01-01 02:20:25.335 140.741
01.01.2013 02:20:02 A 2013-01-01 02:20:26.091 140.741
1.01.2013 02:20:03 A 2013-01-01 02:20:26.091 140.741
# example data
set.seed(1)
df <- data.frame(dates = sort(Sys.time() + sample(1:1000, size=100)),
values = rnorm(100, 100, 50))
# 2 minute groups
df$groups <- cut.POSIXt(df$dates, breaks="2 min")
# summary
require(plyr)
ddply(df, "groups", summarise, mean=mean(values), sd=sd(values))
# groups mean sd
# 1 2014-02-03 14:35:00 114.60027 55.67169
# 2 2014-02-03 14:37:00 107.16711 57.97990
# 3 2014-02-03 14:39:00 99.36876 45.03428
# 4 2014-02-03 14:41:00 111.37508 44.37829
# 5 2014-02-03 14:43:00 93.33474 46.33670
# 6 2014-02-03 14:45:00 108.71795 40.43259
# 7 2014-02-03 14:47:00 85.60400 29.38563
# 8 2014-02-03 14:49:00 83.57215 69.01886
# 9 2014-02-03 14:51:00 26.82735 12.52657
Edit:
With regards to your example data:
df <- read.table(sep=";", header=TRUE, stringsAsFactors=FALSE, text="
Date;Times;Pos;Date and time;pressure;temp
01.01.2013;02:20:01;A;2013-01-01 02:20:25;.335;140.741
01.01.2013;02:20:02;A;2013-01-01 02:20:26;.091;140.741
1.01.2013;02:20:03;A;2013-01-01 02:20:26;.091;140.741")
df$dates <- as.POSIXct(paste(df$Date, df$Times),
format="%d.%m.%Y %H:%M:%S")
df$groups <- cut.POSIXt(df$dates, breaks="2 sec")
require(plyr)
ddply(df, "groups", summarise,
mean_pressure=mean(pressure), sd_pressure=sd(pressure),
mean_temp=mean(temp), sd_temp=sd(temp))
# groups mean_pressure sd_pressure mean_temp sd_temp
# 1 2013-01-01 02:20:01 0.213 0.1725341 140.741 0
# 2 2013-01-01 02:20:03 0.091 NA 140.741 NA
Related
I've this function to generate monthly ranges, it should consider years where february has 28 or 29 days:
starts ends
1 2017-01-01 2017-01-31
2 2017-02-01 2017-02-28
3 2017-03-01 2017-03-31
It works with:
make_date_ranges(as.Date("2017-01-01"), Sys.Date())
But gives error with:
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
Why?
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
Error in data.frame(starts, ends) :
arguments imply differing number of rows: 38, 36
add_months <- function(date, n){
seq(date, by = paste (n, "months"), length = 2)[2]
}
make_date_ranges <- function(start, end){
starts <- seq(from = start,
to = Sys.Date()-1 ,
by = "1 month")
ends <- c((seq(from = add_months(start, 1),
to = end,
by = "1 month" ))-1,
(Sys.Date()-1))
data.frame(starts,ends)
}
## useage
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
1) First, define start of month, som, and end of month, eom functions which take a Date class object, date string in standard Date format or yearmon object and produce a Date class object giving the start or end of its year/months.
Using those, create a monthly Date series s using the start of each month from the month/year of from to that of to. Use pmax to ensure that the series does not extend before from and pmin so that it does not extend past to.
The input arguments can be strings in standard Date format, Date class objects or yearmon class objects. In the yearmon case it assumes the user wanted the full month for every month. (The if statement can be omitted if you don't need to support yearmon inputs.)
library(zoo)
som <- function(x) as.Date(as.yearmon(x))
eom <- function(x) as.Date(as.yearmon(x), frac = 1)
date_ranges2 <- function(from, to) {
if (inherits(to, "yearmon")) to <- eom(to)
s <- seq(som(from), eom(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
date_ranges2("2000-01-10", "2000-06-20")
## from to
## 1 2000-01-10 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-20
date_ranges2(as.yearmon("2000-01"), as.yearmon("2000-06"))
## from to
## 1 2000-01-01 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-30
2) This alternative takes the same approach but defines start of month (som) and end of month (eom) functions without using yearmon so that only base R is needed. It takes character strings in standard Date format or Date class inputs and gives the same output as (1).
som <- function(x) as.Date(cut(as.Date(x), "month")) # start of month
eom <- function(x) som(som(x) + 32) - 1 # end of month
date_ranges3 <- function(from, to) {
s <- seq(som(from), as.Date(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
date_ranges3("2000-01-10", "2000-06-20")
## from to
## 1 2000-01-10 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-20
date_ranges3(som("2000-01-10"), eom("2000-06-20"))
## from to
## 1 2000-01-01 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-30
You don't need to use seq twice -- you can subtract 1 day from the firsts of each month to get the ends, and generate one too many starts, then shift & subset:
make_date_ranges = function(start, end) {
# format(end, "%Y-%m-01") essentially truncates end to
# the first day of end's month; 32 days later is guaranteed to be
# in the subsequent month
starts = seq(from = start, to = as.Date(format(end, '%Y-%m-01')) + 32, by = 'month')
data.frame(starts = head(starts, -1L), ends = tail(starts - 1, -1L))
}
x = make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
rbind(head(x), tail(x))
# starts ends
# 1 2017-01-01 2017-01-31
# 2 2017-02-01 2017-02-28
# 3 2017-03-01 2017-03-31
# 4 2017-04-01 2017-04-30
# 5 2017-05-01 2017-05-31
# 6 2017-06-01 2017-06-30
# 31 2019-07-01 2019-07-31
# 32 2019-08-01 2019-08-31
# 33 2019-09-01 2019-09-30
# 34 2019-10-01 2019-10-31
# 35 2019-11-01 2019-11-30
# 36 2019-12-01 2019-12-31
I am dealing with a huge dataset (years of 1-minute-interval observations of energy usage). I want to convert it from 1-min-interval to 15-min-interval.
I have written a for loop which does this successfully (tested on a small subset of the data); however, when I tried running it on the main data, it was executing very slowly - and it would have taken me over 175 hours to run the full loop (I stopped it mid-execution).
The data to be converted to the 15-th minute interval is the kWh usage; thusly converting it simply requires taking the average of the first 15th observations, then the second 15th, etc. This is the loop that's working:
# Opening the file
data <- read.csv("1.csv",colClasses="character",na.strings="?")
# Adding an index to each row
total <- nrow(data)
data$obsnum <- seq.int(nrow(data))
# Calculating 15 min kwH usage
data$use_15_min <- data$use
for (i in 1:total) {
int_used <- floor((i-1)/15)
obsNum <- 15*int_used
sum <- 0
for (j in 1:15) {
usedIndex <- as.numeric(obsNum+j)
sum <- as.numeric(data$use[usedIndex]) + sum
}
data$use_15_min[i] <- sum/15
}
I have been searching for a function that can do the same, but without using loops, as I imagine this should save much time. Yet, I haven't been able to find one. How is it possible to achieve the same functionality without using a loop?
Try data.table:
library(data.table)
DT <- data.table(data)
n <- nrow(DT)
DT[, use_15_min := mean(use), by = gl(n, 15, n)]
Note
The question is missing the input data so we used this:
data <- data.frame(use = 1:100)
A potential solution is to calculate the running mean (e.g. using TTR::runMean) and then select every 15th observations. Here is an example:
df = data.frame(x = 1:100, y = runif(100))
df['runmean'] = TTR::runMean(df['y'], n=15)
df_15 = df[seq(1,nrow(df), 15), ]
I cannot test it, as I do not have Your data, but perhaps:
total <- nrow(data)
data$use_15_min = TTR::runMean(data$use, n=15)
data_15_min = data[seq(1, nrow(df), 15)]
I would use lubridate::floor_date to create the 15-minute groupings.
library(tidyverse)
library(lubridate)
df <- tibble(
date = seq(ymd_hm("2019-01-01 00:00"), by = "min", length.out = 60 * 24 * 7),
value = rnorm(n = 60 * 24 * 7)
)
df
#> # A tibble: 10,080 x 2
#> date value
#> <dttm> <dbl>
#> 1 2019-01-01 00:00:00 0.182
#> 2 2019-01-01 00:01:00 0.616
#> 3 2019-01-01 00:02:00 -0.252
#> 4 2019-01-01 00:03:00 0.0726
#> 5 2019-01-01 00:04:00 -0.917
#> 6 2019-01-01 00:05:00 -1.78
#> 7 2019-01-01 00:06:00 -1.49
#> 8 2019-01-01 00:07:00 -0.818
#> 9 2019-01-01 00:08:00 0.275
#> 10 2019-01-01 00:09:00 1.26
#> # ... with 10,070 more rows
df %>%
mutate(
nearest_15_mins = floor_date(date, "15 mins")
) %>%
group_by(nearest_15_mins) %>%
summarise(
avg_value_at_15_mins_int = mean(value)
)
#> # A tibble: 672 x 2
#> nearest_15_mins avg_value_at_15_mins_int
#> <dttm> <dbl>
#> 1 2019-01-01 00:00:00 -0.272
#> 2 2019-01-01 00:15:00 -0.129
#> 3 2019-01-01 00:30:00 0.173
#> 4 2019-01-01 00:45:00 -0.186
#> 5 2019-01-01 01:00:00 -0.188
#> 6 2019-01-01 01:15:00 0.104
#> 7 2019-01-01 01:30:00 -0.310
#> 8 2019-01-01 01:45:00 -0.173
#> 9 2019-01-01 02:00:00 0.0137
#> 10 2019-01-01 02:15:00 0.419
#> # ... with 662 more rows
So I do have the output of a water distribution model, which is inflow and discharge values of a river for every hour. I have done 5 model runs
reproducible example:
df <- data.frame(rep(seq(
from=as.POSIXct("2012-1-1 0:00", tz="UTC"),
to=as.POSIXct("2012-1-1 23:00", tz="UTC"),
by="hour"
),5),
as.factor(c(rep(1,24),rep(2,24),rep(3,24), rep(4,24),rep(5,24))),
rep(seq(1,300,length.out=24),5),
rep(seq(1,180, length.out=24),5) )
colnames(df)<-c("time", "run", "inflow", "discharge")
In reality, of course, the values for the runs are varying. (And I do have a lot of more data, as I do have 100 runs and hourly values of 35 years).
So, at first I would like to calculate a water scarcity factor for every run, which means I need to calculate something like (1 - (discharge / inflow of 6 hours before)), as the water needs 6 hours to run through the catchment.
scarcityfactor <- 1 - (discharge / lag(inflow,6))
And then I want to calculate to a mean, max and min of scarcity factors over all runs (to find out the highest, the lowest and mean value of scarcity that could happen at every time step; according to the different model runs). So I would say, I could just calculate a mean, max and min for every time step:
f1 <- function(x) c(Mean = (mean(x)), Max = (max(x)), Min = (min(x)))
results <- do.call(data.frame, aggregate(scarcityfactor ~ time,
data = df,
FUN = f1))
Can anybody help me with the code??
I believe this is what you want, if I understand the problem description correctly.
I'll use data.table:
library(data.table)
setDT(df)
# add scarcity_factor (group by run)
df[ , scarcity_factor := 1 - discharge/shift(inflow, 6L), by = run]
# group by time, excluding times for which the
# scarcity factor is missing
df[!is.na(scarcity_factor), by = time,
.(min_scarcity = min(scarcity_factor),
mean_scarcity = mean(scarcity_factor),
max_scarcity = max(scarcity_factor))]
# time min_scarcity mean_scarcity max_scarcity
# 1: 2012-01-01 06:00:00 -46.695652174 -46.695652174 -46.695652174
# 2: 2012-01-01 07:00:00 -2.962732919 -2.962732919 -2.962732919
# 3: 2012-01-01 08:00:00 -1.342995169 -1.342995169 -1.342995169
# 4: 2012-01-01 09:00:00 -0.776086957 -0.776086957 -0.776086957
# 5: 2012-01-01 10:00:00 -0.487284660 -0.487284660 -0.487284660
# 6: 2012-01-01 11:00:00 -0.312252964 -0.312252964 -0.312252964
# 7: 2012-01-01 12:00:00 -0.194826637 -0.194826637 -0.194826637
# 8: 2012-01-01 13:00:00 -0.110586011 -0.110586011 -0.110586011
# 9: 2012-01-01 14:00:00 -0.047204969 -0.047204969 -0.047204969
# 10: 2012-01-01 15:00:00 0.002210759 0.002210759 0.002210759
# 11: 2012-01-01 16:00:00 0.041818785 0.041818785 0.041818785
# 12: 2012-01-01 17:00:00 0.074275362 0.074275362 0.074275362
# 13: 2012-01-01 18:00:00 0.101356965 0.101356965 0.101356965
# 14: 2012-01-01 19:00:00 0.124296675 0.124296675 0.124296675
# 15: 2012-01-01 20:00:00 0.143977192 0.143977192 0.143977192
# 16: 2012-01-01 21:00:00 0.161047028 0.161047028 0.161047028
# 17: 2012-01-01 22:00:00 0.175993343 0.175993343 0.175993343
# 18: 2012-01-01 23:00:00 0.189189189 0.189189189 0.189189189
You can be a tad more concise by lapplying over different aggregators:
df[!is.na(scarcity_factor), by = time,
lapply(list(min, mean, max), function(f) f(scarcity_factor))]
Lastly you could think of this as reshaping with aggregation and use dcast:
dcast(df, time ~ ., value.var = 'scarcity_factor',
fun.aggregate = list(min, mean, max))
(use df[!is.na(scarcity_factor)] in the first argument of dcast if you want to exclude the meaningless rows)
library(tidyverse)
df %>%
group_by(run) %>%
mutate(scarcityfactor = 1 - discharge / lag(inflow,6)) %>%
group_by(time) %>%
summarise(Mean = mean(scarcityfactor),
Max = max(scarcityfactor),
Min = min(scarcityfactor))
# # A tibble: 24 x 4
# time Mean Max Min
# <dttm> <dbl> <dbl> <dbl>
# 1 2012-01-01 00:00:00 NA NA NA
# 2 2012-01-01 01:00:00 NA NA NA
# 3 2012-01-01 02:00:00 NA NA NA
# 4 2012-01-01 03:00:00 NA NA NA
# 5 2012-01-01 04:00:00 NA NA NA
# 6 2012-01-01 05:00:00 NA NA NA
# 7 2012-01-01 06:00:00 -46.7 -46.7 -46.7
# 8 2012-01-01 07:00:00 -2.96 -2.96 -2.96
# 9 2012-01-01 08:00:00 -1.34 -1.34 -1.34
#10 2012-01-01 09:00:00 -0.776 -0.776 -0.776
# # ... with 14 more rows
I have a data.table, allData, containing data on roughly every (POSIXct) second from different nights. Some nights however are on the same date since data is collected from different people, so I have a column nightNo as an id for every different night.
timestamp nightNo data1 data2
2018-10-19 19:15:00 1 1 7
2018-10-19 19:15:01 1 2 8
2018-10-19 19:15:02 1 3 9
2018-10-19 18:10:22 2 4 10
2018-10-19 18:10:23 2 5 11
2018-10-19 18:10:24 2 6 12
I'd like to aggregate the data to minutes (per night) and using this question I've come up with the following code:
aggregate_minute <- function(df){
df %>%
group_by(timestamp = cut(timestamp, breaks= "1 min")) %>%
summarise(data1= mean(data1), data2= mean(data2)) %>%
as.data.table()
}
allData <- allData[, aggregate_minute(allData), by=nightNo]
However my data.table is quite large and this code isn't fast enough. Is there a more efficient way to solve this problem?
allData <- data.table(timestamp = c(rep(Sys.time(), 3), rep(Sys.time() + 320, 3)),
nightNo = rep(1:2, c(3, 3)),
data1 = 1:6,
data2 = 7:12)
timestamp nightNo data1 data2
1: 2018-06-14 10:43:11 1 1 7
2: 2018-06-14 10:43:11 1 2 8
3: 2018-06-14 10:43:11 1 3 9
4: 2018-06-14 10:48:31 2 4 10
5: 2018-06-14 10:48:31 2 5 11
6: 2018-06-14 10:48:31 2 6 12
allData[, .(data1 = mean(data1), data2 = mean(data2)), by = .(nightNo, timestamp = cut(timestamp, breaks= "1 min"))]
nightNo timestamp data1 data2
1: 1 2018-06-14 10:43:00 2 8
2: 2 2018-06-14 10:48:00 5 11
> system.time(replicate(500, allData[, aggregate_minute(allData), by=nightNo]))
user system elapsed
3.25 0.02 3.31
> system.time(replicate(500, allData[, .(data1 = mean(data1), data2 = mean(data2)), by = .(nightNo, timestamp = cut(timestamp, breaks= "1 min"))]))
user system elapsed
1.02 0.04 1.06
You can use lubridate to 'round' the dates and then use data.table to aggregate the columns.
library(data.table)
library(lubridate)
Reproducible data:
text <- "timestamp nightNo data1 data2
'2018-10-19 19:15:00' 1 1 7
'2018-10-19 19:15:01' 1 2 8
'2018-10-19 19:15:02' 1 3 9
'2018-10-19 18:10:22' 2 4 10
'2018-10-19 18:10:23' 2 5 11
'2018-10-19 18:10:24' 2 6 12"
allData <- read.table(text = text, header = TRUE, stringsAsFactors = FALSE)
Create data.table:
setDT(allData)
Create a timestamp and floor it to the nearest minute:
allData[, timestamp := floor_date(ymd_hms(timestamp), "minutes")]
Change the type of the integer columns to numeric:
allData[, ':='(data1 = as.numeric(data1),
data2 = as.numeric(data2))]
Replace the data columns with their means by nightNo group:
allData[, ':='(data1 = mean(data1),
data2 = mean(data2)),
by = nightNo]
The result is:
timestamp nightNo data1 data2
1: 2018-10-19 19:15:00 1 2 8
2: 2018-10-19 19:15:00 1 2 8
3: 2018-10-19 19:15:00 1 2 8
4: 2018-10-19 18:10:00 2 5 11
5: 2018-10-19 18:10:00 2 5 11
6: 2018-10-19 18:10:00 2 5 11
Let's say I have several years worth of data which look like the following
# load date package and set random seed
library(lubridate)
set.seed(42)
# create data.frame of dates and income
date <- seq(dmy("26-12-2010"), dmy("15-01-2011"), by = "days")
df <- data.frame(date = date,
wday = wday(date),
wday.name = wday(date, label = TRUE, abbr = TRUE),
income = round(runif(21, 0, 100)),
week = format(date, format="%Y-%U"),
stringsAsFactors = FALSE)
# date wday wday.name income week
# 1 2010-12-26 1 Sun 91 2010-52
# 2 2010-12-27 2 Mon 94 2010-52
# 3 2010-12-28 3 Tues 29 2010-52
# 4 2010-12-29 4 Wed 83 2010-52
# 5 2010-12-30 5 Thurs 64 2010-52
# 6 2010-12-31 6 Fri 52 2010-52
# 7 2011-01-01 7 Sat 74 2011-00
# 8 2011-01-02 1 Sun 13 2011-01
# 9 2011-01-03 2 Mon 66 2011-01
# 10 2011-01-04 3 Tues 71 2011-01
# 11 2011-01-05 4 Wed 46 2011-01
# 12 2011-01-06 5 Thurs 72 2011-01
# 13 2011-01-07 6 Fri 93 2011-01
# 14 2011-01-08 7 Sat 26 2011-01
# 15 2011-01-09 1 Sun 46 2011-02
# 16 2011-01-10 2 Mon 94 2011-02
# 17 2011-01-11 3 Tues 98 2011-02
# 18 2011-01-12 4 Wed 12 2011-02
# 19 2011-01-13 5 Thurs 47 2011-02
# 20 2011-01-14 6 Fri 56 2011-02
# 21 2011-01-15 7 Sat 90 2011-02
I would like to sum 'income' for each week (Sunday thru Saturday). Currently I do the following:
Weekending 2011-01-01 = sum(df$income[1:7]) = 487
Weekending 2011-01-08 = sum(df$income[8:14]) = 387
Weekending 2011-01-15 = sum(df$income[15:21]) = 443
However I would like a more robust approach which will automatically sum by week. I can't work out how to automatically subset the data into weeks. Any help would be much appreciated.
First use format to convert your dates to week numbers, then plyr::ddply() to calculate the summaries:
library(plyr)
df$week <- format(df$date, format="%Y-%U")
ddply(df, .(week), summarize, income=sum(income))
week income
1 2011-52 413
2 2012-01 435
3 2012-02 379
For more information on format.date, see ?strptime, particular the bit that defines %U as the week number.
EDIT:
Given the modified data and requirement, one way is to divide the date by 7 to get a numeric number indicating the week. (Or more precisely, divide by the number of seconds in a week to get the number of weeks since the epoch, which is 1970-01-01 by default.
In code:
df$week <- as.Date("1970-01-01")+7*trunc(as.numeric(df$date)/(3600*24*7))
library(plyr)
ddply(df, .(week), summarize, income=sum(income))
week income
1 2010-12-23 298
2 2010-12-30 392
3 2011-01-06 294
4 2011-01-13 152
I have not checked that the week boundaries are on Sunday. You will have to check this, and insert an appropriate offset into the formula.
This is now simple using dplyr. Also I would suggest using cut(breaks = "week") rather than format() to cut the dates into weeks.
library(dplyr)
df %>% group_by(week = cut(date, "week")) %>% mutate(weekly_income = sum(income))
I Googled "group week days into weeks R" and came across this SO question. You mention you have multiple years, so I think we need to keep up with both the week number and also the year, so I modified the answers there as so format(date, format = "%U%y")
In use it looks like this:
library(plyr) #for aggregating
df <- transform(df, weeknum = format(date, format = "%y%U"))
ddply(df, "weeknum", summarize, suminc = sum(income))
#----
weeknum suminc
1 1152 413
2 1201 435
3 1202 379
See ?strptime for all the format abbreviations.
Try rollapply from the zoo package:
rollapply(df$income, width=7, FUN = sum, by = 7)
# [1] 487 387 443
Or, use period.sum from the xts package:
period.sum(xts(df$income, order.by=df$date), which(df$wday %in% 7))
# [,1]
# 2011-01-01 487
# 2011-01-08 387
# 2011-01-15 443
Or, to get the output in the format you want:
data.frame(income = period.sum(xts(df$income, order.by=df$date),
which(df$wday %in% 7)),
week = df$week[which(df$wday %in% 7)])
# income week
# 2011-01-01 487 2011-00
# 2011-01-08 387 2011-01
# 2011-01-15 443 2011-02
Note that the first week shows as 2011-00 because that's how it is entered in your data. You could also use week = df$week[which(df$wday %in% 1)] which would match your output.
This solution is influenced by #Andrie and #Chase.
# load plyr
library(plyr)
# format weeks as per requirement (replace "00" with "52" and adjust corresponding year)
tmp <- list()
tmp$y <- format(df$date, format="%Y")
tmp$w <- format(df$date, format="%U")
tmp$y[tmp$w=="00"] <- as.character(as.numeric(tmp$y[tmp$w=="00"]) - 1)
tmp$w[tmp$w=="00"] <- "52"
df$week <- paste(tmp$y, tmp$w, sep = "-")
# get summary
df2 <- ddply(df, .(week), summarize, income=sum(income))
# include week ending date
tmp$week.ending <- lapply(df2$week, function(x) rev(df[df$week==x, "date"])[[1]])
df2$week.ending <- sapply(tmp$week.ending, as.character)
# week income week.ending
# 1 2010-52 487 2011-01-01
# 2 2011-01 387 2011-01-08
# 3 2011-02 443 2011-01-15
df.index = df['week'] #the the dt variable as index
df.resample('W').sum() #sum using resample
With dplyr:
df %>%
arrange(date) %>%
mutate(week = as.numeric(date - date[1])%/%7) %>%
group_by(week) %>%
summarise(weekincome= sum(income))
Instead of date[1] you can have any date from when you want to start your weekly study.