Aggregate daily level data to weekly level in R - r

I have a huge dataset similar to the following reproducible sample data.
Interval value
1 2012-06-10 552
2 2012-06-11 4850
3 2012-06-12 4642
4 2012-06-13 4132
5 2012-06-14 4190
6 2012-06-15 4186
7 2012-06-16 1139
8 2012-06-17 490
9 2012-06-18 5156
10 2012-06-19 4430
11 2012-06-20 4447
12 2012-06-21 4256
13 2012-06-22 3856
14 2012-06-23 1163
15 2012-06-24 564
16 2012-06-25 4866
17 2012-06-26 4421
18 2012-06-27 4206
19 2012-06-28 4272
20 2012-06-29 3993
21 2012-06-30 1211
22 2012-07-01 698
23 2012-07-02 5770
24 2012-07-03 5103
25 2012-07-04 775
26 2012-07-05 5140
27 2012-07-06 4868
28 2012-07-07 1225
29 2012-07-08 671
30 2012-07-09 5726
31 2012-07-10 5176
I want to aggregate this data to weekly level to get the output similar to the following:
Interval value
1 Week 2, June 2012 *aggregate value for day 10 to day 14 of June 2012*
2 Week 3, June 2012 *aggregate value for day 15 to day 21 of June 2012*
3 Week 4, June 2012 *aggregate value for day 22 to day 28 of June 2012*
4 Week 5, June 2012 *aggregate value for day 29 to day 30 of June 2012*
5 Week 1, July 2012 *aggregate value for day 1 to day 7 of July 2012*
6 Week 2, July 2012 *aggregate value for day 8 to day 10 of July 2012*
How do I achieve this easily without writing a long code?

If you mean the sum of of ‘value’ by week I think the easiest way to do it is to convert the data into a xts object as GSee suggested:
data <- as.xts(data$value,order.by=as.Date(data$interval))
weekly <- apply.weekly(data,sum)
[,1]
2012-06-10 552
2012-06-17 23629
2012-06-24 23872
2012-07-01 23667
2012-07-08 23552
2012-07-10 10902
I leave the formatting of the output as an exercise for you :-)

If you were to use week from lubridate, you would only get five weeks to pass to by. Assume dat is your data,
> library(lubridate)
> do.call(rbind, by(dat$value, week(dat$Interval), summary))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 24 552 4146 4188 3759 4529 4850
# 25 490 2498 4256 3396 4438 5156
# 26 564 2578 4206 3355 4346 4866
# 27 698 993 4868 3366 5122 5770
# 28 671 1086 3200 3200 5314 5726
This shows a summary for the 24th through 28th week of the year. Similarly, we can get the means with aggregate with
> aggregate(value~week(Interval), data = dat, mean)
# week(Interval) value
# 1 24 3758.667
# 2 25 3396.286
# 3 26 3355.000
# 4 27 3366.429
# 5 28 3199.500

I just came across this old question because it was used as a dupe target.
Unfortunately, all the upvoted answers (except the one by konvas and a now deleted one) present solutions for aggregating the data by week of the year while the OP has requested to aggregate by week of the month.
The definition of week of the year and week of the month is ambiguous as discussed here, here, and here.
However, the OP has indicated that he wants to count the days 1 to 7 of each month as week 1 of the month, days 8 to 14 as week 2 of the month, etc. Note that week 5 is a stub for most of the months consisting of only 2 or 3 days (except for the month of February if no leap year).
Having prepared the ground, here is a data.table solution for this kind of aggregation:
library(data.table)
DT[, .(value = sum(value)),
by = .(Interval = sprintf("Week %i, %s",
(mday(Interval) - 1L) %/% 7L + 1L,
format(Interval, "%b %Y")))]
Interval value
1: Week 2, Jun 2012 18366
2: Week 3, Jun 2012 24104
3: Week 4, Jun 2012 23348
4: Week 5, Jun 2012 5204
5: Week 1, Jul 2012 23579
6: Week 2, Jul 2012 11573
We can verify that we have picked the correct intervals by
DT[, .(value = sum(value),
date_range = toString(range(Interval))),
by = .(Week = sprintf("Week %i, %s",
(mday(Interval) -1L) %/% 7L + 1L,
format(Interval, "%b %Y")))]
Week value date_range
1: Week 2, Jun 2012 18366 2012-06-10, 2012-06-14
2: Week 3, Jun 2012 24104 2012-06-15, 2012-06-21
3: Week 4, Jun 2012 23348 2012-06-22, 2012-06-28
4: Week 5, Jun 2012 5204 2012-06-29, 2012-06-30
5: Week 1, Jul 2012 23579 2012-07-01, 2012-07-07
6: Week 2, Jul 2012 11573 2012-07-08, 2012-07-10
which is in line with OP's specification.
Data
library(data.table)
DT <- fread(
"rn Interval value
1 2012-06-10 552
2 2012-06-11 4850
3 2012-06-12 4642
4 2012-06-13 4132
5 2012-06-14 4190
6 2012-06-15 4186
7 2012-06-16 1139
8 2012-06-17 490
9 2012-06-18 5156
10 2012-06-19 4430
11 2012-06-20 4447
12 2012-06-21 4256
13 2012-06-22 3856
14 2012-06-23 1163
15 2012-06-24 564
16 2012-06-25 4866
17 2012-06-26 4421
18 2012-06-27 4206
19 2012-06-28 4272
20 2012-06-29 3993
21 2012-06-30 1211
22 2012-07-01 698
23 2012-07-02 5770
24 2012-07-03 5103
25 2012-07-04 775
26 2012-07-05 5140
27 2012-07-06 4868
28 2012-07-07 1225
29 2012-07-08 671
30 2012-07-09 5726
31 2012-07-10 5176", drop = 1L)
DT[, Interval := as.Date(Interval)]

If you are using a data frame, you can easily do this with the tidyquant package. Use the tq_transmute function, which applies a mutation and returns a new data frame. Select the "value" column and apply the xts function apply.weekly. The additional argument FUN = sum will get the aggregate by week.
library(tidyquant)
df
#> # A tibble: 31 x 2
#> Interval value
#> <date> <int>
#> 1 2012-06-10 552
#> 2 2012-06-11 4850
#> 3 2012-06-12 4642
#> 4 2012-06-13 4132
#> 5 2012-06-14 4190
#> 6 2012-06-15 4186
#> 7 2012-06-16 1139
#> 8 2012-06-17 490
#> 9 2012-06-18 5156
#> 10 2012-06-19 4430
#> # ... with 21 more rows
df %>%
tq_transmute(select = value,
mutate_fun = apply.weekly,
FUN = sum)
#> # A tibble: 6 x 2
#> Interval value
#> <date> <int>
#> 1 2012-06-10 552
#> 2 2012-06-17 23629
#> 3 2012-06-24 23872
#> 4 2012-07-01 23667
#> 5 2012-07-08 23552
#> 6 2012-07-10 10902

When you say "aggregate" the values, you mean take their sum? Let's say your data frame is d and assuming d$Interval is of class Date, you can try
# if d$Interval is not of class Date d$Interval <- as.Date(d$Interval)
formatdate <- function(date)
paste0("Week ", (as.numeric(format(date, "%d")) - 1) + 1,
", ", format(date, "%b %Y"))
# change "sum" to your required function
aggregate(d$value, by = list(formatdate(d$Interval)), sum)
# Group.1 x
# 1 Week 1, Jul 2012 3725.667
# 2 Week 2, Jul 2012 3199.500
# 3 Week 2, Jun 2012 3544.000
# 4 Week 3, Jun 2012 3434.000
# 5 Week 4, Jun 2012 3333.143
# 6 Week 5, Jun 2012 3158.667

Related

Mutate Year with Month Column for a Time Series Data Input in R Using Lubridate Package

I have this time series data frame as follows:
df <- read.table(text =
"Year Month Value
2021 1 4
2021 2 11
2021 3 18
2021 4 6
2021 5 20
2021 6 5
2021 7 12
2021 8 4
2021 9 11
2021 10 18
2021 11 6
2021 12 20
2022 1 14
2022 2 11
2022 3 18
2022 4 9
2022 5 22
2022 6 19
2022 7 22
2022 8 24
2022 9 17
2022 10 28
2022 11 16
2022 12 26",
header = TRUE)
I want to turn this data frame into a time series object of date column and value column only so that I can use the ts function to filter the starting point and the endpoint like ts(ts, start = starts, frequency = 12). R should know that 2022 is a year and the corresponding 1:12 are its months, the same thing should apply to 2021. I will prefer lubridate package.
pacman::p_load(
dplyr,
lubridate)
UPDATE
I now use unite function from dplyr package.
df|>
unite(col='date', c('Year', 'Month'), sep='')
Perhaps this?
df |>
tidyr::unite(col='date', c('Year', 'Month'), sep='-') |>
mutate(date = lubridate::ym(date))
# date Value
# 1 2021-01-01 4
# 2 2021-02-01 11
# 3 2021-03-01 18
# 4 2021-04-01 6
# 5 2021-05-01 20
# 6 2021-06-01 5
# 7 2021-07-01 12
# 8 2021-08-01 4
# 9 2021-09-01 11
# 10 2021-10-01 18
# 11 2021-11-01 6
# 12 2021-12-01 20
# 13 2022-01-01 14
# 14 2022-02-01 11
# 15 2022-03-01 18
# 16 2022-04-01 9
# 17 2022-05-01 22
# 18 2022-06-01 19
# 19 2022-07-01 22
# 20 2022-08-01 24
# 21 2022-09-01 17
# 22 2022-10-01 28
# 23 2022-11-01 16
# 24 2022-12-01 26

Aggregate on a daily basis in R

I'm borrowing the reproducible example given here:
Aggregate daily level data to weekly level in R
since it's pretty much close to what I want to do.
Interval value
1 2012-06-10 552
2 2012-06-11 4850
3 2012-06-12 4642
4 2012-06-13 4132
5 2012-06-14 4190
6 2012-06-15 4186
7 2012-06-16 1139
8 2012-06-17 490
9 2012-06-18 5156
10 2012-06-19 4430
11 2012-06-20 4447
12 2012-06-21 4256
13 2012-06-22 3856
14 2012-06-23 1163
15 2012-06-24 564
16 2012-06-25 4866
17 2012-06-26 4421
18 2012-06-27 4206
19 2012-06-28 4272
20 2012-06-29 3993
21 2012-06-30 1211
22 2012-07-01 698
23 2012-07-02 5770
24 2012-07-03 5103
25 2012-07-04 775
26 2012-07-05 5140
27 2012-07-06 4868
28 2012-07-07 1225
29 2012-07-08 671
30 2012-07-09 5726
31 2012-07-10 5176
In his question, he asks to aggregate on weekly intervals, what I'd like to do is aggregate on a "day of the week basis".
So I'd like to have a table similar to that one, adding the values of all the same day of the week:
Day of the week value
1 "Sunday" 60000
2 "Monday" 50000
3 "Tuesday" 60000
4 "Wednesday" 50000
5 "Thursday" 60000
6 "Friday" 50000
7 "Saturday" 60000
You can try:
aggregate(d$value, list(weekdays(as.Date(d$Interval))), sum)
We can group them by weekly intervals using weekdays :
library(dplyr)
df %>%
group_by(Day_Of_The_Week = weekdays(as.Date(Interval))) %>%
summarise(value = sum(value))
# Day_Of_The_Week value
# <chr> <int>
#1 Friday 16903
#2 Monday 26368
#3 Saturday 4738
#4 Sunday 2975
#5 Thursday 17858
#6 Tuesday 23772
#7 Wednesday 13560
We can do this with data.table
library(data.table)
setDT(df1)[, .(value = sum(value)), .(Dayofweek = weekdays(as.Date(Interval)))]
# Dayofweek value
#1: Sunday 2975
#2: Monday 26368
#3: Tuesday 23772
#4: Wednesday 13560
#5: Thursday 17858
#6: Friday 16903
#7: Saturday 4738
using lubridate https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html
df1$Weekday=wday(arrive,label=TRUE)
library(data.table)
df1=data.table(df1)
df1[,sum(value),Weekday]

insert new rows to the time series data, with date added automatically

I have a time-series data frame looks like:
TS.1
2015-09-01 361656.7
2015-09-02 370086.4
2015-09-03 346571.2
2015-09-04 316616.9
2015-09-05 342271.8
2015-09-06 361548.2
2015-09-07 342609.2
2015-09-08 281868.8
2015-09-09 297011.1
2015-09-10 295160.5
2015-09-11 287926.9
2015-09-12 323365.8
Now, what I want to do is add some new data points (rows) to the existing data frame, say,
320123.5
323521.7
How can I added corresponding date to each row? The data is just sequentially inhered from the last row.
Is there any package can do this automatically, so that the only thing I do is to insert new data point?
Here's some play data:
df <- data.frame(date = seq(as.Date("2015-01-01"), as.Date("2015-01-31"), "days"), x = seq(31))
new.x <- c(32, 33)
This adds the extra observations along with the proper sequence of dates:
new.df <- data.frame(date=seq(max(df$date) + 1, max(df$date) + length(new.x), "days"), x=new.x)
Then just rbind them to get your expanded data frame:
rbind(df, new.df)
date x
1 2015-01-01 1
2 2015-01-02 2
3 2015-01-03 3
4 2015-01-04 4
5 2015-01-05 5
6 2015-01-06 6
7 2015-01-07 7
8 2015-01-08 8
9 2015-01-09 9
10 2015-01-10 10
11 2015-01-11 11
12 2015-01-12 12
13 2015-01-13 13
14 2015-01-14 14
15 2015-01-15 15
16 2015-01-16 16
17 2015-01-17 17
18 2015-01-18 18
19 2015-01-19 19
20 2015-01-20 20
21 2015-01-21 21
22 2015-01-22 22
23 2015-01-23 23
24 2015-01-24 24
25 2015-01-25 25
26 2015-01-26 26
27 2015-01-27 27
28 2015-01-28 28
29 2015-01-29 29
30 2015-01-30 30
31 2015-01-31 31
32 2015-02-01 32
33 2015-02-02 33

Reshape and mean calculation

I have climatic data which have been collected during a whole year along an altitude gradient. Shaped like that:
clim <- read.table(text="alti year month week day meanTemp maxTemp minTemp
350 2011 aug. 31 213 10 14 6
350 2011 aug. 31 214 12 18 6
350 2011 aug. 31 215 10 11 9
550 2011 aug. 31 213 8 10 6
550 2011 aug. 31 214 10 12 8
550 2011 aug. 31 215 8 9 7
350 2011 sep. 31 244 9 10 8
350 2011 sep. 31 245 11 12 10
350 2011 sep. 31 246 10 11 9
550 2011 sep. 31 244 7.5 9 6
550 2011 sep. 31 245 8 10 6
550 2011 sep. 31 246 8.5 9 8", header=TRUE)
and I am trying to reshape this data in order to have only one row per altitude and to calculate the mean data for each month and for the whole year. I would be great if it could be shaped like that:
alti mean_year(meanTemp) mean_year(maxTemp) mean_aug.(meanTemp) mean_aug.(maxTemp) mean_sep.(meanTemp) [...]
350 10.333 12.667 10.667 14.3 10 ...
550 8.333 9.833 8.667 10.333 7.766 ...
Any idea to perform this reshaping & calculation?
You can use data.table and dcast:
library(data.table)
setDT(clim)
merge(
clim[, list("mean_temp_mean_year" = mean(meanTemp), "max_temp_mean_year" = mean(maxTemp)), by = alti]
,
dcast(clim[, list("mean_temp_mean" = mean(meanTemp), "max_temp_mean" = mean(maxTemp)), by = c("alti","month")], alti ~ month, value.var = c("mean_temp_mean","max_temp_mean"))
,
by = "alti")
I've switched the names of some of the variables, and you col order is not perfect, but the can be reordered/renamed afterwards
To get the means of the months or years, you can use aggregate followed by reshape.
The two aggregates can be computed separately, and then merge puts them together:
mon <- aggregate(cbind(meanTemp, maxTemp) ~ month + alti, data=clim, FUN=mean)
mon.wide <- reshape(mon, direction='wide', timevar='month', idvar='alti')
yr <- aggregate(cbind(meanTemp, maxTemp) ~ year + alti, data=clim, FUN=mean)
yr.wide <- reshape(yr, direction='wide', timevar='year', idvar='alti')
Each of these .wide sets have the data that you want. The only common column is alti so we take the merge defaults:
merge(mon.wide, yr.wide)
## alti meanTemp.aug. maxTemp.aug. meanTemp.sep. maxTemp.sep. meanTemp.2011 maxTemp.2011
## 1 350 10.666667 14.33333 10 11.000000 10.333333 12.666667
## 2 550 8.666667 10.33333 8 9.333333 8.333333 9.833333
Here's another variation of data.table solution, but this requires the current devel version, v1.9.5:
require(data.table) # v1.9.5+
setDT(clim)
form = paste("alti", c("year", "month"), sep=" ~ ")
val = c("meanTemp", "maxTemp")
ans = lapply(form, function(x) dcast(clim, x, mean, value.var = val))
Reduce(function(x, y) x[y, on="alti"], ans)
# alti meanTemp_mean_2011 maxTemp_mean_2011 meanTemp_mean_aug. meanTemp_mean_sep. maxTemp_mean_aug. maxTemp_mean_sep.
# 1: 350 10.333333 12.666667 10.666667 10 14.33333 11.000000
# 2: 550 8.333333 9.833333 8.666667 8 10.33333 9.333333

How to make a bar graph in ggplot that groups months of different years

My dataframe, df:
df
EffYr EffMo count dts
2 2012 1 1 2012-01-01
3 2012 2 3 2012-02-01
4 2012 3 1 2012-03-01
5 2012 5 1 2012-05-01
6 2012 6 1 2012-06-01
7 2012 7 2 2012-07-01
8 2012 8 11 2012-08-01
9 2012 9 84 2012-09-01
10 2012 10 184 2012-10-01
11 2012 11 165 2012-11-01
12 2012 12 246 2012-12-01
13 2013 1 414 2013-01-01
14 2013 2 130 2013-02-01
15 2013 3 182 2013-03-01
16 2013 4 261 2013-04-01
17 2013 5 229 2013-05-01
18 2013 6 249 2013-06-01
19 2013 7 330 2013-07-01
20 2013 8 135 2013-08-01
Each row of df represents a "month-year", the earliest being Jan 2012 and the latest being Aug 2013. I want to plot a bar graph (using ggplot2) where each bar represents a row of df with the bar height equal to the row's count. So, I should have 24 bars in total.
I want my x axis to be divided into 12 intervals: Jan-Dec, and bars that represent the same calendar month should lie in the same "month interval". For example, if df has a row for Jan 2011, Jan 2012, Jan 2013, then the Jan portion of my graph should have 3 bars so that I can compare my business's performance in the month of January for subsequent years.
Thanks
Edit: I want something that looks like
ggplot(diamonds, aes(cut, fill=cut)) + geom_bar() +
facet_grid(. ~ clarity)
But broken down by month. I tried to modify that code to fit my data, but never could get it right.
#Ben you're asking a number of ggplot2 questions. I would recommend you sit down with some good ggplot2 resources and try the example to become more skilled. Here are 2 excellent resources I use often:
http://docs.ggplot2.org/current/
http://www.cookbook-r.com/Graphs/
Now the solution I think you're after:
## dat <- read.table(text=" EffYr EffMo count dts
## 2 2012 1 1 2012-01-01
## 3 2012 2 3 2012-02-01
## 4 2012 3 1 2012-03-01
## 5 2012 5 1 2012-05-01
## 6 2012 6 1 2012-06-01
## 7 2012 7 2 2012-07-01
## 8 2012 8 11 2012-08-01
## 9 2012 9 84 2012-09-01
## 10 2012 10 184 2012-10-01
## 11 2012 11 165 2012-11-01
## 12 2012 12 246 2012-12-01
## 13 2013 1 414 2013-01-01
## 14 2013 2 130 2013-02-01
## 15 2013 3 182 2013-03-01
## 16 2013 4 261 2013-04-01
## 17 2013 5 229 2013-05-01
## 18 2013 6 249 2013-06-01
## 19 2013 7 330 2013-07-01
## 20 2013 8 135 2013-08-01", header=TRUE)
dat$month <- factor(month.name[dat$EffMo], levels = month.name)
dat$year <- as.factor(dat$EffYr)
ggplot(dat, aes(month, fill=year)) + geom_bar(aes(weight=count), position="dodge")

Resources