So I have values like
Mon 162 Tue 123 Wed 29
and so on. I need to find the average for all weekdays in R. I have tried filter and group_by but cannot get an answer.
Time Day Count Speed
1 00:00 Sun 169 60.2
2 00:00 Mon 71 58.5
3 00:00 Tue 70 57.2
4 00:00 Wed 68 58.5
5 00:00 Thu 91 58.8
6 00:00 Fri 94 58.7
7 00:00 Sat 135 58.5
8 01:00 Sun 111 60.0
9 01:00 Mon 45 59.2
10 01:00 Tue 50 57.6
I need the out come to be Weekday Average = ####
Let's say your df is
> df
# A tibble: 14 x 2
Day Count
<chr> <dbl>
1 Sun 31
2 Mon 51
3 Tue 21
4 Wed 61
5 Thu 31
6 Fri 51
7 Sat 65
8 Sun 31
9 Mon 13
10 Tue 61
11 Wed 72
12 Thu 46
13 Fri 62
14 Sat 13
You can use
df %>%
filter(!Day %in% c('Sun', 'Sat')) %>%
group_by(Day) %>%
summarize(mean(Count))
To get
# A tibble: 5 x 2
Day `mean(Count)`
<chr> <dbl>
1 Fri 56.5
2 Mon 32
3 Thu 38.5
4 Tue 41
5 Wed 66.5
For the average of all filtered values
df %>%
filter(!Day %in% c("Sun", "Sat")) %>%
summarize("Average of all Weekday counts" = mean(Count))
Output
# A tibble: 1 x 1
`Average of all Weekday counts`
<dbl>
1 46.9
To get a numeric value instead of a tibble
df %>%
filter(!Day %in% c("Sun", "Sat")) %>%
summarize("Average of all Weekday counts" = mean(Count)) %>%
as.numeric()
Output
[1] 46.9
This might do the trick
days <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
d.f <- data.frame(Day = rep(days, 3), Speed = rnorm(21))
# split dataframe by days then take the mean over the speed
lapply(split(d.f, f=days), function(d) mean(d$Speed))
If you're looking for the single mean for just the weekdays, you could do something like this:
dat = data.frame(Time = rep(c("00:00","01:00"),c(7,3)),
Day = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat","Sun","Mon","Tue"),
Count = c(169,71,70,68,91,94,135,111,45,50),
Speed = c(60.2,58.5,57.2,58.5,58.8,58.7,58.5,60.0,59.2,57.6))
mean(dat$Count[dat$Day %in% c("Mon","Tue","Wed","Thu","Fri")])
# [1] 69.85714
If, on the other hand, you're looking for the mean across each individual day then you could do this using base R:
aggregate(dat$Count, by=list(dat$Day), FUN = mean)
# Group.1 x
# 1 Fri 94
# 2 Mon 58
# 3 Sat 135
# 4 Sun 140
# 5 Thu 91
# 6 Tue 60
# 7 Wed 68
It looks like you've tried dplyr, so the syntax for that same operation in dplyr would be:
library(dplyr)
dat %>% group_by(Day) %>% summarize(mean_count = mean(Count))
# Day mean_count
# <chr> <dbl>
# 1 Fri 94
# 2 Mon 58
# 3 Sat 135
# 4 Sun 140
# 5 Thu 91
# 6 Tue 60
# 7 Wed 68
And if you want to do the same thing in data.table you would do this:
library(data.table)
as.data.table(dat)[,.(mean_count = mean(Count)), by = Day]
# Day mean_count
# 1: Sun 140
# 2: Mon 58
# 3: Tue 60
# 4: Wed 68
# 5: Thu 91
# 6: Fri 94
# 7: Sat 135
Related
This question already has answers here:
Summarizing by subgroup percentage in R
(2 answers)
Closed 2 years ago.
Say we have a dataframe looking like this one below:
month issue amount
Jan withdrawal 250
Jan delay 120
Jan other 65
Feb withdrawal 189
Feb delay 122
Feb other 89
My goal is to tweak this in order to get a dataframe giving me the percentages of each value in issue related to each month. Plainly speaking, my desired output should look like follows:
month issue rate
Jan withdrawal 57.47
Jan delay 27.59
Jan other 14.94
Feb withdrawal 47.25
Feb delay 30.50
Feb other 22.25
I've tried helping myself with dplyr but my attempts had all been unsuccessful so far.
library(tidyverse)
df <- read.table(text = "month issue amount
Jan withdrawal 250
Jan delay 120
Jan other 65
Feb withdrawal 189
Feb delay 122
Feb other 89", header = T)
df
#> month issue amount
#> 1 Jan withdrawal 250
#> 2 Jan delay 120
#> 3 Jan other 65
#> 4 Feb withdrawal 189
#> 5 Feb delay 122
#> 6 Feb other 89
df %>%
group_by(month) %>%
mutate(rate = amount / sum(amount, na.rm = T) * 100)
#> # A tibble: 6 x 4
#> # Groups: month [2]
#> month issue amount rate
#> <chr> <chr> <int> <dbl>
#> 1 Jan withdrawal 250 57.5
#> 2 Jan delay 120 27.6
#> 3 Jan other 65 14.9
#> 4 Feb withdrawal 189 47.2
#> 5 Feb delay 122 30.5
#> 6 Feb other 89 22.2
df %>%
group_by(month) %>%
mutate(rate = prop.table(amount) * 100)
#> # A tibble: 6 x 4
#> # Groups: month [2]
#> month issue amount rate
#> <chr> <chr> <int> <dbl>
#> 1 Jan withdrawal 250 57.5
#> 2 Jan delay 120 27.6
#> 3 Jan other 65 14.9
#> 4 Feb withdrawal 189 47.2
#> 5 Feb delay 122 30.5
#> 6 Feb other 89 22.2
Created on 2021-01-26 by the reprex package (v0.3.0)
using data.table
library(data.table)
setDT(df)
df[, rate := prop.table(amount) * 100, by = list(month)]
df
#> month issue amount rate
#> 1: Jan withdrawal 250 57.47126
#> 2: Jan delay 120 27.58621
#> 3: Jan other 65 14.94253
#> 4: Feb withdrawal 189 47.25000
#> 5: Feb delay 122 30.50000
#> 6: Feb other 89 22.25000
Created on 2021-01-26 by the reprex package (v0.3.0)
Try the answer here . it might be what you're looking for. Code is pasted below for your convenience.
library(dplyr)
group_by(df, group) %>% mutate(percent = value/sum(value))
In your case it would probably be something like :
library(dplyr)
group_by(df, month) %>% mutate(rate= amount/sum(amount))
Or alternatively :
group_by(df, month) %>% transmute(issue, rate= amount/sum(amount))
With format() I can extract year, month and day as follows:
date day month year
<date> <fctr> <fctr> <fctr>
2005-01-01 01 01 2005
2005-01-01 01 01 2005
2005-01-02 02 01 2005
2005-01-02 02 01 2005
2005-01-03 03 01 2005
2005-01-03 03 01 2005
...
2010-12-31 31 12 2010
2010-12-31 31 12 2010
2010-12-31 31 12 2010
2010-12-31 31 12 2010
2010-12-31 31 12 2010
2010-12-31 31 12 2010
However, I also want to count how many days,weeks,months there are, from the start to the end. That is, I want to create day,week,month numbers as follows:
date day month year day_num week_num month_num
<date> <fctr> <fctr> <fctr> <double> <double> <double>
2005-01-01 01 01 2005 1 1 1
2005-01-01 01 01 2005 1 1 1
2005-01-02 02 01 2005 2 1 1
2005-01-02 02 01 2005 2 1 1
2005-01-03 03 01 2005 3 1 1
2005-01-03 03 01 2005 3 1 1
...
2005-02-28 28 02 2005 59 9 2
2005-03-01 01 03 2005 60 9 3
2005-03-02 02 03 2005 61 9 3
...
How can I do that without miscounting?
You can use difftime to get the number of days and weeks but you need a workaround for the number of months. This will do the trick:
library(lubridate)
library(dplyr)
df %>%
mutate(
day_num = as.numeric(difftime(date, min(date), units = "days")),
week_num = floor(as.numeric(difftime(date, min(date), units = "weeks"))),
tmp = year(date) * 12 + month(date),
month_num = tmp - min(tmp)
) %>%
select(-tmp)
Use format() with the following codes:
date = strptime('2005-02-28', format='%Y-%m-%d')
format(date, '%j') # Decimal day of the year
format(date, '%U') # Decimal week of the year (starting on Sunday)
format(date, '%W') # Decimal week of the year (starting on Monday)
format(date, '%m') # Decimal month
Output:
[1] "059"
[1] "09"
[1] "09"
[1] "02"
Source
I am trying to summarise this daily time serie of rainfall by groups of 10-day periods within each month and calculate the acummulated rainfall.
library(tidyverse)
(dat <- tibble(
date = seq(as.Date("2016-01-01"), as.Date("2016-12-31"), by=1),
rainfall = rgamma(length(date), shape=2, scale=2)))
Therefore, I will obtain variability in the third group along the year, for instance: in january the third period has 11 days, february 9 days, and so on. This is my try:
library(lubridate)
dat %>%
group_by(decade=floor_date(date, "10 days")) %>%
summarize(acum_rainfall=sum(rainfall),
days = n())
this is the resulting output
# A tibble: 43 x 3
decade acum_rainfall days
<date> <dbl> <int>
1 2016-01-01 48.5 10
2 2016-01-11 39.9 10
3 2016-01-21 36.1 10
4 2016-01-31 1.87 1
5 2016-02-01 50.6 10
6 2016-02-11 32.1 10
7 2016-02-21 22.1 9
8 2016-03-01 45.9 10
9 2016-03-11 30.0 10
10 2016-03-21 42.4 10
# ... with 33 more rows
can someone help me to sum the residuals periods to the third one to obtain always 3 periods within each month? This would be the desired output (pay attention to the row 3):
decade acum_rainfall days
<date> <dbl> <int>
1 2016-01-01 48.5 10
2 2016-01-11 39.9 10
3 2016-01-21 37.97 11
4 2016-02-01 50.6 10
5 2016-02-11 32.1 10
6 2016-02-21 22.1 9
One way to do this is to use if_else to apply floor_date with different arguments depending on the day value of date. If day(date) is <30, use the normal way, if it's >= 30, then use '20 days' to ensure it gets rounded to day 21:
dat %>%
group_by(decade=if_else(day(date) >= 30,
floor_date(date, "20 days"),
floor_date(date, "10 days"))) %>%
summarize(acum_rainfall=sum(rainfall),
days = n())
# A tibble: 36 x 3
decade acum_rainfall days
<date> <dbl> <int>
1 2016-01-01 38.8 10
2 2016-01-11 38.4 10
3 2016-01-21 43.4 11
4 2016-02-01 34.4 10
5 2016-02-11 34.8 10
6 2016-02-21 25.3 9
7 2016-03-01 39.6 10
8 2016-03-11 53.9 10
9 2016-03-21 38.1 11
10 2016-04-01 36.6 10
# … with 26 more rows
Let say I have the following data.frame:
Dates<-seq(as.Date('2017/01/01'), by = 'day', length.out = 365)
A <- data.frame(date=(Dates), month=month(Dates), week=week(Dates))
library(dplyr)
B <- A %>% dplyr::mutate(day = lubridate::wday(date, label = TRUE))
B[350:365,]
date month week day
350 2017-12-16 12 50 Sat
351 2017-12-17 12 51 Sun
352 2017-12-18 12 51 Mon
353 2017-12-19 12 51 Tue
354 2017-12-20 12 51 Wed
355 2017-12-21 12 51 Thu
356 2017-12-22 12 51 Fri
357 2017-12-23 12 51 Sat
358 2017-12-24 12 52 Sun
359 2017-12-25 12 52 Mon
360 2017-12-26 12 52 Tue
361 2017-12-27 12 52 Wed
362 2017-12-28 12 52 Thu
363 2017-12-29 12 52 Fri
364 2017-12-30 12 52 Sat
365 2017-12-31 12 53 Sun
I need to add another ten dates after the end date which is from 2018-01-01 to 2018-01-10. Sequence for week should be continuous. For example:
date month week day
365 2017-12-31 12 53 Sun
366 2018-01-01 1 53 Mon
367 2018-01-02 1 53 Tue
368 2018-01-03 1 53 Wed
369 2018-01-04 1 53 Thu
370 2018-01-05 1 53 Fri
371 2018-01-06 1 53 Sat
372 2018-01-07 1 54 Sun
373 2018-01-08 1 54 Mon
374 2018-01-09 1 54 Tue
375 2018-01-10 1 54 Wed
library(dplyr)
library(lubridate)
Dates<-seq(as.Date('2017/01/01'), by = 'day', length.out = 365)
A <- data.frame(date=(Dates), month=month(Dates), week=week(Dates))
B <- A %>% dplyr::mutate(day = lubridate::wday(date, label = TRUE))
B[350:365,]
B %>%
rbind( # bind rows with the following dataset
data.frame(date = seq(max(B$date)+1, by = 'day', length.out = 10)) %>% # create sequence of new dates
mutate(month = month(date), # add month
day = wday(date, label = TRUE), # add day
week = cumsum(day=="Sun") + max(A$week)) ) %>% # add week: continuous from last week of A and gets updated every Sunday
tbl_df() # only for visualisation purposes
# # A tibble: 375 x 4
# date month week day
# <date> <dbl> <dbl> <ord>
# 1 2017-01-01 1 1 Sun
# 2 2017-01-02 1 1 Mon
# 3 2017-01-03 1 1 Tue
# 4 2017-01-04 1 1 Wed
# 5 2017-01-05 1 1 Thu
# 6 2017-01-06 1 1 Fri
# 7 2017-01-07 1 1 Sat
# 8 2017-01-08 1 2 Sun
# 9 2017-01-09 1 2 Mon
#10 2017-01-10 1 2 Tue
# # ... with 365 more rows
Little Tweak to #antoniosk code , just added max of week from the past data frame and got the continuous week numbers as desired.
library(dplyr)
library(lubridate)
Dates<-seq(as.Date('2017/01/01'), by = 'day', length.out = 365)
A <- data.frame(date=(Dates), month=month(Dates), week=week(Dates))
B <- A %>% dplyr::mutate(day = lubridate::wday(date, label = TRUE))
B[350:365,]
c<- B %>% rbind( # bind rows with the following dataset
data.frame(date = seq(max(B$date)+1, by = 'day', length.out = 10)) %>% # get 10 extra sequential dates after the last date in B
mutate(month = month(date), week = (as.numeric(strftime(date, format = "%U")) +max(A$week)),day = wday(date, label = TRUE)) ) %>% tbl_df()
I have a huge dataset similar to the following reproducible sample data.
Interval value
1 2012-06-10 552
2 2012-06-11 4850
3 2012-06-12 4642
4 2012-06-13 4132
5 2012-06-14 4190
6 2012-06-15 4186
7 2012-06-16 1139
8 2012-06-17 490
9 2012-06-18 5156
10 2012-06-19 4430
11 2012-06-20 4447
12 2012-06-21 4256
13 2012-06-22 3856
14 2012-06-23 1163
15 2012-06-24 564
16 2012-06-25 4866
17 2012-06-26 4421
18 2012-06-27 4206
19 2012-06-28 4272
20 2012-06-29 3993
21 2012-06-30 1211
22 2012-07-01 698
23 2012-07-02 5770
24 2012-07-03 5103
25 2012-07-04 775
26 2012-07-05 5140
27 2012-07-06 4868
28 2012-07-07 1225
29 2012-07-08 671
30 2012-07-09 5726
31 2012-07-10 5176
I want to aggregate this data to weekly level to get the output similar to the following:
Interval value
1 Week 2, June 2012 *aggregate value for day 10 to day 14 of June 2012*
2 Week 3, June 2012 *aggregate value for day 15 to day 21 of June 2012*
3 Week 4, June 2012 *aggregate value for day 22 to day 28 of June 2012*
4 Week 5, June 2012 *aggregate value for day 29 to day 30 of June 2012*
5 Week 1, July 2012 *aggregate value for day 1 to day 7 of July 2012*
6 Week 2, July 2012 *aggregate value for day 8 to day 10 of July 2012*
How do I achieve this easily without writing a long code?
If you mean the sum of of ‘value’ by week I think the easiest way to do it is to convert the data into a xts object as GSee suggested:
data <- as.xts(data$value,order.by=as.Date(data$interval))
weekly <- apply.weekly(data,sum)
[,1]
2012-06-10 552
2012-06-17 23629
2012-06-24 23872
2012-07-01 23667
2012-07-08 23552
2012-07-10 10902
I leave the formatting of the output as an exercise for you :-)
If you were to use week from lubridate, you would only get five weeks to pass to by. Assume dat is your data,
> library(lubridate)
> do.call(rbind, by(dat$value, week(dat$Interval), summary))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 24 552 4146 4188 3759 4529 4850
# 25 490 2498 4256 3396 4438 5156
# 26 564 2578 4206 3355 4346 4866
# 27 698 993 4868 3366 5122 5770
# 28 671 1086 3200 3200 5314 5726
This shows a summary for the 24th through 28th week of the year. Similarly, we can get the means with aggregate with
> aggregate(value~week(Interval), data = dat, mean)
# week(Interval) value
# 1 24 3758.667
# 2 25 3396.286
# 3 26 3355.000
# 4 27 3366.429
# 5 28 3199.500
I just came across this old question because it was used as a dupe target.
Unfortunately, all the upvoted answers (except the one by konvas and a now deleted one) present solutions for aggregating the data by week of the year while the OP has requested to aggregate by week of the month.
The definition of week of the year and week of the month is ambiguous as discussed here, here, and here.
However, the OP has indicated that he wants to count the days 1 to 7 of each month as week 1 of the month, days 8 to 14 as week 2 of the month, etc. Note that week 5 is a stub for most of the months consisting of only 2 or 3 days (except for the month of February if no leap year).
Having prepared the ground, here is a data.table solution for this kind of aggregation:
library(data.table)
DT[, .(value = sum(value)),
by = .(Interval = sprintf("Week %i, %s",
(mday(Interval) - 1L) %/% 7L + 1L,
format(Interval, "%b %Y")))]
Interval value
1: Week 2, Jun 2012 18366
2: Week 3, Jun 2012 24104
3: Week 4, Jun 2012 23348
4: Week 5, Jun 2012 5204
5: Week 1, Jul 2012 23579
6: Week 2, Jul 2012 11573
We can verify that we have picked the correct intervals by
DT[, .(value = sum(value),
date_range = toString(range(Interval))),
by = .(Week = sprintf("Week %i, %s",
(mday(Interval) -1L) %/% 7L + 1L,
format(Interval, "%b %Y")))]
Week value date_range
1: Week 2, Jun 2012 18366 2012-06-10, 2012-06-14
2: Week 3, Jun 2012 24104 2012-06-15, 2012-06-21
3: Week 4, Jun 2012 23348 2012-06-22, 2012-06-28
4: Week 5, Jun 2012 5204 2012-06-29, 2012-06-30
5: Week 1, Jul 2012 23579 2012-07-01, 2012-07-07
6: Week 2, Jul 2012 11573 2012-07-08, 2012-07-10
which is in line with OP's specification.
Data
library(data.table)
DT <- fread(
"rn Interval value
1 2012-06-10 552
2 2012-06-11 4850
3 2012-06-12 4642
4 2012-06-13 4132
5 2012-06-14 4190
6 2012-06-15 4186
7 2012-06-16 1139
8 2012-06-17 490
9 2012-06-18 5156
10 2012-06-19 4430
11 2012-06-20 4447
12 2012-06-21 4256
13 2012-06-22 3856
14 2012-06-23 1163
15 2012-06-24 564
16 2012-06-25 4866
17 2012-06-26 4421
18 2012-06-27 4206
19 2012-06-28 4272
20 2012-06-29 3993
21 2012-06-30 1211
22 2012-07-01 698
23 2012-07-02 5770
24 2012-07-03 5103
25 2012-07-04 775
26 2012-07-05 5140
27 2012-07-06 4868
28 2012-07-07 1225
29 2012-07-08 671
30 2012-07-09 5726
31 2012-07-10 5176", drop = 1L)
DT[, Interval := as.Date(Interval)]
If you are using a data frame, you can easily do this with the tidyquant package. Use the tq_transmute function, which applies a mutation and returns a new data frame. Select the "value" column and apply the xts function apply.weekly. The additional argument FUN = sum will get the aggregate by week.
library(tidyquant)
df
#> # A tibble: 31 x 2
#> Interval value
#> <date> <int>
#> 1 2012-06-10 552
#> 2 2012-06-11 4850
#> 3 2012-06-12 4642
#> 4 2012-06-13 4132
#> 5 2012-06-14 4190
#> 6 2012-06-15 4186
#> 7 2012-06-16 1139
#> 8 2012-06-17 490
#> 9 2012-06-18 5156
#> 10 2012-06-19 4430
#> # ... with 21 more rows
df %>%
tq_transmute(select = value,
mutate_fun = apply.weekly,
FUN = sum)
#> # A tibble: 6 x 2
#> Interval value
#> <date> <int>
#> 1 2012-06-10 552
#> 2 2012-06-17 23629
#> 3 2012-06-24 23872
#> 4 2012-07-01 23667
#> 5 2012-07-08 23552
#> 6 2012-07-10 10902
When you say "aggregate" the values, you mean take their sum? Let's say your data frame is d and assuming d$Interval is of class Date, you can try
# if d$Interval is not of class Date d$Interval <- as.Date(d$Interval)
formatdate <- function(date)
paste0("Week ", (as.numeric(format(date, "%d")) - 1) + 1,
", ", format(date, "%b %Y"))
# change "sum" to your required function
aggregate(d$value, by = list(formatdate(d$Interval)), sum)
# Group.1 x
# 1 Week 1, Jul 2012 3725.667
# 2 Week 2, Jul 2012 3199.500
# 3 Week 2, Jun 2012 3544.000
# 4 Week 3, Jun 2012 3434.000
# 5 Week 4, Jun 2012 3333.143
# 6 Week 5, Jun 2012 3158.667