How do I group my date variable into month/year in R? - r

I have a "date" vector, that contains dates in mm/dd/yyyy format:
head(Entered_Date,5)
[1] 1/5/1998 1/5/1998 1/5/1998 1/5/1998 1/5/1998
I am trying to plot a frequency variable against the date, but I want to group the dates that it is by month or year. As it is now, there is a frequency per day, but I want to plot the frequency by month or year. So instead of having a frequency of 1 for 1/5/1998, 1 for 1/7/1998, and 3 for 1/8/1998, I would like to display it as 5 for 1/1998. It is a relatively large data set, with dates from 1998 to present, and I would like to find some automated way to accomplish this.
> dput(head(Entered_Date))
structure(c(260L, 260L, 260L, 260L, 260L, 260L), .Label = c("1/1/1998",
"1/1/1999", "1/1/2001", "1/1/2002", "1/10/2000", "1/10/2001",
"1/10/2002", "1/10/2003", "1/10/2005", "1/10/2006", "1/10/2007",
"1/10/2008", "1/10/2011", "1/10/2012", "1/10/2013", "1/11/1999",
"1/11/2000", "1/11/2001", "1/11/2002", "1/11/2005", "1/11/2006",
"1/11/2008", "1/11/2010", "1/11/2011", "1/11/2012", "1/11/2013",
"1/12/1998", "1/12/1999", "1/12/2001", "1/12/2004", "1/12/2005", ...

The floor_date() function from the lubridate package does this nicely.
data %>%
group_by(month = lubridate::floor_date(date, "month")) %>%
summarize(summary_variable = sum(value))
Thanks to Roman Cheplyaka
https://ro-che.info/articles/2017-02-22-group_by_month_r
See more on how to use the function: https://lubridate.tidyverse.org/reference/round_date.html

Here is an example using dplyr. You simply use the corresponding date format string for month %m or year %Y in the format statement.
set.seed(123)
df <- data.frame(date = seq.Date(from =as.Date("01/01/1998", "%d/%m/%Y"),
to=as.Date("01/01/2000", "%d/%m/%Y"), by="day"),
value = sample(seq(5), 731, replace = TRUE))
head(df)
date value
1 1998-01-01 2
2 1998-01-02 4
3 1998-01-03 3
4 1998-01-04 5
5 1998-01-05 5
6 1998-01-06 1
library(dplyr)
df %>%
mutate(month = format(date, "%m"), year = format(date, "%Y")) %>%
group_by(month, year) %>%
summarise(total = sum(value))
Source: local data frame [25 x 3]
Groups: month [?]
month year total
(chr) (chr) (int)
1 01 1998 105
2 01 1999 91
3 01 2000 3
4 02 1998 74
5 02 1999 77
6 03 1998 96
7 03 1999 86
8 04 1998 91
9 04 1999 95
10 05 1998 93
.. ... ... ...

Just to add to #cdeterman answer, you can use lubridate along with dplyr to make this even easier:
df <- data.frame(date = seq.Date(from =as.Date("01/01/1998", "%d/%m/%Y"),
to=as.Date("01/01/2000", "%d/%m/%Y"), by="day"),
value = sample(seq(5), 731, replace = TRUE))
library(dplyr)
library(lubridate)
df %>%
mutate(month = month(date), year = year(date)) %>%
group_by(month, year) %>%
summarise(total = sum(value))

Maybe you just add a column in your data like this:
Year <- format(as.Date(Entered_Date, "%d/%m/%Y"), "%Y")

Dont need dplyr. Look at ?as.POSIXlt
df$date<-as.POSIXlt(df$date)
mon<-df$date$mon
yr<-df$date$year
monyr<-as.factor(paste(mon,yr,sep="/"))
df$date<-monyr
Don't need to use ggplot2 but its nice for this kind of thing.
c <- ggplot(df, aes(factor(date)))
c + geom_bar()
If you want to see the actual numbers
aggregate(. ~ date,data = df,FUN=length )
df2<-aggregate(. ~ date,data = df,FUN=length )
df2
date value
1 0/98 31
2 0/99 31
3 1/98 28
4 1/99 28
5 10/98 30
6 10/99 30
7 11/97 1
8 11/98 31
9 11/99 31
10 2/98 31
11 2/99 31
12 3/98 30
13 3/99 30
14 4/98 31
15 4/99 31
16 5/98 30
17 5/99 30
18 6/98 31
19 6/99 31
20 7/98 31
21 7/99 31
22 8/98 30
23 8/99 30
24 9/98 31
25 9/99 31

There is a super easy way using the cut() function:
list = as.Date(c("1998-5-2", "1993-4-16", "1998-5-10"))
cut(list, breaks = "month")
and you will get this:
[1] 1998-05-01 1993-04-01 1998-05-01
62 Levels: 1993-04-01 1993-05-01 1993-06-01 1993-07-01 1993-08-01 ... 1998-05-01

Another solution is slider::slide_period:
library(slider)
library(dplyr)
monthly_summary <- function(data) summarise(data, date = format(max(date), "%Y-%m"), value = sum(value))
slide_period_dfr(df, df$date, "month", monthly_summary)
date value
1 1998-01 92
2 1998-02 82
3 1998-03 113
4 1998-04 94
5 1998-05 92
6 1998-06 74
7 1998-07 89
8 1998-08 92
9 1998-09 91
10 1998-10 100
...

There is also group_by(month_yr = cut(date, breaks = "1 month") in base R, without needing to use lubridate or other packages.

Related

Calculate number of pending tasks at given time points (ideally with dplyr)

I have a database containing a list of events. Each event has an associated start date, and a date when the event ended or was completed, eg:
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
> dataset
# A tibble: 25 x 3
eventid start_date completed_date
<int> <date> <date>
1 57 2011-01-14 2013-01-07
2 97 2011-01-21 2011-03-03
3 58 2011-01-26 2011-02-05
4 25 2011-03-22 2013-07-20
5 8 2011-04-20 2012-07-16
6 81 2011-04-26 2013-03-04
7 42 2011-05-02 2012-01-16
8 77 2011-05-03 2012-08-14
9 78 2011-05-21 2013-09-26
10 49 2011-05-22 2013-01-04
# ... with 15 more rows
>
I am trying to produce a rolling "snapshot" of how many tasks were pending a different points in time, e.g. month by month. Expected result:
# A tibble: 25 x 2
month count
<date> <int>
1 2011-01-01 0
2 2011-02-01 3
3 2011-03-01 2
4 2011-04-01 2
5 2011-05-01 4
6 2011-06-01 8
I have attempted to group my variables using group_by(period=floor_date(start_date,"month")), but I'm a bit stuck and would appreciate a pointer in the right direction!
I would prefer a solution using dplyr if possible.
Thanks!
You can expand rows for each month included in the range of dates with map2 from purrr. map2 will iterate over multiple inputs simultaneously. In this case, it will iterate through the start and end dates at the same time.
In each iteration, if will create a monthly sequence using seq (or seq.Date) from start to end month (determined from floor_date). The result is nested for each row of data (since one row can have multiple months in the sequence). So, unnest is needed afterwards.
The transmute will add a new variable called month_year (and drop the old ones) and use substr to extract the year and month only (no day). This is the first through seventh character of the date.
Then, you can group_by the month-year and count up the number of pending projects for each month_year.
I included set.seed to reproduce from data below.
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
dataset %>%
mutate(month = map2(floor_date(start_date, "month"),
floor_date(completed_date, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2011-01 1
2 2011-02 3
3 2011-03 9
4 2011-04 10
5 2011-05 13
6 2011-06 15
7 2011-07 16
8 2011-08 18
9 2011-09 19
10 2011-10 20
# … with 22 more rows
If you want to exclude the completed month (except when start month and completed month are the same, if that can exist), you can subtract 1 month from the sequence of months created. In this case, you can use pmax so that if both start and end months are the same, it will still count the month).
Here is the modified mutate with map2:
mutate(month = map2(floor_date(start_date, "month"),
pmax(floor_date(completed_date, "month") - 1, floor_date(start_date, "month")),
seq.Date,
by = "month"))
Data
set.seed(123)
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)

How to apply for loop with ddply function?

I want to calculate the number of days in each month with rainfall >= 2.5 mm for every column. I was able to calculate it for a single column after taking help from this post like
require(seas)
library (zoo)
data(mscdata)
dat.int <- (mksub(mscdata, id=1108447))
dat.int$yearmon <- as.yearmon(dat.int$date, "%b %y")
require(plyr)
rainydays_by_yearmon <- ddply(dat.int, .(yearmon), summarize, rainy_days=sum(rain >= 1.0) )
print.data.frame(rainydays_by_yearmon)
Now I want to apply it for all the columns. I have tried the following code
for(i in 1:length(dat.int)){
y1 <- dat.int[[i]]
rainydays <- ddply(dat.int, .(yearmon), summarize, rainy_days=sum(y1 >= 2.5))
if(i==1){
m1 <- rainydays
}
else{
m1 <- cbind(rainydays, m1)
}
print(i)
}
m1
But I am unable to get the desired results. Please help me out!!!
I would use dplyr and tidyr from tidyverse instead. pivot_longer puts the data into long form with is easier to manipulate. pivot_wider makes it wide again (probably unnecessary depending on your next step)
library(seas)
library(tidyverse)
library(zoo)
data(mscdata)
dat.int <- (mksub(mscdata, id=1108447))
dat.int %>%
as_tibble() %>% # for easier viewing
mutate(yearmon = as.yearmon(dat.int$date, "%b %y")) %>%
select(-date, -year, -yday) %>%
pivot_longer(cols = -yearmon, names_to = "variable", values_to = "value") %>%
group_by(yearmon, variable) %>%
summarise(rainy_days = sum(value > 2.5)) %>%
pivot_wider(names_from = "variable", values_from = "rainy_days")
if you don't mind using the data.table library, see the solution below.
library('data.table')
library('seas')
setDT(mscdata)
mscdata[id == 1108447 & rain >= 2.5, .(rain_ge_2.5mm = .N),
by = .(year, month = format(date, "%m"))]
Output
# year month rain_ge_2.5mm
# 1: 1975 01 12
# 2: 1975 02 8
# 3: 1975 03 10
# 4: 1975 04 2
# 5: 1975 05 4
# ---
# 350: 2004 07 2
# 351: 2004 08 5
# 352: 2004 10 10
# 353: 2004 11 14
# 354: 2004 12 14
If you want to process all ids, then you can group data by id as below.
For rain only:
mscdata[, .(rain_ge_2.5mm = sum(rain >= 2.5)),
by = .(id, year, month = format(date, "%m"))]
For rain, snow, and precip
mscdata[, .(rain_ge_2.5mm = sum(rain >= 2.5),
snow_ge_2 = sum(snow >= 2.0),
precip_ge_2 = sum(precip >= 2.0)),
by = .(id, year, month = format(date, "%m"))]
# id year month rain_ge_2.5mm snow_ge_2 precip_ge_2
# 1: 1096450 1975 01 1 10 9
# 2: 1096450 1975 02 0 5 3
# 3: 1096450 1975 03 1 9 9
# 4: 1096450 1975 04 1 2 3
# 5: 1096450 1975 05 5 1 6
# ---
# 862: 2100630 2000 07 NA NA 3
# 863: 2100630 2000 08 NA NA 8
# 864: 2100630 2000 09 NA NA 6
# 865: 2100630 2000 11 NA NA NA
# 866: 2100630 2001 01 NA NA NA

Finding means after grouping by day [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 4 years ago.
I have a dataset of observations by day for several months and need to find the average of the observations for each day. The data is from a tab delimited text file with the following column names: Day, Date, Views, Engagement, Sales. I'm trying to find the average Views, Engage., and Sales for all 7 days of the week. In SAS I would have just used proc tabulate with Day as the class and Views, Engagements, and Sales as the variables but I'm unsure of how to translate this into R code.
Monday 21JUL03 7206 32 $6.73
Tuesday 22JUL03 9333 51 $4.99
Wednesday 23JUL03 8321 61 $8.87
Thursday 24JUL03 8378 35 $3.69
Friday 25JUL03 12202 45 $4.34
Saturday 26JUL03 6161 34 $3.12
Sunday 27JUL03 9115 29 $2.77
Monday 28JUL03 17112 51 $10.36
Tuesday 29JUL03 12690 51 $10.24
Wednesday 30JUL03 10822 30 $3.96
Thursday 31JUL03 10395 41 $5.45
Friday 01AUG03 6979 31 $2.95
Saturday 02AUG03 3810 19 $1.78
Sunday 03AUG03 4554 30 $5.71
OP wants to calculate mean for 3 columns of his data.frame. Hence, dplyr::summarise_at should be a good option to go for.
The solution is two steps process as:
Read from tab separated file
Process data using dplyr
Solution:
# Read from file. "sales.txt" has been created using OP's data.
df <- read.delim("sales.txt", header = FALSE, stringsAsFactors = FALSE)
names(df) <- c("Day", "Date", "Views", "Engagement", "Sales")
library(dplyr)
df %>% mutate(Sales = as.numeric(sub("\\$","", Sales))) %>%
group_by(Day) %>%
summarise_at(vars(c("Views", "Engagement", "Sales")),funs(Mean = mean))
# Result
# # A tibble: 7 x 4
# Day Views_Mean Engagement_Mean Sales_Mean
# <chr> <dbl> <dbl> <dbl>
# 1 Friday 9590 38.0 3.64
# 2 Monday 12159 41.5 8.54
# 3 Saturday 4986 26.5 2.45
# 4 Sunday 6834 29.5 4.24
# 5 Thursday 9386 38.0 4.57
# 6 Tuesday 11012 51.0 7.62
# 7 Wednesday 9572 45.5 6.41
Maybe something like this?
library(tidyverse)
Date <- seq(lubridate::ymd('2012-07-03'),lubridate::ymd('2012-07-20'),by='days')
Day <- lubridate::wday(Date, label = TRUE)
Views <- sample(c(4000:20000), length(Date))
Engagement <- sample(c(20:50), length(Date))
Sales <- sample.int(300:1000, length(Date))/100
df <- data.frame(Day, Date, Views, Engagement, Sales) %>%
group_by(Day) %>%
summarise(mean_engagement = mean(Engagement),
mean_views = mean(Views),
mean_sales = mean(Sales))
df

R aggregate data in 6-monthly periods by group

looking to aggregate data (mean) in half-year periods by group.
Here is a snapshot of the data:
Date Score Group Score2
01/01/2015 15 A 11
02/01/2015 34 A 33
03/01/2015 16 A 1
04/01/2015 29 A 36
05/01/2015 4 A 28
06/01/2015 10 B 33
07/01/2015 21 B 19
08/01/2015 6 B 47
09/01/2015 40 B 15
10/01/2015 34 B 13
11/01/2015 16 B 7
12/01/2015 8 B 4
I have dfd$mon<-as.yearmon(dfd$Date) then
r<-as.data.frame(dfd %>%
mutate(month = format(Date, "%m"), year = format(Date, "%Y")) %>%
group_by(Group,mon) %>%
summarise(total = mean(Score), total1 = mean(Score2)))
for monthly aggregation, but how would you do this for every 6 months, grouped by Group?
I sense I am overcomplicating a simple issue here!
add another mutate after the current one:
mutate(yearhalf = as.integer(6/7)+1) %>%
output is 1 for the first 6 months and 2 for the months 7 to 12. Then you of course have to adapt the following functions for the new name, but that should do the trick.

How to subset data.frame by weeks and then sum?

Let's say I have several years worth of data which look like the following
# load date package and set random seed
library(lubridate)
set.seed(42)
# create data.frame of dates and income
date <- seq(dmy("26-12-2010"), dmy("15-01-2011"), by = "days")
df <- data.frame(date = date,
wday = wday(date),
wday.name = wday(date, label = TRUE, abbr = TRUE),
income = round(runif(21, 0, 100)),
week = format(date, format="%Y-%U"),
stringsAsFactors = FALSE)
# date wday wday.name income week
# 1 2010-12-26 1 Sun 91 2010-52
# 2 2010-12-27 2 Mon 94 2010-52
# 3 2010-12-28 3 Tues 29 2010-52
# 4 2010-12-29 4 Wed 83 2010-52
# 5 2010-12-30 5 Thurs 64 2010-52
# 6 2010-12-31 6 Fri 52 2010-52
# 7 2011-01-01 7 Sat 74 2011-00
# 8 2011-01-02 1 Sun 13 2011-01
# 9 2011-01-03 2 Mon 66 2011-01
# 10 2011-01-04 3 Tues 71 2011-01
# 11 2011-01-05 4 Wed 46 2011-01
# 12 2011-01-06 5 Thurs 72 2011-01
# 13 2011-01-07 6 Fri 93 2011-01
# 14 2011-01-08 7 Sat 26 2011-01
# 15 2011-01-09 1 Sun 46 2011-02
# 16 2011-01-10 2 Mon 94 2011-02
# 17 2011-01-11 3 Tues 98 2011-02
# 18 2011-01-12 4 Wed 12 2011-02
# 19 2011-01-13 5 Thurs 47 2011-02
# 20 2011-01-14 6 Fri 56 2011-02
# 21 2011-01-15 7 Sat 90 2011-02
I would like to sum 'income' for each week (Sunday thru Saturday). Currently I do the following:
Weekending 2011-01-01 = sum(df$income[1:7]) = 487
Weekending 2011-01-08 = sum(df$income[8:14]) = 387
Weekending 2011-01-15 = sum(df$income[15:21]) = 443
However I would like a more robust approach which will automatically sum by week. I can't work out how to automatically subset the data into weeks. Any help would be much appreciated.
First use format to convert your dates to week numbers, then plyr::ddply() to calculate the summaries:
library(plyr)
df$week <- format(df$date, format="%Y-%U")
ddply(df, .(week), summarize, income=sum(income))
week income
1 2011-52 413
2 2012-01 435
3 2012-02 379
For more information on format.date, see ?strptime, particular the bit that defines %U as the week number.
EDIT:
Given the modified data and requirement, one way is to divide the date by 7 to get a numeric number indicating the week. (Or more precisely, divide by the number of seconds in a week to get the number of weeks since the epoch, which is 1970-01-01 by default.
In code:
df$week <- as.Date("1970-01-01")+7*trunc(as.numeric(df$date)/(3600*24*7))
library(plyr)
ddply(df, .(week), summarize, income=sum(income))
week income
1 2010-12-23 298
2 2010-12-30 392
3 2011-01-06 294
4 2011-01-13 152
I have not checked that the week boundaries are on Sunday. You will have to check this, and insert an appropriate offset into the formula.
This is now simple using dplyr. Also I would suggest using cut(breaks = "week") rather than format() to cut the dates into weeks.
library(dplyr)
df %>% group_by(week = cut(date, "week")) %>% mutate(weekly_income = sum(income))
I Googled "group week days into weeks R" and came across this SO question. You mention you have multiple years, so I think we need to keep up with both the week number and also the year, so I modified the answers there as so format(date, format = "%U%y")
In use it looks like this:
library(plyr) #for aggregating
df <- transform(df, weeknum = format(date, format = "%y%U"))
ddply(df, "weeknum", summarize, suminc = sum(income))
#----
weeknum suminc
1 1152 413
2 1201 435
3 1202 379
See ?strptime for all the format abbreviations.
Try rollapply from the zoo package:
rollapply(df$income, width=7, FUN = sum, by = 7)
# [1] 487 387 443
Or, use period.sum from the xts package:
period.sum(xts(df$income, order.by=df$date), which(df$wday %in% 7))
# [,1]
# 2011-01-01 487
# 2011-01-08 387
# 2011-01-15 443
Or, to get the output in the format you want:
data.frame(income = period.sum(xts(df$income, order.by=df$date),
which(df$wday %in% 7)),
week = df$week[which(df$wday %in% 7)])
# income week
# 2011-01-01 487 2011-00
# 2011-01-08 387 2011-01
# 2011-01-15 443 2011-02
Note that the first week shows as 2011-00 because that's how it is entered in your data. You could also use week = df$week[which(df$wday %in% 1)] which would match your output.
This solution is influenced by #Andrie and #Chase.
# load plyr
library(plyr)
# format weeks as per requirement (replace "00" with "52" and adjust corresponding year)
tmp <- list()
tmp$y <- format(df$date, format="%Y")
tmp$w <- format(df$date, format="%U")
tmp$y[tmp$w=="00"] <- as.character(as.numeric(tmp$y[tmp$w=="00"]) - 1)
tmp$w[tmp$w=="00"] <- "52"
df$week <- paste(tmp$y, tmp$w, sep = "-")
# get summary
df2 <- ddply(df, .(week), summarize, income=sum(income))
# include week ending date
tmp$week.ending <- lapply(df2$week, function(x) rev(df[df$week==x, "date"])[[1]])
df2$week.ending <- sapply(tmp$week.ending, as.character)
# week income week.ending
# 1 2010-52 487 2011-01-01
# 2 2011-01 387 2011-01-08
# 3 2011-02 443 2011-01-15
df.index = df['week'] #the the dt variable as index
df.resample('W').sum() #sum using resample
With dplyr:
df %>%
arrange(date) %>%
mutate(week = as.numeric(date - date[1])%/%7) %>%
group_by(week) %>%
summarise(weekincome= sum(income))
Instead of date[1] you can have any date from when you want to start your weekly study.

Resources