Can I count dates grouped by year? - r

I've got some data that looks about like so:
demo <- read.table(text = "
date num
'12/31/2010' 35
'04/01/2013' 34
'06/02/2015' 34
'06/15/2015' 34
'01/30/2015' 33
'04/15/2014' 33
'05/28/2014' 33
'06/02/2014' 33
'06/17/2015' 33
'06/25/2015' 33
'06/24/2015' 32
'07/31/2013' 32
'08/31/2013' 32
'04/27/2015' 31
'05/07/2015' 31
'12/30/2013' 31
'11/21/2014' 30
'12/20/2013' 30
",header = TRUE, sep = "")
How do I group and count these by year?
2010 1
2013 5
etc.
I can use plyr to count each date: count(demo, vars = 'date'), but not group them.

I'd convert the dates to a date format first, rather than treating them as strings.
library(lubridate)
# Convert string to date format
demo$date <- as.Date(demo$date, "%m/%d/%Y")
# Table of counts by year
table(year(demo$date))
# 2010 2013 2014 2015
# 1 5 4 8

I like data.table for this. First we need to convert to "Date" class in the date column, then find the number of observations by year.
library(data.table)
demo$date <- as.Date(demo$date, "%m/%d/%Y")
as.data.table(demo)[, .N, keyby = year(date)]
# year N
# 1: 2010 1
# 2: 2013 5
# 3: 2014 4
# 4: 2015 8
We use keyby here so we get a nice ordered result. Alternatively, and to change your entire table to a data.table, you can use setDT() instead of as.data.table(). This is the preferred method.
setDT(demo)[, .N, keyby = year(date)]

table(substr(demo$date, 7,10))
2010 2013 2014 2015
1 5 4 8
substr allows you isolate the year, and table tallies the amounts.

demo$date <- as.Date(demo$date, format = "%m/%d/%Y")
demo$year <- format(demo$date, format = "%Y")
aggregate(num ~ year, demo, FUN = length)
## year num
## 1 2010 1
## 2 2013 5
## 3 2014 4
## 4 2015 8

Date formats can be modified using Date and POSIXct classes. This allows you to handle dates that looks like '1/1/2010'.
dates <- as.Date(demo$date, format = "%m/%d/%Y")
head(dates)
# [1] "2010-12-31" "2013-04-01" "2015-06-02" "2015-06-15" "2015-01-30"
# [6] "2014-04-15"
table(format(dates, format = "%Y"))
#
# 2010 2013 2014 2015
# 1 5 4 8

Related

How to select the earliest date in a month from a Date series in R?

I have a database containing the value of different indices with different frequency (weekly, monthly, daily)of data. I hope to calculate monthly returns by abstracting beginning of month value from the time series.
I have tried to use a loop to partition the time series month by month then use min() to get the earliest date in the month. However, I am wondering whether there is a more efficient way to speed up the calculation.
library(data.table)
df<-fread("statistic_date index_value funds_number
2013-1-1 1000.000 0
2013-1-4 996.096 21
2013-1-11 1011.141 21
2013-1-18 1057.344 21
2013-1-25 1073.376 21
2013-2-1 1150.479 22
2013-2-8 1150.288 19
2013-2-22 1112.993 18
2013-3-1 1148.826 20
2013-3-8 1093.515 18
2013-3-15 1092.352 17
2013-3-22 1138.346 18
2013-3-29 1107.440 17
2013-4-3 1101.897 17
2013-4-12 1093.344 17")
I expect to filter to get the rows of the earliest date of each month, such as:
2013-1-1 1000.000 0
2013-2-1 1150.479 22
2013-3-1 1148.826 20
2013-4-3 1101.897 17
Your help will be much appreciated!
Using the tidyverse and lubridate packages,
library(lubridate)
library(tidyverse)
df %>% mutate(statistic_date = ymd(statistic_date), # convert statistic_date to date format
month = month(statistic_date), #create month and year columns
year= year(statistic_date)) %>%
group_by(month,year) %>% # group by month and year
arrange(statistic_date) %>% # make sure the df is sorted by date
filter(row_number()==1) # select first row within each group
# A tibble: 4 x 5
# Groups: month, year [4]
# statistic_date index_value funds_number month year
# <date> <dbl> <int> <dbl> <dbl>
#1 2013-01-01 1000 0 1 2013
#2 2013-02-01 1150. 22 2 2013
#3 2013-03-01 1149. 20 3 2013
#4 2013-04-03 1102. 17 4 2013
First make statistic_date a Date:
df$statistic_date <- as.Date(df$statistic_date)
The you can use nth_day to find the first day of every month in statistic_date.
library("datetimeutils")
dates <- nth_day(df$statistic_date, period = "month", n = "first")
## [1] "2013-01-01" "2013-02-01" "2013-03-01" "2013-04-03"
df[statistic_date %in% dates]
## statistic_date index_value funds_number
## 1: 2013-01-01 1000.000 0
## 2: 2013-02-01 1150.479 22
## 3: 2013-03-01 1148.826 20
## 4: 2013-04-03 1101.897 17

Create a categorical variable from date data in R

I have data that includes dates (dd/mm/yyyy) and am wanting to summarise the data by year. I'm sure that there is an easier way to do it but the route that I've taken is to try to create a new categorical variable using the "cut" function.
For example:
# create sample dataframe
dates<-c("01/01/2013", "01/02/2013", "01/01/2014", "01/02/2014", "01/01/2015", "01/02/2015")
cases<-c(3,5,2,6,8,4)
df<-as.data.frame(cbind(dates, cases))
df$dates <- as.Date(df$dates,"%d/%m/%Y")
# categorise by year
df$year <- cut(df$dates, c(2013-01-01, 2013-12-31, 2014-12-31, 2015-12-31))
This gives an error:
invalid specification of 'breaks'
How do I tell R to cut at various "date" intervals? Is my approach to this all wrong? Still new to R (sorry about the basic question).
Greg
How should your output look like?
Your code works when you define your breaks with as.Date:
breaks <- as.Date(c("2013-01-01", "2013-12-31", "2014-12-31", "2015-12-31"))
# categorise by year
df$year <- cut(df$dates, breaks)
dates cases year
1 2013-01-01 3 2013-01-01
2 2013-02-01 5 2013-01-01
3 2014-01-01 2 2013-12-31
4 2014-02-01 6 2013-12-31
5 2015-01-01 8 2014-12-31
6 2015-02-01 4 2014-12-31
I'm guessing you want your variable year to look different, though? You can define labels when using cut:
# categorise by year
df$year <- cut(df$dates, breaks, labels = c(2013, 2014, 2015))
dates cases year
1 2013-01-01 3 2013
2 2013-02-01 5 2013
3 2014-01-01 2 2014
4 2014-02-01 6 2014
5 2015-01-01 8 2015
6 2015-02-01 4 2015
if you are just looking for the year, maybe this helps:
df$year <- format(df$dates, format="%Y")
dates cases year
1 2013-01-01 3 2013
2 2013-02-01 5 2013
3 2014-01-01 2 2014
4 2014-02-01 6 2014
5 2015-01-01 8 2015
6 2015-02-01 4 2015
I think the solutions based on cut are a bit overkill. You can use the year function from the lubridate package to extract the year from the date:
library(dplyr)
library(lubridate)
df %>% mutate(year = year(dates))
# dates cases year
# 1 2013-01-01 3 2013
# 2 2013-02-01 5 2013
# 3 2014-01-01 2 2014
# 4 2014-02-01 6 2014
# 5 2015-01-01 8 2015
# 6 2015-02-01 4 2015
lubridate is such an awesome package when it comes to dealing with time data.
After the year column is constructed you can apply all kinds of summaries. I use the dplyr style here:
# Note that as.numeric(as.character()) is needed as `cbind` forces `cases` to be a factor
df %>% mutate(year = year(dates), cases = as.numeric(as.character(cases))) %>%
group_by(year) %>% summarise(tot_cases = sum(cases))
# # A tibble: 3 × 2
# year tot_cases
# <dbl> <dbl>
# 1 2013 8
# 2 2014 8
# 3 2015 12
Note that group_by ensures that all operations after that are done per unique category mentioned there, in this case per year.
A simple solution would be using the dplyr package. Here is a simple example:
library(dplyr)
df_grouped <- df %>%
mutate(
dates = as_date(dates),
cases = as.numeric(cases)) %>%
group_by(year = year(dates)) %>%
summarise(tot_cases = sum(cases))
In the mutate statement we convert the variables to a more suitable format, in group_by we select which variable is going to do the grouping and in summarise we create any new variables that we want.
df_grouped looks like this:
# A tibble: 3 × 2
year tot_cases
<dbl> <dbl>
1 2013 6
2 2014 6
3 2015 9

Can I cross tab dates, grouped by year?

I cleared one hurdle, with some help from SO and thought the next hurdle would be easier. What I really have is start and end dates in a data frame:
require(lubridate)
demo <- read.table(text = "
start end num
2010-12-31 <NA> 35
2013-04-01 <NA> 34
2015-06-02 <NA> 34
2015-06-15 2012-12-31 34
2015-01-30 2011-12-31 33
2014-04-15 2013-12-31 33
2014-05-28 2013-12-31 33
2014-06-02 <NA> 33
2015-06-17 <NA> 33
2015-06-25 <NA> 33
2015-06-24 <NA> 32
2013-07-31 <NA> 32
2013-08-31 <NA> 32
2015-04-27 <NA> 31
2015-05-07 <NA> 31
2013-12-30 <NA> 31
2014-11-21 <NA> 30
2013-12-20 2013-06-30 30
",header = TRUE, sep = "")
demo$start <- as.Date(demo$start, '%Y-%m-%d')
demo$end <- as.Date(demo$end, '%Y-%m-%d')
I can get a table of start years, or a table of end years, with table(year(demo$end)) or table(year(demo$start)) which is a lovely start. But what I really want to know is something more like: for each year, how many entries that started have not yet ended? So count is.na() for each start year.
I thought I could use aggregate() for that, but this:
aggregate(is.na(end) ~ year(start), demo, FUN = length)
But that seems to be counting every observation, not just the observations for which the end date is.na()
You can use table with multiple arguments to give you 2-way or multi-way tables:
> with(demo, table( year=format(demo$start, "%Y"), Not.missing = !is.na(end) ) )
Not.missing
year FALSE TRUE
2010 1 0
2013 4 1
2014 2 2
2015 6 2
You could also use lubridate::year instead of hte format call.
If you need to find the number of NA values for each 'year', we can use sum as the is.na(end) is a logical vector. The length gives the total length of the vector per year instead of the length of the TRUE values
aggregate(cbind(end=is.na(end)) ~ cbind(year=year(start)), demo, FUN = sum)
# year end
#1 2010 1
#2 2013 4
#3 2014 2
#4 2015 6
Or we can use data.table. We convert the 'data.frame' to 'data.table' (setDT(demo)), grouped by the year of the 'start' column and using i as is.na(end) as row index, we get the .N or the number of elements for each group.
library(data.table)
setDT(demo)[is.na(end), list(end = .N) , list(year=year(start))]
# year end
#1: 2010 1
#2: 2013 4
#3: 2015 6
#4: 2014 2
Here is another option:
library(dplyr)
library(lubridate)
demo %>% subset(is.na(end)) %>% group_by(year(start)) %>% summarise(n=length(end))
#Source: local data frame [4 x 2]
#
# year(start) n
#1 2010 1
#2 2013 4
#3 2014 2
#4 2015 6
This is pretty straightforward. With your original data (demo), subset to only get the NA in your end column. Afterwards (and using year() from the lubridate package), group by each year, and get the summary of the number of NAs present in the end column. This will return a data.frame object.

Convert a date vector into Julian day in R

I have a column of dates in the format:
16Jun10
and I would like to extract the Julian day.
I have various years.
I have tried the functions julian and mdy.date and it doesn't seem to work.
Try the following to convert from class character(i.e. text) to class POSIXlt, and then extract Julian day (yday):
tmp <- as.POSIXlt("16Jun10", format = "%d%b%y")
tmp$yday
# [1] 166
For more details on function settings:
?POSIXlt
?DateTimeClasses
Another option is to use a Date class, and then use format to extract a julian day (notice that this class define julian days between 1:366, while POSIXlt is 0:365):
tmp <- as.Date("16Jun10", format = "%d%b%y")
format(tmp, "%j")
# [1] "167"
Similarly:
require(lubridate)
x = as.Date('2010-06-10')
yday(x)
[1] 161
Also note, using lubridate:
> dmy('16Jun10')
[1] "2010-06-16 UTC"
You can use R's insol package which has a JD(x, inverse=FALSE) function which converts POSIXct to Julian Day Number (JDN).
insol package also has JDymd(year,month,day,hour=12,minute=0,sec=0) for custom dates.
To display the whole Julian Date (JD) you possibly have to set options(digits=16).
my.data = read.table(text = "
OBS MONTH1 DAY1 YEAR1
1 3 1 2012
2 3 31 2012
3 4 1 2012
4 4 30 2012
5 5 1 2012
6 5 31 2012
7 6 1 2012
8 6 30 2012
9 7 1 2012
10 7 31 2012
", header = TRUE, stringsAsFactors = FALSE)
my.data$MY.DATE1 <- do.call(paste, list(my.data$MONTH1, my.data$DAY1, my.data$YEAR1))
my.data$MY.DATE1 <- as.Date(my.data$MY.DATE1, format=c("%m %d %Y"))
my.data$my.julian.date <- as.numeric(format(my.data$MY.DATE1, "%j"))
my.data
Returns, which technically is incorrect since Julian dates do not return to 1 on the first day of each January:
http://en.wikipedia.org/wiki/Julian_day
The dates below are Ordinal dates:
OBS MONTH1 DAY1 YEAR1 MY.DATE1 my.julian.date
1 1 3 1 2012 2012-03-01 61
2 2 3 31 2012 2012-03-31 91
3 3 4 1 2012 2012-04-01 92
4 4 4 30 2012 2012-04-30 121
5 5 5 1 2012 2012-05-01 122
6 6 5 31 2012 2012-05-31 152
7 7 6 1 2012 2012-06-01 153
8 8 6 30 2012 2012-06-30 182
9 9 7 1 2012 2012-07-01 183
10 10 7 31 2012 2012-07-31 213
Here are my R versions of code originally written in APL and converted to J. We call this pseudo-Julian because it is only intended for dates after October 15, 1582 which is when calendar reform, in some parts of the Western world, arbitrarily changed the date.
#* toJulian: convert 3-element c(Y,M,D) timestamp into pseudo-Julian day number.
toJulian<- function(TS3)
{ mm<- TS3[2]
xx<- 0
if( mm<=2) {xx<- 1}
mm<- (12*xx)+mm
yy<- TS3[1]-xx
nc<- floor(0.01*yy)
jd<- floor(365.25*yy)+floor(30.6001*(1+mm))+TS3[3]+1720995+(2-(nc-floor(0.25*nc)))
return(jd)
#EG toJulian c(1959,5,24) -> 2436713
#EG toJulian c(1992,12,16) -> 2448973
}
Here's the inverse function:
#* toGregorian: convert pseudo-Julian day number to timestamp in form c(Y,M,D)
# (>15 Oct 1582). Adapted from "Numerical Recipes in C" by Press,
# Teukolsky, et al.
toGregorian<- function(jdn)
{ igreg<- 2299161 # Gregorian calendar conversion day c(1582,10,15).
ja<- floor(jdn)
xx<- 0
if(igreg<=ja){xx<- 1}
jalpha<- floor((floor((xx*ja)-1867216)-0.25)/36524.25)
ja<- ((1-xx)*ja) + ((xx*ja)+1+jalpha-floor(0.25*jalpha))
jb<- ja+1524
jc<- floor(6680+((jb-2439870)-122.1)/365.25)
jd<- floor(365.25*jc)
je<- floor((jb-jd)/30.6001)
id<- floor((jb-jd)-floor(30.6001*je))
mm<- floor(je-1)
if(12<mm){mm<- mm-12}
iyyy<- floor(jc-4715)
if(mm>2){iyyy<- iyyy-1}
if(0>iyyy){iyyy<- iyyy-1}
gd<- c(iyyy, mm, id)
return(gd)
#EG toGregorian 2436713 -> c(1959,5,24)
#EG toGregorian 2448973 -> c(1992,12,16)
}

How to subset data.frame by weeks and then sum?

Let's say I have several years worth of data which look like the following
# load date package and set random seed
library(lubridate)
set.seed(42)
# create data.frame of dates and income
date <- seq(dmy("26-12-2010"), dmy("15-01-2011"), by = "days")
df <- data.frame(date = date,
wday = wday(date),
wday.name = wday(date, label = TRUE, abbr = TRUE),
income = round(runif(21, 0, 100)),
week = format(date, format="%Y-%U"),
stringsAsFactors = FALSE)
# date wday wday.name income week
# 1 2010-12-26 1 Sun 91 2010-52
# 2 2010-12-27 2 Mon 94 2010-52
# 3 2010-12-28 3 Tues 29 2010-52
# 4 2010-12-29 4 Wed 83 2010-52
# 5 2010-12-30 5 Thurs 64 2010-52
# 6 2010-12-31 6 Fri 52 2010-52
# 7 2011-01-01 7 Sat 74 2011-00
# 8 2011-01-02 1 Sun 13 2011-01
# 9 2011-01-03 2 Mon 66 2011-01
# 10 2011-01-04 3 Tues 71 2011-01
# 11 2011-01-05 4 Wed 46 2011-01
# 12 2011-01-06 5 Thurs 72 2011-01
# 13 2011-01-07 6 Fri 93 2011-01
# 14 2011-01-08 7 Sat 26 2011-01
# 15 2011-01-09 1 Sun 46 2011-02
# 16 2011-01-10 2 Mon 94 2011-02
# 17 2011-01-11 3 Tues 98 2011-02
# 18 2011-01-12 4 Wed 12 2011-02
# 19 2011-01-13 5 Thurs 47 2011-02
# 20 2011-01-14 6 Fri 56 2011-02
# 21 2011-01-15 7 Sat 90 2011-02
I would like to sum 'income' for each week (Sunday thru Saturday). Currently I do the following:
Weekending 2011-01-01 = sum(df$income[1:7]) = 487
Weekending 2011-01-08 = sum(df$income[8:14]) = 387
Weekending 2011-01-15 = sum(df$income[15:21]) = 443
However I would like a more robust approach which will automatically sum by week. I can't work out how to automatically subset the data into weeks. Any help would be much appreciated.
First use format to convert your dates to week numbers, then plyr::ddply() to calculate the summaries:
library(plyr)
df$week <- format(df$date, format="%Y-%U")
ddply(df, .(week), summarize, income=sum(income))
week income
1 2011-52 413
2 2012-01 435
3 2012-02 379
For more information on format.date, see ?strptime, particular the bit that defines %U as the week number.
EDIT:
Given the modified data and requirement, one way is to divide the date by 7 to get a numeric number indicating the week. (Or more precisely, divide by the number of seconds in a week to get the number of weeks since the epoch, which is 1970-01-01 by default.
In code:
df$week <- as.Date("1970-01-01")+7*trunc(as.numeric(df$date)/(3600*24*7))
library(plyr)
ddply(df, .(week), summarize, income=sum(income))
week income
1 2010-12-23 298
2 2010-12-30 392
3 2011-01-06 294
4 2011-01-13 152
I have not checked that the week boundaries are on Sunday. You will have to check this, and insert an appropriate offset into the formula.
This is now simple using dplyr. Also I would suggest using cut(breaks = "week") rather than format() to cut the dates into weeks.
library(dplyr)
df %>% group_by(week = cut(date, "week")) %>% mutate(weekly_income = sum(income))
I Googled "group week days into weeks R" and came across this SO question. You mention you have multiple years, so I think we need to keep up with both the week number and also the year, so I modified the answers there as so format(date, format = "%U%y")
In use it looks like this:
library(plyr) #for aggregating
df <- transform(df, weeknum = format(date, format = "%y%U"))
ddply(df, "weeknum", summarize, suminc = sum(income))
#----
weeknum suminc
1 1152 413
2 1201 435
3 1202 379
See ?strptime for all the format abbreviations.
Try rollapply from the zoo package:
rollapply(df$income, width=7, FUN = sum, by = 7)
# [1] 487 387 443
Or, use period.sum from the xts package:
period.sum(xts(df$income, order.by=df$date), which(df$wday %in% 7))
# [,1]
# 2011-01-01 487
# 2011-01-08 387
# 2011-01-15 443
Or, to get the output in the format you want:
data.frame(income = period.sum(xts(df$income, order.by=df$date),
which(df$wday %in% 7)),
week = df$week[which(df$wday %in% 7)])
# income week
# 2011-01-01 487 2011-00
# 2011-01-08 387 2011-01
# 2011-01-15 443 2011-02
Note that the first week shows as 2011-00 because that's how it is entered in your data. You could also use week = df$week[which(df$wday %in% 1)] which would match your output.
This solution is influenced by #Andrie and #Chase.
# load plyr
library(plyr)
# format weeks as per requirement (replace "00" with "52" and adjust corresponding year)
tmp <- list()
tmp$y <- format(df$date, format="%Y")
tmp$w <- format(df$date, format="%U")
tmp$y[tmp$w=="00"] <- as.character(as.numeric(tmp$y[tmp$w=="00"]) - 1)
tmp$w[tmp$w=="00"] <- "52"
df$week <- paste(tmp$y, tmp$w, sep = "-")
# get summary
df2 <- ddply(df, .(week), summarize, income=sum(income))
# include week ending date
tmp$week.ending <- lapply(df2$week, function(x) rev(df[df$week==x, "date"])[[1]])
df2$week.ending <- sapply(tmp$week.ending, as.character)
# week income week.ending
# 1 2010-52 487 2011-01-01
# 2 2011-01 387 2011-01-08
# 3 2011-02 443 2011-01-15
df.index = df['week'] #the the dt variable as index
df.resample('W').sum() #sum using resample
With dplyr:
df %>%
arrange(date) %>%
mutate(week = as.numeric(date - date[1])%/%7) %>%
group_by(week) %>%
summarise(weekincome= sum(income))
Instead of date[1] you can have any date from when you want to start your weekly study.

Resources