I am using data.table for the first time.
I have a column of about 400,000 ages in my table. I need to convert them from birth dates to ages.
What is the best way to do this?
I've been thinking about this and have been dissatisfied with the two answers so far. I like using lubridate, as #KFB did, but I also want things wrapped up nicely in a function, as in my answer using the eeptools package. So here's a wrapper function using the lubridate interval method with some nice options:
#' Calculate age
#'
#' By default, calculates the typical "age in years", with a
#' \code{floor} applied so that you are, e.g., 5 years old from
#' 5th birthday through the day before your 6th birthday. Set
#' \code{floor = FALSE} to return decimal ages, and change \code{units}
#' for units other than years.
#' #param dob date-of-birth, the day to start calculating age.
#' #param age.day the date on which age is to be calculated.
#' #param units unit to measure age in. Defaults to \code{"years"}. Passed to \link{\code{duration}}.
#' #param floor boolean for whether or not to floor the result. Defaults to \code{TRUE}.
#' #return Age in \code{units}. Will be an integer if \code{floor = TRUE}.
#' #examples
#' my.dob <- as.Date('1983-10-20')
#' age(my.dob)
#' age(my.dob, units = "minutes")
#' age(my.dob, floor = FALSE)
age <- function(dob, age.day = today(), units = "years", floor = TRUE) {
calc.age = lubridate::interval(dob, age.day) / lubridate::duration(num = 1, units = units)
if (floor) return(as.integer(floor(calc.age)))
return(calc.age)
}
Usage examples:
> my.dob <- as.Date('1983-10-20')
> age(my.dob)
[1] 31
> age(my.dob, floor = FALSE)
[1] 31.15616
> age(my.dob, units = "minutes")
[1] 16375680
> age(seq(my.dob, length.out = 6, by = "years"))
[1] 31 30 29 28 27 26
From the comments of this blog entry, I found the age_calc function in the eeptools package. It takes care of edge cases (leap years, etc.), checks inputs and looks quite robust.
library(eeptools)
x <- as.Date(c("2011-01-01", "1996-02-29"))
age_calc(x[1],x[2]) # default is age in months
[1] 46.73333 224.83118
age_calc(x[1],x[2], units = "years") # but you can set it to years
[1] 3.893151 18.731507
floor(age_calc(x[1],x[2], units = "years"))
[1] 3 18
For your data
yourdata$age <- floor(age_calc(yourdata$birthdate, units = "years"))
assuming you want age in integer years.
Assume you have a data.table, you could do below:
library(data.table)
library(lubridate)
# toy data
X = data.table(birth=seq(from=as.Date("1970-01-01"), to=as.Date("1980-12-31"), by="year"))
Sys.Date()
Option 1 : use "as.period" from lubriate package
X[, age := as.period(Sys.Date() - birth)][]
birth age
1: 1970-01-01 44y 0m 327d 0H 0M 0S
2: 1971-01-01 43y 0m 327d 6H 0M 0S
3: 1972-01-01 42y 0m 327d 12H 0M 0S
4: 1973-01-01 41y 0m 326d 18H 0M 0S
5: 1974-01-01 40y 0m 327d 0H 0M 0S
6: 1975-01-01 39y 0m 327d 6H 0M 0S
7: 1976-01-01 38y 0m 327d 12H 0M 0S
8: 1977-01-01 37y 0m 326d 18H 0M 0S
9: 1978-01-01 36y 0m 327d 0H 0M 0S
10: 1979-01-01 35y 0m 327d 6H 0M 0S
11: 1980-01-01 34y 0m 327d 12H 0M 0S
Option 2 : if you do not like the format of Option 1, you could do below:
yr = duration(num = 1, units = "years")
X[, age := new_interval(birth, Sys.Date())/yr][]
# you get
birth age
1: 1970-01-01 44.92603
2: 1971-01-01 43.92603
3: 1972-01-01 42.92603
4: 1973-01-01 41.92329
5: 1974-01-01 40.92329
6: 1975-01-01 39.92329
7: 1976-01-01 38.92329
8: 1977-01-01 37.92055
9: 1978-01-01 36.92055
10: 1979-01-01 35.92055
11: 1980-01-01 34.92055
Believe Option 2 should be the more desirable.
I prefer to do this using the lubridate package, borrowing syntax I originally encountered in another post.
It's necessary to standardize your input dates in terms of R date objects, preferably with the lubridate::mdy() or lubridate::ymd() or similar functions, as applicable. You can use the interval() function to generate an interval describing the time elapsed between the two dates, and then use the duration() function to define how this interval should be "diced".
I've summarized the simplest case for calculating an age from two dates below, using the most current syntax in R.
df$DOB <- mdy(df$DOB)
df$EndDate <- mdy(df$EndDate)
df$Calc_Age <- interval(start= df$DOB, end=df$EndDate)/
duration(n=1, unit="years")
Age may be rounded down to the nearest complete integer using the base R 'floor()` function, like so:
df$Calc_AgeF <- floor(df$Calc_Age)
Alternately, the digits= argument in the base R round() function can be used to round up or down, and specify the exact number of decimals in the returned value, like so:
df$Calc_Age2 <- round(df$Calc_Age, digits = 2) ## 2 decimals
df$Calc_Age0 <- round(df$Calc_Age, digits = 0) ## nearest integer
It's worth noting that once the input dates are passed through the calculation step described above (i.e., interval() and duration() functions) , the returned value will be numeric and no longer a date object in R. This is significant whereas the lubridate::floor_date() is limited strictly to date-time objects.
The above syntax works regardless whether the input dates occur in a data.table or data.frame object.
I wanted an implementation that didn't increase my dependencies beyond data.table, which is usually my only dependency. The data.table is only needed for mday, which means day of the month.
Development function
This function is logically how I would think about someone's age. I start with [current year] - [brith year] - 1, then add 1 if they've already had their birthday in the current year. To check for that offset I start by considering month, then (if necessary) day of month.
Here is that step by step implementation:
agecalc <- function(origin, current){
require(data.table)
y <- year(current) - year(origin) - 1
offset <- 0
if(month(current) > month(origin)) offset <- 1
if(month(current) == month(origin) &
mday(current) >= mday(origin)) offset <- 1
age <- y + offset
return(age)
}
Production function
This is the same logic refactored and vectorized:
agecalc <- function(origin, current){
require(data.table)
age <- year(current) - year(origin) - 1
ii <- (month(current) > month(origin)) | (month(current) == month(origin) &
mday(current) >= mday(origin))
age[ii] <- age[ii] + 1
return(age)
}
Experimental function that uses strings
You could also do a string comparison on the month / day part. Perhaps there are times when this is more efficient, for example if you had the year as a number and the birth date as a string.
agecalc_strings <- function(origin, current){
origin <- as.character(origin)
current <- as.character(current)
age <- as.numeric(substr(current, 1, 4)) - as.numeric(substr(origin, 1, 4)) - 1
if(substr(current, 6, 10) >= substr(origin, 6, 10)){
age <- age + 1
}
return(age)
}
Some tests on the vectorized "production" version:
## Examples for specific dates to test the calculation with things like
## beginning and end of months, and leap years:
agecalc(as.IDate("1985-08-13"), as.IDate("1985-08-12"))
agecalc(as.IDate("1985-08-13"), as.IDate("1985-08-13"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-08-12"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-08-13"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-09-12"))
agecalc(as.IDate("2000-02-29"), as.IDate("2000-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2000-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-03-01"))
agecalc(as.IDate("2000-02-29"), as.IDate("2004-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2004-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2011-03-01"))
## Testing every age for every day over several years
## This test requires vectorized version:
d <- data.table(d=as.IDate("2000-01-01") + 0:10000)
d[ , b1 := as.IDate("2000-08-15")]
d[ , b2 := as.IDate("2000-02-29")]
d[ , age1_num := (d - b1) / 365]
d[ , age2_num := (d - b2) / 365]
d[ , age1 := agecalc(b1, d)]
d[ , age2 := agecalc(b2, d)]
d
Below is a trivial plot of ages as numeric and integer. As you can see the
integer ages are a sort of stair step pattern that is tangent to (but below) the
straight line of numeric ages.
plot(numeric_age1 ~ today, dt, type = "l",
ylab = "ages", main = "ages plotted")
lines(integer_age1 ~ today, dt, col = "blue")
I wasn't happy with any of the responses when it comes to calculating the age in months or years, when dealing with leap years, so this is my function using the lubridate package.
Basically, it slices the interval between from and to into (up to) yearly chunks, and then adjusts the interval for whether that chunk is leap year or not. The total interval is the sum of the age of each chunk.
library(lubridate)
#' Get Age of Date relative to Another Date
#'
#' #param from,to the date or dates to consider
#' #param units the units to consider
#' #param floor logical as to whether to floor the result
#' #param simple logical as to whether to do a simple calculation, a simple calculation doesn't account for leap year.
#' #author Nicholas Hamilton
#' #export
age <- function(from, to = today(), units = "years", floor = FALSE, simple = FALSE) {
#Account for Leap Year if Working in Months and Years
if(!simple && length(grep("^(month|year)",units)) > 0){
df = data.frame(from,to)
calc = sapply(1:nrow(df),function(r){
#Start and Finish Points
st = df[r,1]; fn = df[r,2]
#If there is no difference, age is zero
if(st == fn){ return(0) }
#If there is a difference, age is not zero and needs to be calculated
sign = +1 #Age Direction
if(st > fn){ tmp = st; st = fn; fn = tmp; sign = -1 } #Swap and Change sign
#Determine the slice-points
mid = ceiling_date(seq(st,fn,by='year'),'year')
#Build the sequence
dates = unique( c(st,mid,fn) )
dates = dates[which(dates >= st & dates <= fn)]
#Determine the age of the chunks
chunks = sapply(head(seq_along(dates),-1),function(ix){
k = 365/( 365 + leap_year(dates[ix]) )
k*interval( dates[ix], dates[ix+1] ) / duration(num = 1, units = units)
})
#Sum the Chunks, and account for direction
sign*sum(chunks)
})
#If Simple Calculation or Not Months or Not years
}else{
calc = interval(from,to) / duration(num = 1, units = units)
}
if (floor) calc = as.integer(floor(calc))
calc
}
(Sys.Date() - yourDate) / 365.25
A very simple way of calculating the age from two dates without using any additional packages probably is:
df$age = with(df, as.Date(date_2, "%Y-%m-%d") - as.Date(date_1, "%Y-%m-%d"))
Here is a (I think simpler) solution using lubridate:
library(lubridate)
age <- function(dob, on.day=today()) {
intvl <- interval(dob, on.day)
prd <- as.period(intvl)
return(prd#year)
}
Note that age_calc from the eeptools package in particular fails on cases with the year 2000 around birthdays.
Some examples that don't work in age_calc:
library(lubridate)
library(eeptools)
age_calc(ymd("1997-04-21"), ymd("2000-04-21"), units = "years")
age_calc(ymd("2000-04-21"), ymd("2019-04-21"), units = "years")
age_calc(ymd("2000-04-21"), ymd("2016-04-21"), units = "years")
Some of the other solutions also have some output that is not intuitive to what I would want for decimal ages when leap years are involved. I like #James_D 's solution and it is precise and concise, but I wanted something where the decimal age is calculated as complete years plus the fraction of the year completed from their last birthday to their next birthday (which would be out of 365 or 366 days depending on year). In the case of leap years I use lubridate's rollback function to use March 1st for non-leap years following February 29th. I used some test cases from #geneorama and added some of my own, and the output aligns with what I would expect.
library(lubridate)
# Calculate precise age from birthdate in ymd format
age_calculation <- function(birth_date, later_year) {
if (birth_date > later_year)
{
stop("Birth date is after the desired date!")
}
# Calculate the most recent birthday of the person based on the desired year
latest_bday <- ymd(add_with_rollback(birth_date, years((year(later_year) - year(birth_date))), roll_to_first = TRUE))
# Get amount of days between the desired date and the latest birthday
days_between <- as.numeric(days(later_year - latest_bday), units = "days")
# Get how many days are in the year between their most recent and next bdays
year_length <- as.numeric(days((add_with_rollback(latest_bday, years(1), roll_to_first = TRUE)) - latest_bday), units = "days")
# Get the year fraction (amount of year completed before next birthday)
fraction_year <- days_between/year_length
# Sum the difference of years with the year fraction
age_sum <- (year(later_year) - year(birth_date)) + fraction_year
return(age_sum)
}
test_list <- list(c("1985-08-13", "1986-08-12"),
c("1985-08-13", "1985-08-13"),
c("1985-08-13", "1986-08-13"),
c("1985-08-13", "1986-09-12"),
c("2000-02-29", "2000-02-29"),
c("2000-02-29", "2000-03-01"),
c("2000-02-29", "2001-02-28"),
c("2000-02-29", "2004-02-29"),
c("2000-02-29", "2011-03-01"),
c("1997-04-21", "2000-04-21"),
c("2000-04-21", "2016-04-21"),
c("2000-04-21", "2019-04-21"),
c("2017-06-15", "2018-04-30"),
c("2019-04-20", "2019-08-24"),
c("2020-05-25", "2021-11-25"),
c("2020-11-25", "2021-11-24"),
c("2020-11-24", "2020-11-25"),
c("2020-02-28", "2020-02-29"),
c("2020-02-29", "2020-02-28"))
for (i in 1:length(test_list))
{
print(paste0("Dates from ", test_list[[i]][1], " to ", test_list[[i]][2]))
result <- age_calculation(ymd(test_list[[i]][1]), ymd(test_list[[i]][2]))
print(result)
}
Output:
[1] "Dates from 1985-08-13 to 1986-08-12"
[1] 0.9972603
[1] "Dates from 1985-08-13 to 1985-08-13"
[1] 0
[1] "Dates from 1985-08-13 to 1986-08-13"
[1] 1
[1] "Dates from 1985-08-13 to 1986-09-12"
[1] 1.082192
[1] "Dates from 2000-02-29 to 2000-02-29"
[1] 0
[1] "Dates from 2000-02-29 to 2000-03-01"
[1] 0.00273224
[1] "Dates from 2000-02-29 to 2001-02-28"
[1] 0.9972603
[1] "Dates from 2000-02-29 to 2004-02-29"
[1] 4
[1] "Dates from 2000-02-29 to 2011-03-01"
[1] 11
[1] "Dates from 1997-04-21 to 2000-04-21"
[1] 3
[1] "Dates from 2000-04-21 to 2016-04-21"
[1] 16
[1] "Dates from 2000-04-21 to 2019-04-21"
[1] 19
[1] "Dates from 2017-06-15 to 2018-04-30"
[1] 0.8739726
[1] "Dates from 2019-04-20 to 2019-08-24"
[1] 0.3442623
[1] "Dates from 2020-05-25 to 2021-11-25"
[1] 1.50411
[1] "Dates from 2020-11-25 to 2021-11-24"
[1] 0.9972603
[1] "Dates from 2020-11-24 to 2020-11-25"
[1] 0.002739726
[1] "Dates from 2020-02-28 to 2020-02-29"
[1] 0.00273224
[1] "Dates from 2020-02-29 to 2020-02-28"
Error in age_calculation(ymd(test_list[[i]][1]), ymd(test_list[[i]][2])) :
Birth date is after the desired date!
As others have been saying, the trunc function is excellent to get integer age.
I realise there are a lot of answers but since I can't help myself, I might as well add to the discussion.
I'm building a package that's focused on dates and datetimes and in it I use a function called time_diff(). Here is a simplified version.
time_diff <- function(x, y, units, num = 1,
type = c("duration", "period"),
as_period = FALSE){
type <- match.arg(type)
units <- match.arg(units, c("picoseconds", "nanoseconds", "microseconds",
"milliseconds", "seconds", "minutes", "hours", "days",
"weeks", "months", "years"))
int <- lubridate::interval(x, y)
if (as_period || type == "period"){
if (as_period) int <- lubridate::as.period(int, unit = units)
unit <- lubridate::period(num = num, units = units)
} else {
unit <- do.call(get(paste0("d", units),
asNamespace("lubridate")),
list(x = num))
}
out <- int / unit
out
}
# Wrapper around the more general time_diff
age_years <- function(x, y){
trunc(time_diff(x, y, units = "years", num = 1,
type = "period", as_period = TRUE))
}
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
bday <- dmy("01-01-2000")
time_diff(bday, today(), "years", type = "period")
#> [1] 23.11233
leap1 <- dmy("29-02-2020")
leap2 <- dmy("28-02-2021")
leap3 <- dmy("01-03-2021")
# Many people might say this is wrong so use the more exact age_years
time_diff(leap1, leap2, "years", type = "period")
#> [1] 1
# age in years, accounting for leap years properly
age_years(leap1, leap2)
#> [1] 0
age_years(leap1, leap3)
#> [1] 1
# So to add a column of ages in years, one can do this..
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
my_data <- tibble(dob = seq(bday, today(), by = "day"))
my_data <- my_data %>%
mutate(age_years = age_years(dob, today()))
slice_head(my_data, n = 10)
#> # A tibble: 10 x 2
#> dob age_years
#> <date> <dbl>
#> 1 2000-01-01 23
#> 2 2000-01-02 23
#> 3 2000-01-03 23
#> 4 2000-01-04 23
#> 5 2000-01-05 23
#> 6 2000-01-06 23
#> 7 2000-01-07 23
#> 8 2000-01-08 23
#> 9 2000-01-09 23
#> 10 2000-01-10 23
Created on 2023-02-11 with reprex v2.0.2
Related
I am using data.table for the first time.
I have a column of about 400,000 ages in my table. I need to convert them from birth dates to ages.
What is the best way to do this?
I've been thinking about this and have been dissatisfied with the two answers so far. I like using lubridate, as #KFB did, but I also want things wrapped up nicely in a function, as in my answer using the eeptools package. So here's a wrapper function using the lubridate interval method with some nice options:
#' Calculate age
#'
#' By default, calculates the typical "age in years", with a
#' \code{floor} applied so that you are, e.g., 5 years old from
#' 5th birthday through the day before your 6th birthday. Set
#' \code{floor = FALSE} to return decimal ages, and change \code{units}
#' for units other than years.
#' #param dob date-of-birth, the day to start calculating age.
#' #param age.day the date on which age is to be calculated.
#' #param units unit to measure age in. Defaults to \code{"years"}. Passed to \link{\code{duration}}.
#' #param floor boolean for whether or not to floor the result. Defaults to \code{TRUE}.
#' #return Age in \code{units}. Will be an integer if \code{floor = TRUE}.
#' #examples
#' my.dob <- as.Date('1983-10-20')
#' age(my.dob)
#' age(my.dob, units = "minutes")
#' age(my.dob, floor = FALSE)
age <- function(dob, age.day = today(), units = "years", floor = TRUE) {
calc.age = lubridate::interval(dob, age.day) / lubridate::duration(num = 1, units = units)
if (floor) return(as.integer(floor(calc.age)))
return(calc.age)
}
Usage examples:
> my.dob <- as.Date('1983-10-20')
> age(my.dob)
[1] 31
> age(my.dob, floor = FALSE)
[1] 31.15616
> age(my.dob, units = "minutes")
[1] 16375680
> age(seq(my.dob, length.out = 6, by = "years"))
[1] 31 30 29 28 27 26
From the comments of this blog entry, I found the age_calc function in the eeptools package. It takes care of edge cases (leap years, etc.), checks inputs and looks quite robust.
library(eeptools)
x <- as.Date(c("2011-01-01", "1996-02-29"))
age_calc(x[1],x[2]) # default is age in months
[1] 46.73333 224.83118
age_calc(x[1],x[2], units = "years") # but you can set it to years
[1] 3.893151 18.731507
floor(age_calc(x[1],x[2], units = "years"))
[1] 3 18
For your data
yourdata$age <- floor(age_calc(yourdata$birthdate, units = "years"))
assuming you want age in integer years.
Assume you have a data.table, you could do below:
library(data.table)
library(lubridate)
# toy data
X = data.table(birth=seq(from=as.Date("1970-01-01"), to=as.Date("1980-12-31"), by="year"))
Sys.Date()
Option 1 : use "as.period" from lubriate package
X[, age := as.period(Sys.Date() - birth)][]
birth age
1: 1970-01-01 44y 0m 327d 0H 0M 0S
2: 1971-01-01 43y 0m 327d 6H 0M 0S
3: 1972-01-01 42y 0m 327d 12H 0M 0S
4: 1973-01-01 41y 0m 326d 18H 0M 0S
5: 1974-01-01 40y 0m 327d 0H 0M 0S
6: 1975-01-01 39y 0m 327d 6H 0M 0S
7: 1976-01-01 38y 0m 327d 12H 0M 0S
8: 1977-01-01 37y 0m 326d 18H 0M 0S
9: 1978-01-01 36y 0m 327d 0H 0M 0S
10: 1979-01-01 35y 0m 327d 6H 0M 0S
11: 1980-01-01 34y 0m 327d 12H 0M 0S
Option 2 : if you do not like the format of Option 1, you could do below:
yr = duration(num = 1, units = "years")
X[, age := new_interval(birth, Sys.Date())/yr][]
# you get
birth age
1: 1970-01-01 44.92603
2: 1971-01-01 43.92603
3: 1972-01-01 42.92603
4: 1973-01-01 41.92329
5: 1974-01-01 40.92329
6: 1975-01-01 39.92329
7: 1976-01-01 38.92329
8: 1977-01-01 37.92055
9: 1978-01-01 36.92055
10: 1979-01-01 35.92055
11: 1980-01-01 34.92055
Believe Option 2 should be the more desirable.
I prefer to do this using the lubridate package, borrowing syntax I originally encountered in another post.
It's necessary to standardize your input dates in terms of R date objects, preferably with the lubridate::mdy() or lubridate::ymd() or similar functions, as applicable. You can use the interval() function to generate an interval describing the time elapsed between the two dates, and then use the duration() function to define how this interval should be "diced".
I've summarized the simplest case for calculating an age from two dates below, using the most current syntax in R.
df$DOB <- mdy(df$DOB)
df$EndDate <- mdy(df$EndDate)
df$Calc_Age <- interval(start= df$DOB, end=df$EndDate)/
duration(n=1, unit="years")
Age may be rounded down to the nearest complete integer using the base R 'floor()` function, like so:
df$Calc_AgeF <- floor(df$Calc_Age)
Alternately, the digits= argument in the base R round() function can be used to round up or down, and specify the exact number of decimals in the returned value, like so:
df$Calc_Age2 <- round(df$Calc_Age, digits = 2) ## 2 decimals
df$Calc_Age0 <- round(df$Calc_Age, digits = 0) ## nearest integer
It's worth noting that once the input dates are passed through the calculation step described above (i.e., interval() and duration() functions) , the returned value will be numeric and no longer a date object in R. This is significant whereas the lubridate::floor_date() is limited strictly to date-time objects.
The above syntax works regardless whether the input dates occur in a data.table or data.frame object.
I wanted an implementation that didn't increase my dependencies beyond data.table, which is usually my only dependency. The data.table is only needed for mday, which means day of the month.
Development function
This function is logically how I would think about someone's age. I start with [current year] - [brith year] - 1, then add 1 if they've already had their birthday in the current year. To check for that offset I start by considering month, then (if necessary) day of month.
Here is that step by step implementation:
agecalc <- function(origin, current){
require(data.table)
y <- year(current) - year(origin) - 1
offset <- 0
if(month(current) > month(origin)) offset <- 1
if(month(current) == month(origin) &
mday(current) >= mday(origin)) offset <- 1
age <- y + offset
return(age)
}
Production function
This is the same logic refactored and vectorized:
agecalc <- function(origin, current){
require(data.table)
age <- year(current) - year(origin) - 1
ii <- (month(current) > month(origin)) | (month(current) == month(origin) &
mday(current) >= mday(origin))
age[ii] <- age[ii] + 1
return(age)
}
Experimental function that uses strings
You could also do a string comparison on the month / day part. Perhaps there are times when this is more efficient, for example if you had the year as a number and the birth date as a string.
agecalc_strings <- function(origin, current){
origin <- as.character(origin)
current <- as.character(current)
age <- as.numeric(substr(current, 1, 4)) - as.numeric(substr(origin, 1, 4)) - 1
if(substr(current, 6, 10) >= substr(origin, 6, 10)){
age <- age + 1
}
return(age)
}
Some tests on the vectorized "production" version:
## Examples for specific dates to test the calculation with things like
## beginning and end of months, and leap years:
agecalc(as.IDate("1985-08-13"), as.IDate("1985-08-12"))
agecalc(as.IDate("1985-08-13"), as.IDate("1985-08-13"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-08-12"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-08-13"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-09-12"))
agecalc(as.IDate("2000-02-29"), as.IDate("2000-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2000-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-03-01"))
agecalc(as.IDate("2000-02-29"), as.IDate("2004-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2004-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2011-03-01"))
## Testing every age for every day over several years
## This test requires vectorized version:
d <- data.table(d=as.IDate("2000-01-01") + 0:10000)
d[ , b1 := as.IDate("2000-08-15")]
d[ , b2 := as.IDate("2000-02-29")]
d[ , age1_num := (d - b1) / 365]
d[ , age2_num := (d - b2) / 365]
d[ , age1 := agecalc(b1, d)]
d[ , age2 := agecalc(b2, d)]
d
Below is a trivial plot of ages as numeric and integer. As you can see the
integer ages are a sort of stair step pattern that is tangent to (but below) the
straight line of numeric ages.
plot(numeric_age1 ~ today, dt, type = "l",
ylab = "ages", main = "ages plotted")
lines(integer_age1 ~ today, dt, col = "blue")
I wasn't happy with any of the responses when it comes to calculating the age in months or years, when dealing with leap years, so this is my function using the lubridate package.
Basically, it slices the interval between from and to into (up to) yearly chunks, and then adjusts the interval for whether that chunk is leap year or not. The total interval is the sum of the age of each chunk.
library(lubridate)
#' Get Age of Date relative to Another Date
#'
#' #param from,to the date or dates to consider
#' #param units the units to consider
#' #param floor logical as to whether to floor the result
#' #param simple logical as to whether to do a simple calculation, a simple calculation doesn't account for leap year.
#' #author Nicholas Hamilton
#' #export
age <- function(from, to = today(), units = "years", floor = FALSE, simple = FALSE) {
#Account for Leap Year if Working in Months and Years
if(!simple && length(grep("^(month|year)",units)) > 0){
df = data.frame(from,to)
calc = sapply(1:nrow(df),function(r){
#Start and Finish Points
st = df[r,1]; fn = df[r,2]
#If there is no difference, age is zero
if(st == fn){ return(0) }
#If there is a difference, age is not zero and needs to be calculated
sign = +1 #Age Direction
if(st > fn){ tmp = st; st = fn; fn = tmp; sign = -1 } #Swap and Change sign
#Determine the slice-points
mid = ceiling_date(seq(st,fn,by='year'),'year')
#Build the sequence
dates = unique( c(st,mid,fn) )
dates = dates[which(dates >= st & dates <= fn)]
#Determine the age of the chunks
chunks = sapply(head(seq_along(dates),-1),function(ix){
k = 365/( 365 + leap_year(dates[ix]) )
k*interval( dates[ix], dates[ix+1] ) / duration(num = 1, units = units)
})
#Sum the Chunks, and account for direction
sign*sum(chunks)
})
#If Simple Calculation or Not Months or Not years
}else{
calc = interval(from,to) / duration(num = 1, units = units)
}
if (floor) calc = as.integer(floor(calc))
calc
}
(Sys.Date() - yourDate) / 365.25
A very simple way of calculating the age from two dates without using any additional packages probably is:
df$age = with(df, as.Date(date_2, "%Y-%m-%d") - as.Date(date_1, "%Y-%m-%d"))
Here is a (I think simpler) solution using lubridate:
library(lubridate)
age <- function(dob, on.day=today()) {
intvl <- interval(dob, on.day)
prd <- as.period(intvl)
return(prd#year)
}
Note that age_calc from the eeptools package in particular fails on cases with the year 2000 around birthdays.
Some examples that don't work in age_calc:
library(lubridate)
library(eeptools)
age_calc(ymd("1997-04-21"), ymd("2000-04-21"), units = "years")
age_calc(ymd("2000-04-21"), ymd("2019-04-21"), units = "years")
age_calc(ymd("2000-04-21"), ymd("2016-04-21"), units = "years")
Some of the other solutions also have some output that is not intuitive to what I would want for decimal ages when leap years are involved. I like #James_D 's solution and it is precise and concise, but I wanted something where the decimal age is calculated as complete years plus the fraction of the year completed from their last birthday to their next birthday (which would be out of 365 or 366 days depending on year). In the case of leap years I use lubridate's rollback function to use March 1st for non-leap years following February 29th. I used some test cases from #geneorama and added some of my own, and the output aligns with what I would expect.
library(lubridate)
# Calculate precise age from birthdate in ymd format
age_calculation <- function(birth_date, later_year) {
if (birth_date > later_year)
{
stop("Birth date is after the desired date!")
}
# Calculate the most recent birthday of the person based on the desired year
latest_bday <- ymd(add_with_rollback(birth_date, years((year(later_year) - year(birth_date))), roll_to_first = TRUE))
# Get amount of days between the desired date and the latest birthday
days_between <- as.numeric(days(later_year - latest_bday), units = "days")
# Get how many days are in the year between their most recent and next bdays
year_length <- as.numeric(days((add_with_rollback(latest_bday, years(1), roll_to_first = TRUE)) - latest_bday), units = "days")
# Get the year fraction (amount of year completed before next birthday)
fraction_year <- days_between/year_length
# Sum the difference of years with the year fraction
age_sum <- (year(later_year) - year(birth_date)) + fraction_year
return(age_sum)
}
test_list <- list(c("1985-08-13", "1986-08-12"),
c("1985-08-13", "1985-08-13"),
c("1985-08-13", "1986-08-13"),
c("1985-08-13", "1986-09-12"),
c("2000-02-29", "2000-02-29"),
c("2000-02-29", "2000-03-01"),
c("2000-02-29", "2001-02-28"),
c("2000-02-29", "2004-02-29"),
c("2000-02-29", "2011-03-01"),
c("1997-04-21", "2000-04-21"),
c("2000-04-21", "2016-04-21"),
c("2000-04-21", "2019-04-21"),
c("2017-06-15", "2018-04-30"),
c("2019-04-20", "2019-08-24"),
c("2020-05-25", "2021-11-25"),
c("2020-11-25", "2021-11-24"),
c("2020-11-24", "2020-11-25"),
c("2020-02-28", "2020-02-29"),
c("2020-02-29", "2020-02-28"))
for (i in 1:length(test_list))
{
print(paste0("Dates from ", test_list[[i]][1], " to ", test_list[[i]][2]))
result <- age_calculation(ymd(test_list[[i]][1]), ymd(test_list[[i]][2]))
print(result)
}
Output:
[1] "Dates from 1985-08-13 to 1986-08-12"
[1] 0.9972603
[1] "Dates from 1985-08-13 to 1985-08-13"
[1] 0
[1] "Dates from 1985-08-13 to 1986-08-13"
[1] 1
[1] "Dates from 1985-08-13 to 1986-09-12"
[1] 1.082192
[1] "Dates from 2000-02-29 to 2000-02-29"
[1] 0
[1] "Dates from 2000-02-29 to 2000-03-01"
[1] 0.00273224
[1] "Dates from 2000-02-29 to 2001-02-28"
[1] 0.9972603
[1] "Dates from 2000-02-29 to 2004-02-29"
[1] 4
[1] "Dates from 2000-02-29 to 2011-03-01"
[1] 11
[1] "Dates from 1997-04-21 to 2000-04-21"
[1] 3
[1] "Dates from 2000-04-21 to 2016-04-21"
[1] 16
[1] "Dates from 2000-04-21 to 2019-04-21"
[1] 19
[1] "Dates from 2017-06-15 to 2018-04-30"
[1] 0.8739726
[1] "Dates from 2019-04-20 to 2019-08-24"
[1] 0.3442623
[1] "Dates from 2020-05-25 to 2021-11-25"
[1] 1.50411
[1] "Dates from 2020-11-25 to 2021-11-24"
[1] 0.9972603
[1] "Dates from 2020-11-24 to 2020-11-25"
[1] 0.002739726
[1] "Dates from 2020-02-28 to 2020-02-29"
[1] 0.00273224
[1] "Dates from 2020-02-29 to 2020-02-28"
Error in age_calculation(ymd(test_list[[i]][1]), ymd(test_list[[i]][2])) :
Birth date is after the desired date!
As others have been saying, the trunc function is excellent to get integer age.
I realise there are a lot of answers but since I can't help myself, I might as well add to the discussion.
I'm building a package that's focused on dates and datetimes and in it I use a function called time_diff(). Here is a simplified version.
time_diff <- function(x, y, units, num = 1,
type = c("duration", "period"),
as_period = FALSE){
type <- match.arg(type)
units <- match.arg(units, c("picoseconds", "nanoseconds", "microseconds",
"milliseconds", "seconds", "minutes", "hours", "days",
"weeks", "months", "years"))
int <- lubridate::interval(x, y)
if (as_period || type == "period"){
if (as_period) int <- lubridate::as.period(int, unit = units)
unit <- lubridate::period(num = num, units = units)
} else {
unit <- do.call(get(paste0("d", units),
asNamespace("lubridate")),
list(x = num))
}
out <- int / unit
out
}
# Wrapper around the more general time_diff
age_years <- function(x, y){
trunc(time_diff(x, y, units = "years", num = 1,
type = "period", as_period = TRUE))
}
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
bday <- dmy("01-01-2000")
time_diff(bday, today(), "years", type = "period")
#> [1] 23.11233
leap1 <- dmy("29-02-2020")
leap2 <- dmy("28-02-2021")
leap3 <- dmy("01-03-2021")
# Many people might say this is wrong so use the more exact age_years
time_diff(leap1, leap2, "years", type = "period")
#> [1] 1
# age in years, accounting for leap years properly
age_years(leap1, leap2)
#> [1] 0
age_years(leap1, leap3)
#> [1] 1
# So to add a column of ages in years, one can do this..
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
my_data <- tibble(dob = seq(bday, today(), by = "day"))
my_data <- my_data %>%
mutate(age_years = age_years(dob, today()))
slice_head(my_data, n = 10)
#> # A tibble: 10 x 2
#> dob age_years
#> <date> <dbl>
#> 1 2000-01-01 23
#> 2 2000-01-02 23
#> 3 2000-01-03 23
#> 4 2000-01-04 23
#> 5 2000-01-05 23
#> 6 2000-01-06 23
#> 7 2000-01-07 23
#> 8 2000-01-08 23
#> 9 2000-01-09 23
#> 10 2000-01-10 23
Created on 2023-02-11 with reprex v2.0.2
Problem description
I work with trice monthly data a lot. Trice monthly (or roughly every 10 days, also referred to as a dekad) it is the typical reporting interval for water related data in the former Soviet Union and for many more climate/water related data sets around the world. Below is an examplary data set with 2 variables:
> date = unique(floor_date(seq.Date(as.Date("2019-01-01"), as.Date("2019-12-31"),
by="day"), "10days"))
> example_data <- tibble(
date = date[day(date)!=31],
value = seq(1,36,1),
var = "A") %>%
add_row(tibble(
date = date[day(date)!=31],
value = seq(10,360,10),
var = "B"))
> example_data
# A tibble: 72 x 3
# Groups: var [2]
date value var
<ord> <dbl> <chr>
1 2019-01-01 1 A
2 2019-01-01 10 B
3 2019-01-11 2 A
4 2019-01-11 20 B
5 2019-01-21 3 A
6 2019-01-21 30 B
7 2019-02-01 4 A
8 2019-02-01 40 B
9 2019-02-11 5 A
10 2019-02-11 50 B
# … with 62 more rows
In the example I chose the 1., 11., and 21. to date the decades but it would actually be more appropriate to index them in dekad 1 to 3 per month (analogue to months 1 to 12 per year) or in dekad 1 to 36 per year (analogue to day of the year). The most elegant solution would be to have a proper date format for dekadal data like yearmonth in lubridate. However, lubridate may not plan to do support dekadal data in the near future (github conversation).
I have workflows using tsibble and timetk which work well with monthly data but it would really be more appropriate to work with the original dekadal time steps and I'm looking for a way to be able to use the tidyverse functions with dekadal data with as few cumbersome workarounds as possible.
The problem with using daily dates for dekadal data in tsibble is that is identifies the time interval as daily and you get a lot of data gaps between your 3 values per month:
> example_data_tsbl <- as_tsibble(example_data, index = date, key = var)
> count_gaps(example_data_tsbl, .full = FALSE)
# A tibble: 70 x 4
var .from .to .n
<chr> <date> <date> <int>
1 A 2019-01-02 2019-01-10 9
2 A 2019-01-12 2019-01-20 9
3 A 2019-01-22 2019-01-31 10
# …
Here's what I did so far:
I saw here the possibility to define ordered factors as indices in tsibble but timetk does not recognise factors as indices. timetk suggests to define custom indices (see 2.).
There is the possibility to add custom indices to tsibble but I haven't found examples on this and I don't understand how I have to use these functions (a vignette is still planned). I have started reading the code to try to understand how to use the functions to get support for dekadal data but I'm a bit overwhelmed.
Questions
Will dekadal custom indices in tsibble behave similarly as the yearmonth or weekyear?
Would anyone here have an example to share on how to add custom indices to tsibble?
Or does anyone know of another way to elegantly handle dekadal data in the tidyverse?
This doesn't discuss tsibbles but it was too long for a comment and does provide an alternative.
zoo can do this either by (1) the code below which does not require the creation of a new class or (2) by creating a new class and methods. For that alternative following the methods that the yearmon class has would be sufficient. See here. zoo itself does not have to be modified.
As we see below, for the first approach dates will be shown as year(cycle) where cycle is 1, 2, ..., 36. Internally the dates are stored as year + (cycle-1)/36 .
It would also be possible to use ts class if the dates were consecutive month thirds (or if not if you don't mind having NAs inserted to make them so). For that use as.ts(z).
Start a fresh session with no packages loaded and then copy and paste the input DF shown in the Note at the end and then this code. Date2dek will convert a Date vector or a character vector representing dates in standard yyyy-mm-dd format to a dek format which is described above. dek2Date performs the inverse transformation. It is not actually used below but might be useful.
library(zoo)
# convert Date or yyyy-mm-dd char vector
Date2dek <- function(x, ...) with(as.POSIXlt(x, tz="GMT"),
1900 + year + (mon + ((mday >= 11) + (mday >= 21)) / 3) / 12)
dek2Date <- function(x, ...) { # not used below but shows inverse
cyc <- round(36 * (as.numeric(x) %% 1)) + 1
if(all(is.na(x))) return(as.Date(x))
month <- (cyc - 1) %/% 3 + 1
day <- 10 * ((cyc - 1) %% 3) + 1
year <- floor(x + .001)
ix <- !is.na(year)
as.Date(paste(year[ix], month[ix], day[ix], sep = "-"))
}
# DF given in Note below
z <- read.zoo(DF, split = "var", FUN = Date2dek, regular = TRUE, freq = 36)
z
The result is the following zooreg object:
A B
2019(1) 1 10
2019(2) 2 20
2019(3) 3 30
2019(4) 4 40
2019(5) 5 50
Note
DF <- data.frame(
date = as.Date(ISOdate(2019, rep(1:2, 3:2), c(1, 11, 21))),
value = c(1:5, 10*(1:5)),
var = rep(c("A", "B"), each = 5))
Extending tsibble to support a new index requires defining methods for these generics:
index_valid() - This method should return TRUE if the class is acceptable as an index
interval_pull() - This method accepts your index values and computes the interval of the data. The interval can be created using tsibble:::new_interval(). You may find tsibble::gcd_interval() useful for computing the smallest interval.
seq() and + - These methods are used to produce future time values using the new_data() function.
A minimal example of a new tsibble index class for 'year' is as follows:
library(tsibble)
#>
#> Attaching package: 'tsibble'
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, union
library(vctrs)
# Object creation function
my_year <- function(x = integer()) {
x <- vec_cast(x, integer())
vctrs::new_vctr(x, class = "year")
}
# Declare this class as a valid index
index_valid.year <- function(x) TRUE
# Compute the interval of a year input
interval_pull.year <- function(x) {
tsibble::new_interval(
year = tsibble::gcd_interval(vec_data(x))
)
}
# Specify how sequences are generated from years
seq.year <- function(from, to, by, length.out = NULL, along.with = NULL, ...) {
from <- vec_data(from)
if (!rlang::is_missing(to)) {
vec_assert(to, my_year())
to <- vec_data(to)
}
my_year(NextMethod())
}
# Define `+` operation as needed for `new_data()`
vec_arith.year <- function(op, x, y, ...) {
my_year(vec_arith(op, vec_data(x), vec_data(y), ...))
}
# Use the new index class
x <- tsibble::tsibble(
year = my_year(c(2018, 2020, 2024)),
y = rnorm(3),
index = "year"
)
x
#> # A tsibble: 3 x 2 [2Y]
#> year y
#> <year> <dbl>
#> 1 2018 0.211
#> 2 2020 -0.410
#> 3 2024 0.333
interval(x)
#> <interval[1]>
#> [1] 2Y
new_data(x, 3)
#> # A tsibble: 3 x 1 [2Y]
#> year
#> <year>
#> 1 2026
#> 2 2028
#> 3 2030
Created on 2021-02-08 by the reprex package (v0.3.0)
i have estimated a variable age.first.union as a time difference using lubridate by subracting the date of wedding wdow from the date of birth wdob. I got the following numeric vector
head(wm$age.first.union, 3)
[1] 15.43014 12.67123 17.34247
I would like to have the decimals converted into months (and possibly also into days, but that's a minor detail), so the first value would be 15 years and 5 months. What I did was to create a series of new variables and then perform some calculations. To get the number of months, first, I duplicated and truncated the age.first.union variable. Then I estimated the difference between the two to get only the decimal part and then used proportions (e.g. 0.43 : 10 = x : 12 ) to get the months.
I looked into the lubridate documentation but I could not find much on this. I tried the following
years(floor(dseconds(15.43014)))
but I got only the years
[1] "15y 0m 0d 0H 0M 0S"
One idea would be to get the durations in seconds
seconds(floor(dyears(15.43014)))
[1] "486604895S"
but then the challenge would be that months have difference lengths. Even an approximation of years = 365 days, and months = 30 days would be more then perfect, but I do not know how to do it apart from lengthy calculations.
One final idea would be to have years and month using the calculation as described at the beginning of this post, and then merge the two variables into the final one using something similar to make_date (but it looks like a make_duration does not seem to exist yet).
The whole process looks quite cumbersome to me, anyone has a different take?
Many thanks
Manolo
While lubridate provides a function decimal_date to convert a fractional date to D-M-Y date, you seem to be dealing with durations. So this won't work.
However, you can quite easily define a custom function to extract the integer year, month and fractional day (based on an average 30.42 days per month in a regular year):
age <- c(15.43014 12.67123 17.34247)
f <- function(x) {
year <- floor(x);
month <- floor((x - year) * 12);
day <- ((x - year) * 12 - month) * 30.42;
return(sprintf("%i years, %i months, %3.2f days", year, month, day))
}
lapply(age, f);
#[[1]]
#[1] "15 years, 5 months, 4.92 days"
#
#[[2]]
#[1] "12 years, 8 months, 1.67 days"
#
#[[3]]
#[1] "17 years, 4 months, 3.34 days"
Update
If you want to return the integer year, month and fractional day you can define f as
f <- function(x) {
year <- floor(x);
month <- floor((x - year) * 12);
day <- ((x - year) * 12 - month) * 30.42;
return(list(year = year, month = month, day = day))
}
which gives you e.g.
sapply(age, f);
# [,1] [,2] [,3]
#year 15 12 17
#month 5 8 4
#day 4.918306 1.665799 3.335249
We can define our own ym S3 class to represent year/month objects. Here we define several ym methods as well as extractor functions for years and months. The as.data.frame.ym method is a partial implementation. We have defined a month to be 1/12th of a year.
as.ym <- function(x, ...) structure(x, class = "ym")
as.data.frame.ym <- function(x, ...)
structure(list(x), row.names = seq_along(x), class = "data.frame")
years.ym <- as.integer
months.ym <- function(x) 12 * as.numeric(x) %% 1
format.ym <- function(x, ...) paste0(years.ym(x), "Y ", round(months.ym(x)), "M")
print.ym <- function(x, ...) print(format(x), ...)
# test
x <- c(15.43014, 12.67123, 17.34247) # test input
xx <- as.ym(x)
xx
## [1] "15Y 5M" "12Y 8M" "17Y 4M"
DF <- data.frame(x, xx)
DF
x xx
1 15.43014 15Y 5M
2 12.67123 12Y 8M
3 17.34247 17Y 4M
years.ym(xx)
## [1] 15 12 17
months.ym(xx)
## [1] 5.16168 8.05476 4.10964
class(xx)
## [1] "ym"
Days
To extend this to include days, as well, we assume that there are 365.25 days in a year and, again, we use 12 months in a year. We create a ymd S3 class for this.
as.ymd <- function(x, ...) structure(x, class = "ymd")
as.data.frame.ymd <- function(x, ...)
structure(list(x), row.names = seq_along(x), class = "data.frame")
years.ymd <- as.integer
months.ymd <- function(x) as.integer(12 * as.numeric(x) %% 1)
days.ymd <- function(x) (365.25 * as.numeric(x)) %% (365.25 / 12)
format.ymd <- function(x, ...)
paste0(years.ymd(x), "Y ", as.integer(months.ymd(x)), "M ", round(days.ymd(x), 1), "D")
print.ymd <- function(x, ...) print(format(x), ...)
xx <- as.ymd(x)
xx
## [1] "15Y 5M 4.9D" "12Y 8M 1.7D" "17Y 4M 3.3D"
DF <- data.frame(x, xx)
DF
x xx
1 15.43014 15Y 5M 4.9D
2 12.67123 12Y 8M 1.7D
3 17.34247 17Y 4M 3.3D
years.ymd(xx)
## [1] 15 12 17
months.ymd(xx)
## [1] 5 8 4
days.ymd(xx)
## [1] 4.921135 1.666758 3.337167
class(xx)
## [1] "ymd"
I have two dates let´s say 14.01.2013 and 26.03.2014.
I would like to get the difference between those two dates in terms of weeks(?), months(in the example 14), quarters(4) and years(1).
Do you know the best way to get this?
what about this:
# get difference between dates `"01.12.2013"` and `"31.12.2013"`
# weeks
difftime(strptime("26.03.2014", format = "%d.%m.%Y"),
strptime("14.01.2013", format = "%d.%m.%Y"),units="weeks")
Time difference of 62.28571 weeks
# months
(as.yearmon(strptime("26.03.2014", format = "%d.%m.%Y"))-
as.yearmon(strptime("14.01.2013", format = "%d.%m.%Y")))*12
[1] 14
# quarters
(as.yearqtr(strptime("26.03.2014", format = "%d.%m.%Y"))-
as.yearqtr(strptime("14.01.2013", format = "%d.%m.%Y")))*4
[1] 4
# years
year(strptime("26.03.2014", format = "%d.%m.%Y"))-
year(strptime("14.01.2013", format = "%d.%m.%Y"))
[1] 1
as.yearmon() and as.yearqtr() are in package zoo. year() is in package lubridate.
What do you think?
All the existing answers are imperfect (IMO) and either make assumptions about the desired output or don't provide flexibility for the desired output.
Based on the examples from the OP, and the OP's stated expected answers, I think these are the answers you are looking for (plus some additional examples that make it easy to extrapolate).
(This only requires base R and doesn't require zoo or lubridate)
Convert to Datetime Objects
date_strings = c("14.01.2013", "26.03.2014")
datetimes = strptime(date_strings, format = "%d.%m.%Y") # convert to datetime objects
Difference in Days
You can use the diff in days to get some of our later answers
diff_in_days = difftime(datetimes[2], datetimes[1], units = "days") # days
diff_in_days
#Time difference of 435.9583 days
Difference in Weeks
Difference in weeks is a special case of units = "weeks" in difftime()
diff_in_weeks = difftime(datetimes[2], datetimes[1], units = "weeks") # weeks
diff_in_weeks
#Time difference of 62.27976 weeks
Note that this is the same as dividing our diff_in_days by 7 (7 days in a week)
as.double(diff_in_days)/7
#[1] 62.27976
Difference in Years
With similar logic, we can derive years from diff_in_days
diff_in_years = as.double(diff_in_days)/365 # absolute years
diff_in_years
#[1] 1.194406
You seem to be expecting the diff in years to be "1", so I assume you just want to count absolute calendar years or something, which you can easily do by using floor()
# get desired output, given your definition of 'years'
floor(diff_in_years)
#[1] 1
Difference in Quarters
# get desired output for quarters, given your definition of 'quarters'
floor(diff_in_years * 4)
#[1] 4
Difference in Months
Can calculate this as a conversion from diff_years
# months, defined as absolute calendar months (this might be what you want, given your question details)
months_diff = diff_in_years*12
floor(month_diff)
#[1] 14
I know this question is old, but given that I still had to solve this problem just now, I thought I would add my answers. Hope it helps.
For weeks, you can use function difftime:
date1 <- strptime("14.01.2013", format="%d.%m.%Y")
date2 <- strptime("26.03.2014", format="%d.%m.%Y")
difftime(date2,date1,units="weeks")
Time difference of 62.28571 weeks
But difftime doesn't work with duration over weeks.
The following is a very suboptimal solution using cut.POSIXt for those durations but you can work around it:
seq1 <- seq(date1,date2, by="days")
nlevels(cut(seq1,"months"))
15
nlevels(cut(seq1,"quarters"))
5
nlevels(cut(seq1,"years"))
2
This is however the number of months, quarters or years spanned by your time interval and not the duration of your time interval expressed in months, quarters, years (since those do not have a constant duration). Considering the comment you made on #SvenHohenstein answer I would think you can use nlevels(cut(seq1,"months")) - 1 for what you're trying to achieve.
I just wrote this for another question, then stumbled here.
library(lubridate)
#' Calculate age
#'
#' By default, calculates the typical "age in years", with a
#' \code{floor} applied so that you are, e.g., 5 years old from
#' 5th birthday through the day before your 6th birthday. Set
#' \code{floor = FALSE} to return decimal ages, and change \code{units}
#' for units other than years.
#' #param dob date-of-birth, the day to start calculating age.
#' #param age.day the date on which age is to be calculated.
#' #param units unit to measure age in. Defaults to \code{"years"}. Passed to \link{\code{duration}}.
#' #param floor boolean for whether or not to floor the result. Defaults to \code{TRUE}.
#' #return Age in \code{units}. Will be an integer if \code{floor = TRUE}.
#' #examples
#' my.dob <- as.Date('1983-10-20')
#' age(my.dob)
#' age(my.dob, units = "minutes")
#' age(my.dob, floor = FALSE)
age <- function(dob, age.day = today(), units = "years", floor = TRUE) {
calc.age = interval(dob, age.day) / duration(num = 1, units = units)
if (floor) return(as.integer(floor(calc.age)))
return(calc.age)
}
Usage examples:
my.dob <- as.Date('1983-10-20')
age(my.dob)
# [1] 31
age(my.dob, floor = FALSE)
# [1] 31.15616
age(my.dob, units = "minutes")
# [1] 16375680
age(seq(my.dob, length.out = 6, by = "years"))
# [1] 31 30 29 28 27 26
Here the still lacking lubridate answer (although Gregor's function is built on this package)
The lubridate timespan documentation is very helpful for understanding the difference between periods and duration. I also like the lubridate cheatsheet and this very useful thread
library(lubridate)
dates <- c(dmy('14.01.2013'), dmy('26.03.2014'))
span <- dates[1] %--% dates[2] #creating an interval object
#creating period objects
as.period(span, unit = 'year')
#> [1] "1y 2m 12d 0H 0M 0S"
as.period(span, unit = 'month')
#> [1] "14m 12d 0H 0M 0S"
as.period(span, unit = 'day')
#> [1] "436d 0H 0M 0S"
Periods do not accept weeks as units. But you can convert durations to weeks:
as.duration(span)/ dweeks(1)
#makes duration object (in seconds) and divides by duration of a week (in seconds)
#> [1] 62.28571
Created on 2019-11-04 by the reprex package (v0.3.0)
Here's a solution:
dates <- c("14.01.2013", "26.03.2014")
# Date format:
dates2 <- strptime(dates, format = "%d.%m.%Y")
dif <- diff(as.numeric(dates2)) # difference in seconds
dif/(60 * 60 * 24 * 7) # weeks
[1] 62.28571
dif/(60 * 60 * 24 * 30) # months
[1] 14.53333
dif/(60 * 60 * 24 * 30 * 3) # quartes
[1] 4.844444
dif/(60 * 60 * 24 * 365) # years
[1] 1.194521
This is a simple way to find out the difference in years with the lubridate package:
as.numeric(as.Date("14-03-2013", format = "%d-%m-%Y") %--% as.Date("23-03-2014", format = "%d-%m-%Y"), "years")
This returns 1.023956
You can use floor() if you don't want the decimals.
try this for a months solution
StartDate <- strptime("14 January 2013", "%d %B %Y")
EventDates <- strptime(c("26 March 2014"), "%d %B %Y")
difftime(EventDates, StartDate)
A more "precise" calculation. That is, the number of week/month/quarter/year for a non-complete week/month/quarter/year is the fraction of calendar days in that week/month/quarter/year. For example, the number of months between 2016-02-22 and 2016-03-31 is 8/29 + 31/31 = 1.27586
explanation inline with code
#' Calculate precise number of periods between 2 dates
#'
#' #details The number of week/month/quarter/year for a non-complete week/month/quarter/year
#' is the fraction of calendar days in that week/month/quarter/year.
#' For example, the number of months between 2016-02-22 and 2016-03-31
#' is 8/29 + 31/31 = 1.27586
#'
#' #param startdate start Date of the interval
#' #param enddate end Date of the interval
#' #param period character. It must be one of 'day', 'week', 'month', 'quarter' and 'year'
#'
#' #examples
#' identical(numPeriods(as.Date("2016-02-15"), as.Date("2016-03-31"), "month"), 15/29 + 1)
#' identical(numPeriods(as.Date("2016-02-15"), as.Date("2016-03-31"), "quarter"), (15 + 31)/(31 + 29 + 31))
#' identical(numPeriods(as.Date("2016-02-15"), as.Date("2016-03-31"), "year"), (15 + 31)/366)
#'
#' #return exact number of periods between
#'
numPeriods <- function(startdate, enddate, period) {
numdays <- as.numeric(enddate - startdate) + 1
if (grepl("day", period, ignore.case=TRUE)) {
return(numdays)
} else if (grepl("week", period, ignore.case=TRUE)) {
return(numdays / 7)
}
#create a sequence of dates between start and end dates
effDaysinBins <- cut(seq(startdate, enddate, by="1 day"), period)
#use the earliest start date of the previous bins and create a breaks of periodic dates with
#user's period interval
intervals <- seq(from=as.Date(min(levels(effDaysinBins)), "%Y-%m-%d"),
by=paste("1",period),
length.out=length(levels(effDaysinBins))+1)
#create a sequence of dates between the earliest interval date and last date of the interval
#that contains the enddate
allDays <- seq(from=intervals[1],
to=intervals[intervals > enddate][1] - 1,
by="1 day")
#bin all days in the whole period using previous breaks
allDaysInBins <- cut(allDays, intervals)
#calculate ratio of effective days to all days in whole period
sum( tabulate(effDaysinBins) / tabulate(allDaysInBins) )
} #numPeriods
Please let me know if you find more boundary cases where the above solution does not work.
I have two dates let´s say 14.01.2013 and 26.03.2014.
I would like to get the difference between those two dates in terms of weeks(?), months(in the example 14), quarters(4) and years(1).
Do you know the best way to get this?
what about this:
# get difference between dates `"01.12.2013"` and `"31.12.2013"`
# weeks
difftime(strptime("26.03.2014", format = "%d.%m.%Y"),
strptime("14.01.2013", format = "%d.%m.%Y"),units="weeks")
Time difference of 62.28571 weeks
# months
(as.yearmon(strptime("26.03.2014", format = "%d.%m.%Y"))-
as.yearmon(strptime("14.01.2013", format = "%d.%m.%Y")))*12
[1] 14
# quarters
(as.yearqtr(strptime("26.03.2014", format = "%d.%m.%Y"))-
as.yearqtr(strptime("14.01.2013", format = "%d.%m.%Y")))*4
[1] 4
# years
year(strptime("26.03.2014", format = "%d.%m.%Y"))-
year(strptime("14.01.2013", format = "%d.%m.%Y"))
[1] 1
as.yearmon() and as.yearqtr() are in package zoo. year() is in package lubridate.
What do you think?
All the existing answers are imperfect (IMO) and either make assumptions about the desired output or don't provide flexibility for the desired output.
Based on the examples from the OP, and the OP's stated expected answers, I think these are the answers you are looking for (plus some additional examples that make it easy to extrapolate).
(This only requires base R and doesn't require zoo or lubridate)
Convert to Datetime Objects
date_strings = c("14.01.2013", "26.03.2014")
datetimes = strptime(date_strings, format = "%d.%m.%Y") # convert to datetime objects
Difference in Days
You can use the diff in days to get some of our later answers
diff_in_days = difftime(datetimes[2], datetimes[1], units = "days") # days
diff_in_days
#Time difference of 435.9583 days
Difference in Weeks
Difference in weeks is a special case of units = "weeks" in difftime()
diff_in_weeks = difftime(datetimes[2], datetimes[1], units = "weeks") # weeks
diff_in_weeks
#Time difference of 62.27976 weeks
Note that this is the same as dividing our diff_in_days by 7 (7 days in a week)
as.double(diff_in_days)/7
#[1] 62.27976
Difference in Years
With similar logic, we can derive years from diff_in_days
diff_in_years = as.double(diff_in_days)/365 # absolute years
diff_in_years
#[1] 1.194406
You seem to be expecting the diff in years to be "1", so I assume you just want to count absolute calendar years or something, which you can easily do by using floor()
# get desired output, given your definition of 'years'
floor(diff_in_years)
#[1] 1
Difference in Quarters
# get desired output for quarters, given your definition of 'quarters'
floor(diff_in_years * 4)
#[1] 4
Difference in Months
Can calculate this as a conversion from diff_years
# months, defined as absolute calendar months (this might be what you want, given your question details)
months_diff = diff_in_years*12
floor(month_diff)
#[1] 14
I know this question is old, but given that I still had to solve this problem just now, I thought I would add my answers. Hope it helps.
For weeks, you can use function difftime:
date1 <- strptime("14.01.2013", format="%d.%m.%Y")
date2 <- strptime("26.03.2014", format="%d.%m.%Y")
difftime(date2,date1,units="weeks")
Time difference of 62.28571 weeks
But difftime doesn't work with duration over weeks.
The following is a very suboptimal solution using cut.POSIXt for those durations but you can work around it:
seq1 <- seq(date1,date2, by="days")
nlevels(cut(seq1,"months"))
15
nlevels(cut(seq1,"quarters"))
5
nlevels(cut(seq1,"years"))
2
This is however the number of months, quarters or years spanned by your time interval and not the duration of your time interval expressed in months, quarters, years (since those do not have a constant duration). Considering the comment you made on #SvenHohenstein answer I would think you can use nlevels(cut(seq1,"months")) - 1 for what you're trying to achieve.
I just wrote this for another question, then stumbled here.
library(lubridate)
#' Calculate age
#'
#' By default, calculates the typical "age in years", with a
#' \code{floor} applied so that you are, e.g., 5 years old from
#' 5th birthday through the day before your 6th birthday. Set
#' \code{floor = FALSE} to return decimal ages, and change \code{units}
#' for units other than years.
#' #param dob date-of-birth, the day to start calculating age.
#' #param age.day the date on which age is to be calculated.
#' #param units unit to measure age in. Defaults to \code{"years"}. Passed to \link{\code{duration}}.
#' #param floor boolean for whether or not to floor the result. Defaults to \code{TRUE}.
#' #return Age in \code{units}. Will be an integer if \code{floor = TRUE}.
#' #examples
#' my.dob <- as.Date('1983-10-20')
#' age(my.dob)
#' age(my.dob, units = "minutes")
#' age(my.dob, floor = FALSE)
age <- function(dob, age.day = today(), units = "years", floor = TRUE) {
calc.age = interval(dob, age.day) / duration(num = 1, units = units)
if (floor) return(as.integer(floor(calc.age)))
return(calc.age)
}
Usage examples:
my.dob <- as.Date('1983-10-20')
age(my.dob)
# [1] 31
age(my.dob, floor = FALSE)
# [1] 31.15616
age(my.dob, units = "minutes")
# [1] 16375680
age(seq(my.dob, length.out = 6, by = "years"))
# [1] 31 30 29 28 27 26
Here the still lacking lubridate answer (although Gregor's function is built on this package)
The lubridate timespan documentation is very helpful for understanding the difference between periods and duration. I also like the lubridate cheatsheet and this very useful thread
library(lubridate)
dates <- c(dmy('14.01.2013'), dmy('26.03.2014'))
span <- dates[1] %--% dates[2] #creating an interval object
#creating period objects
as.period(span, unit = 'year')
#> [1] "1y 2m 12d 0H 0M 0S"
as.period(span, unit = 'month')
#> [1] "14m 12d 0H 0M 0S"
as.period(span, unit = 'day')
#> [1] "436d 0H 0M 0S"
Periods do not accept weeks as units. But you can convert durations to weeks:
as.duration(span)/ dweeks(1)
#makes duration object (in seconds) and divides by duration of a week (in seconds)
#> [1] 62.28571
Created on 2019-11-04 by the reprex package (v0.3.0)
Here's a solution:
dates <- c("14.01.2013", "26.03.2014")
# Date format:
dates2 <- strptime(dates, format = "%d.%m.%Y")
dif <- diff(as.numeric(dates2)) # difference in seconds
dif/(60 * 60 * 24 * 7) # weeks
[1] 62.28571
dif/(60 * 60 * 24 * 30) # months
[1] 14.53333
dif/(60 * 60 * 24 * 30 * 3) # quartes
[1] 4.844444
dif/(60 * 60 * 24 * 365) # years
[1] 1.194521
This is a simple way to find out the difference in years with the lubridate package:
as.numeric(as.Date("14-03-2013", format = "%d-%m-%Y") %--% as.Date("23-03-2014", format = "%d-%m-%Y"), "years")
This returns 1.023956
You can use floor() if you don't want the decimals.
try this for a months solution
StartDate <- strptime("14 January 2013", "%d %B %Y")
EventDates <- strptime(c("26 March 2014"), "%d %B %Y")
difftime(EventDates, StartDate)
A more "precise" calculation. That is, the number of week/month/quarter/year for a non-complete week/month/quarter/year is the fraction of calendar days in that week/month/quarter/year. For example, the number of months between 2016-02-22 and 2016-03-31 is 8/29 + 31/31 = 1.27586
explanation inline with code
#' Calculate precise number of periods between 2 dates
#'
#' #details The number of week/month/quarter/year for a non-complete week/month/quarter/year
#' is the fraction of calendar days in that week/month/quarter/year.
#' For example, the number of months between 2016-02-22 and 2016-03-31
#' is 8/29 + 31/31 = 1.27586
#'
#' #param startdate start Date of the interval
#' #param enddate end Date of the interval
#' #param period character. It must be one of 'day', 'week', 'month', 'quarter' and 'year'
#'
#' #examples
#' identical(numPeriods(as.Date("2016-02-15"), as.Date("2016-03-31"), "month"), 15/29 + 1)
#' identical(numPeriods(as.Date("2016-02-15"), as.Date("2016-03-31"), "quarter"), (15 + 31)/(31 + 29 + 31))
#' identical(numPeriods(as.Date("2016-02-15"), as.Date("2016-03-31"), "year"), (15 + 31)/366)
#'
#' #return exact number of periods between
#'
numPeriods <- function(startdate, enddate, period) {
numdays <- as.numeric(enddate - startdate) + 1
if (grepl("day", period, ignore.case=TRUE)) {
return(numdays)
} else if (grepl("week", period, ignore.case=TRUE)) {
return(numdays / 7)
}
#create a sequence of dates between start and end dates
effDaysinBins <- cut(seq(startdate, enddate, by="1 day"), period)
#use the earliest start date of the previous bins and create a breaks of periodic dates with
#user's period interval
intervals <- seq(from=as.Date(min(levels(effDaysinBins)), "%Y-%m-%d"),
by=paste("1",period),
length.out=length(levels(effDaysinBins))+1)
#create a sequence of dates between the earliest interval date and last date of the interval
#that contains the enddate
allDays <- seq(from=intervals[1],
to=intervals[intervals > enddate][1] - 1,
by="1 day")
#bin all days in the whole period using previous breaks
allDaysInBins <- cut(allDays, intervals)
#calculate ratio of effective days to all days in whole period
sum( tabulate(effDaysinBins) / tabulate(allDaysInBins) )
} #numPeriods
Please let me know if you find more boundary cases where the above solution does not work.