running by date in R [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I need to perform walk forward optimization on a time series. The attached image shows a diagram of how this should be done. I have to perform my data processing function on each period, the number of periods I have to adjust to a variable (for example: I assign a start and end date and each period in the test should be 1 month). My problem is as follows: I do not know how to shift the dates by the value of the out-of-sample period and get a sheet with the results of calculations for each period at the output of the function. The value of the out-of-sample period will be 30% of the total length of the selected period. What tools in R can I use to solve my problem?
start date: 2019-01-01, end date: 2019-12-31
first period: from 2019-01-01 to 2019-03-31
second period: from 2019-02-01 to 2019-04-30 etc...

Assuming that the question is asking how to form the sequence of start, end and start-of-testing (oos) dates given st and en shown below, first form the months sequence and then transform it to append the start-of-test date. To do that seq can generate a beginning of month Date sequence. Also if we add an integer to a Date class object then the result is to add or subtract that number of days so we can get the end of the month by subtracting one day from the start of the next month.
We have allocated 70% of the three month period to training and 30% to the test making use of the fact that the difference between two Date objects is the number of days between them. 70/30 is what the question asks for; however, that means that there will be a few days not in any test in each period whereas the diagram has no days that are not in any test except at the beginning. If all days are to be in a test then we might instead use the third month in the period as the test period and the first two months as the training period. In that case uncomment the commented out transform line. We also show this variation at the end.
Finally define a function f (we have shown a dummy calculation to make it possible to run the code) with arguments start, end and test to perform whatever calculation is needed. It can produce any sort of output object for one train/test instance. We can use either Map or by as shown below. The output list of results will have one component per row of d.
# input
st <- as.Date("2019-01-01")
en <- as.Date("2019-12-31")
months <- seq(st, en, by = "month")
d <- data.frame(start = head(months, -2), end = c(tail(months, -3) - 1, en))
# append date that test starts -- d shown at end
d <- transform(d, test = start + .7 * (end - start + 1))
# d <- transform(d, test = tail(months, -2))
# replace this with your function. Can be many lines.
f <- function(start, end, test) {
data.frame(start, end, test) # dummy calc - just show dates
}
# use `Map` or `by` to run f nrow(d) times giving a list of results,
# one component per row of d
with(d, Map(f, start, end, test))
# or
by(d, 1:nrow(d), with, f(start, end, test))
The data frame d above is:
> d
start end test
1 2019-01-01 2019-03-31 2019-03-05
2 2019-02-01 2019-04-30 2019-04-04
3 2019-03-01 2019-05-31 2019-05-04
4 2019-04-01 2019-06-30 2019-06-04
5 2019-05-01 2019-07-31 2019-07-04
6 2019-06-01 2019-08-31 2019-08-04
7 2019-07-01 2019-09-30 2019-09-03
8 2019-08-01 2019-10-31 2019-10-04
9 2019-09-01 2019-11-30 2019-11-04
10 2019-10-01 2019-12-31 2019-12-04
If we had used the commented out version of d then it would look like this (same except last column):
start end test
1 2019-01-01 2019-03-31 2019-03-01
2 2019-02-01 2019-04-30 2019-04-01
3 2019-03-01 2019-05-31 2019-05-01
4 2019-04-01 2019-06-30 2019-06-01
5 2019-05-01 2019-07-31 2019-07-01
6 2019-06-01 2019-08-31 2019-08-01
7 2019-07-01 2019-09-30 2019-09-01
8 2019-08-01 2019-10-31 2019-10-01
9 2019-09-01 2019-11-30 2019-11-01
10 2019-10-01 2019-12-31 2019-12-01
Graphics
We can display these as gantt charts using ggplot2.
library(ggplot2)
library(gridExtra)
library(scales)
n <- nrow(d)
Plot <- function(x, main) {
ggplot(x, aes(size = I(15))) +
geom_segment(aes(x = start, xend = test, y = n:1, yend = n:1), col = "green") +
geom_segment(aes(x = test, xend = end, y = n:1, yend = n:1), col = "blue") +
scale_x_date(labels = date_format("%b\n%Y"), breaks = date_breaks("month")) +
ggtitle(main) +
theme(legend.position = "none",
axis.title = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid.major = element_line(colour = "#808080"))
}
d <- transform(d, test = start + .7 * (end - start + 1))
g1 <- Plot(d, "70/30")
d <- transform(d, test = tail(months, -2))
g2 <- Plot(d, "2 months/1 month")
grid.arrange(g1, g2, ncol = 2)

Thanks everyone for the help. i found a way to solve by writing a small function.
dates <- function(startDate, endDate, periodLength, lag){
start <- as.Date(startDate)
end <- as.Date(endDate)
data <- start
while(data[length(data)] < end){
x <- as.Date(data[length(data)] + lag)
data <- as.Date(rbind(data, x))
}
end <- data + periodLength
data <- data.table(data, end)
colnames(data) <- c('start', 'end')
data$start <- as.Date(data$start)
data$end <- as.Date(data$end)
data <- as.list(as.data.table(t(data)))
return(data)
}
where
startDate - this is the start date of the testing period,
endDate - this is the end date of the testing period,
periodLength - this is the length of one period in days,
lag - this is the offset (the length of the OOS period)
dates(startDate = '2019-01-01', endDate = '2019-06-30', periodLength = 30, lag = 10)
$V1
[1] "2019-01-01" "2019-01-31"
$V2
[1] "2019-01-11" "2019-02-10"
$V3
[1] "2019-01-21" "2019-02-20"
$V4
[1] "2019-01-31" "2019-03-02"
$V5
[1] "2019-02-10" "2019-03-12"
$V6
[1] "2019-02-20" "2019-03-22"
$V7
[1] "2019-03-02" "2019-04-01"
$V8
[1] "2019-03-12" "2019-04-11"
$V9
[1] "2019-03-22" "2019-04-21"
$V10
[1] "2019-04-01" "2019-05-01"
$V11
[1] "2019-04-11" "2019-05-11"
$V12
[1] "2019-04-21" "2019-05-21"
$V13
[1] "2019-05-01" "2019-05-31"
$V14
[1] "2019-05-11" "2019-06-10"
$V15
[1] "2019-05-21" "2019-06-20"
$V16
[1] "2019-05-31" "2019-06-30"
$V17
[1] "2019-06-10" "2019-07-10"
$V18
[1] "2019-06-20" "2019-07-20"
$V19
[1] "2019-06-30" "2019-07-30"

Related

Finding age in R [duplicate]

I am using data.table for the first time.
I have a column of about 400,000 ages in my table. I need to convert them from birth dates to ages.
What is the best way to do this?
I've been thinking about this and have been dissatisfied with the two answers so far. I like using lubridate, as #KFB did, but I also want things wrapped up nicely in a function, as in my answer using the eeptools package. So here's a wrapper function using the lubridate interval method with some nice options:
#' Calculate age
#'
#' By default, calculates the typical "age in years", with a
#' \code{floor} applied so that you are, e.g., 5 years old from
#' 5th birthday through the day before your 6th birthday. Set
#' \code{floor = FALSE} to return decimal ages, and change \code{units}
#' for units other than years.
#' #param dob date-of-birth, the day to start calculating age.
#' #param age.day the date on which age is to be calculated.
#' #param units unit to measure age in. Defaults to \code{"years"}. Passed to \link{\code{duration}}.
#' #param floor boolean for whether or not to floor the result. Defaults to \code{TRUE}.
#' #return Age in \code{units}. Will be an integer if \code{floor = TRUE}.
#' #examples
#' my.dob <- as.Date('1983-10-20')
#' age(my.dob)
#' age(my.dob, units = "minutes")
#' age(my.dob, floor = FALSE)
age <- function(dob, age.day = today(), units = "years", floor = TRUE) {
calc.age = lubridate::interval(dob, age.day) / lubridate::duration(num = 1, units = units)
if (floor) return(as.integer(floor(calc.age)))
return(calc.age)
}
Usage examples:
> my.dob <- as.Date('1983-10-20')
> age(my.dob)
[1] 31
> age(my.dob, floor = FALSE)
[1] 31.15616
> age(my.dob, units = "minutes")
[1] 16375680
> age(seq(my.dob, length.out = 6, by = "years"))
[1] 31 30 29 28 27 26
From the comments of this blog entry, I found the age_calc function in the eeptools package. It takes care of edge cases (leap years, etc.), checks inputs and looks quite robust.
library(eeptools)
x <- as.Date(c("2011-01-01", "1996-02-29"))
age_calc(x[1],x[2]) # default is age in months
[1] 46.73333 224.83118
age_calc(x[1],x[2], units = "years") # but you can set it to years
[1] 3.893151 18.731507
floor(age_calc(x[1],x[2], units = "years"))
[1] 3 18
For your data
yourdata$age <- floor(age_calc(yourdata$birthdate, units = "years"))
assuming you want age in integer years.
Assume you have a data.table, you could do below:
library(data.table)
library(lubridate)
# toy data
X = data.table(birth=seq(from=as.Date("1970-01-01"), to=as.Date("1980-12-31"), by="year"))
Sys.Date()
Option 1 : use "as.period" from lubriate package
X[, age := as.period(Sys.Date() - birth)][]
birth age
1: 1970-01-01 44y 0m 327d 0H 0M 0S
2: 1971-01-01 43y 0m 327d 6H 0M 0S
3: 1972-01-01 42y 0m 327d 12H 0M 0S
4: 1973-01-01 41y 0m 326d 18H 0M 0S
5: 1974-01-01 40y 0m 327d 0H 0M 0S
6: 1975-01-01 39y 0m 327d 6H 0M 0S
7: 1976-01-01 38y 0m 327d 12H 0M 0S
8: 1977-01-01 37y 0m 326d 18H 0M 0S
9: 1978-01-01 36y 0m 327d 0H 0M 0S
10: 1979-01-01 35y 0m 327d 6H 0M 0S
11: 1980-01-01 34y 0m 327d 12H 0M 0S
Option 2 : if you do not like the format of Option 1, you could do below:
yr = duration(num = 1, units = "years")
X[, age := new_interval(birth, Sys.Date())/yr][]
# you get
birth age
1: 1970-01-01 44.92603
2: 1971-01-01 43.92603
3: 1972-01-01 42.92603
4: 1973-01-01 41.92329
5: 1974-01-01 40.92329
6: 1975-01-01 39.92329
7: 1976-01-01 38.92329
8: 1977-01-01 37.92055
9: 1978-01-01 36.92055
10: 1979-01-01 35.92055
11: 1980-01-01 34.92055
Believe Option 2 should be the more desirable.
I prefer to do this using the lubridate package, borrowing syntax I originally encountered in another post.
It's necessary to standardize your input dates in terms of R date objects, preferably with the lubridate::mdy() or lubridate::ymd() or similar functions, as applicable. You can use the interval() function to generate an interval describing the time elapsed between the two dates, and then use the duration() function to define how this interval should be "diced".
I've summarized the simplest case for calculating an age from two dates below, using the most current syntax in R.
df$DOB <- mdy(df$DOB)
df$EndDate <- mdy(df$EndDate)
df$Calc_Age <- interval(start= df$DOB, end=df$EndDate)/
duration(n=1, unit="years")
Age may be rounded down to the nearest complete integer using the base R 'floor()` function, like so:
df$Calc_AgeF <- floor(df$Calc_Age)
Alternately, the digits= argument in the base R round() function can be used to round up or down, and specify the exact number of decimals in the returned value, like so:
df$Calc_Age2 <- round(df$Calc_Age, digits = 2) ## 2 decimals
df$Calc_Age0 <- round(df$Calc_Age, digits = 0) ## nearest integer
It's worth noting that once the input dates are passed through the calculation step described above (i.e., interval() and duration() functions) , the returned value will be numeric and no longer a date object in R. This is significant whereas the lubridate::floor_date() is limited strictly to date-time objects.
The above syntax works regardless whether the input dates occur in a data.table or data.frame object.
I wanted an implementation that didn't increase my dependencies beyond data.table, which is usually my only dependency. The data.table is only needed for mday, which means day of the month.
Development function
This function is logically how I would think about someone's age. I start with [current year] - [brith year] - 1, then add 1 if they've already had their birthday in the current year. To check for that offset I start by considering month, then (if necessary) day of month.
Here is that step by step implementation:
agecalc <- function(origin, current){
require(data.table)
y <- year(current) - year(origin) - 1
offset <- 0
if(month(current) > month(origin)) offset <- 1
if(month(current) == month(origin) &
mday(current) >= mday(origin)) offset <- 1
age <- y + offset
return(age)
}
Production function
This is the same logic refactored and vectorized:
agecalc <- function(origin, current){
require(data.table)
age <- year(current) - year(origin) - 1
ii <- (month(current) > month(origin)) | (month(current) == month(origin) &
mday(current) >= mday(origin))
age[ii] <- age[ii] + 1
return(age)
}
Experimental function that uses strings
You could also do a string comparison on the month / day part. Perhaps there are times when this is more efficient, for example if you had the year as a number and the birth date as a string.
agecalc_strings <- function(origin, current){
origin <- as.character(origin)
current <- as.character(current)
age <- as.numeric(substr(current, 1, 4)) - as.numeric(substr(origin, 1, 4)) - 1
if(substr(current, 6, 10) >= substr(origin, 6, 10)){
age <- age + 1
}
return(age)
}
Some tests on the vectorized "production" version:
## Examples for specific dates to test the calculation with things like
## beginning and end of months, and leap years:
agecalc(as.IDate("1985-08-13"), as.IDate("1985-08-12"))
agecalc(as.IDate("1985-08-13"), as.IDate("1985-08-13"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-08-12"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-08-13"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-09-12"))
agecalc(as.IDate("2000-02-29"), as.IDate("2000-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2000-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-03-01"))
agecalc(as.IDate("2000-02-29"), as.IDate("2004-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2004-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2011-03-01"))
## Testing every age for every day over several years
## This test requires vectorized version:
d <- data.table(d=as.IDate("2000-01-01") + 0:10000)
d[ , b1 := as.IDate("2000-08-15")]
d[ , b2 := as.IDate("2000-02-29")]
d[ , age1_num := (d - b1) / 365]
d[ , age2_num := (d - b2) / 365]
d[ , age1 := agecalc(b1, d)]
d[ , age2 := agecalc(b2, d)]
d
Below is a trivial plot of ages as numeric and integer. As you can see the
integer ages are a sort of stair step pattern that is tangent to (but below) the
straight line of numeric ages.
plot(numeric_age1 ~ today, dt, type = "l",
ylab = "ages", main = "ages plotted")
lines(integer_age1 ~ today, dt, col = "blue")
I wasn't happy with any of the responses when it comes to calculating the age in months or years, when dealing with leap years, so this is my function using the lubridate package.
Basically, it slices the interval between from and to into (up to) yearly chunks, and then adjusts the interval for whether that chunk is leap year or not. The total interval is the sum of the age of each chunk.
library(lubridate)
#' Get Age of Date relative to Another Date
#'
#' #param from,to the date or dates to consider
#' #param units the units to consider
#' #param floor logical as to whether to floor the result
#' #param simple logical as to whether to do a simple calculation, a simple calculation doesn't account for leap year.
#' #author Nicholas Hamilton
#' #export
age <- function(from, to = today(), units = "years", floor = FALSE, simple = FALSE) {
#Account for Leap Year if Working in Months and Years
if(!simple && length(grep("^(month|year)",units)) > 0){
df = data.frame(from,to)
calc = sapply(1:nrow(df),function(r){
#Start and Finish Points
st = df[r,1]; fn = df[r,2]
#If there is no difference, age is zero
if(st == fn){ return(0) }
#If there is a difference, age is not zero and needs to be calculated
sign = +1 #Age Direction
if(st > fn){ tmp = st; st = fn; fn = tmp; sign = -1 } #Swap and Change sign
#Determine the slice-points
mid = ceiling_date(seq(st,fn,by='year'),'year')
#Build the sequence
dates = unique( c(st,mid,fn) )
dates = dates[which(dates >= st & dates <= fn)]
#Determine the age of the chunks
chunks = sapply(head(seq_along(dates),-1),function(ix){
k = 365/( 365 + leap_year(dates[ix]) )
k*interval( dates[ix], dates[ix+1] ) / duration(num = 1, units = units)
})
#Sum the Chunks, and account for direction
sign*sum(chunks)
})
#If Simple Calculation or Not Months or Not years
}else{
calc = interval(from,to) / duration(num = 1, units = units)
}
if (floor) calc = as.integer(floor(calc))
calc
}
(Sys.Date() - yourDate) / 365.25
A very simple way of calculating the age from two dates without using any additional packages probably is:
df$age = with(df, as.Date(date_2, "%Y-%m-%d") - as.Date(date_1, "%Y-%m-%d"))
Here is a (I think simpler) solution using lubridate:
library(lubridate)
age <- function(dob, on.day=today()) {
intvl <- interval(dob, on.day)
prd <- as.period(intvl)
return(prd#year)
}
Note that age_calc from the eeptools package in particular fails on cases with the year 2000 around birthdays.
Some examples that don't work in age_calc:
library(lubridate)
library(eeptools)
age_calc(ymd("1997-04-21"), ymd("2000-04-21"), units = "years")
age_calc(ymd("2000-04-21"), ymd("2019-04-21"), units = "years")
age_calc(ymd("2000-04-21"), ymd("2016-04-21"), units = "years")
Some of the other solutions also have some output that is not intuitive to what I would want for decimal ages when leap years are involved. I like #James_D 's solution and it is precise and concise, but I wanted something where the decimal age is calculated as complete years plus the fraction of the year completed from their last birthday to their next birthday (which would be out of 365 or 366 days depending on year). In the case of leap years I use lubridate's rollback function to use March 1st for non-leap years following February 29th. I used some test cases from #geneorama and added some of my own, and the output aligns with what I would expect.
library(lubridate)
# Calculate precise age from birthdate in ymd format
age_calculation <- function(birth_date, later_year) {
if (birth_date > later_year)
{
stop("Birth date is after the desired date!")
}
# Calculate the most recent birthday of the person based on the desired year
latest_bday <- ymd(add_with_rollback(birth_date, years((year(later_year) - year(birth_date))), roll_to_first = TRUE))
# Get amount of days between the desired date and the latest birthday
days_between <- as.numeric(days(later_year - latest_bday), units = "days")
# Get how many days are in the year between their most recent and next bdays
year_length <- as.numeric(days((add_with_rollback(latest_bday, years(1), roll_to_first = TRUE)) - latest_bday), units = "days")
# Get the year fraction (amount of year completed before next birthday)
fraction_year <- days_between/year_length
# Sum the difference of years with the year fraction
age_sum <- (year(later_year) - year(birth_date)) + fraction_year
return(age_sum)
}
test_list <- list(c("1985-08-13", "1986-08-12"),
c("1985-08-13", "1985-08-13"),
c("1985-08-13", "1986-08-13"),
c("1985-08-13", "1986-09-12"),
c("2000-02-29", "2000-02-29"),
c("2000-02-29", "2000-03-01"),
c("2000-02-29", "2001-02-28"),
c("2000-02-29", "2004-02-29"),
c("2000-02-29", "2011-03-01"),
c("1997-04-21", "2000-04-21"),
c("2000-04-21", "2016-04-21"),
c("2000-04-21", "2019-04-21"),
c("2017-06-15", "2018-04-30"),
c("2019-04-20", "2019-08-24"),
c("2020-05-25", "2021-11-25"),
c("2020-11-25", "2021-11-24"),
c("2020-11-24", "2020-11-25"),
c("2020-02-28", "2020-02-29"),
c("2020-02-29", "2020-02-28"))
for (i in 1:length(test_list))
{
print(paste0("Dates from ", test_list[[i]][1], " to ", test_list[[i]][2]))
result <- age_calculation(ymd(test_list[[i]][1]), ymd(test_list[[i]][2]))
print(result)
}
Output:
[1] "Dates from 1985-08-13 to 1986-08-12"
[1] 0.9972603
[1] "Dates from 1985-08-13 to 1985-08-13"
[1] 0
[1] "Dates from 1985-08-13 to 1986-08-13"
[1] 1
[1] "Dates from 1985-08-13 to 1986-09-12"
[1] 1.082192
[1] "Dates from 2000-02-29 to 2000-02-29"
[1] 0
[1] "Dates from 2000-02-29 to 2000-03-01"
[1] 0.00273224
[1] "Dates from 2000-02-29 to 2001-02-28"
[1] 0.9972603
[1] "Dates from 2000-02-29 to 2004-02-29"
[1] 4
[1] "Dates from 2000-02-29 to 2011-03-01"
[1] 11
[1] "Dates from 1997-04-21 to 2000-04-21"
[1] 3
[1] "Dates from 2000-04-21 to 2016-04-21"
[1] 16
[1] "Dates from 2000-04-21 to 2019-04-21"
[1] 19
[1] "Dates from 2017-06-15 to 2018-04-30"
[1] 0.8739726
[1] "Dates from 2019-04-20 to 2019-08-24"
[1] 0.3442623
[1] "Dates from 2020-05-25 to 2021-11-25"
[1] 1.50411
[1] "Dates from 2020-11-25 to 2021-11-24"
[1] 0.9972603
[1] "Dates from 2020-11-24 to 2020-11-25"
[1] 0.002739726
[1] "Dates from 2020-02-28 to 2020-02-29"
[1] 0.00273224
[1] "Dates from 2020-02-29 to 2020-02-28"
Error in age_calculation(ymd(test_list[[i]][1]), ymd(test_list[[i]][2])) :
Birth date is after the desired date!
As others have been saying, the trunc function is excellent to get integer age.
I realise there are a lot of answers but since I can't help myself, I might as well add to the discussion.
I'm building a package that's focused on dates and datetimes and in it I use a function called time_diff(). Here is a simplified version.
time_diff <- function(x, y, units, num = 1,
type = c("duration", "period"),
as_period = FALSE){
type <- match.arg(type)
units <- match.arg(units, c("picoseconds", "nanoseconds", "microseconds",
"milliseconds", "seconds", "minutes", "hours", "days",
"weeks", "months", "years"))
int <- lubridate::interval(x, y)
if (as_period || type == "period"){
if (as_period) int <- lubridate::as.period(int, unit = units)
unit <- lubridate::period(num = num, units = units)
} else {
unit <- do.call(get(paste0("d", units),
asNamespace("lubridate")),
list(x = num))
}
out <- int / unit
out
}
# Wrapper around the more general time_diff
age_years <- function(x, y){
trunc(time_diff(x, y, units = "years", num = 1,
type = "period", as_period = TRUE))
}
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
bday <- dmy("01-01-2000")
time_diff(bday, today(), "years", type = "period")
#> [1] 23.11233
leap1 <- dmy("29-02-2020")
leap2 <- dmy("28-02-2021")
leap3 <- dmy("01-03-2021")
# Many people might say this is wrong so use the more exact age_years
time_diff(leap1, leap2, "years", type = "period")
#> [1] 1
# age in years, accounting for leap years properly
age_years(leap1, leap2)
#> [1] 0
age_years(leap1, leap3)
#> [1] 1
# So to add a column of ages in years, one can do this..
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
my_data <- tibble(dob = seq(bday, today(), by = "day"))
my_data <- my_data %>%
mutate(age_years = age_years(dob, today()))
slice_head(my_data, n = 10)
#> # A tibble: 10 x 2
#> dob age_years
#> <date> <dbl>
#> 1 2000-01-01 23
#> 2 2000-01-02 23
#> 3 2000-01-03 23
#> 4 2000-01-04 23
#> 5 2000-01-05 23
#> 6 2000-01-06 23
#> 7 2000-01-07 23
#> 8 2000-01-08 23
#> 9 2000-01-09 23
#> 10 2000-01-10 23
Created on 2023-02-11 with reprex v2.0.2

Merge overlapping time periods with milliseconds in R

I'm trying to find a way of merging overlapping time intervals that can deal with milliseconds.
Three potential options have been posted here:
How to flatten / merge overlapping time periods
However, I don't need to group by ID, and so am finding the dplyr and data.table methods confusing (I'm not sure whether they can deal with milliseconds, as I can't get them to work).
I have managed to get the IRanges solution working, but it converts POSIXct objects to as.numeric integers to calculate the overlaps. So, I'm assuming this is why milliseconds are absent from the output?
The lack of milliseconds doesn't seem to be a display issue, as when I subtract the resulting start and end times, I get integer results in seconds.
Here's a sample of my data:
start <- c("2019-07-15 21:32:43.565",
"2019-07-15 21:32:43.634",
"2019-07-15 21:32:54.301",
"2019-07-15 21:34:08.506",
"2019-07-15 21:34:09.957")
end <- c("2019-07-15 21:32:48.445",
"2019-07-15 21:32:49.045",
"2019-07-15 21:32:54.801",
"2019-07-15 21:34:10.111",
"2019-07-15 21:34:10.236")
df <- data.frame(start, end)
The output I get from the IRanges solution:
start end
1 2019-07-15 21:32:43 2019-07-15 21:32:49
2 2019-07-15 21:32:54 2019-07-15 21:32:54
3 2019-07-15 21:34:08 2019-07-15 21:34:10
And the desired result:
start end
1 2019-07-15 21:32:43.565 2019-07-15 21:32:49.045
2 2019-07-15 21:32:54.301 2019-07-15 21:32:54.801
3 2019-07-15 21:34:08.506 2019-07-15 21:34:10.236
Suggestions would be very much appreciated!
I've found it is quite easy to preserve milliseconds if you use POSIXlt format. Although there are faster ways to calculate the overlap, it's fast enough for most purposes to just loop through the data frame.
Here's a reproducible example.
start <- c("2019-07-15 21:32:43.565",
"2019-07-15 21:32:43.634",
"2019-07-15 21:32:54.301",
"2019-07-15 21:34:08.506",
"2019-07-15 21:34:09.957")
end <- c("2019-07-15 21:32:48.445",
"2019-07-15 21:32:49.045",
"2019-07-15 21:32:54.801",
"2019-07-15 21:34:10.111",
"2019-07-15 21:34:10.236")
df <- data.frame(start = as.POSIXlt(start), end = as.POSIXlt(end))
i <- 1
df <- data.frame(start = as.POSIXlt(start), end = as.POSIXlt(end))
while(i < nrow(df))
{
overlaps <- which(df$start < df$end[i] & df$end > df$start[i])
if(length(overlaps) > 1)
{
df$end[i] <- max(df$end[overlaps])
df <- df[-overlaps[-which(overlaps == i)], ]
i <- i - 1
}
i <- i + 1
}
So now our data frame doesn't have overlaps:
df
#> start end
#> 1 2019-07-15 21:32:43 2019-07-15 21:32:49
#> 3 2019-07-15 21:32:54 2019-07-15 21:32:54
#> 4 2019-07-15 21:34:08 2019-07-15 21:34:10
Although it appears we have lost the milliseconds, this is just a display issue, as we can show by doing this:
df$end - df$start
#> Time differences in secs
#> [1] 5.48 0.50 1.73
as.numeric(df$end - df$start)
#> [1] 5.48 0.50 1.73
Created on 2020-02-20 by the reprex package (v0.3.0)
I think the best thing to do here is to use the clock package (for a true sub-second precision date-time type) along with the ivs package (for merging overlapping intervals).
Using POSIXct for sub-second date-times can be a bit challenging for various reasons, which I've talked about here.
The key here is iv_groups(), which merges all overlapping intervals and returns the intervals that remain after all of the overlaps have been merged. It is also backed by a C implementation that is very fast.
library(clock)
library(ivs)
library(dplyr)
df <- tibble(
start = c(
"2019-07-15 21:32:43.565", "2019-07-15 21:32:43.634",
"2019-07-15 21:32:54.301", "2019-07-15 21:34:08.506",
"2019-07-15 21:34:09.957"
),
end = c(
"2019-07-15 21:32:48.445", "2019-07-15 21:32:49.045",
"2019-07-15 21:32:54.801", "2019-07-15 21:34:10.111",
"2019-07-15 21:34:10.236"
)
)
# Parse into "naive time" (i.e. with a yet-to-be-defined time zone)
# using a millisecond precision
df <- df %>%
mutate(
start = naive_time_parse(start, format = "%Y-%m-%d %H:%M:%S", precision = "millisecond"),
end = naive_time_parse(end, format = "%Y-%m-%d %H:%M:%S", precision = "millisecond"),
)
df
#> # A tibble: 5 × 2
#> start end
#> <tp<naive><milli>> <tp<naive><milli>>
#> 1 2019-07-15T21:32:43.565 2019-07-15T21:32:48.445
#> 2 2019-07-15T21:32:43.634 2019-07-15T21:32:49.045
#> 3 2019-07-15T21:32:54.301 2019-07-15T21:32:54.801
#> 4 2019-07-15T21:34:08.506 2019-07-15T21:34:10.111
#> 5 2019-07-15T21:34:09.957 2019-07-15T21:34:10.236
# Now combine these start/end boundaries into a single interval vector
df <- df %>%
mutate(interval = iv(start, end), .keep = "unused")
df
#> # A tibble: 5 × 1
#> interval
#> <iv<tp<naive><milli>>>
#> 1 [2019-07-15T21:32:43.565, 2019-07-15T21:32:48.445)
#> 2 [2019-07-15T21:32:43.634, 2019-07-15T21:32:49.045)
#> 3 [2019-07-15T21:32:54.301, 2019-07-15T21:32:54.801)
#> 4 [2019-07-15T21:34:08.506, 2019-07-15T21:34:10.111)
#> 5 [2019-07-15T21:34:09.957, 2019-07-15T21:34:10.236)
# And use `iv_groups()` to merge all overlapping intervals.
# It returns the remaining intervals after all overlaps have been removed.
df %>%
summarise(interval = iv_groups(interval))
#> # A tibble: 3 × 1
#> interval
#> <iv<tp<naive><milli>>>
#> 1 [2019-07-15T21:32:43.565, 2019-07-15T21:32:49.045)
#> 2 [2019-07-15T21:32:54.301, 2019-07-15T21:32:54.801)
#> 3 [2019-07-15T21:34:08.506, 2019-07-15T21:34:10.236)
Created on 2022-04-05 by the reprex package (v2.0.1)

rnoaa air pressure looping

The rnoaa package only allows you to gather 30 days worth of air pressure information at a time https://cran.r-project.org/web/packages/rnoaa/rnoaa.pdf. I'm looking to create a function/ for loop to pull data from the package a month at a time. It's specific the date format that is requires, YYYYMMDD. No - or /. I started with a function, but the lapply, doesn't seem to be applying to the function to call the air pressure data.
I have tried loops in many ways, and I can't seem to get it. Here's an example.
for (i in dates)) {
air_pressure[i] <- coops_search(begin_date = start[i], end_date = end[i],
station_name = 8727520, product= "air_pressure", units = "metric", time_zone = "gmt")
print(air_pressure[i])
}
start<-seq(as.Date("2015/01/01"), by = "month", length.out = 100)
start <- as.numeric(gsub("-","",start))
end<-seq(as.Date("2015/02/01"), by = "month", length.out = 100)
end <- as.numeric(gsub("-","",end))
pressure_function<- function(air_pressure) {
coops_search(station_name = 8727520, begin_date = starting,
end_date = ending, product = "air_pressure")
}
lapply(pressure_function, starting= start, ending= end, FUN= sum)
No real error messages, just don't populate, or run the function.
There's some pretty fundamental things wrong here. First, your for loop has too many closing parentheses. Second, your lapply call passes a function as the first parameter; that does not work, pass it in the second slot. And more ....
Anyway, try this:
library(rnoaa)
fun <- function(begin, end) {
coops_search(station_name = 8727520, begin_date = gsub("-", "", begin),
end_date = gsub("-", "", end), product = "air_pressure")
}
start_dates <- seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by = "month")
end_dates <- seq(as.Date("2015-02-01"), as.Date("2016-01-01"), by = "month") - 1
res <- Map(fun, start_dates, end_dates)
df <- dplyr::bind_rows(lapply(res, "[[", "data"))
head(df)
#> t v f
#> 1 2015-01-01 00:00:00 1025.3 0,0,0
#> 2 2015-01-01 00:06:00 1025.4 0,0,0
#> 3 2015-01-01 00:12:00 1025.5 0,0,0
#> 4 2015-01-01 00:18:00 1025.6 0,0,0
#> 5 2015-01-01 00:24:00 1025.6 0,0,0
#> 6 2015-01-01 00:30:00 1025.6 0,0,0
NROW(df)
#> [1] 87600

how to break a time range into monthly queries?

Consider this simple example
bogus <- function(start_time, end_time){
print(paste('hey this starts on', start_time, 'until', end_time))
}
start_time <- ymd('2018-01-01')
end_time <- ymd('2018-05-01')
> bogus(start_time, end_time)
[1] "hey this starts on 2018-01-01 until 2018-05-01"
Unfortunately, doing so with a long time range does not work with my real-life bogus function, so I need to break my original time range into monthly pieces.
In other words the first call would be bogus(ymd('2018-01-01'), ymd('2018-01-31')), the second one bogus(ymd('2018-02-01'), ymd('2018-02-28')), etc.
Is there a simple way to do using purrr and lubridate?
Thanks
Are you looking for something like:
library(lubridate)
seq_dates <- seq(start_time, end_time - 1, by = "month")
lapply(seq_dates, function(x) print(paste('hey this starts on', x, 'until', ceiling_date(x, unit = "month") - 1)))
You could also do a short bogus function like:
bogus <- function(start_var, end_var) {
require(lubridate)
seq_dates <- seq(as.Date(start_var), as.Date(end_var) - 1, by = "month")
printed_statement <- lapply(seq_dates, function(x) paste('hey this starts on', x, 'until', ceiling_date(x, unit = "month") - 1))
for (i in printed_statement) { print(i) }
}
And call it like:
bogus("2018-01-01", "2018-05-01")
Output:
[1] "hey this starts on 2018-01-01 until 2018-01-31"
[1] "hey this starts on 2018-02-01 until 2018-02-28"
[1] "hey this starts on 2018-03-01 until 2018-03-31"
[1] "hey this starts on 2018-04-01 until 2018-04-30"
This way you can just give minimum start and maximum end date and get everything in-between.
With base:
seqdate<-seq.Date(start_time,end_time,by="1 month")
dateranges<-data.frame(start.dates=seqdate[1:length(seqdate)-1],
end.dates=seqdate[2:length(seqdate)]-1)
start.dates end.dates
1 2018-01-01 2018-01-31
2 2018-02-01 2018-02-28
3 2018-03-01 2018-03-31
4 2018-04-01 2018-04-30

change a column from birth date to age in r

I am using data.table for the first time.
I have a column of about 400,000 ages in my table. I need to convert them from birth dates to ages.
What is the best way to do this?
I've been thinking about this and have been dissatisfied with the two answers so far. I like using lubridate, as #KFB did, but I also want things wrapped up nicely in a function, as in my answer using the eeptools package. So here's a wrapper function using the lubridate interval method with some nice options:
#' Calculate age
#'
#' By default, calculates the typical "age in years", with a
#' \code{floor} applied so that you are, e.g., 5 years old from
#' 5th birthday through the day before your 6th birthday. Set
#' \code{floor = FALSE} to return decimal ages, and change \code{units}
#' for units other than years.
#' #param dob date-of-birth, the day to start calculating age.
#' #param age.day the date on which age is to be calculated.
#' #param units unit to measure age in. Defaults to \code{"years"}. Passed to \link{\code{duration}}.
#' #param floor boolean for whether or not to floor the result. Defaults to \code{TRUE}.
#' #return Age in \code{units}. Will be an integer if \code{floor = TRUE}.
#' #examples
#' my.dob <- as.Date('1983-10-20')
#' age(my.dob)
#' age(my.dob, units = "minutes")
#' age(my.dob, floor = FALSE)
age <- function(dob, age.day = today(), units = "years", floor = TRUE) {
calc.age = lubridate::interval(dob, age.day) / lubridate::duration(num = 1, units = units)
if (floor) return(as.integer(floor(calc.age)))
return(calc.age)
}
Usage examples:
> my.dob <- as.Date('1983-10-20')
> age(my.dob)
[1] 31
> age(my.dob, floor = FALSE)
[1] 31.15616
> age(my.dob, units = "minutes")
[1] 16375680
> age(seq(my.dob, length.out = 6, by = "years"))
[1] 31 30 29 28 27 26
From the comments of this blog entry, I found the age_calc function in the eeptools package. It takes care of edge cases (leap years, etc.), checks inputs and looks quite robust.
library(eeptools)
x <- as.Date(c("2011-01-01", "1996-02-29"))
age_calc(x[1],x[2]) # default is age in months
[1] 46.73333 224.83118
age_calc(x[1],x[2], units = "years") # but you can set it to years
[1] 3.893151 18.731507
floor(age_calc(x[1],x[2], units = "years"))
[1] 3 18
For your data
yourdata$age <- floor(age_calc(yourdata$birthdate, units = "years"))
assuming you want age in integer years.
Assume you have a data.table, you could do below:
library(data.table)
library(lubridate)
# toy data
X = data.table(birth=seq(from=as.Date("1970-01-01"), to=as.Date("1980-12-31"), by="year"))
Sys.Date()
Option 1 : use "as.period" from lubriate package
X[, age := as.period(Sys.Date() - birth)][]
birth age
1: 1970-01-01 44y 0m 327d 0H 0M 0S
2: 1971-01-01 43y 0m 327d 6H 0M 0S
3: 1972-01-01 42y 0m 327d 12H 0M 0S
4: 1973-01-01 41y 0m 326d 18H 0M 0S
5: 1974-01-01 40y 0m 327d 0H 0M 0S
6: 1975-01-01 39y 0m 327d 6H 0M 0S
7: 1976-01-01 38y 0m 327d 12H 0M 0S
8: 1977-01-01 37y 0m 326d 18H 0M 0S
9: 1978-01-01 36y 0m 327d 0H 0M 0S
10: 1979-01-01 35y 0m 327d 6H 0M 0S
11: 1980-01-01 34y 0m 327d 12H 0M 0S
Option 2 : if you do not like the format of Option 1, you could do below:
yr = duration(num = 1, units = "years")
X[, age := new_interval(birth, Sys.Date())/yr][]
# you get
birth age
1: 1970-01-01 44.92603
2: 1971-01-01 43.92603
3: 1972-01-01 42.92603
4: 1973-01-01 41.92329
5: 1974-01-01 40.92329
6: 1975-01-01 39.92329
7: 1976-01-01 38.92329
8: 1977-01-01 37.92055
9: 1978-01-01 36.92055
10: 1979-01-01 35.92055
11: 1980-01-01 34.92055
Believe Option 2 should be the more desirable.
I prefer to do this using the lubridate package, borrowing syntax I originally encountered in another post.
It's necessary to standardize your input dates in terms of R date objects, preferably with the lubridate::mdy() or lubridate::ymd() or similar functions, as applicable. You can use the interval() function to generate an interval describing the time elapsed between the two dates, and then use the duration() function to define how this interval should be "diced".
I've summarized the simplest case for calculating an age from two dates below, using the most current syntax in R.
df$DOB <- mdy(df$DOB)
df$EndDate <- mdy(df$EndDate)
df$Calc_Age <- interval(start= df$DOB, end=df$EndDate)/
duration(n=1, unit="years")
Age may be rounded down to the nearest complete integer using the base R 'floor()` function, like so:
df$Calc_AgeF <- floor(df$Calc_Age)
Alternately, the digits= argument in the base R round() function can be used to round up or down, and specify the exact number of decimals in the returned value, like so:
df$Calc_Age2 <- round(df$Calc_Age, digits = 2) ## 2 decimals
df$Calc_Age0 <- round(df$Calc_Age, digits = 0) ## nearest integer
It's worth noting that once the input dates are passed through the calculation step described above (i.e., interval() and duration() functions) , the returned value will be numeric and no longer a date object in R. This is significant whereas the lubridate::floor_date() is limited strictly to date-time objects.
The above syntax works regardless whether the input dates occur in a data.table or data.frame object.
I wanted an implementation that didn't increase my dependencies beyond data.table, which is usually my only dependency. The data.table is only needed for mday, which means day of the month.
Development function
This function is logically how I would think about someone's age. I start with [current year] - [brith year] - 1, then add 1 if they've already had their birthday in the current year. To check for that offset I start by considering month, then (if necessary) day of month.
Here is that step by step implementation:
agecalc <- function(origin, current){
require(data.table)
y <- year(current) - year(origin) - 1
offset <- 0
if(month(current) > month(origin)) offset <- 1
if(month(current) == month(origin) &
mday(current) >= mday(origin)) offset <- 1
age <- y + offset
return(age)
}
Production function
This is the same logic refactored and vectorized:
agecalc <- function(origin, current){
require(data.table)
age <- year(current) - year(origin) - 1
ii <- (month(current) > month(origin)) | (month(current) == month(origin) &
mday(current) >= mday(origin))
age[ii] <- age[ii] + 1
return(age)
}
Experimental function that uses strings
You could also do a string comparison on the month / day part. Perhaps there are times when this is more efficient, for example if you had the year as a number and the birth date as a string.
agecalc_strings <- function(origin, current){
origin <- as.character(origin)
current <- as.character(current)
age <- as.numeric(substr(current, 1, 4)) - as.numeric(substr(origin, 1, 4)) - 1
if(substr(current, 6, 10) >= substr(origin, 6, 10)){
age <- age + 1
}
return(age)
}
Some tests on the vectorized "production" version:
## Examples for specific dates to test the calculation with things like
## beginning and end of months, and leap years:
agecalc(as.IDate("1985-08-13"), as.IDate("1985-08-12"))
agecalc(as.IDate("1985-08-13"), as.IDate("1985-08-13"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-08-12"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-08-13"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-09-12"))
agecalc(as.IDate("2000-02-29"), as.IDate("2000-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2000-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-03-01"))
agecalc(as.IDate("2000-02-29"), as.IDate("2004-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2004-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2011-03-01"))
## Testing every age for every day over several years
## This test requires vectorized version:
d <- data.table(d=as.IDate("2000-01-01") + 0:10000)
d[ , b1 := as.IDate("2000-08-15")]
d[ , b2 := as.IDate("2000-02-29")]
d[ , age1_num := (d - b1) / 365]
d[ , age2_num := (d - b2) / 365]
d[ , age1 := agecalc(b1, d)]
d[ , age2 := agecalc(b2, d)]
d
Below is a trivial plot of ages as numeric and integer. As you can see the
integer ages are a sort of stair step pattern that is tangent to (but below) the
straight line of numeric ages.
plot(numeric_age1 ~ today, dt, type = "l",
ylab = "ages", main = "ages plotted")
lines(integer_age1 ~ today, dt, col = "blue")
I wasn't happy with any of the responses when it comes to calculating the age in months or years, when dealing with leap years, so this is my function using the lubridate package.
Basically, it slices the interval between from and to into (up to) yearly chunks, and then adjusts the interval for whether that chunk is leap year or not. The total interval is the sum of the age of each chunk.
library(lubridate)
#' Get Age of Date relative to Another Date
#'
#' #param from,to the date or dates to consider
#' #param units the units to consider
#' #param floor logical as to whether to floor the result
#' #param simple logical as to whether to do a simple calculation, a simple calculation doesn't account for leap year.
#' #author Nicholas Hamilton
#' #export
age <- function(from, to = today(), units = "years", floor = FALSE, simple = FALSE) {
#Account for Leap Year if Working in Months and Years
if(!simple && length(grep("^(month|year)",units)) > 0){
df = data.frame(from,to)
calc = sapply(1:nrow(df),function(r){
#Start and Finish Points
st = df[r,1]; fn = df[r,2]
#If there is no difference, age is zero
if(st == fn){ return(0) }
#If there is a difference, age is not zero and needs to be calculated
sign = +1 #Age Direction
if(st > fn){ tmp = st; st = fn; fn = tmp; sign = -1 } #Swap and Change sign
#Determine the slice-points
mid = ceiling_date(seq(st,fn,by='year'),'year')
#Build the sequence
dates = unique( c(st,mid,fn) )
dates = dates[which(dates >= st & dates <= fn)]
#Determine the age of the chunks
chunks = sapply(head(seq_along(dates),-1),function(ix){
k = 365/( 365 + leap_year(dates[ix]) )
k*interval( dates[ix], dates[ix+1] ) / duration(num = 1, units = units)
})
#Sum the Chunks, and account for direction
sign*sum(chunks)
})
#If Simple Calculation or Not Months or Not years
}else{
calc = interval(from,to) / duration(num = 1, units = units)
}
if (floor) calc = as.integer(floor(calc))
calc
}
(Sys.Date() - yourDate) / 365.25
A very simple way of calculating the age from two dates without using any additional packages probably is:
df$age = with(df, as.Date(date_2, "%Y-%m-%d") - as.Date(date_1, "%Y-%m-%d"))
Here is a (I think simpler) solution using lubridate:
library(lubridate)
age <- function(dob, on.day=today()) {
intvl <- interval(dob, on.day)
prd <- as.period(intvl)
return(prd#year)
}
Note that age_calc from the eeptools package in particular fails on cases with the year 2000 around birthdays.
Some examples that don't work in age_calc:
library(lubridate)
library(eeptools)
age_calc(ymd("1997-04-21"), ymd("2000-04-21"), units = "years")
age_calc(ymd("2000-04-21"), ymd("2019-04-21"), units = "years")
age_calc(ymd("2000-04-21"), ymd("2016-04-21"), units = "years")
Some of the other solutions also have some output that is not intuitive to what I would want for decimal ages when leap years are involved. I like #James_D 's solution and it is precise and concise, but I wanted something where the decimal age is calculated as complete years plus the fraction of the year completed from their last birthday to their next birthday (which would be out of 365 or 366 days depending on year). In the case of leap years I use lubridate's rollback function to use March 1st for non-leap years following February 29th. I used some test cases from #geneorama and added some of my own, and the output aligns with what I would expect.
library(lubridate)
# Calculate precise age from birthdate in ymd format
age_calculation <- function(birth_date, later_year) {
if (birth_date > later_year)
{
stop("Birth date is after the desired date!")
}
# Calculate the most recent birthday of the person based on the desired year
latest_bday <- ymd(add_with_rollback(birth_date, years((year(later_year) - year(birth_date))), roll_to_first = TRUE))
# Get amount of days between the desired date and the latest birthday
days_between <- as.numeric(days(later_year - latest_bday), units = "days")
# Get how many days are in the year between their most recent and next bdays
year_length <- as.numeric(days((add_with_rollback(latest_bday, years(1), roll_to_first = TRUE)) - latest_bday), units = "days")
# Get the year fraction (amount of year completed before next birthday)
fraction_year <- days_between/year_length
# Sum the difference of years with the year fraction
age_sum <- (year(later_year) - year(birth_date)) + fraction_year
return(age_sum)
}
test_list <- list(c("1985-08-13", "1986-08-12"),
c("1985-08-13", "1985-08-13"),
c("1985-08-13", "1986-08-13"),
c("1985-08-13", "1986-09-12"),
c("2000-02-29", "2000-02-29"),
c("2000-02-29", "2000-03-01"),
c("2000-02-29", "2001-02-28"),
c("2000-02-29", "2004-02-29"),
c("2000-02-29", "2011-03-01"),
c("1997-04-21", "2000-04-21"),
c("2000-04-21", "2016-04-21"),
c("2000-04-21", "2019-04-21"),
c("2017-06-15", "2018-04-30"),
c("2019-04-20", "2019-08-24"),
c("2020-05-25", "2021-11-25"),
c("2020-11-25", "2021-11-24"),
c("2020-11-24", "2020-11-25"),
c("2020-02-28", "2020-02-29"),
c("2020-02-29", "2020-02-28"))
for (i in 1:length(test_list))
{
print(paste0("Dates from ", test_list[[i]][1], " to ", test_list[[i]][2]))
result <- age_calculation(ymd(test_list[[i]][1]), ymd(test_list[[i]][2]))
print(result)
}
Output:
[1] "Dates from 1985-08-13 to 1986-08-12"
[1] 0.9972603
[1] "Dates from 1985-08-13 to 1985-08-13"
[1] 0
[1] "Dates from 1985-08-13 to 1986-08-13"
[1] 1
[1] "Dates from 1985-08-13 to 1986-09-12"
[1] 1.082192
[1] "Dates from 2000-02-29 to 2000-02-29"
[1] 0
[1] "Dates from 2000-02-29 to 2000-03-01"
[1] 0.00273224
[1] "Dates from 2000-02-29 to 2001-02-28"
[1] 0.9972603
[1] "Dates from 2000-02-29 to 2004-02-29"
[1] 4
[1] "Dates from 2000-02-29 to 2011-03-01"
[1] 11
[1] "Dates from 1997-04-21 to 2000-04-21"
[1] 3
[1] "Dates from 2000-04-21 to 2016-04-21"
[1] 16
[1] "Dates from 2000-04-21 to 2019-04-21"
[1] 19
[1] "Dates from 2017-06-15 to 2018-04-30"
[1] 0.8739726
[1] "Dates from 2019-04-20 to 2019-08-24"
[1] 0.3442623
[1] "Dates from 2020-05-25 to 2021-11-25"
[1] 1.50411
[1] "Dates from 2020-11-25 to 2021-11-24"
[1] 0.9972603
[1] "Dates from 2020-11-24 to 2020-11-25"
[1] 0.002739726
[1] "Dates from 2020-02-28 to 2020-02-29"
[1] 0.00273224
[1] "Dates from 2020-02-29 to 2020-02-28"
Error in age_calculation(ymd(test_list[[i]][1]), ymd(test_list[[i]][2])) :
Birth date is after the desired date!
As others have been saying, the trunc function is excellent to get integer age.
I realise there are a lot of answers but since I can't help myself, I might as well add to the discussion.
I'm building a package that's focused on dates and datetimes and in it I use a function called time_diff(). Here is a simplified version.
time_diff <- function(x, y, units, num = 1,
type = c("duration", "period"),
as_period = FALSE){
type <- match.arg(type)
units <- match.arg(units, c("picoseconds", "nanoseconds", "microseconds",
"milliseconds", "seconds", "minutes", "hours", "days",
"weeks", "months", "years"))
int <- lubridate::interval(x, y)
if (as_period || type == "period"){
if (as_period) int <- lubridate::as.period(int, unit = units)
unit <- lubridate::period(num = num, units = units)
} else {
unit <- do.call(get(paste0("d", units),
asNamespace("lubridate")),
list(x = num))
}
out <- int / unit
out
}
# Wrapper around the more general time_diff
age_years <- function(x, y){
trunc(time_diff(x, y, units = "years", num = 1,
type = "period", as_period = TRUE))
}
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
bday <- dmy("01-01-2000")
time_diff(bday, today(), "years", type = "period")
#> [1] 23.11233
leap1 <- dmy("29-02-2020")
leap2 <- dmy("28-02-2021")
leap3 <- dmy("01-03-2021")
# Many people might say this is wrong so use the more exact age_years
time_diff(leap1, leap2, "years", type = "period")
#> [1] 1
# age in years, accounting for leap years properly
age_years(leap1, leap2)
#> [1] 0
age_years(leap1, leap3)
#> [1] 1
# So to add a column of ages in years, one can do this..
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
my_data <- tibble(dob = seq(bday, today(), by = "day"))
my_data <- my_data %>%
mutate(age_years = age_years(dob, today()))
slice_head(my_data, n = 10)
#> # A tibble: 10 x 2
#> dob age_years
#> <date> <dbl>
#> 1 2000-01-01 23
#> 2 2000-01-02 23
#> 3 2000-01-03 23
#> 4 2000-01-04 23
#> 5 2000-01-05 23
#> 6 2000-01-06 23
#> 7 2000-01-07 23
#> 8 2000-01-08 23
#> 9 2000-01-09 23
#> 10 2000-01-10 23
Created on 2023-02-11 with reprex v2.0.2

Resources