Computing age today from date of birth variable (R) - r

I have a dataframe with information on date of birth by individual id.
mydf <- data.frame(id=c(1,2),
dtbirth=as.Date(c("2012-01-01","2013-04-01")))
I would like to compute the age of the individuals as of today. The code below seems to work but outputs "days" to the new variable age
mydf %>%
mutate(age=floor((today()-dtbirth)/365))

We can wrap with as.integer/as.numeric to remove the class attribute difftime
mydf %>%
mutate(age= as.integer(floor((today()-dtbirth)/365)))
-output
# id dtbirth age
#1 1 2012-01-01 9
#2 2 2013-04-01 8
By default, when we use the -, the difftime picks up the unit by "auto"
mydf %>%
mutate(age = today() - dtbirth)
# id dtbirth age
#1 1 2012-01-01 3430 days
#2 2 2013-04-01 2974 days
If we need more fine control, use difftime itself and specify the units
mydf %>%
mutate(age = difftime(today(), dtbirth, units = 'weeks'))
# id dtbirth age
#1 1 2012-01-01 490.0000 weeks
#2 2 2013-04-01 424.8571 weeks
We cannot have units greater than 'weeks' as the available options are
difftime(time1, time2, tz,
units = c("auto", "secs", "mins", "hours",
"days", "weeks"))
and it is mentioned as
Units such as "months" are not possible as they are not of constant length. To create intervals of months, quarters or years use seq.Date or seq.POSIXt.

Related

Filter data by last 12 Months of the total data available in R

R:
I have a data-set with N Products sales value from some yyyy-mm-dd to some yyyy-mm-dd, I just want to filter the data for the last 12 months for each product in the data-set.
Eg:
Say, I have values from 2016-01-01 to 2020-02-01
So now I want to filter the sales values for the last 12 months that is from 2019-02-01 to 2020-02-01
I just cannot simply mention a "filter(Month >= as.Date("2019-04-01") & Month <= as.Date("2020-04-01"))" because the end date keeps changing for my case as every months passes by so I need to automate the case.
You can use :
library(dplyr)
library(lubridate)
data %>%
group_by(Product) %>%
filter(between(date, max(date) - years(1), max(date)))
#filter(date >= (max(date) - years(1)) & date <= max(date))
You can test whether the date is bigger equal the maximal date per product minus 365 days:
library(dplyr)
df %>%
group_by(Products) %>%
filter(Date >= max(Date)-365)
# A tibble: 6 x 2
# Groups: Products [3]
Products Date
<dbl> <date>
1 1 2002-01-21
2 1 2002-02-10
3 2 2002-02-24
4 2 2002-02-10
5 2 2001-07-01
6 3 2005-03-10
Data
df <- data.frame(
Products = c(1,1,1,1,2,2,2,3,3,3),
Date = as.Date(c("2000-02-01", "2002-01-21", "2002-02-10",
"2000-06-01", "2002-02-24", "2002-02-10",
"2001-07-01", "2003-01-02", "2005-03-10",
"2002-05-01")))
If your aim is to just capture entries from today back to the same day last year, then:
The function Sys.Date() returns the current date as an object of type Date. You can then convert that to POSIXlc form to adjust the year to get the start date. For example:
end.date <- Sys.Date()
end.date.lt <- asPOSIXlt(end.date)
start.date.lt <- end.date.lt
start.date.lt$year <- start.date.lt$year - 1
start.date <- asPOSIXct(start.date.lt)
Now this does have one potential fail-state: if today is February 29th. One way to deal with that would be to write a "today.last.year" function to do the above conversion, but give an explicit treatment for leap years - possibly including an option to count "today last year" as either February 28th or March 1st, depending on which gives you the desired behaviour.
Alternatively, if you wanted to filter based on a start-of-month date, you can make your function also set start.date.lt$day = 1, and so forth if you need to adjust in different ways.
Input:
product date
1: a 2017-01-01
2: b 2017-04-01
3: a 2017-07-01
4: b 2017-10-01
5: a 2018-01-01
6: b 2018-04-01
7: a 2018-07-01
8: b 2018-10-01
9: a 2019-01-01
10: b 2019-04-01
11: a 2019-07-01
12: b 2019-10-01
Code:
library(lubridate)
library(data.table)
DT <- data.table(
product = rep(c("a", "b"), 6),
date = seq(as.Date("2017-01-01"), as.Date("2019-12-31"), by = "quarter")
)
yearBefore <- function(x){
year(x) <- year(x) - 1
x
}
date_DT <- DT[, .(last_date = last(date)), by = product]
date_DT[, year_before := yearBefore(last_date)]
result <- DT[, date_DT[DT, on = .(product, year_before <= date), nomatch=0]]
result[, last_date := NULL]
setnames(result, "year_before", "date")
Output:
product date
1: a 2018-07-01
2: b 2018-10-01
3: a 2019-01-01
4: b 2019-04-01
5: a 2019-07-01
6: b 2019-10-01
Is this what you are looking for?

Assigning values to all rows within a specific hour range using monthly data

I have a dataframe in the following format:
temp:
id time date
1 06:22:30 2018-01-01
2 08:58:00 2018-01-15
3 09:30:21 2018-01-30
The actual data set continues on for 9000 rows with obs for times throughout the month of January. I want to write a code that will assign each row a new value depending on which hour range the time variable belongs to.
A couple of example hour ranges would be:
Morning peak: 06:00:00 - 08:59:00
Morning: 09:00:00 - 11:59:00
The desired output would look like this:
id time date time_of_day
1 06:22:30 2018-01-01 MorningPeak
2 08:58:00 2018-01-15 MorningPeak
3 09:30:21 2018-01-30 Morning
I have tried playing around with time objects using the chron package using the following code to specify different time ranges:
MorningPeak <- temp[temp$Time >= "06:00:00" & temp$Time <= "08:59:59",]
MorningPeak$time_of_day <- "MorningPeak"
Morning <- temp[temp$Time >= "09:00:00" & temp$Time <= "11:59:59",]
Midday$time_of_day <- "Morning"
The results could then be merged and then manipulated to get everything in the same column. Is there a way to do this such that the desired result is generated and no extra data manipulation is required? I am interested in learning how to make my code more efficient.
You are comparing characters and not time/datetime objects, you need to convert it to date-time before comparison. It seems you can compare the hour of the day to get appropriate labels.
library(dplyr)
df %>%
mutate(hour = as.integer(format(as.POSIXct(time, format = "%T"), "%H")),
time_of_day = case_when(hour >= 6 & hour < 9 ~ "MorningPeak",
hour >= 9 & hour < 12 ~ "Morning",
TRUE ~ "Rest of the day"))
# id time date hour time_of_day
#1 1 06:22:30 2018-01-01 6 MorningPeak
#2 2 08:58:00 2018-01-15 8 MorningPeak
#3 3 09:30:21 2018-01-30 9 Morning
You can add more hourly criteria if needed.
We can also use cut
cut(as.integer(format(as.POSIXct(df$time, format = "%T"), "%H")),
breaks = c(-Inf, 6, 9, 12, Inf), right = FALSE,
labels = c("Rest of the day", "MorningPeak", "Morning", "Rest of the day"))

Match Days of the Week within an Interval to Create Specific Dates

I am working with a data set that has the following structure for its dates:
Week DateStart DateEnd Day
1 5-Aug-16 11-Aug-16 Monday
2 12-Aug-16 18-Aug-16 Thursday
Where "Week" corresponds to a study week number, "DateStart" and "DateEnd" are the first and last days of that week, and "Day" represents the specific day from within that week. I would like to use the "DateStart", "DateEnd", and "Day" fields to create a new field, "Date", that assigns a specific date to each "Day" that falls within the "DateStart" and "DateEnd" interval.
I've used the %--% operator to turn DateStart and DateEnd into an interval:
Week_Interval <- DateStart %--% DateEnd
but then I haven't had much luck on figuring out how to match the Day field to a date within the resulting interval. I've tried reading through the lubridate documentation, but it didn't seem like there was anything in there that could specifically solve my problem. I'm hoping someone here might have some experience with this and could help point me in the right direction.
My ideal output would be something like:
Week DateStart DateEnd Day Date
1 5-Aug-16 11-Aug-16 Monday 08-08-2016
2 12-Aug-16 18-Aug-16 Thursday 18-08-2016
Where the date follows the standard dd-mm-yyyy format.
Take the difference between the day of the week of Day and DateStart modulo 7 and add that to the DateStart.
No packages are used.
dow <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
transform(DF, Date =
DateStart + (match(Day, dow) - 1 - as.POSIXlt(DateStart)$wday) %% 7)
giving:
Week DateStart DateEnd Day Date
1 1 2016-08-05 2016-08-11 Monday 2016-08-08
2 2 2016-08-12 2016-08-18 Thursday 2016-08-18
Note 1
An alternative to writing out the days of the week, provided you are in an English locale, is:
dow <- weekdays(as.Date("1950-01-01") + 0:6)
Note 2
In the example the Start Date is Friday on both rows. If it were known that that is always the case we could shorten the code by hard coding it as 5:
transform(DF, Date = DateStart + (match(Day, dow) - 1 - 5) %% 7)
Note 3
The input, in reproducible form, is:
Lines <- "Week DateStart DateEnd Day
1 5-Aug-16 11-Aug-16 Monday
2 12-Aug-16 18-Aug-16 Thursday"
DF <- read.table(text = Lines, header = TRUE)
fmt <- "%d-%b-%y"
DF <- transform(DF, DateStart = as.Date(DateStart, fmt),
DateEnd = as.Date(DateEnd, fmt))
# example data
df = read.table(text = "
Week DateStart DateEnd Day
1 5-Aug-16 11-Aug-16 Monday
2 12-Aug-16 18-Aug-16 Thursday
", header=T, stringsAsFactors=F)
library(tidyverse)
library(lubridate)
df %>%
group_by(Week, Day) %>% # for each week and day
mutate(Date = list(seq(dmy(DateStart), dmy(DateEnd), "1 day")), # get sequence of dates between start and end
Day2 = map(Date, weekdays)) %>% # get name of days for each date in the sequence
unnest() %>% # unnest dates
ungroup() %>% # forget the grouping
filter(Day == Day2) %>% # keep days that match
select(-Day2) # remove unnecessary column
# # A tibble: 2 x 5
# Week DateStart DateEnd Day Date
# <int> <chr> <chr> <chr> <date>
# 1 1 5-Aug-16 11-Aug-16 Monday 2016-08-08
# 2 2 12-Aug-16 18-Aug-16 Thursday 2016-08-18

R check if date is 2 years apart

I have a dataset with two columns Id and Date as shown below using a toy dataset.
Id Date
5373283 2010-11-05
5373283 2014-11-05
5373283 2001-07-13
5373283 2007-12-01
5373283 2015-07-07
3475684 2015-05-19
3475684 2010-06-24
I want to check if any of the dates for each id are within 2 years range. If they are then a column will show yes, if not, No. The final output would look like this
Id Status
5373283 Yes
3475684 No
Yes for Id 5373283 because the two dates 2014-11-05 and 2015-07-07 are within two years of each other. No for Id 3475684 because the two dates are more than 2 years apart. Any help on accomplishing this much appreciated.
Hypothetical data.
DF <- data.frame(id = c(1, 1, 1, 2, 2),
date = c("2010-10-9", "2012-10-8", "2008-10-5",
"2007-7-5", "2009-7-5"), stringsAsFactors = FALSE)
The code below gets the minimal interval by ID in days.
What is happening is:
mutate redefines the date as Date class
arrange sort the data by date
group_by tells the following computation shall be done for each ID,
summarize computes the minimum difference.
library(dplyr)
DF %>% mutate(date = as.Date(date)) %>%
arrange(date) %>%
group_by(id) %>%
summarize(diffmin = as.numeric(min(diff(date)), units = "days"))
# id diffmin
# (dbl) (dbl)
#1 1 730
#2 2 731
If you can ignore leap years, this being smaller than or equal to 730 means within 2 years. Note that difference between 2007-7-5 and 2009-7-5 is 731 days, and thus judged as out of 2 years.
If this is not good to you, simple days-difference is not enough. I would need to define a custom checker function.
check2years <- function(a, b) {
# check if b - a <= 2 years
# assumes a and b are Date
yr_a <- format(a, "%Y") %>% as.integer()
yr_b <- format(b, "%Y") %>% as.integer()
dy_a <- format(a, "%m-%d")
dy_b <- format(b, "%m-%d")
(yr_b - yr_a < 2) | ((yr_b - yr_a == 2) & (dy_b >= dy_a))
}
Then, you can check if any combination is within 2 years by the following.
DF %>% mutate(date = as.Date(date)) %>%
arrange(date) %>%
group_by(id) %>%
summarize(within2yr = any(check2years(head(date, length(date)-1),
tail(date, length(date)-1))))
# id within2yr
# (dbl) (lgl)
#1 1 TRUE
#2 2 TRUE
You can also solve this without any library:
Using your example:
Id = c(5373283,5373283,5373283,5373283,5373283,3475684,3475684)
Date = as.Date(c("2010-11-05","2014-11-05","2001-07-13","2007-12-01","2015-07-07","2015-05-19","2010-06-24"))
df = data.frame(Id,Date)
> df
Id Date
7 3475684 2010-06-24
6 3475684 2015-05-19
3 5373283 2001-07-13
4 5373283 2007-12-01
1 5373283 2010-11-05
2 5373283 2014-11-05
5 5373283 2015-07-07
Do the following:
First order your data first by Id then by Date
df = df[order(df$Id,df$Date),]
Do an aggregate by Id using the function min(diff(x)), where x are the dates for each Id.
z = aggregate(df$Date,by = list(Id = df$Id),FUN = function(x){min(diff(x))})
What this function does is it returns the lowest difference between adjacent dates. This is why you need to order the data frame first.
This returns:
> z
Id x
1 3475684 1790
2 5373283 244
Where column x is the minimum difference in days.
Here, you only need to evaluate is if column x is less than or equal to 2*365
z$result = z$x<=2*365
Giving:
Id x result
1 3475684 1790 FALSE
2 5373283 244 TRUE
Final code
df = df[order(df$Id,df$Date),]
z = aggregate(df$Date,by = list(Id = df$Id),FUN = function(x){min(diff(x))})
z$result = z$x>=2*365
You can use something like this with library dplyr with the idea of taking top two dates in sorted order for each ID and see if they differ by two years:
library(dplyr)
df$Date <- as.Date(df$Date)
df %>%
group_by(Id) %>%
summarise(Status = as.numeric(difftime(max(Date), Date[order(Date, decreasing = TRUE)][2], units = 'days')) < 730)
Output will be as follows:
Source: local data frame [2 x 2]
Id Status
(int) (lgl)
1 3475684 FALSE
2 5373283 TRUE

Calculate mean date across years

I am trying to calculate the mean date independent of year for each level of a factor.
DF <- data.frame(Date = seq(as.Date("2013-2-15"), by = "day", length.out = 730))
DF$ID = rep(c("AAA", "BBB", "CCC"), length.out = 730)
head(DF)
Date ID
1 2013-02-15 AAA
2 2013-02-16 BBB
3 2013-02-17 CCC
4 2013-02-18 AAA
5 2013-02-19 BBB
6 2013-02-20 CCC
With the data above and the code below, I can calculate the mean date for each factor, but this includes the year.
I want a mean month and day across years. The preferred result would be a POSIXct time class formatted as month-day (eg. 12-31 for Dec 31st) representing the mean month and day across multiple years.
library(dplyr)
DF2 <- DF %>% group_by(ID) %>% mutate(
Col = mean(Date, na.rm = T))
DF2
Addition
I am looking for the mean day of the year with a month and day component, for each factor level. If the date represents, for example, the date an animal reproduced, I am not interested in the yearly differences between years, but instead want a single mean day.
I The end result would look like DF2 but with the new value calculated as previously described (mean day of the year with a month day component.
Sorry this was not more clear.
If I understand your question correctly, here's how to get a mean date column. I first extract the day of the year with yday from POSIXlt. I then calculate the mean. To get a date back, I have to add those days to an actual year, hence the creation of the Year object. As requested, I put the results in the same format as DF2 in your example.
library(dplyr)
DF2 <- DF %>%
mutate(Year=format(Date,"%Y"),
Date_day=as.POSIXlt(Date, origin = "1960-01-01")$yday)%>%
group_by(ID) %>%
mutate(Col = mean(Date_day, na.rm = T),Mean_date=format(as.Date(paste0(Year,"-01-01"))+Col,"%m-%d"))%>%
select(Date,ID,Mean_date)
DF2
> DF2
Source: local data frame [730 x 3]
Groups: ID [3]
Date ID Mean_date
(date) (chr) (chr)
1 2013-02-15 AAA 07-02
2 2013-02-16 BBB 07-02
3 2013-02-17 CCC 07-01
4 2013-02-18 AAA 07-02
5 2013-02-19 BBB 07-02
6 2013-02-20 CCC 07-01
7 2013-02-21 AAA 07-02
8 2013-02-22 BBB 07-02
9 2013-02-23 CCC 07-01
10 2013-02-24 AAA 07-02
.. ... ... ...
You can take the mean of dates by using the mean function. However, note that the mean implementation (and result) will be different depending on the data type. For POSIXct, the mean will be calculated and return the date and time - think of taking the mean of a bunch of integers and you will likely get a float or numeric. For Date, it will essentially 'round' the date to the nearest date.
For example, I recently took a mean of dates. Look at the output when different data types are used.
> mean(as.Date(stationPointDf$knockInDate))
[1] "2018-06-04"
> mean(as.POSIXct(stationPointDf$knockInDate))
[1] "2018-06-03 21:19:21 CDT"
If I am looking for a mean Month and Day across years, I convert all the dates to have the current year using lubridate package.
library(lubridate)
year(myVectorOfDates) <- 2018
Then, I compute the mean and drop the year.

Resources