Calculate "age at first record" for each ID - r

Background
I've got a dataset d on use of services for members (shown as ID) of an organization. Here's a toy example:
d <- data.frame(ID = c("a","a","b","b"),
dob = as.Date(c("2004-04-17","2004-04-17","2009-04-24","2009-04-24")),
service_date = as.Date(c("2018-01-01","2019-07-12","2014-12-23","2016-04-27")),stringsAsFactors=FALSE)
It looks like this:
Besides ID, it's got a date of birth dob and dates of service service_date for each time the member has used the organization's service.
Problem
I'd like to add a new column age_first_record that represents -- you guessed it -- a member's age at their first service_date. Ideally, this figure should be in years, not days. So if you're 13 years and one month old, the figure would be 13.08. The number should repeat within ID, too, so whatever that ID's age at their first service_date was, that's the number that goes in every row for that ID.
What I'm looking for is something that looks like this:
What I've Tried
So far, I'm messing with the MIN function a bit, like so:
d <- d %>%
mutate(age_first_rec = min(d$service_date-d$dob))
But I can't seem to (a) get it work "within" each ID and (b) express it in years, not days. I'm not too familiar with working with datetime objects, so forgive the clumsiness here.
Any help is much appreciated!

We can use difftime to get the difference in days and divide by 365
library(dplyr)
d %>%
group_by(ID) %>%
mutate(age_first_record = as.numeric(difftime(min(service_date),
dob, unit = 'day')/365)) %>%
ungroup
-output
# A tibble: 4 x 4
ID dob service_date age_first_record
<chr> <date> <date> <dbl>
1 a 2004-04-17 2018-01-01 13.7
2 a 2004-04-17 2019-07-12 13.7
3 b 2009-04-24 2014-12-23 5.67
4 b 2009-04-24 2016-04-27 5.67

Related

How to mutate a complex variable involving dates?

I have a tibble in which each row represents an image of an eye and contains the following relevant variables: patientId, laterality (left or right), date, imageId.
I would like to manipulate this to create another tibble showing the number of followUpYears for each eye (patientId, laterality). followUpYears is defined in a somewhat unusual way:
In order to meet the requirements for follow-up in a particular year, there must be two different imaging dates during that year i.e. between days 0-365 for year 1, days 366-730 for year 2 etc. The first image date is always the baseline and followUpYears is always an integer.
Only one image per date is considered.
Follow-up ceases as soon as the requirement for 2 imaging dates in a year is not met i.e. if there is only 1 imaging date in the first year, followUpYears is 0 regardless of how many images are taken subsequently.
There is no requirement for there to be at least n years between the first and last image date for an eye to have n followUpYears.
The following dummy data demonstrates these points:
data <- tibble(patientId = c('A','A','A','A','A','A','B','B','B','B','B','B','B'),
laterality = c('L','L','L','L','L','L','R','R','R','R','L','L','L'),
date = as.Date(c('2000-05-05','2000-05-05','2001-05-06','2001-05-07','2002-05-06','2002-05-07','2000-09-08','2001-09-07','2001-09-09','2001-09-10','2000-09-08','2001-09-07','2001-09-10')),
imageId = 1:13)
expected_output <- tibble(patientId = c('A','B','B'),
laterality = c('L','R','L'),
followUpYears = c(0, 2, 1))
Patient A's left eye has 0 followUpYears because of points 2 and 3. Patient B's right eye has 2 followUpYears because of point 4 (despite the fact that there is only slightly more than 1 year between the first and last image date). Patient B's left eye only has 1 year of follow up since it doesn't meet the requirement for 2 image dates in year 2.
I am familiar with the basic dplyr verbs but I can't think of how to frame this type of variable. Note that patients might have one or both eyes included and some might have 10+ years of follow up. Finally, a solution that considers 1 year to be 365 days regardless of leap years is fine.
Thank you!
Here's a way with ifelse. diff_year is a helper function that computes the difference between two dates in year rounded to the value above.
library(dplyr)
diff_year <- function(date1, date2) ceiling(as.numeric(difftime(date1, date2)) / 365)
data %>%
group_by(patientId) %>%
summarise(followUpYears = ifelse(diff_year(date[date != first(date)][1], first(date)) <= 1,
diff_year(max(date), min(date)), 0))
#A tibble: 2 × 2
# patientId followUpYears
# <chr> <dbl>
#1 A 0
#2 B 2
Update with OP's comment. This should work with all conditions:
diff_year <- function(date1, date2) as.numeric((date1 - date2) / 365)
data %>%
distinct(patientId, laterality, date, .keep_all = TRUE) %>%
group_by(patientId, laterality) %>%
mutate(diffYear = floor(diff_year(date, min(date)))) %>%
add_count(count = diffYear) %>%
filter(!cumany(lag(n == 1, default = 0)) | row_number() == 1) %>%
summarise(followUpYears = ifelse(any(n > 1), ceiling(diff_year(max(date[n != 1]), min(date))), 0))
# patientId laterality followUpYears
#1 A L 0
#2 B L 1
#3 B R 2
Below is my approach which should cover all four conditions, I'm not sure however, how you get:
#> # A tibble: 1 x 3
#> patientId laterality followUpYears
#> <chr> <chr> <dbl>
#> 1 B L 1
since according to your logic it should fall into the two year band from 2000-09-08 to 2001-09-10 are 367 days which equals two years.
The idea is that we first calculate a followup_flag which checks if the date is within 365 days of the former date, and then takes the cummin() so that the series breaks as soon there is no direct follow up year.
Then we can filter all rows which meet the followup_flag == 1.
And for this data set we check how many years are between the first and the last date, and since we want to count 367 as 2 years we have to take the ceiling().
library(dplyr)
library(lubridate)
data %>%
group_by(patientId, laterality) %>%
mutate(followup_flag = cummin(date - dplyr::lag(date, default = first(date)) <= 365)) %>%
filter(followup_flag == 1) %>%
summarise(followUpYears = as.numeric(
difftime(last(date), first(date), units = "days") / 365) %>%
ceiling()
)
#> `summarise()` has grouped output by 'patientId'. You can override using the
#> `.groups` argument.
#> # A tibble: 3 x 3
#> # Groups: patientId [2]
#> patientId laterality followUpYears
#> <chr> <chr> <dbl>
#> 1 A L 0
#> 2 B L 2
#> 3 B R 2
Data used:
data <- tibble(patientId = c('A','A','A','A','A','A','B','B','B','B','B','B','B'),
laterality = c('L','L','L','L','L','L','R','R','R','R','L','L','L'),
date = as.Date(c('2000-05-05','2000-05-05','2001-05-06','2001-05-07','2002-05-06','2002-05-07','2000-09-08','2001-09-07','2001-09-09','2001-09-10','2000-09-08','2001-09-07','2001-09-10')),
imageId = 1:13)
Created on 2023-02-08 by the reprex package (v2.0.1)

How to use group_by without ordering alphabetically?

I'm trying to visualize some bird data, however after grouping by month, the resulting output is out of order from the original data. It is in order for December, January, February, and March in the original, but after manipulating it results in December, February, January, March.
Any ideas how I can fix this or sort the rows?
This is the code:
BirdDataTimeClean <- BirdDataTimes %>%
group_by(Date) %>%
summarise(Gulls=sum(Gulls), Terns=sum(Terns), Sandpipers=sum(Sandpipers),
Plovers=sum(Plovers), Pelicans=sum(Pelicans), Oystercatchers=sum(Oystercatchers),
Egrets=sum(Egrets), PeregrineFalcon=sum(Peregrine_Falcon), BlackPhoebe=sum(Black_Phoebe),
Raven=sum(Common_Raven))
BirdDataTimeClean2 <- BirdDataTimeClean %>%
pivot_longer(!Date, names_to = "Species", values_to = "Count")
You haven't shared any workable data but i face this many times when reading from csv and hence all dates and data are in character.
as suggested, please convert the date data to "date" format using lubridate package or base as.Date() and then arrange() in dplyr will work or even group_by
example :toy data created
birds <- data.table(dates = c("2020-Feb-20","2020-Jan-20","2020-Dec-20","2020-Apr-20"),
species = c('Gulls','Turns','Gulls','Sandpiper'),
Counts = c(20,30,40,50)
str(birds) will show date is character (and I have not kept order)
using lubridate convert dates
birds$dates%>%lubridate::ymd() will change to date data-type
birds$dates%>%ymd()%>%str()
Date[1:4], format: "2020-02-20" "2020-01-20" "2020-12-20" "2020-04-20"
save it with birds$dates <- ymd(birds$dates) or do it in your pipeline as follows
now simply so the dplyr analysis:
birds%>%group_by(Months= ymd(dates))%>%
summarise(N=n()
,Species_Count = sum(Counts)
)%>%arrange(Months)
will give
# A tibble: 4 x 3
Months N Species_Count
<date> <int> <dbl>
1 2020-01-20 1 30
2 2020-02-20 1 20
3 2020-04-20 1 50
However, if you want Apr , Jan instead of numbers and apply as.Date() with format etc, the dates become "character" again. I woudl suggest you keep your data that way and while representing in output for others -> format it there with as.Date or if using DT or other datatables -> check the output formatting options. That way your original data remains and users see what they want.
this will make it character
birds%>%group_by(Months= as.character.Date(dates))%>%
summarise(N=n()
,Species_Count = sum(Counts)
)%>%arrange(Months)
A tibble: 4 x 3
Months N Species_Count
<chr> <int> <dbl>
1 2020-Apr-20 1 50
2 2020-Dec-20 1 40
3 2020-Feb-20 1 20
4 2020-Jan-20 1 30

R, how to compute an operation between dates on selected rows?

I am working in R on a dataframe which has the date of the first visit and/or of the last visit of a patient, this way:
patient_ID
date
date_number
max_date_number
3
2017-09-25
1
7
3
2019-03-05
7
7
5
2015-10-01
1
1
6
2010-04-15
1
7
6
2011-04-15
5
5
This table is contained in the visits_dataframe variable, computed this way:
visits_dataframe <- data.frame(patient_ID=integer(), date=character(), date_number=character(), max_date_number=character())
patients <- c(3,3,5,6,6)
dates <- c("2017-09-25", "2019-03-05", "2015-10-01", "2010-04-15", "2011-04-15")
date_numbers <- c("1","7","1","1","5")
max_date_numbers <- c("7","7","1","7","5")
visits_dataframe <- data.frame(patients, dates, date_numbers, max_date_numbers, stringsAsFactors=FALSE)
I need to compute the average date distance between the first visit and the last visit, when available, for all the patients. That would be the total duration of the therapy for each patient.
In this example, I would like to compute the distance between 2019-03-05 and 2017-09-25 for the 3 patient, and between 2011-04-15 and 2010-04-15 for the 6 patient.
In this example, I would not be able to compute it for the 5 patient, because the max_date_number is unavailable for her/him.
I tried this piece of code but did not work:
visits_dataframe_durations <- ave(visits_dataframe$date_number, visits_dataframe$patient_ID, FUN = (visits_dataframe[(visits_dataframe$date_number==1),] - visits_dataframe[(visits_dataframe$date_number==max_date_number),]))
Basically, I have to use a command that says:
for each patient ID:
find the last visit date (date_number == max_date_number)
find the first visit date (date_number == 1)
compute the distance between last visit and first visit (thisDuration)
save this duration into a general duration variable (generalDuration += thisDuration)
end for
compute average duration = general duration / number of patients
Can someone help me with this problem? Thanks
We could do this in dplyr
library(dplyr)
visits_dataframe %>%
mutate(dates = as.Date(dates)) %>%
group_by(patients) %>%
mutate(durations = dates[date_numbers == 1] -
dates[date_numbers == max_date_numbers])

Trying to shift entries in a column

I am trying to move entries in my column around to set them up as start/stop times. The first row for each id is set up fine but I need to shift the rest down and over for this to work as I am trying.
I have tried using dplyr and mutating the entries into new columns but the problem is the time entry is in another column so I am trying to work around that.
#This is what my data looks like
mydata<-data.frame(id=c(rep(1,3),rep(2,2)),baseline=c(rep("2018-07-14",3),
rep("2018-06-16",2)),
date=c("2018-08-23","2018-09-20","2018-10-05","2018-07-04","2018-08-08"))
head(mydata)
expecteddata<-data.frame(id=c(rep(1,3),rep(2,2)),
start=c("2018-07-14","2018-08-23","2018-09-20","2018-06-16","2018-07-04"),
end=c("2018-08-23","2018-09-20","2018-10-05","2018-07-04","2018-08-08"))
head(expecteddata)
This is what I am hoping to get. It also might be nice to increment start times since different rows would belong to different risk sets but that is a different issue. Any help or pointers would be greatly appreciated on how I can proceed.
Ensure that the date variables are the correct class and try:
library(dplyr)
mydata %>%
group_by(id) %>%
mutate(baseline = lag(date, default = first(baseline))) %>%
rename(start = baseline, end = date)
# A tibble: 5 x 3
# Groups: id [2]
id start end
<dbl> <date> <date>
1 1 2018-07-14 2018-08-23
2 1 2018-08-23 2018-09-20
3 1 2018-09-20 2018-10-05
4 2 2018-06-16 2018-07-04
5 2 2018-07-04 2018-08-08

How to split a panel data record in R based on a threshold value for a variable?

I have data for hospitalisations that records date of admission and the number of days spent in the hospital:
ID date ndays
1 2005-06-01 15
2 2005-06-15 60
3 2005-12-25 20
4 2005-01-01 400
4 2006-06-04 15
I would like to create a dataset of days spend at the hospital per year, and therefore I need to deal with cases like ID 3, whose stay at the hospital goes over the end of the year, and ID 4, whose stay at the hospital is longer than one year. There is also the problem that some people do have a record on next year, and I would like to add the `surplus' days to those when this happens.
So far I have come up with this solution:
library(lubridate)
ndays_new <- ifelse((as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) < data$ndays,
(as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) ,
data$ndays)
However, I can't think of a way to get those `surplus' days that go over the end of the year and assign them to a new record starting on the next year. Can any one point me to a good solution? I use dplyr, so solutions with that package would be specially welcome, but I'm willing to try any other tool if needed.
My solution isn't compact. But, I tried to employ dplyr and did the following. I initially changed column names for my own understanding. I calculated another date (i.e., date.2) by adding ndays to date.1. If the years of date.1 and date.2 match, that means you do not have to consider the following year. If the years do not match, you need to consider the following year. ndays.2 is basically ndays for the following year. Then, I reshaped the data using do. After filtering unnecessary rows with NAs, I changed date to year and aggregated the data by ID and year.
rename(mydf, date.1 = date, ndays.1 = ndays) %>%
mutate(date.1 = as.POSIXct(date.1, format = "%Y-%m-%d"),
date.2 = date.1 + (60 * 60 * 24) * ndays.1,
ndays.2 = ifelse(as.character(format(date.1, "%Y")) == as.character(format(date.2, "%Y")), NA,
date.2 - as.POSIXct(paste0(as.character(format(date.2, "%Y")),"-01-01"), format = "%Y-%m-%d")),
ndays.1 = ifelse(ndays.2 %in% NA, ndays.1, ndays.1 - ndays.2)) %>%
do(data.frame(ID = .$ID, date = c(.$date.1, .$date.2), ndays = c(.$ndays.1, .$ndays.2))) %>%
filter(complete.cases(ndays)) %>%
mutate(date = as.numeric(format(date, "%Y"))) %>%
rename(year = date) %>%
group_by(ID, year) %>%
summarise(ndays = sum(ndays))
# ID year ndays
#1 1 2005 15
#2 2 2005 60
#3 3 2005 7
#4 3 2006 13
#5 4 2005 365
#6 4 2006 50

Resources