Trying to shift entries in a column - r

I am trying to move entries in my column around to set them up as start/stop times. The first row for each id is set up fine but I need to shift the rest down and over for this to work as I am trying.
I have tried using dplyr and mutating the entries into new columns but the problem is the time entry is in another column so I am trying to work around that.
#This is what my data looks like
mydata<-data.frame(id=c(rep(1,3),rep(2,2)),baseline=c(rep("2018-07-14",3),
rep("2018-06-16",2)),
date=c("2018-08-23","2018-09-20","2018-10-05","2018-07-04","2018-08-08"))
head(mydata)
expecteddata<-data.frame(id=c(rep(1,3),rep(2,2)),
start=c("2018-07-14","2018-08-23","2018-09-20","2018-06-16","2018-07-04"),
end=c("2018-08-23","2018-09-20","2018-10-05","2018-07-04","2018-08-08"))
head(expecteddata)
This is what I am hoping to get. It also might be nice to increment start times since different rows would belong to different risk sets but that is a different issue. Any help or pointers would be greatly appreciated on how I can proceed.

Ensure that the date variables are the correct class and try:
library(dplyr)
mydata %>%
group_by(id) %>%
mutate(baseline = lag(date, default = first(baseline))) %>%
rename(start = baseline, end = date)
# A tibble: 5 x 3
# Groups: id [2]
id start end
<dbl> <date> <date>
1 1 2018-07-14 2018-08-23
2 1 2018-08-23 2018-09-20
3 1 2018-09-20 2018-10-05
4 2 2018-06-16 2018-07-04
5 2 2018-07-04 2018-08-08

Related

How to mutate a complex variable involving dates?

I have a tibble in which each row represents an image of an eye and contains the following relevant variables: patientId, laterality (left or right), date, imageId.
I would like to manipulate this to create another tibble showing the number of followUpYears for each eye (patientId, laterality). followUpYears is defined in a somewhat unusual way:
In order to meet the requirements for follow-up in a particular year, there must be two different imaging dates during that year i.e. between days 0-365 for year 1, days 366-730 for year 2 etc. The first image date is always the baseline and followUpYears is always an integer.
Only one image per date is considered.
Follow-up ceases as soon as the requirement for 2 imaging dates in a year is not met i.e. if there is only 1 imaging date in the first year, followUpYears is 0 regardless of how many images are taken subsequently.
There is no requirement for there to be at least n years between the first and last image date for an eye to have n followUpYears.
The following dummy data demonstrates these points:
data <- tibble(patientId = c('A','A','A','A','A','A','B','B','B','B','B','B','B'),
laterality = c('L','L','L','L','L','L','R','R','R','R','L','L','L'),
date = as.Date(c('2000-05-05','2000-05-05','2001-05-06','2001-05-07','2002-05-06','2002-05-07','2000-09-08','2001-09-07','2001-09-09','2001-09-10','2000-09-08','2001-09-07','2001-09-10')),
imageId = 1:13)
expected_output <- tibble(patientId = c('A','B','B'),
laterality = c('L','R','L'),
followUpYears = c(0, 2, 1))
Patient A's left eye has 0 followUpYears because of points 2 and 3. Patient B's right eye has 2 followUpYears because of point 4 (despite the fact that there is only slightly more than 1 year between the first and last image date). Patient B's left eye only has 1 year of follow up since it doesn't meet the requirement for 2 image dates in year 2.
I am familiar with the basic dplyr verbs but I can't think of how to frame this type of variable. Note that patients might have one or both eyes included and some might have 10+ years of follow up. Finally, a solution that considers 1 year to be 365 days regardless of leap years is fine.
Thank you!
Here's a way with ifelse. diff_year is a helper function that computes the difference between two dates in year rounded to the value above.
library(dplyr)
diff_year <- function(date1, date2) ceiling(as.numeric(difftime(date1, date2)) / 365)
data %>%
group_by(patientId) %>%
summarise(followUpYears = ifelse(diff_year(date[date != first(date)][1], first(date)) <= 1,
diff_year(max(date), min(date)), 0))
#A tibble: 2 × 2
# patientId followUpYears
# <chr> <dbl>
#1 A 0
#2 B 2
Update with OP's comment. This should work with all conditions:
diff_year <- function(date1, date2) as.numeric((date1 - date2) / 365)
data %>%
distinct(patientId, laterality, date, .keep_all = TRUE) %>%
group_by(patientId, laterality) %>%
mutate(diffYear = floor(diff_year(date, min(date)))) %>%
add_count(count = diffYear) %>%
filter(!cumany(lag(n == 1, default = 0)) | row_number() == 1) %>%
summarise(followUpYears = ifelse(any(n > 1), ceiling(diff_year(max(date[n != 1]), min(date))), 0))
# patientId laterality followUpYears
#1 A L 0
#2 B L 1
#3 B R 2
Below is my approach which should cover all four conditions, I'm not sure however, how you get:
#> # A tibble: 1 x 3
#> patientId laterality followUpYears
#> <chr> <chr> <dbl>
#> 1 B L 1
since according to your logic it should fall into the two year band from 2000-09-08 to 2001-09-10 are 367 days which equals two years.
The idea is that we first calculate a followup_flag which checks if the date is within 365 days of the former date, and then takes the cummin() so that the series breaks as soon there is no direct follow up year.
Then we can filter all rows which meet the followup_flag == 1.
And for this data set we check how many years are between the first and the last date, and since we want to count 367 as 2 years we have to take the ceiling().
library(dplyr)
library(lubridate)
data %>%
group_by(patientId, laterality) %>%
mutate(followup_flag = cummin(date - dplyr::lag(date, default = first(date)) <= 365)) %>%
filter(followup_flag == 1) %>%
summarise(followUpYears = as.numeric(
difftime(last(date), first(date), units = "days") / 365) %>%
ceiling()
)
#> `summarise()` has grouped output by 'patientId'. You can override using the
#> `.groups` argument.
#> # A tibble: 3 x 3
#> # Groups: patientId [2]
#> patientId laterality followUpYears
#> <chr> <chr> <dbl>
#> 1 A L 0
#> 2 B L 2
#> 3 B R 2
Data used:
data <- tibble(patientId = c('A','A','A','A','A','A','B','B','B','B','B','B','B'),
laterality = c('L','L','L','L','L','L','R','R','R','R','L','L','L'),
date = as.Date(c('2000-05-05','2000-05-05','2001-05-06','2001-05-07','2002-05-06','2002-05-07','2000-09-08','2001-09-07','2001-09-09','2001-09-10','2000-09-08','2001-09-07','2001-09-10')),
imageId = 1:13)
Created on 2023-02-08 by the reprex package (v2.0.1)

How to use group_by without ordering alphabetically?

I'm trying to visualize some bird data, however after grouping by month, the resulting output is out of order from the original data. It is in order for December, January, February, and March in the original, but after manipulating it results in December, February, January, March.
Any ideas how I can fix this or sort the rows?
This is the code:
BirdDataTimeClean <- BirdDataTimes %>%
group_by(Date) %>%
summarise(Gulls=sum(Gulls), Terns=sum(Terns), Sandpipers=sum(Sandpipers),
Plovers=sum(Plovers), Pelicans=sum(Pelicans), Oystercatchers=sum(Oystercatchers),
Egrets=sum(Egrets), PeregrineFalcon=sum(Peregrine_Falcon), BlackPhoebe=sum(Black_Phoebe),
Raven=sum(Common_Raven))
BirdDataTimeClean2 <- BirdDataTimeClean %>%
pivot_longer(!Date, names_to = "Species", values_to = "Count")
You haven't shared any workable data but i face this many times when reading from csv and hence all dates and data are in character.
as suggested, please convert the date data to "date" format using lubridate package or base as.Date() and then arrange() in dplyr will work or even group_by
example :toy data created
birds <- data.table(dates = c("2020-Feb-20","2020-Jan-20","2020-Dec-20","2020-Apr-20"),
species = c('Gulls','Turns','Gulls','Sandpiper'),
Counts = c(20,30,40,50)
str(birds) will show date is character (and I have not kept order)
using lubridate convert dates
birds$dates%>%lubridate::ymd() will change to date data-type
birds$dates%>%ymd()%>%str()
Date[1:4], format: "2020-02-20" "2020-01-20" "2020-12-20" "2020-04-20"
save it with birds$dates <- ymd(birds$dates) or do it in your pipeline as follows
now simply so the dplyr analysis:
birds%>%group_by(Months= ymd(dates))%>%
summarise(N=n()
,Species_Count = sum(Counts)
)%>%arrange(Months)
will give
# A tibble: 4 x 3
Months N Species_Count
<date> <int> <dbl>
1 2020-01-20 1 30
2 2020-02-20 1 20
3 2020-04-20 1 50
However, if you want Apr , Jan instead of numbers and apply as.Date() with format etc, the dates become "character" again. I woudl suggest you keep your data that way and while representing in output for others -> format it there with as.Date or if using DT or other datatables -> check the output formatting options. That way your original data remains and users see what they want.
this will make it character
birds%>%group_by(Months= as.character.Date(dates))%>%
summarise(N=n()
,Species_Count = sum(Counts)
)%>%arrange(Months)
A tibble: 4 x 3
Months N Species_Count
<chr> <int> <dbl>
1 2020-Apr-20 1 50
2 2020-Dec-20 1 40
3 2020-Feb-20 1 20
4 2020-Jan-20 1 30

Calculate "age at first record" for each ID

Background
I've got a dataset d on use of services for members (shown as ID) of an organization. Here's a toy example:
d <- data.frame(ID = c("a","a","b","b"),
dob = as.Date(c("2004-04-17","2004-04-17","2009-04-24","2009-04-24")),
service_date = as.Date(c("2018-01-01","2019-07-12","2014-12-23","2016-04-27")),stringsAsFactors=FALSE)
It looks like this:
Besides ID, it's got a date of birth dob and dates of service service_date for each time the member has used the organization's service.
Problem
I'd like to add a new column age_first_record that represents -- you guessed it -- a member's age at their first service_date. Ideally, this figure should be in years, not days. So if you're 13 years and one month old, the figure would be 13.08. The number should repeat within ID, too, so whatever that ID's age at their first service_date was, that's the number that goes in every row for that ID.
What I'm looking for is something that looks like this:
What I've Tried
So far, I'm messing with the MIN function a bit, like so:
d <- d %>%
mutate(age_first_rec = min(d$service_date-d$dob))
But I can't seem to (a) get it work "within" each ID and (b) express it in years, not days. I'm not too familiar with working with datetime objects, so forgive the clumsiness here.
Any help is much appreciated!
We can use difftime to get the difference in days and divide by 365
library(dplyr)
d %>%
group_by(ID) %>%
mutate(age_first_record = as.numeric(difftime(min(service_date),
dob, unit = 'day')/365)) %>%
ungroup
-output
# A tibble: 4 x 4
ID dob service_date age_first_record
<chr> <date> <date> <dbl>
1 a 2004-04-17 2018-01-01 13.7
2 a 2004-04-17 2019-07-12 13.7
3 b 2009-04-24 2014-12-23 5.67
4 b 2009-04-24 2016-04-27 5.67

Choosing forward propagating observation in R

I am working with a dataset like this:
There is the participant ID and then there is the date of the exam and the date of biopsy. There are multiple observations per participant.
The dataset looks like this:
df <- data.frame(ID = c("A", "A", "B", "B", "B", "C", "C", "C", "C"),
date_of_exam = c("2020-05-03", "2020-07-08", "2020-04-12", "2020-04-01", "2020-03-12", "2020-08-12", "2020-08-10", "2020-09-12", "2020-10-01"),
date_of_biopsy = c("2020-05-01", "2020-07-06", "2020-04-15", "2020-04-03", "2020-03-09", "2020-08-15", "2020-08-09", "2020-09-14", "2020-10-05"))
Whenever the date of exam is before the date of biopsy (date_of_exam minus date_of_biopsy < 0), I want to use the next higher date of the exam.
How can I create a forloop or else that checks the condition (date_of_exam minus date_of_biopsy <0) and if this is true it chooses the next higher value of the exam, checks this again for the condition (date_of_exam minus date_of_biopsy <0) and if it is not met, takes this value or else chooses the next higher value...?
I was thinking about creating multiple
ifelse conditions in dplyr:
library(dplyr)
df %>%
group_by(ID) %>%
arrange(ID) %>%
mutate(exam_2nd_value = ifelse(test=(df$data_of_exam-df_date_of_biopsy<0)==TRUE, yes=df$date_of_exam[min(n(), 2)],no=df$date_of_exam[min(n(),1)]))%>%
ungroup()
And then some more if else, but I feel like this is not a good way and is probably not going to work.
Also, when I use the code mentioned above, I get this error:
Error: Problem with mutate() input new. x Input new can't be recycled to size 3
Can you tell me why I get this error and what I can do to solve my problem?
Thanks a lot,
Phil
If I am understanding the problem correctly, something like this should work:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(date_of_exam = as.Date(date_of_exam),
date_of_biopsy = as.Date(date_of_biopsy)) %>%
arrange(date_of_exam) %>%
mutate(lead_exam_date = lead(date_of_exam),
exam_2nd_value = if_else(date_of_exam>date_of_biopsy,
date_of_exam,
lead_exam_date)) %>%
ungroup()
Explanation
group by ID
convert all strings to date objects (which internally are just epoch times and thus can be added/subtracted like numbers)
arrange by date in each group (lowest first)
use lead to shift the date_of_exam column up by one for each group
do the comparison you want, if exam date is greater than biopsy date (eg former in the future relative to latter), use the current exam date, otherwise use the next exam date (the one created using lead)
This will give you:
ID date_of_exam date_of_biopsy lead_exam_date exam_2nd_value
<chr> <date> <date> <date> <date>
1 B 2020-03-12 2020-03-09 2020-04-01 2020-03-12
2 B 2020-04-01 2020-04-03 2020-04-12 2020-04-13
3 B 2020-04-12 2020-04-15 NA NA
4 A 2020-05-03 2020-05-01 2020-07-08 2020-05-03
5 A 2020-07-08 2020-07-06 NA 2020-07-08
6 C 2020-08-10 2020-08-09 2020-08-12 2020-08-10
7 C 2020-08-12 2020-08-15 2020-09-12 2020-09-13
8 C 2020-09-12 2020-09-14 2020-10-01 2020-10-02
9 C 2020-10-01 2020-10-05 NA NA
I'm not sure based on the way the original problem is framed what to do if there is no next exam date (hence the NAs). To deal with those, lead has a default argument which will be supplied when there is no next value.
As far as the errors you encountered, I couldn't reproduce the error you got but instead I got an error for trying to subtract two strings (the two dates in the mutate call). Yours might be the same thing just expressed a different way due to version differences or something.
Additionally, you can't use ifelse here because this function isn't vectorized; you need if_else which takes entire vectors rather than scalers (the name is super confusing).
Finally, I don't think the calls to the original dataframe with df$ would have worked since those aren't aware of the grouping you imposed earlier in the chain.

Consolidate multiple time intervals within a group unit (dplyr or data.table solution preferred)

I have a tricky little coding issue I was hoping someone might have a solution for.
I essentially have a very large dataset of stays (person ids, admits, discharges), >10 million.
library(dplyr)
library(lubridate)
dat <- read.csv(text="
personid, start, end
1, 2017-09-01, 2017-10-01
1, 2017-10-05, 2017-10-07
2, 2017-10-21, 2017-11-01
3, 2017-12-01, 2017-12-15
3, 2017-12-27, 2017-12-31") %>%
transmute(
personid,
start = ymd(start),
end = ymd(end))
Each stay is non-overlapping, but we have a logic rule where if the stays are within 10 days of each other we want to consolidate them as one (i.e keep the earlier admit and the later discharge). So that the final dataset are unique stays at least 10 days from each other.
e.g.:
1, 2017-09-01, 2017-10-07
2, 2017-10-21, 2017-11-01
3, 2017-12-01, 2017-12-15
3, 2017-12-27, 2017-12-31
There are a few posts about overlapping intervals, but this is a little different: Consolidating set of time intervals, chains of intervals to single interval
I think overly complex compared to what I need.
I was also hoping for a dplyr or data.table solution, though the group_by statement takes quite a while.
One tidyverse possibility. We first group_by person_id and create a new variable (diffe) which has difference in days between current start day and previous (lag) end day. We group every person_id and diffe which is within 10 days into one group and select first start day and last end day from each group.
library(tidyverse)
dat %>%
group_by(personid) %>%
mutate(diffe = as.numeric(start - lag(end))) %>%
replace_na(list(diffe = 0)) %>%
group_by(personid, group = cumsum(diffe > 10)) %>%
summarise(start = first(start),
end = last(end)) %>%
select(-group)
# personid start end
# <int> <date> <date>
#1 1 2017-09-01 2017-10-07
#2 2 2017-10-21 2017-11-01
#3 3 2017-12-01 2017-12-15
#4 3 2017-12-27 2017-12-31

Resources