I am working on a data set which is similar to
data <-tribble(
~id, ~ dates, ~days_prior,
1,20190101, NA,
1,NA, 15,
1,NA, 20,
2, 20190103, NA,
2,NA, 3,
2,NA, 4)
I have the first date for each ID and I am trying to calculate the next date by adding days_prior to the previous date. I am using the lag function to refer to the previous date.
df<- df%>% mutate(dates = as.Date(ymd(dates)), days_prior =as.integer(days_prior))
df<-df %>% mutate(dates =
as.Date(ifelse(is.na(days_prior),dates,days_prior+lag(dates)),
origin="1970-01-01"))
This works but only for the next row as you can see attached data.
What am I doing wrong? I would like all the dates to be calculated by mutate(). What different approach should I take to calculate this.
I don't really see how lag would help here; unless I misunderstood here is an option using tidyr::fill
data %>%
group_by(id) %>%
mutate(dates = as.Date(ymd(dates))) %>%
fill(dates) %>%
mutate(dates = dates + if_else(is.na(days_prior), 0L, as.integer(days_prior))) %>%
ungroup()
## A tibble: 6 x 3
# id dates days_prior
# <dbl> <date> <dbl>
#1 1 2019-01-01 NA
#2 1 2019-01-16 15
#3 1 2019-01-21 20
#4 2 2019-01-03 NA
#5 2 2019-01-06 3
#6 2 2019-01-07 4
Or a slight variation, replacing the NA entries in days_prior with 0
data %>%
group_by(id) %>%
mutate(
dates = as.Date(ymd(dates)),
days_prior = replace(days_prior, is.na(days_prior), 0)) %>%
fill(dates) %>%
mutate(dates = dates + as.integer(days_prior)) %>%
ungroup()
Update
In response to your clarifications in the comments, here is what you can do
data %>%
group_by(id) %>%
mutate(
dates = as.Date(ymd(dates)),
days_prior = replace(days_prior, is.na(days_prior), 0)) %>%
fill(dates) %>%
mutate(dates = dates + cumsum(days_prior)) %>%
ungroup()
## A tibble: 6 x 3
# id dates days_prior
# <dbl> <date> <dbl>
#1 1 2019-01-01 0
#2 1 2019-01-16 15
#3 1 2019-02-05 20
#4 2 2019-01-03 0
#5 2 2019-01-06 3
#6 2 2019-01-10 4
You can use the na.locf from the zoo package to fill in the last observed date before adding the prior days.
library("tidyverse")
library("zoo")
data %>%
# Fill in NA dates with the previous non-NA date
# The `locf` stands for "last observation carried forward"
# Fill in NA days_prior with 0
mutate(dates = zoo::na.locf(dates),
days_prior = replace_na(days_prior, 0)) %>%
mutate(dates = lubridate::ymd(dates) + days_prior)
This solution makes two assumptions:
The rows are sorted by id. You can get around this assumption with a group_by(id) followed by an ungroup() statement as shows in the solution by Maurits Evers.
For each id, the row with the observed date is first in the group. This needs to be true in any case with either na.locf and fill because both functions fill in NAs using the previous non-NA entry.
If you don't want to make any assumptions about the ordering, you can sort the rows at the start with data %>% arrange(id, dates).
Related
I have data structured like below, but with many more columns.
I need to filter the data to include only instances where a person has a date of X and X+1.
In this example only person B and C should remain, and only the rows with directly adjacent dates. So rows 2,3,5,6 should be the only remaining ones.
Once it is filtered I need to count how many times this occurred as well as do calculations on the other values, likely summing up the Values column for the X+1 date.
Person <- c("A","B","B","B","C","C","D","D")
Date <- c("2021-01-01","2021-01-01","2021-01-02","2021-01-04","2021-01-09","2021-01-10","2021-01-26","2021-01-29")
Values <- c(10,15,6,48,71,3,1,3)
df <- data.frame(Person, Date, Values)
df
How would I accomplish this?
end_points <- df %>%
mutate(Date = as.Date(Date)) %>%
group_by(Person) %>%
filter(Date - lag(Date) == 1 | lead(Date) - Date == 1) %>%
ungroup()
Result
end_points
# A tibble: 4 x 3
Person Date Values
<chr> <date> <dbl>
1 B 2021-01-01 15
2 B 2021-01-02 6
3 C 2021-01-09 71
4 C 2021-01-10 3
2nd part:
end_points %>%
group_by(Person) %>%
slice_max(Date) %>%
ungroup() %>%
summarize(total = sum(Values))
I'm trying to calculate the number of days that a patient spent during a given state in R.
The image of an example data is included below. I only have columns 1 to 3 and I want to get the answer in column 5. I am thinking if I am able to create a date column in column 4 which is the first recorded date for each state, then I can subtract that from column 2 and get the days I am looking for.
I tried a group_by(MRN, STATE) but the problem is, it groups the second set of 1's as part of the first set of 1's, so does the 2's which is not what I want.
Use mdy_hm to change OBS_DTM to POSIXct type, group_by ID and rleid of STATE so that first set of 1's are handled separately than the second set. Use difftime to calculate difference between OBS_DTM with the minimum value in the group in days.
If your data is called data :
library(dplyr)
data %>%
mutate(OBS_DTM = lubridate::mdy_hm(OBS_DTM)) %>%
group_by(MRN, grp = data.table::rleid(STATE)) %>%
mutate(Answer = as.numeric(difftime(OBS_DTM, min(OBS_DTM),units = 'days'))) %>%
ungroup %>%
select(-grp) -> result
result
You could try the following:
library(dplyr)
df %>%
group_by(ID, State) %>%
mutate(priorObsDTM = lag(OBS_DTM)) %>%
filter(!is.na(priorObsDTM)) %>%
ungroup() %>%
mutate(Answer = as.numeric(OBS_DTM - priorObsDTM, units = 'days'))
The dataframe I used for this example:
df <- df <- data.frame(
ID = 1,
OBS_DTM = as.POSIXlt(
c('2020-07-27 8:44', '2020-7-27 8:56', '2020-8-8 20:12',
'2020-8-14 10:13', '2020-8-15 13:32')
),
State = c(1, 1, 2, 2, 2),
stringsAsFactors = FALSE
)
df
# A tibble: 3 x 5
# ID OBS_DTM State priorObsDTM Answer
# <dbl> <dttm> <dbl> <dttm> <dbl>
# 1 1 2020-07-27 08:56:00 1 2020-07-27 08:44:00 0.00833
# 2 1 2020-08-14 10:13:00 2 2020-08-08 20:12:00 5.58
# 3 1 2020-08-15 13:32:00 2 2020-08-14 10:13:00 1.14
I have data frame like below, and I want to get the summation(value) for each 4 rolling month.
Edit: In the output I have "2018-12". But it's not shown in the input. It's a typo, my actual data contain "2018-12".
I prefer to use dplyr:
group <- c("red","green","red","red","red","green","green","green","red","green","green","green")
Month <- c("2019-01","2019-02","2019-03","2019-03","2019-05","2019-07","2019-07","2019-08","2019-09","2019-10","2019-10","2019-10")
VALUE <- c(10,20,30,40,50,60,70,80,90,100,110,120)
d_f <- data.frame(group,Month,VALUE)
d_f %>%
group_by(group) %>%
summarise(value = sum(value))
Can anyone please help me with how to handle the 4 rolling month? Thanks a lot for your valuable time.
Using lubridate you can use floor_date and group your dates by 4 month intervals.
library(tidyverse)
library(lubridate)
d_f %>%
mutate(date = as.Date(paste0(Month, '-01'), format = "%Y-%m-%d")) %>%
arrange(date) %>%
group_by(group, startdategroup = floor_date(date, "4 months")) %>%
summarise(value = sum(VALUE)) %>%
mutate(enddategroup = startdategroup %m+% months(4) - 1)
Output
# A tibble: 6 x 4
# Groups: group [2]
group startdategroup value enddategroup
<fct> <date> <dbl> <date>
1 green 2019-01-01 20 2019-04-30
2 green 2019-05-01 210 2019-08-31
3 green 2019-09-01 330 2019-12-31
4 red 2019-01-01 80 2019-04-30
5 red 2019-05-01 50 2019-08-31
6 red 2019-09-01 90 2019-12-31
Edit: To allow for an "overlap month" (months on the edge of two sequential date intervals), I might take a different approach.
First, I might create a sequence of start and end dates for the intervals (based on minimum and maximum dates in your data frame). The sequence would have date intervals every 4 months.
Then, I would do a fuzzy_left_join (using >= and <= logic) and merge this new data frame with yours. Then a row of data for a single month could be counted twice (once for each of two different intervals).
library(fuzzyjoin)
d_f$date = as.Date(paste0(Month, '-01'), format = "%Y-%m-%d")
d_f2 <- data.frame(date_start = seq.Date(min(d_f$date), max(d_f$date), "4 months"))
d_f2$date_end = date_start %m+% months(4)
d_f %>%
fuzzy_left_join(d_f2,
by = c("date" = "date_start", "date" = "date_end"),
match_fun = list(`>=`, `<=`)) %>%
group_by(group, date_start, date_end) %>%
summarise(value = sum(VALUE))
Output
# A tibble: 6 x 4
# Groups: group, date_start [6]
group date_start date_end value
<fct> <date> <date> <dbl>
1 green 2019-01-01 2019-05-01 20
2 green 2019-05-01 2019-09-01 210
3 green 2019-09-01 2020-01-01 330
4 red 2019-01-01 2019-05-01 130
5 red 2019-05-01 2019-09-01 140
6 red 2019-09-01 2020-01-01 90
One approach is to use the lag/lead functions in dplyr. Something like:
df2 = df %>%
group_by(group) %>%
mutate(prev_value = lag(value, 1, order_by = month),
prev_value2 = lag(value, 2, order_by = month),
prev_value3 = lag(value, 3, order_by = month)) %>%
mutate(avg = (value + prev_value + prev_value2 + prev_value3) / 4)
And then filter away the intervals you are not interested in.
I have a dataset that I'd now like to split at 12:00pm (midday) into two, i.e. if variable goes from 08:00-13:00 it becomes 08:00-12:00 and 12:00-13:00 across two rows. The variable duration and cumulative sum would need to be changed accordingly, but the other variables should be as in the original (unchanged).
This should be applicable across different id variables.
id = unchanged from row 1, just repeated
start = changed in both rows
end = changed in both rows
day = unchanged from row 1, just repeated
duration = changed in both rows
cumulative time = changed in both row
ORIGINAL DATAFILE
#Current dataframe
id<-c("m1","m1")
x<-c("2020-01-03 10:00:00","2020-01-03 19:20:00")
start<-strptime(x,"%Y-%m-%d %H:%M:%S")
y<-c("2020-01-03 16:00:00","2020-01-03 20:50:00")
end<-strptime(y,"%Y-%m-%d %H:%M:%S")
day<-c(1,1)
mydf<-data.frame(id,start,end,day)
# calculate duration and time
mydf$duration<-as.numeric(difftime(mydf$end,mydf$start,units = "hours"))
mydf$time<-c(cumsum(mydf$duration))
REQUIRED DATAFILE
#Required dataframe
id2<-c("m1","m1","m1")
x2<-c("2020-01-03 10:00:00","2020-01-03 12:00:00","2020-01-03 19:20:00")
start2<-strptime(x2,"%Y-%m-%d %H:%M:%S")
y2<-c("2020-01-03 12:00:00","2020-01-03 16:00:00","2020-01-03 20:50:00")
end2<-strptime(y2,"%Y-%m-%d %H:%M:%S")
day2<-c(1,1,1)
mydf2<-data.frame(id2,start2,end2,day2)
# calculate duration and time
mydf2$duration<-c(2,4,1.5)
mydf2$time<-c(2,6,7.5)
Good question. So, each line implicitly contains either one or two intervals, so you should be able to just define those interval(s) on each line and then pivot to long, but you can't pivot with interval values (yet?). So, here's my approach, which computes up to two shift start times for each line, and then infers the shift end from the start of the next shift after pivoting. Comments inline.
library(lubridate, warn.conflicts = FALSE)
library(tidyverse)
library(magrittr, warn.conflicts = FALSE)
library(hablar, warn.conflicts = FALSE)
(mydf <- tibble(
id = "m1",
start = as_datetime(c("2020-01-03 10:00:00", "2020-01-03 19:20:00")),
end = as_datetime(c("2020-01-03 16:00:00", "2020-01-03 20:50:00")),
day = 1
))
#> # A tibble: 2 x 4
#> id start end day
#> <chr> <dttm> <dttm> <dbl>
#> 1 m1 2020-01-03 10:00:00 2020-01-03 16:00:00 1
#> 2 m1 2020-01-03 19:20:00 2020-01-03 20:50:00 1
(mydf2 <-
mydf %>%
# Assume the relevant noontime cutoff is on the same day as the start
mutate(midday =
start %>% as_date() %>%
add(12 %>% hours()) %>%
fit_to_timeline() %>%
# No relevant midday if the shift doesn't include noon
na_if(not(. %within% interval(start, end)))) %>%
# Make an original row ID since there doesn't seem to be one, and we will need
# to build intervals within the data stemming from each original row
rownames_to_column("orig_shift") %>%
pivot_longer(cols = c(start, midday, end),
# The timestamps we have here will be treated as start times
values_to = "start",
# Drop rows that would exist due to irrelevant middays
values_drop_na = TRUE) %>%
select(-name) %>%
# Infer shift end times as the start of the next shift, within lines defined
# by the original shifts
group_by(orig_shift) %>%
arrange(start) %>%
mutate(end = lead(start)) %>%
ungroup() %>%
# Drop lines that represent the end of the last shift and not a full one
drop_na() %>%
# Compute those durations and times (should times really be globally
# cumulative? Also, your specified mydf2 seems to have an incorrect first time
# value)
mutate(duration = start %--% end %>% as.numeric("hours"),
time = cumsum(duration)) %>%
select(id, start, end, day, duration, time))
#> # A tibble: 3 x 6
#> id start end day duration time
#> <chr> <dttm> <dttm> <dbl> <dbl> <dbl>
#> 1 m1 2020-01-03 10:00:00 2020-01-03 12:00:00 1 2 2
#> 2 m1 2020-01-03 12:00:00 2020-01-03 16:00:00 1 4 6
#> 3 m1 2020-01-03 19:20:00 2020-01-03 20:50:00 1 1.5 7.5
Created on 2019-10-23 by the reprex package (v0.3.0)
Here is mine solution for a more general case when you have many observations with different dates. The logic is the following.
Firstly, I create a data frame with 12:00pm (midday) splitters.
Next, I identify the rows which should be split by joining the data frame to the initial one and saving them in separate data frame.
Next, I duplicate the rows and create the split_rows
From the original dataset I delete the rows which I split and join the correct doubled rows.
library(dplyr)
split_time_data =
tibble(split_time = as.POSIXct(seq(0, 365*60*60*24, 60*60*24),
origin="2020-01-01 17:00:00")) %>%
mutate(key = TRUE)# I use 17:00 to make it 12:00 EST, adjust for your purposes
data_to_split =
mydf %>%
mutate(key = TRUE) %>%
left_join(split_time_data) %>%
filter(between(split_time, start, end)) %>%
select(-key)
library(lubridate)
split_rows =
data_to_split %>%
rbind(data_to_split) %>%
arrange(start) %>%
group_by(start) %>%
mutate(row_number = row_number() ) %>%
ungroup() %>%
mutate(start = if_else(row_number == 1, start, split_time ),
end = if_else(row_number == 1, split_time, end )) %>%
select(-row_number, -split_time) %>%
mutate(duration = hour(end) - hour(start) )
mydf %>%
anti_join(data_to_split) %>%
full_join(split_rows) %>%
arrange(start) %>%
mutate(time = cumsum(duration) )
The output
id start end day duration time
1 m1 2020-01-03 10:00:00 2020-01-03 12:00:00 1 2.0 2.0
2 m1 2020-01-03 12:00:00 2020-01-03 16:00:00 1 4.0 6.0
3 m1 2020-01-03 19:20:00 2020-01-03 20:50:00 1 1.5 7.5
I have a dataset with ID, date, days of life, and medication variables. Each ID has multiple observations indicating different administrations of a certain drug. I want to find UNIQUE meds that were administered within 365 days of each other. A sample of the data frame is as follows:
ID date dayoflife meds
1 2003-11-24 16361 lasiks
1 2003-11-24 16361 vigab
1 2004-01-09 16407 lacos
1 2013-11-25 20015 pheno
1 2013-11-26 20016 vigab
1 2013-11-26 20016 lasiks
2 2008-06-05 24133 pheno
2 2008-04-07 24074 vigab
3 2014-11-25 8458 pheno
3 2014-12-22 8485 pheno
I expect the outcome to be:
ID N
1 3
2 2
3 1
indicating that individual 1 had a max of 3 different types of medications administered within 365 days of each other. I am not sure if it is best to use days of life or the date to get to this expected outcome.Any help is appreciated
An option would be to convert the 'date' to Date class, grouped by 'ID', get the absolute difference of 'date' and the lag of the column, check whether it is greater than 365, create a grouping index with cumsum, get the number of distinct elements of 'meds' in summarise
library(dplyr)
df1 %>%
mutate(date = as.Date(date)) %>%
group_by(ID) %>%
mutate(diffd = abs(as.numeric(difftime(date, lag(date, default = first(date)),
units = 'days')))) %>%
group_by(grp = cumsum(diffd > 365), add = TRUE) %>%
summarise(N = n_distinct(meds)) %>%
group_by(ID) %>%
summarise(N = max(N))
# A tibble: 3 x 2
# ID N
# <int> <int>
#1 1 2
#2 2 2
#3 3 1
You can try:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(date = as.Date(date),
lag_date = abs(date - lag(date)) <= 365,
lead_date = abs(date - lead(date)) <= 365) %>%
mutate_at(vars(lag_date, lead_date), ~ ifelse(., ., NA)) %>%
filter(coalesce(lag_date, lead_date)) %>%
summarise(N = n_distinct(meds))
Output:
# A tibble: 3 x 2
ID N
<int> <int>
1 1 2
2 2 2
3 3 1