Related
Background
I've got an R dataframe d:
d <- data.frame(ID = c("a","a","b","b", "c","c","c"),
event = c(1,1,0,0,1,1,1),
event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07","2014-07-14")),
stringsAsFactors=FALSE)
As you can see, it's got 3 distinct people in the ID column, and they've either had or not had an event, along with a date their event status was recorded (event_date).
The Problem
I'd like to create a new variable / column, event_within_interval, which assigns 1 to all the cells of a given ID if that ID has 2 or more event=1 within 180 days of their first event=1.
Let me explain further: both ID=a and ID=c have 2 or more events each, but only ID=c has their second event within 180 days of their first (so here, the 4/7/2013 - 3/14/2013 = 24 days for ID=c).
The problem is that I'm not sure how to tell R this idea of "if the second happens within 180 days of the first event=1".
What I'd like
Here's what I'm looking for:
want <- data.frame(ID = c("a","a","b","b","c","c","c"),
event = c(1,1,1,0,0,1,1),
event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07","2014-07-14")),
event_within_interval = c(0,0,0,0,1,1,1),
stringsAsFactors=FALSE)
What I've tried
I've only got the beginnings of an attempt thus far:
d <- d %>%
mutate(event_within_interval = ID %in% if_else(d$event == 1, 1, 0))
But this doesn't give me what I'd like, as you can tell if you run the code.
I've set the thing up as an if_else, but I'm not sure where to go from here.
UPDATE: I've edited both reproducible examples (what I've got and what I want) to emphasize the fact that the desired date interval needs to be between the first event and the second event, not the first event and the last event. (A couple of users submitted examples using last, which worked for the previous iteration of the reproducible example but wouldn't have worked on the real dataset.)
What about by packages lubridate and data.table?
library(data.table)
library(lubridate)
d <- data.frame(ID = c("a","a","b","b", "c","c"),
event = c(1,1,0,0,1,1),
event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07")),
stringsAsFactors=FALSE)
d <- data.table(d)
d <- d[, event_within_interval := 0]
timeInterval <- interval(start = "2013-03-14", end = "2013-04-07")
d <- d[event == 1 & event_date %within% timeInterval, event_within_interval := 1]
d
# ID event event_date event_within_interval
# 1: a 1 2011-01-01 0
# 2: a 1 2012-08-21 0
# 3: b 0 2011-12-23 0
# 4: b 0 2011-12-31 0
# 5: c 1 2013-03-14 1
# 6: c 1 2013-04-07 1
This is good fun.
Scenario 1
My approach would be to
group events by ID
Apply first condition check on two the span of days between current date and initial date
check if the sum of events is bigger or equal two: sum(event) >= 2
only if the two conditions are met I would return one for the event
For readability, I've returned values of conditions in the data as test_* variables.
d %>%
group_by(ID) %>%
mutate(test_interval = event_date - min(event_date) < 180,
test_sum_events = sum(event) >= 2,
event_within_interval = if_else(test_interval & test_sum_events,
1, 0)) %>%
ungroup()
Scenario 2
In this scenario, the data is sorted by event_date within ID and the difference between the first event and second event has to be under 180 days. Rest of events is ignored.
d %>%
group_by(ID) %>%
arrange(event_date) %>%
mutate(
# Check the difference between first event: min(event_date) and
# second event: event_date[2]
test_interval_first_two = event_date[2] - min(event_date) <= 180,
test_sum_events = sum(event) >= 2,
event_within_interval = if_else(
test_interval_first_two & test_sum_events, 1, 0)
) %>%
ungroup()
You can first group_by the ID column, so that we can calculate days within the same ID. Then in the condition in the if_else statement, use condition with sum() > 1 AND day difference <= 180.
Here I assume there's only two "events" or rows per ID.
library(dplyr)
d %>%
group_by(ID) %>%
mutate(event_within_interval = if_else(sum(event) > 1 & last(event_date) - first(event_date) <= 180, 1L, 0L))
# A tibble: 6 x 4
# Groups: ID [3]
ID event event_date event_within_interval
<chr> <dbl> <date> <int>
1 a 1 2011-01-01 0
2 a 1 2012-08-21 0
3 b 0 2011-12-23 0
4 b 0 2011-12-31 0
5 c 1 2013-03-14 1
6 c 1 2013-04-07 1
Here is how we could do it. In this example with an additional column interval to see the interval and then use an ifelse statement.
library(dpylr)
d %>%
group_by(ID) %>%
mutate(interval = last(event_date)- first(event_date),
event_within_interval = ifelse(event == 1 &
interval < 180, 1, 0))
ID event event_date interval event_within_interval
<chr> <dbl> <date> <drtn> <dbl>
1 a 1 2011-01-01 598 days 0
2 a 1 2012-08-21 598 days 0
3 b 0 2011-12-23 8 days 0
4 b 0 2011-12-31 8 days 0
5 c 1 2013-03-14 24 days 1
6 c 1 2013-04-07 24 days 1
I have clinical data that records a patient at three time points with a disease outcome indicated by a binary variable. It looks something like this
patientid <- c(100,100,100,101,101,101,102,102,102)
time <- c(1,2,3,1,2,3,1,2,3)
outcome <- c(0,1,1,0,0,1,1,1,0)
Data<- data.frame(patientid=patientid,time=time,outcome=outcome)
Data
I want to create an onset variable, so for each patient it would code a 1 for the time which the patient first got the disease, but would then be a 0 for any time period before or a time period after (even if that patient still had the disease). For the example data it should now look like this.
patientid <- c(100,100,100,101,101,101,102,102,102)
time <- c(1,2,3,1,2,3,1,2,3)
outcome <- c(0,1,1,0,0,1,1,1,0)
outcome_onset <- c(0,1,0,0,0,1,1,0,0)
Data<- data.frame(patientid=patientid,time=time,outcome=outcome,
outcome_onset=outcome_onset)
Data
Therefore I would like some code/ some help automating the creation of the outcome_onset variable.
Here is an option with cumsum to create a logical vector after grouping by the 'patientid'
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = +(cumsum(outcome) == 1))
Or use match and %in%
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = +(row_number() %in% match(1, outcome_onset)))
We can use which.max to get the index of 1st one in outcome variable and make that row as 1 and rest of them as 0.
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = as.integer(row_number() %in% which.max(outcome)),
outcome_onset = replace(outcome_onset, is.na(outcome), NA))
# patientid time outcome outcome_onset
# <dbl> <dbl> <dbl> <int>
#1 100 1 0 0
#2 100 2 1 1
#3 100 3 1 0
#4 101 1 0 0
#5 101 2 0 0
#6 101 3 1 1
#7 102 1 1 1
#8 102 2 1 0
#9 102 3 0 0
I have a dataset in long-format (i.e. multiple observations per ID). Each ID contains multiple visits at which the individual was diagnosed for disease (in the toy example, I show 3 but in my real data I have as many as 30), which are coded in consecutive columns (disease1-disease3). A value of 1 means they were diagnosed with the disease at the time of diagnosis_dt, and 0 means the did not have it. For each ID, I'm interested in summarizing whether or not they had any disease across all visits where diagnosis_dt falls between start_dt and end_dt. Some IDs don't have diagnosis information, and consequently are coded as NAs in the respective columns. I'd still like to keep this information.
A toy example of my dataset is below:
library(dplyr)
library(data.table)
ex_dat <- data.frame(ID = c(rep("a",3),
rep("b",4),
rep("c",5)),
start_dt = as.Date(c(rep("2009-01-01",3),
rep("2009-04-01",4),
rep("2009-02-01",5))),
end_dt = as.Date(c(rep("2010-12-31",3),
rep("2011-03-31",4),
rep("2011-01-31",5))),
diagnosis_dt = c(as.Date(c("2011-01-03","2010-11-01","2009-12-01")),
as.Date(c("2011-04-03","2010-11-01","2009-12-01","2011-12-01")),
rep(NA,5)),
disease1 = c(c(1,0,0),
c(1,1,0,1),
rep(NA,5)),
disease2 = c(c(1,1,0),
c(0,0,0,1),
rep(NA,5)),
disease3 = c(c(0,0,0),
c(0,0,1,0),
rep(NA,5))
)
The desired output is:
ID disease1 disease2 disease3
1 a 0 1 0
2 b 1 0 1
3 c NA NA NA
I've been trying this for hours now and my latest attempt is:
out <- ex_dat %>% group_by(ID) %>%
mutate_at(vars(disease1:disease3),
function(x) ifelse(!is.na(.$diagnosis_dt) &
between(.$diagnosis_dt,.$start_dt,.$end_dt) &
sum(x)>0,
1,0)) %>%
slice(1) %>%
select(ID,disease1:disease3)
Here is a tidyverse solution using filter to eliminate the rows that do not meet the desired condition and then use complete to complete the missing groups with NA.
library(tidyverse)
ex_dat %>%
#Group by ID
group_by(ID) %>%
# Stay with the rows for which diagnosis_dt is between start_dt and end_dt
filter(diagnosis_dt >= start_dt & diagnosis_dt <= end_dt ) %>%
# summarize all variables that start with disease by taking its max value
summarize_at(vars(starts_with("disease")), max) %>%
# Complete the missing IDs, those that only had NA or did not meet the criteria in
# the filter
complete(ID)
# A tibble: 3 x 4
# ID disease1 disease2 disease3
# <fct> <dbl> <dbl> <dbl>
# 1 a 0 1 0
# 2 b 1 0 1
# 3 c NA NA NA
Here's an approach with the dplyr across functionality (version >= 1.0.0):
library(dplyr)
ex_dat %>%
group_by(ID) %>%
summarize(across(-one_of(c("start_dt","end_dt","diagnosis_dt")),
~ if_else(any(diagnosis_dt > start_dt & diagnosis_dt < end_dt & .),
1, 0)))
## A tibble: 3 x 4
# ID disease1 disease2 disease3
# <fct> <dbl> <dbl> <dbl>
#1 a 0 1 0
#2 b 1 0 1
#3 c NA NA NA
Note that using the & operator on the integer column . converts to logical. I'm using the -one_of tidyselect verb because then we don't even need to know how many diseases there are. The columns that are actively being group_by-ed are automatically excluded.
Your version isn't working because 1) you need to summarize, not mutate, and 2) inside the function call . refers to the column that is being worked on, not the data from piping. Instead, you need to access those columns without $ from the calling environment.
I'm trying to modify a solution posted here Create cohort dropout rate table from raw data
I'd like to create a CUMULATIVE dropout rate table using these data.
DT<-data.table(
id =c (1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30,31,32,33,34,35),
year =c (2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,
2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,
2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016),
cohort =c(1,1,1,1,1,1,1,1,1,1,
2,2,2,1,1,2,1,2,1,2,
1,1,3,3,3,2,2,2,2,3,3,3,3,3,3))
So far, I've been able to get to this point
library(tidyverse)
DT %>%
group_by(year) %>%
count(cohort) %>%
ungroup() %>%
spread(year, n) %>%
mutate(y2014_2015_dropouts = (`2014` - `2015`),
y2015_2016_dropouts = (`2015` - `2016`)) %>%
mutate(y2014_2015_cumulative =y2014_2015_dropouts/`2014`,
y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative)%>%
replace_na(list(y2014_2015_dropouts = 0.0,
y2015_2016_dropouts = 0.0)) %>%
select(cohort, y2014_2015_dropouts, y2015_2016_dropouts, y2014_2015_cumulative,y2015_2016_cumulative )
A cumulative dropout rate table reflects the proportion of students within a class who dropped out of school across years.
# A tibble: 3 x 5
cohort y2014_2015_dropouts y2015_2016_dropouts y2014_2015_cumulative y2015_2016_cumulative
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6 2 0.6 0.8
2 2 0 2 NA NA
3 3 0 0 NA NA
>
The last two columns of the tibble show that by the end of year 2014-2015, 60% of cohort 1 students dropped out; and by the end of year 2015-2016, 80% of cohort 1 students had dropped out.
I'd like to calculate the same for cohorts 2 and 3, but I don't know how to do it.
Here is an alternative data.table solution that keeps your data organized in a way that I find easier to deal with. Using your DT input data:
Organize and order by cohort and year:
DT2 <- DT[, .N, list(cohort, year)][order(cohort, year)]
Assign the year range:
DT2[, year := paste(lag(year), year, sep = "_"),]
Get dropouts per year
DT2[, dropouts := ifelse(!is.na(lag(N)), lag(N) - N, 0), , cohort, ]
Get the cumulative sum of proportion dropped out each year per cohort:
DT2[, cumul := cumsum(dropouts) / max(N), cohort]
Output:
> DT2
cohort year N dropouts cumul
1: 1 NA_2014 10 0 0.0000000
2: 1 2014_2015 4 6 0.6000000
3: 1 2015_2016 2 2 0.8000000
4: 2 2016_2015 6 0 0.0000000
5: 2 2015_2016 4 2 0.3333333
6: 3 2016_2016 9 0 0.0000000
Because you spread your data by year early in your pipe and your 2014 columns have NA values for everything related to cohort 2, you need to coalesce the denominator in your calculation for y2015_2016_cumulative. If you replace the definition for that variable from the current
y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative
to
y2015_2016_cumulative =y2015_2016_dropouts/coalesce(`2014`, `2015`) +
coalesce(y2014_2015_cumulative, 0)
you should be good to go. The coalesce function tries the first argument, but inputs the second argument if the first is NA. That being said, this current method isn't extremely scalable. You would have to add additional coalesce statements for every year you added. If you keep your data in the tidy format, you can keep a running list at the year-cohort level using
DT %>%
group_by(year) %>%
count(cohort) %>%
ungroup() %>%
group_by(cohort) %>%
mutate(dropouts = lag(n) - n,
dropout_rate = dropouts / max(n)) %>%
replace_na(list(dropouts = 0, n = 0, dropout_rate = 0)) %>%
mutate(cumulative_dropouts = cumsum(dropouts),
cumulative_dropout_rate = cumulative_dropouts / max(n))
This is a sample of my data set:
day city count
1 1 A 50
2 2 A 100
3 2 B 110
4 2 C 90
Here is the code for reproducing it:
df <- data.frame(
day = c(1,2,2,2),
city = c("A","A","B","C"),
count = c(50,100,110,90)
)
As you could see, the count data is missing for city B and C on the day 1. What I want to do is to use city A's count as an estimate for the other two cities. So the desired output would be:
day city count
1 1 A 50
2 1 B 50
3 1 C 50
4 2 A 100
5 2 B 110
6 2 C 90
I could come up with a for loop to do it, but I feel like there should be an easier way of doing it. My idea is to count the number of observations for each day, and then for the days that the number of observations is less than the number of cities in the data set, I would replicate the row to complete the data for that day. Any better ideas? or a more efficient for-loop? Thanks.
With dplyr and tidyr, we can do:
library(dplyr)
library(tidyr)
df %>%
expand(day, city) %>%
left_join(df) %>%
group_by(day) %>%
fill(count, .direction = "up") %>%
fill(count, .direction = "down")
Alternatively, we can avoid the left_join using thelatemail's solution:
df %>%
complete(day, city) %>%
group_by(day) %>%
fill(count, .direction = "up") %>%
fill(count, .direction = "down")
Both return:
# A tibble: 6 x 3
day city count
<dbl> <fct> <dbl>
1 1. A 50.
2 1. B 50.
3 1. C 50.
4 2. A 100.
5 2. B 110.
6 2. C 90.
Data (slightly modified to show .direction filling both directions):
df <- data.frame(
day = c(1,2,2,2),
city = c("B","A","B","C"),
count = c(50,100,110,90)
)