I have a series of admissions for patients (dataframe 'admissions' below) and a series of events (2nd dataframe called 'events').
I am interested in whether events occurred within 5 days after an admission. Obviously matches have to be made within patient ID ('id').
In real life, the admissions data frame contains >500k admissions on 100k pts. One patient might have multiple admissions, and multiple events. Not all patients will have an event.
admissions <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), date = structure(c(16436,
16443, 16574, 16468, 16481, 16494), class = "Date")), .Names = c("id",
"date"), row.names = c(NA, 6L), class = "data.frame")
> admissions
id date
1 1 2015-01-01
2 1 2015-01-08
3 1 2015-05-19
4 2 2015-02-02
5 2 2015-02-15
6 2 2015-02-28
events <- structure(list(id = c(1L, 1L, 2L), date = structure(c(16453,
16578, 16467), class = "Date")), .Names = c("id", "date"), row.names = 7:9, class = "data.frame")
> events
id date
7 1 2015-01-18
8 1 2015-05-23
9 2 2015-02-01
I guess I just need the minimum difference in days (only positive values considered) for each event relative to the admissions, matched within patients.
Event 1 (id ==1): +10 days (10 days after 08/01/2015)
Event 2 (id ==1): +4 days
Event 3 (id ==2): -1 days
I can then select those events that fall within my window (which will probably be 5 days).
My guess would be that lapply() is involved, but for some reason the apply's are not every natural to me (yet!).
Using dplyr:
library(dplyr)
mutate(events, event_id=row_number()) %>% # Add event id
right_join(admissions, by="id") %>% # Join with admissions
rename(adm_date = date.y, ev_date = date.x) %>% # Clean names
mutate(diff = ev_date - adm_date) %>% # Compute diffrence
filter(diff >= 0) %>% # Filter
group_by(event_id) %>%
arrange(diff) %>% # Sort ascending by diff by event_id
summarise_each(funs(first), ev_date, adm_date, diff) # Get nearest
Source: local data frame [2 x 4]
event_id ev_date adm_date diff
1 1 2015-01-18 2015-01-08 10 days
2 2 2015-05-23 2015-05-19 4 days
Using data.table rolling join:
keycols <- c("id", "date")
admissions_dt <- admissions %>% mutate(adm_date = date) %>% as.data.table()
setkeyv(admissions_dt, keycols)
events_dt <- mutate(events, event_id=row_number()) %>% as.data.table()
setkeyv(events_dt, keycols)
admissions_dt[events_dt, roll=10][order(event_id)]
id date adm_date event_id
1: 1 2015-01-18 2015-01-08 1
2: 1 2015-05-23 2015-05-19 2
3: 2 2015-02-01 <NA> 3
Using data.table 1.9.5 for its on= feature.
For each row in event, find the index corresponding to the closest date <= admissions$date.
idx = setDT(admissions)[events, which=TRUE, roll=TRUE, on=c("id", "date")]
idx
# [1] 2 3 NA
If you already know you'll only prefer 5 day window, then you can use roll=5 instead of roll=TRUE. roll=<positive number> performs a LOCF rolling join.
The indices correspond to matching rows in admission for each row of event. So we can now extract the date as follows:
setDT(events)[, adm_date := admission$date[idx]]
# id date adm_date
# 1: 1 2015-01-18 2015-01-08
# 2: 1 2015-05-23 2015-05-19
# 3: 2 2015-02-01 <NA>
Related
My data (df) looks similar to this
date
address1
address2
2015-01-01
2
8
2015-01-02
3
7
2015-01-03
7
3
2015-01-04
3
1
2015-01-05
9
4
2015-01-06
3
4
I want to get 3 days average of value at each address like this
date
address1
address2
2015-01-03
4
6
2015-01-06
5
3
I have tried to extract date by every three days like d<-date[seq(1,length(date),by=3)]. I calculated the value using dat<-rowsum(df[,-1],rep(1:6,each=3)). Then divided the whole dataframe by 3 and combined d and dat.
I have tried to find rowmean works like rowsum, but did not manage to. Also, rolling means does not suit my case as it averages rows with overlapping (multiple use of rows).
Please help me to improve my method. Thanks a lot.
You can create group of every 3 rows and take mean of all the "address" columns -
library(dplyr)
df %>%
mutate(date = as.Date(date)) %>%
group_by(grp = ceiling(row_number()/3)) %>%
summarise(date = last(date),
across(starts_with('address'), mean, na.rm = TRUE)) %>%
select(-grp)
# date address1 address2
# <date> <dbl> <dbl>
#1 2015-01-03 4 6
#2 2015-01-06 5 3
Another option is to cut by 3 days but it will give the date of starting of the group.
df %>%
mutate(date = as.Date(date)) %>%
group_by(date = cut(date, '3 days')) %>%
summarise(across(starts_with('address'), mean, na.rm = TRUE))
The solutions below use the input shown reproducibly in the Note at the end. The first two use only base R. The first requires that the number of rows be a multiple of 3 but the others so not have this restriction.
1) rowsum Create a grouping vector, date, and use it in the second argument to rowsum giving the numeric matrix shown.
nr <- nrow(df)
date <- df$date[ 3 * col(matrix(0, 3, nr/3)) ]
rowsum(df[-1], date) / 3
## address1 address2
## 2015-01-03 4 6
## 2015-01-06 5 3
2) aggregate Alternately use aggregate giving a 3 column data frame.
nr <- nrow(df)
date <- ave(df$date, seq(0, length = nr) %/% 3, FUN = max)
aggregate(df[-1], data.frame(date), mean)
## date address1 address2
## 1 2015-01-03 4 6
## 2 2015-01-06 5 3
3) collap collap from the collapse package can be used in place of aggregate. date is from (2).
library(collapse)
collap(df[-1], date)
## date address1 address2
## 1 2015-01-03 4 6
## 2 2015-01-06 5 3
4) data.table Using data.table and date from (2) this returns a data.table (which is also a data frame).
library(data.table)
as.data.table(df[, -1])[, lapply(.SD, mean), by = .(date)]
## date address1 address2
## 1: 2015-01-03 4 6
## 2: 2015-01-06 5 3
Note
The input in reproducible form is:
df <-
structure(list(date = c("2015-01-01", "2015-01-02", "2015-01-03",
"2015-01-04", "2015-01-05", "2015-01-06"), address1 = c(2L, 3L,
7L, 3L, 9L, 3L), address2 = c(8L, 7L, 3L, 1L, 4L, 4L)), class = "data.frame", row.names = c(NA,
-6L))
Another base R option with aggregate + ave
aggregate(
. ~ date,
transform(
df,
date = ave(date, ceiling(seq_along(date) / 3), FUN = max)
),
mean
)
gives
date address1 address2
1 2015-01-03 4 6
2 2015-01-06 5 3
I have a dataset in Excel that is structured as follows:
A B C
ID Start_date End_date
1 01/01/2000 05/01/2000
1 06/01/2000 15/05/2000
1 16/05/2000 07/04/2018
2 06/07/2016 09/10/2019
2 10/10/2019 14/12/2019
3 02/08/2000 06/08/2006
3 07/08/2006 15/02/2020
4 05/09/2012 09/11/2017
I would like to create a time series of the number of unique values in the above dataset that occur more than 3 times in the 12 months prior to any month in the date range covered by the dataset (in this case 01/01/2000 - 15/02/2020). So, for example, the number of unique values appearing more than three times in the 12 months prior to January 2001 would be 1 (ID = 1).
I've tried this in Excel using the following formula:
{=SUM(--(FREQUENCY(IF(($B$2:$B$8<=EOMONTH('Time Series'!A2,0))*($C$2:$C$8>=EOMONTH('Time Series'!A2,-12),$A$2:$A$8),$A$2:$A$8)>0))}
Where the value in 'Time Series'!A2 is January 2001.
However, this only returns the number of unique values that occur in the 12 months prior to January 2001, not how many unique values occur more than three times in the period.
Any help on this would be greatly appreciated - while I have been doing this in Excel so far, I would be open to performing the calculation in R if that would prove simpler.
I am not sure if I understood your question correctly:
1.Create minimal reproducible example:
df <-structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L),
Start_date = c("01/01/2000", "06/01/2000", "16/05/2000", "06/07/2016", "10/10/2019", "02/08/2000", "07/08/2006", "05/09/2012"),
End_date = c("05/01/2000", "15/05/2000","07/04/2018", "09/10/2019", "14/12/2019", "06/08/2006", "15/02/2020", "09/11/2017")),
class = "data.frame", row.names = c(NA, -8L))
head(df)
Returns:
ID Start_date End_date
1 1 01/01/2000 05/01/2000
2 1 06/01/2000 15/05/2000
3 1 16/05/2000 07/04/2018
4 2 06/07/2016 09/10/2019
5 2 10/10/2019 14/12/2019
6 3 02/08/2000 06/08/2006
Suggested solution using dplyr
Format date columns as.Date:
library(dplyr)
df_formated <- df %>%
mutate(Start_date = as.Date(Start_date, "%d/%m/%Y"),
End_date = as.Date(End_date, "%d/%m/%Y"))
str(df)
Returns:
'data.frame': 8 obs. of 3 variables:
$ ID : int 1 1 1 2 2 3 3 4
$ Start_date: chr "01/01/2000" "06/01/2000" "16/05/2000" "06/07/2016" ...
$ End_date : chr "05/01/2000" "15/05/2000" "07/04/2018" "09/10/2019" ...
Filter by cutoff_date and count occurences and filter by min_number_of_occurences:
cutoff_date <- as.Date("01/01/2001", "%d/%m/%Y")
min_number_of_occurences <- 3
df_formated %>%
filter(Start_date < cutoff_date) %>%
group_by(ID) %>%
summarise(N = n()) %>%
filter(N >= min_number_of_occurences)
Returns:
# A tibble: 1 x 2
ID N
<int> <int>
1 1 3
I have data like
subject date number
1 1/2/01 4
1 3/2/01 6
1 10/2/01 7
2 1/1/01 2
2 4/1/01 3
I want to get R to work out the number of days since the first sample for each subject. eg:
Subject days
1 0
1 2
1 9
2 0
2 3
How can I do this? I have converted the dates using lubridate.
SOmething like:
for(i in 1:nrow(data)){
if(data$date[i] != data$date[i -1]) {
data$timeline <- data$date[i] - data$date[i-1]
}
}
I get the error:
argument is of length 0 - I think the problem is the first line where there is no preceeding row..?
I would use dplyr to do some grouping and data manipulation. Note that we first have to convert your date into something R will recognize as a date.
library(dplyr)
dat$Date <- as.Date(dat$date, '%d/%m/%y')
dat %>%
group_by(subject) %>%
mutate(days = Date - min(Date))
# subject date number Date days
# <int> <chr> <int> <date> <time>
# 1 1 1/2/01 4 2001-02-01 0
# 2 1 3/2/01 6 2001-02-03 2
# 3 1 10/2/01 7 2001-02-10 9
# 4 2 1/1/01 2 2001-01-01 0
# 5 2 4/3/01 3 2001-03-04 62
here's the data:
dat <- structure(list(subject = c(1L, 1L, 1L, 2L, 2L), date = c("1/2/01",
"3/2/01", "10/2/01", "1/1/01", "4/3/01"), number = c(4L, 6L,
7L, 2L, 3L), Date = structure(c(11354, 11356, 11363, 11323, 11385
), class = "Date")), .Names = c("subject", "date", "number",
"Date"), row.names = c(NA, -5L), class = "data.frame")
Using the input shown in the note convert the date column to Date class (assuming that it is in the form dd/mm/yy) and then use ave to subtract the least date from all the dates for each subject. If the input is sorted as in the question we could optionally use x[1] instead of min(x). No packages are used.
data$date <- as.Date(data$date, "%d/%m/%y")
diff1 <- function(x) x - min(x)
with(data, data.frame(subject, days = ave(as.numeric(date), subject, FUN = diff1)))
giving:
subject days
1 1 0
2 1 2
3 1 9
4 2 0
5 2 62
Note
The input used, in reproducible form, is:
Lines <- "
subject date number
1 1/2/01 4
1 3/2/01 6
1 10/2/01 7
2 1/1/01 2
2 4/3/01 3"
data <- read.table(text = Lines, header = TRUE)
I have a dataframe as follows
structure(list(HospNum_Id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L), VisitDate = c("13/02/03", "13/04/05", "13/05/12",
"13/12/06", "13/04/12", "13/05/13", "13/06/14", "13/04/15", "03/04/15",
"04/05/16", "04/06/16"), EVENT = c("EMR", "RFA", "nothing", "nothing",
"EMR", "nothing", "EMR", "EMR", "RFA", "EMR", "nothing")), .Names = c("HospNum_Id",
"VisitDate", "EVENT"), class = "data.frame", row.names = c(NA,
-11L))
I want to only select the row where the current row EVENT is "EMR" and the one prior to this (arranged by ascending date order) is "nothing" for each HospNum_Id.
My desired output is:
HospNum_Id VisitDate EVENT
2 13/12/06 nothing
2 13/04/12 EMR
2 13/05/13 nothing
2 13/06/14 EMR
but my current output is:
HospNum_Id VisitDate EVENT
(int) (chr) (chr)
1 2 13/04/12 EMR
2 2 13/06/14 EMR
3 2 13/04/15 EMR
Currently I have the following code but its letting me down I think because I am using first in the filter rather than a phrase meaning "before the row that has EMR in the EVENT":
Upstaging<-Therap %>%
arrange(HospNum_Id, as.Date(Therap$VisitDate, '%d/%m/%y')) %>%
group_by(HospNum_Id) %>%
filter(first(EVENT == "nothing") & EVENT == "EMR")
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'HospNum_Id', we get the index ('i1') where 'EVENT' is "EMR" and the previous value is "nothing". Use that index to get the previous element index ('i1-1') sort and get the row index (.I). With that, we subset the rows.
library(data.table)
v1 <- setDT(df1)[, {i1 <- which(EVENT == "EMR" & shift(EVENT)=="nothing")
.I[sort(c(i1, i1-1))] } , by = HospNum_Id]$V1
df1[v1]
# HospNum_Id VisitDate EVENT
#1: 2 13/12/06 nothing
#2: 2 13/04/12 EMR
#3: 2 13/05/13 nothing
#4: 2 13/06/14 EMR
Or with similar methodology from dplyr.
library(dplyr)
df1 %>%
group_by(HospNum_Id) %>%
mutate(ind = EVENT=="nothing" & lead(EVENT)=="EMR") %>%
slice(sort(c(which(ind),which(ind)+1))) %>%
select(-ind)
# HospNum_Id VisitDate EVENT
# <int> <chr> <chr>
#1 2 13/12/06 nothing
#2 2 13/04/12 EMR
#3 2 13/05/13 nothing
#4 2 13/06/14 EMR
just using elementary operation the desired result can be obtained.
Step 1. Load the data(
Step 2. Arrange the data frame by ascending date order
Step 3. Select the rows having event="EMR" and create a data frame and create a data frame containing the previous rows.
Step 4. Remove duplicate and sort according to date
a<-loaded dataframe
a[order(as.Date(a$VisitDate,format="%d/%m/%Y")),,drop=FALSE]
revdf <- a[rev(rownames(a)),]
b<- revdf[which(revdf$EVENT=="EMR" ),]
c<- revdf[which(revdf$EVENT=="EMR" )-1,]
d<-rbind(b,c)
e<-d[!duplicated(d),]
f<-e[order(as.Date(e$VisitDate,format="%d/%m/%Y")),,drop=FALSE]
revdf1<-f[rev(rownames(f)),]
output:
>revdf1
HospNum_Id VisitDate EVENT
11 3 04/06/16 nothing
10 3 04/05/16 EMR
8 2 13/04/15 EMR
9 3 03/04/15 RFA
7 2 13/06/14 EMR
3 1 13/05/12 nothing
5 2 13/04/12 EMR
2 1 13/04/05 RFA
1 1 13/02/03 EMR
I have a large data set of time periods, defined by a 'start' and and an 'end' column. Some of the periods overlap.
I would like to combine (flatten / merge / collapse) all overlapping time periods to have one 'start' value and one 'end' value.
Some example data:
ID start end
1 A 2013-01-01 2013-01-05
2 A 2013-01-01 2013-01-05
3 A 2013-01-02 2013-01-03
4 A 2013-01-04 2013-01-06
5 A 2013-01-07 2013-01-09
6 A 2013-01-08 2013-01-11
7 A 2013-01-12 2013-01-15
Desired result:
ID start end
1 A 2013-01-01 2013-01-06
2 A 2013-01-07 2013-01-11
3 A 2013-01-12 2013-01-15
What I have tried:
require(dplyr)
data <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "A"),
start = structure(c(1356998400, 1356998400, 1357084800, 1357257600,
1357516800, 1357603200, 1357948800), tzone = "UTC", class = c("POSIXct",
"POSIXt")), end = structure(c(1357344000, 1357344000, 1357171200,
1357430400, 1357689600, 1357862400, 1358208000), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), .Names = c("ID", "start", "end"), row.names = c(NA,
-7L), class = "data.frame")
remove.overlaps <- function(data){
data2 <- data
for ( i in 1:length(unique(data$start))) {
x3 <- filter(data2, start>=data$start[i] & start<=data$end[i])
x4 <- x3[1,]
x4$end <- max(x3$end)
data2 <- filter(data2, start<data$start[i] | start>data$end[i])
data2 <- rbind(data2,x4)
}
data2 <- na.omit(data2)}
data <- remove.overlaps(data)
Here's a possible solution. The basic idea here is to compare lagged start date with the maximum end date "until now" using the cummax function and create an index that will separate the data into groups
data %>%
arrange(ID, start) %>% # as suggested by #Jonno in case the data is unsorted
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = first(start), end = last(end))
# Source: local data frame [3 x 4]
# Groups: ID
#
# ID indx start end
# 1 A 0 2013-01-01 2013-01-06
# 2 A 1 2013-01-07 2013-01-11
# 3 A 2 2013-01-12 2013-01-15
#David Arenburg's answer is great - but I ran into an issue where an earlier interval ended after a later interval - but using last in the summarise call resulted in the wrong end date. I'd suggest changing first(start) and last(end) to min(start) and max(end)
data %>%
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = min(start), end = max(end))
Also, as #Jonno Bourne mentioned, sorting by start and any grouping variables is important before applying the method.
For the sake of completeness, the IRanges package on Bioconductor has some neat functions which can be used to deal with date or date time ranges. One of it is the reduce() function which merges overlapping or adjacent ranges.
However, there is a drawback because IRanges works on integer ranges (hence the name), so the convenience of using IRanges functions comes at the expense of converting Date or POSIXct objects to and fro.
Also, it seems that dplyr doesn't play well with IRanges (at least judged by my limited experience with dplyr) so I use data.table:
library(data.table)
options(datatable.print.class = TRUE)
library(IRanges)
library(lubridate)
setDT(data)[, {
ir <- reduce(IRanges(as.numeric(start), as.numeric(end)))
.(start = as_datetime(start(ir)), end = as_datetime(end(ir)))
}, by = ID]
ID start end
<fctr> <POSc> <POSc>
1: A 2013-01-01 2013-01-06
2: A 2013-01-07 2013-01-11
3: A 2013-01-12 2013-01-15
A code variant is
setDT(data)[, as.data.table(reduce(IRanges(as.numeric(start), as.numeric(end))))[
, lapply(.SD, as_datetime), .SDcols = -"width"],
by = ID]
In both variants the as_datetime() from the lubridate packages is used which spares to specify the origin when converting numbers to POSIXct objects.
Would be interesting to see a benchmark comparision of the IRanges approaches vs David's answer.
It looks like I'm a little late to the party, but I took #zach's code and re-wrote it using data.table below. I didn't do comprehensive testing, but this seemed to run about 20% faster than the tidy version. (I couldn't test the IRange method because the package is not yet available for R 3.5.1)
Also, fwiw, the accepted answer doesn't capture the edge case in which one date range is totally within another (e.g., 2018-07-07 to 2017-07-14 is within 2018-05-01 to 2018-12-01). #zach's answer does capture that edge case.
library(data.table)
start_col = c("2018-01-01","2018-03-01","2018-03-10","2018-03-20","2018-04-10","2018-05-01","2018-05-05","2018-05-10","2018-07-07")
end_col = c("2018-01-21","2018-03-21","2018-03-31","2018-04-09","2018-04-30","2018-05-21","2018-05-26","2018-05-30","2018-07-14")
# create fake data, double it, add ID
# change row 17, such that each ID grouping is a little different
# also adds an edge case in which one date range is totally within another
# (this is the edge case not currently captured by the accepted answer)
d <- data.table(start_col = as.Date(start_col), end_col = as.Date(end_col))
d2<- rbind(d,d)
d2[1:(.N/2), ID := 1]
d2[(.N/2 +1):.N, ID := 2]
d2[17,end_col := as.Date('2018-12-01')]
# set keys (also orders)
setkey(d2, ID, start_col, end_col)
# get rid of overlapping transactions and do the date math
squished <- d2[,.(START_DT = start_col,
END_DT = end_col,
indx = c(0, cumsum(as.numeric(lead(start_col)) > cummax(as.numeric(end_col)))[-.N])),
keyby=ID
][,.(start=min(START_DT),
end = max(END_DT)),
by=c("ID","indx")
]
I think that you can solve this problem pretty nicely with dplyr and the ivs package, which is designed for working with interval vectors, exactly like what you have here. It is inspired by IRanges, but is more suitable for use in the tidyverse and is completely generic so it can handle date intervals automatically (no need to convert to numeric and back).
The key is to combine the start/end boundaries into a single interval vector column, and then use iv_groups(). This merges all of the overlapping intervals in the interval vector and returns the intervals that remain after the overlaps have been merged.
It seems like you want to do this by ID, so I've also grouped by ID.
library(ivs)
library(dplyr)
data <- tribble(
~ID, ~start, ~end,
"A", "2013-01-01", "2013-01-05",
"A", "2013-01-01", "2013-01-05",
"A", "2013-01-02", "2013-01-03",
"A", "2013-01-04", "2013-01-06",
"A", "2013-01-07", "2013-01-09",
"A", "2013-01-08", "2013-01-11",
"A", "2013-01-12", "2013-01-15"
) %>%
mutate(
start = as.Date(start),
end = as.Date(end)
)
data
#> # A tibble: 7 × 3
#> ID start end
#> <chr> <date> <date>
#> 1 A 2013-01-01 2013-01-05
#> 2 A 2013-01-01 2013-01-05
#> 3 A 2013-01-02 2013-01-03
#> 4 A 2013-01-04 2013-01-06
#> 5 A 2013-01-07 2013-01-09
#> 6 A 2013-01-08 2013-01-11
#> 7 A 2013-01-12 2013-01-15
# Combine `start` and `end` into a single interval vector column
data <- data %>%
mutate(interval = iv(start, end), .keep = "unused")
# Note that this is a half-open interval!
data
#> # A tibble: 7 × 2
#> ID interval
#> <chr> <iv<date>>
#> 1 A [2013-01-01, 2013-01-05)
#> 2 A [2013-01-01, 2013-01-05)
#> 3 A [2013-01-02, 2013-01-03)
#> 4 A [2013-01-04, 2013-01-06)
#> 5 A [2013-01-07, 2013-01-09)
#> 6 A [2013-01-08, 2013-01-11)
#> 7 A [2013-01-12, 2013-01-15)
# It seems like you'd want to group by ID, so lets do that.
# Then we use `iv_groups()` which merges all overlapping intervals and returns
# the intervals that remain after all the overlaps have been merged
data %>%
group_by(ID) %>%
summarise(interval = iv_groups(interval), .groups = "drop")
#> # A tibble: 3 × 2
#> ID interval
#> <chr> <iv<date>>
#> 1 A [2013-01-01, 2013-01-06)
#> 2 A [2013-01-07, 2013-01-11)
#> 3 A [2013-01-12, 2013-01-15)
Created on 2022-04-05 by the reprex package (v2.0.1)