I have a dataset of several thousand ICU patients covering several years. Some patients (each with a unique identifier, ID) have had multiple ICU admissions. Each row covers a single ICU admission, and therefore an individual patient may have multiple rows of data.
For each patient, I want to determine whether their ICU admission was:
A readmission during the same hospital stay. This could be identified by an icu_adm time occurring prior to their previous hosp_dis time, or by multiple rows with the same hosp_dis time.
A transfer to a different ICU for management of the same illness. I am defining this as an icu_adm time occurring within 24 hours of their previous hosp_dis time. These patients icu_dis time and hosp_dis time should be the same, as their hospital discharge occured from ICU.
A new admission of the same patient
I am able to use lubridate to compare times without difficulty, but I am stuck on how to do the between-row comparisons, especially for patients with multiple ICU admissions (who have new admissions, readmissions, and transfers all in the time period of interest).
Some example data:
ID site icu_adm icu_dis hosp_adm hosp_dis
1 A 2016-02-02 15:38:00 2016-02-06 14:25:00 2016-02-02 15:17:00 2016-02-06 14:25:00
1 B 2016-02-06 16:17:00 2016-02-16 14:16:00 2016-02-06 16:16:00 2016-03-16 17:50:00
2 C 2009-08-09 14:27:00 2009-08-10 15:06:00 2009-08-03 02:51:00 2009-09-02 00:00:00
3 C 2009-08-18 20:32:00 2009-08-27 15:10:00 2009-08-03 02:51:00 2009-09-02 00:00:00
3 A 2010-02-20 21:00:00 2010-03-03 13:00:00 2010-02-18 03:00:00 2010-03-18 15:21:00
3 B 2010-05-05 17:00:00 2010-05-08 09:13:00 2010-05-03 11:21:00 2010-05-20 17:18:00
Desired output would be:
ID … readmission transferred new_adm
1 0 0 1
1 0 1 0
2 0 0 1
2 1 0 0
3 0 0 1
3 0 0 1
I'm not entirely sure this will work with all of your data, but thought this might be helpful.
Using tidyverse (or dplyr package in this case), you can start by grouping by ID to look at transfers. Based on your definition, if your icu_adm time is less than 24 hours of the previous row's discharge time (hosp_dis), then it is considered an ICU transfer. You can use lag to compare with previous row, assume dates/times are in chronological order (if not, you can use arrange to order).
Next, you can group by ID, hosp_adm, and hosp_dis. This will help look at readmissions. After grouping, all rows of data after the first row (for the same hospital admission) will be considered ICU readmissions.
Then, everything left that is not a transfer or readmission could be considered a new ICU admission.
Let me know if this is what you had in mind.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(transfer = ifelse(abs(difftime(icu_adm, lag(hosp_dis, default = first(hosp_dis)), units = "hours")) < 24, 1, 0)) %>%
group_by(ID, hosp_adm, hosp_dis) %>%
mutate(readmission = ifelse(row_number() > 1, 1, 0),
new_adm = ifelse(transfer != 1 & readmission != 1, 1, 0))
Output
ID site icu_adm icu_dis hosp_adm hosp_dis transfer readmission new_adm
<int> <chr> <dttm> <dttm> <dttm> <dttm> <dbl> <dbl> <dbl>
1 1 A 2016-02-02 15:38:00 2016-02-06 14:25:00 2016-02-02 15:17:00 2016-02-06 14:25:00 0 0 1
2 1 B 2016-02-06 16:17:00 2016-02-16 14:16:00 2016-02-06 16:16:00 2016-03-16 17:50:00 1 0 0
3 2 C 2009-08-09 14:27:00 2009-08-10 15:06:00 2009-08-03 02:51:00 2009-09-02 00:00:00 0 0 1
4 2 C 2009-08-18 20:32:00 2009-08-27 15:10:00 2009-08-03 02:51:00 2009-09-02 00:00:00 0 1 0
5 3 A 2010-02-20 21:00:00 2010-03-03 13:00:00 2010-02-18 03:00:00 2010-03-18 15:21:00 0 0 1
6 3 B 2010-05-05 17:00:00 2010-05-08 09:13:00 2010-05-03 11:21:00 2010-05-20 17:18:00 0 0 1
Data
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L), site = c("A",
"B", "C", "C", "A", "B"), icu_adm = structure(c(1454427480, 1454775420,
1249828020, 1250627520, 1266699600, 1273078800), tzone = "", class = c("POSIXct",
"POSIXt")), icu_dis = structure(c(1454768700, 1455632160, 1249916760,
1251385800, 1267621200, 1273309980), tzone = "", class = c("POSIXct",
"POSIXt")), hosp_adm = structure(c(1454426220, 1454775360, 1249267860,
1249267860, 1266462000, 1272885660), tzone = "", class = c("POSIXct",
"POSIXt")), hosp_dis = structure(c(1454768700, 1458150600, 1251849600,
1251849600, 1268925660, 1274375880), tzone = "", class = c("POSIXct",
"POSIXt"))), class = "data.frame", row.names = c(NA, -6L))
Related
I have below-mentioned dataframe in R.
DF
ID Datetime Value
T-1 2020-01-01 15:12:14 10
T-2 2020-01-01 00:12:10 20
T-3 2020-01-01 03:11:11 25
T-4 2020-01-01 14:01:01 20
T-5 2020-01-01 18:07:11 10
T-6 2020-01-01 20:10:09 15
T-7 2020-01-01 15:45:23 15
By utilizing the above-mentioned dataframe, I want to bifurcate the count basis month and time bucket considering the Datetime.
Required Output:
Month Count Sum
Jan-20 7 115
12:00 AM to 05:00 AM 2 45
06:00 AM to 12:00 PM 0 0
12:00 PM to 03:00 PM 1 20
03:00 PM to 08:00 PM 3 35
08:00 PM to 12:00 AM 1 15
You can bin the hours of the day by using hour from the lubridate package and then cut from base R, before summarizing with dplyr.
Here, I am assuming that your Datetime column is actually in a date-time format and not just a character string or factor. If it is, ensure you have done DF$Datetime <- as.POSIXct(as.character(DF$Datetime)) first to convert it.
library(tidyverse)
DF$bins <- cut(lubridate::hour(DF$Datetime), c(-1, 5.99, 11.99, 14.99, 19.99, 24))
levels(DF$bins) <- c("00:00 to 05:59", "06:00 to 11:59", "12:00 to 14:59",
"15:00 to 19:59", "20:00 to 23:59")
newDF <- DF %>%
group_by(bins, .drop = FALSE) %>%
summarise(Count = length(Value), Total = sum(Value))
This gives the following result:
newDF
#> # A tibble: 5 x 3
#> bins Count Total
#> <fct> <int> <dbl>
#> 1 00:00 to 05:59 2 45
#> 2 06:00 to 11:59 0 0
#> 3 12:00 to 14:59 1 20
#> 4 15:00 to 19:59 3 35
#> 5 20:00 to 23:59 1 15
And if you want to add January as a first row (though I'm not sure how much sense this makes in this context) you could do:
newDF %>%
summarise(bins = "January", Count = sum(Count), Total = sum(Total)) %>% bind_rows(newDF)
#> # A tibble: 6 x 3
#> bins Count Total
#> <chr> <int> <dbl>
#> 1 January 7 115
#> 2 00:00 to 05:59 2 45
#> 3 06:00 to 11:59 0 0
#> 4 12:00 to 14:59 1 20
#> 5 15:00 to 19:59 3 35
#> 6 20:00 to 23:59 1 15
Incidentally, the reproducible version of the data I used for this was:
structure(list(ID = structure(1:7, .Label = c("T-1", "T-2", "T-3",
"T-4", "T-5", "T-6", "T-7"), class = "factor"), Datetime = structure(c(1577891534,
1577837530, 1577848271, 1577887261, 1577902031, 1577909409, 1577893523
), class = c("POSIXct", "POSIXt"), tzone = ""), Value = c(10,
20, 25, 20, 10, 15, 15)), class = "data.frame", row.names = c(NA,
-7L))
I have some trips per each person and each household. the first 2 rows of data is like this
Household person trip are.time depends.time duration
1 1 0 02:20:00 08:20:00 NA
1 1 1 08:50:00 17:00:00 30
This means that the person start trip at 8:20 and get to destination at 8:50 . that's why duration is 30. (is trip duration)
Now I want to put start time of each trip in the same row as that trip.
like this
Household person trip start.time. are.time depends.time duration
1 1 0 NA. 02:20:00 08:20:00 NA
1 1 1 08:20:00 08:50:00 17:00:00 30
Notice that for zero trip of each person we do not have start time so I put NA.
A solution using dplyr with base R functions for time manipulation.
First create the data
df <- data.frame(
Household = c(1, 1),
person = c(1, 1),
trip = c(0, 1),
are.time = c("02:20:00", "08:50:00"),
depends.time = c("08:20:00", "17:00:00"),
duration = c(NA, 30)
)
Then subtract 30 minutes where duration is not NA. I multiply duration*60 here because as.numeric(strptime()) is converting the string to a value in the units of seconds and we want to add duration*60 seconds to the value.
library(dplyr)
df %>%
mutate(start.time = case_when(
!is.na(duration) ~ as.character(as.POSIXct(as.numeric(strptime(are.time, "%H:%M:%S")) - duration*60, origin="1970-01-01"), format="%H:%M:%S"),
TRUE ~ NA_character_
))
Output:
Household person trip are.time depends.time duration start.time
1 1 1 0 02:20:00 08:20:00 NA <NA>
2 1 1 1 08:50:00 17:00:00 30 08:20:00
I have a dataframe (tibble) with multiple rows, each row contains an IDNR, a start date, an end date and an exposure status. The IDNR is a character variable, the start and end date are date variables and the exposure status is a numerical variable. This is what the top 3 rows look like:
# A tibble: 48,266 x 4
IDNR start end exposure
<chr> <date> <date> <dbl>
1 1 2018-02-15 2018-07-01 0
2 2 2017-10-30 2018-07-01 0
3 3 2016-02-11 2016-12-03 1
# ... with 48,256 more rows
In order to do a time-varying cox regression, I want to split up the rows into 90 day parts, while maintaining the start and end date. Here is an example of what I would like to achieve. What happens, is that the new end date is start + 90 days, and a new row is created. This row has the start date which is the same as the end date from the previous row. If the time between start and end is now less than 90 days, this is fine (as for IDNR 1 and 3), however, for IDNR 2 the time is still exceeding 90 days. Therefore a third row needs to be added.
# A tibble: 48,266 x 4
# Groups: IDNR [33,240]
IDNR start end exposure
<chr> <date> <date> <dbl>
1 1 2018-02-15 2018-05-16 0
2 1 2018-05-16 2018-07-01 0
3 2 2017-10-30 2018-01-28 0
4 2 2018-01-28 2018-04-28 0
5 2 2018-04-28 2018-07-01 0
6 3 2016-02-11 2016-08-09 1
7 3 2016-08-09 2016-12-03 1
I'm relatively new to coding in R, but I've found dplyr to be very useful so far. So, if someone knows a solution using dplyr I would really appreciate that.
Thanks in advance!
Here you go:
Using df as your data frame:
df = data.frame(IDNR = 1:3,
start = c("2018-02-15","2017-10-30","2016-02-11"),
end = c("2018-07-01","2018-07-01","2016-12-03"),
exposure = c(0,0,1))
Do:
library(lubridate)
newDF = apply(df, 1, function(x){
newStart = seq(from = ymd(x["start"]), to = ymd(x["end"]), by = 90)
newEnd = c(seq(from = ymd(x["start"]), to = ymd(x["end"]), by = 90)[-1], ymd(x["end"]))
d = data.frame(IDNR = rep(x["IDNR"], length(newStart)),
start = newStart,
end = newEnd,
exposure = rep(x["exposure"], length(newStart)))
})
newDF = do.call(rbind, newDF)
newDF = newDF[newDF$start != newDF$end,]
Result:
> newDF
IDNR start end exposure
1 1 2018-02-15 2018-05-16 0
2 1 2018-05-16 2018-07-01 0
3 2 2017-10-30 2018-01-28 0
4 2 2018-01-28 2018-04-28 0
5 2 2018-04-28 2018-07-01 0
6 3 2016-02-11 2016-05-11 1
7 3 2016-05-11 2016-08-09 1
8 3 2016-08-09 2016-11-07 1
9 3 2016-11-07 2016-12-03 1
What this does is create a sequence of days from start to end by 90 days and create a smaller data frame with them along with the IDNR and exposure. This apply will return a list of data frames that you can join together using do.call. The last line removes lines that have the same start and end date
I have a data frame of users and access times. Access times can be duplicated.
I am trying to create a list of users grouped and named by a given time interval, e.g. year.
timestamp user
1 2013-03-06 01:00:00 1
2 2014-07-06 21:00:00 1
3 2014-07-31 23:00:00 2
4 2014-08-09 17:00:00 2
5 2014-08-14 20:00:00 2
6 2014-08-14 22:00:00 3
7 2014-08-16 15:00:00 3
8 2014-08-19 02:00:00 1
9 2014-12-28 18:00:00 1
10 2015-01-17 17:00:00 1
11 2015-01-22 22:00:00 2
12 2015-01-22 22:00:00 3
13 2015-03-23 15:00:00 4
14 2015-04-05 18:00:00 1
15 2015-04-06 01:00:00 2
My code example already creates a list of users grouped by year.
My problem is that I need to modify the table in this approach, which becomes a problem with my tables of a million entries.
test <- structure(list(timestamp = c("2013-03-06 01:00:00", "2014-07-06 21:00:00",
"2014-07-31 23:00:00", "2014-08-09 17:00:00", "2014-08-14 20:00:00",
"2014-08-14 22:00:00", "2014-08-16 15:00:00", "2014-08-19 02:00:00",
"2014-12-28 18:00:00", "2015-01-17 17:00:00", "2015-01-22 22:00:00",
"2015-01-22 22:00:00", "2015-03-23 15:00:00", "2015-04-05 18:00:00",
"2015-04-06 01:00:00"), user = c(1L, 1L, 2L, 2L, 2L, 3L, 3L,
1L, 1L, 1L, 2L, 3L, 4L, 1L, 2L)), .Names = c("timestamp", "user"
), class = "data.frame", row.names = c(NA, -15L))
require(lubridate)
#Creating "POSIXct" object from string timestamp
timestamp <- lapply(test$timestamp,
function(x)parse_date_time(x, "y-m-d H:M:S"))
test$timestamp <- do.call(c,timestamp)
print(class(test$timestamp))
#Adding column for year
test <- cbind(test,sapply(timestamp, function(x)year(x)))
colnames(test)[3]<- "year"
#Creating list of year time intervals for users
intervals <- names(table(test$year))
users <- lapply(intervals, function(x)test[test$year %in% x,"user"])
names(users) <- intervals
without timestamps
treat the timestamp as a character. Only works if for every timestap, the first 4 digits represent the year.
library(dplyr)
test %>%
group_by( user, substr(timestamp,1,4 ) ) %>%
summarise( )
# user `substr(timestamp, 1, 4)`
# <int> <chr>
# 1 1 2013
# 2 1 2014
# 3 1 2015
# 4 2 2014
# 5 2 2015
# 6 3 2014
# 7 3 2015
# 8 4 2015
dplyr + lubridate
will extract the year from the timestamp
library( dplyr )
library( lubridate )
test %>%
mutate( timestamp = as.POSIXct( timestamp, format = "%Y-%m-%d %H:%M:%S" ) ) %>%
group_by( user, lubridate::year( timestamp ) ) %>%
summarise( )
# # Groups: user [?]
# user `year(timestamp)`
# <int> <dbl>
# 1 1 2013
# 2 1 2014
# 3 1 2015
# 4 2 2014
# 5 2 2015
# 6 3 2014
# 7 3 2015
# 8 4 2015
table
a frequency table is also quickly made
table( test$user, substr( test$timestamp, 1, 4 ) )
# 2013 2014 2015
# 1 1 3 2
# 2 0 3 2
# 3 0 2 1
# 4 0 0 1
there are any more alternatives... pick one
edit
if speed is an issue, ty data.table
dcast(
setDT( test )[, timestamp := as.POSIXct( timestamp, format = "%Y-%m-%d %H:%M:%S" )][, .N, by = list( user, data.table::year(timestamp) )],
user ~ data.table,
value.var = "N")
# user 2013 2014 2015
# 1: 1 1 3 2
# 2: 2 NA 3 2
# 3: 3 NA 2 1
# 4: 4 NA NA 1
Another option using the lightning fast data.table package:
library(data.table)
setDT(test) # make `test` a data.frame 'by reference' (no copy is made at all)
test[, j=.(users=list(unique(user))),
by=.(year=substr(test$timestamp,1,4))]
year users
1: 2013 1
2: 2014 1,2,3
3: 2015 1,2,3,4
Again assuming your test$timestamp column is a character vector - otherwise substitute lubridate::year() as needed.
Update:
Simple change to show grouping instead by month (just as it was mentioned in a comment):
test[, j=.(users=list(unique(user))),
by=.(ym=substr(test$timestamp,1,7))]
ym users
1: 2013-03 1
2: 2014-07 1,2
3: 2014-08 2,3,1
4: 2014-12 1
5: 2015-01 1,2,3
6: 2015-03 4
7: 2015-04 1,2
Or group by day, to help demonstrate how to subset with chaining:
test[, j=.(users=list(unique(user))),
by=.(ymd=substr(test$timestamp,1,11))][ymd>='2014-08-01' & ymd<= '2014-08-21']
ymd users
1: 2014-08-09 2
2: 2014-08-14 2,3
3: 2014-08-16 3
4: 2014-08-19 1
Note for filtering/subsetting, if you are only interested in a subset of dates for a "one off" calculation (and not otherwise saving the whole aggregated set to be stored for other purposes) it will likely be more efficient to do the subset in i of DT[i, j, by] for the "one off" calculation.
You could also use base (stats) function aggregate() as follows:
aggregate( x = test$user,
by = list(year=substr(test$timestamp,1,4)),
FUN = unique )
Result:
year x
1 2013 1
2 2014 1, 2, 3
3 2015 1, 2, 3, 4
Above working on assumption that your timestamp column is initially just a character vector exactly as included in your structured example data. In which case you may directly substr out the year with substr(test$timestamp,1,4) avoiding the need to first convert to dates.
However, if you have the timestamp column already as a date, simply substitute the lubridate::year() function you demonstrated in your attempted solution.
This question already has answers here:
How to create a lag variable within each group?
(5 answers)
Closed 5 years ago.
I have repeat encounters with animals that have a unique IndIDII and a unique GPSSrl number. Each encounter has a FstCptr date.
dat <- structure(list(IndIDII = c("BHS_115", "BHS_115", "BHS_372", "BHS_372",
"BHS_372", "BHS_372"), GPSSrl = c("035665", "036052", "034818",
"035339", "036030", "036059"), FstCptr = structure(c(1481439600,
1450162800, 1426831200, 1481439600, 1457766000, 1489215600), class = c("POSIXct",
"POSIXt"), tzone = "")), .Names = c("IndIDII", "GPSSrl", "FstCptr"
), class = "data.frame", row.names = c(1L, 2L, 29L, 30L, 31L,
32L))
> dat
IndIDII GPSSrl FstCptr
1 BHS_115 035665 2016-12-11
2 BHS_115 036052 2015-12-15
29 BHS_372 034818 2015-03-20
30 BHS_372 035339 2016-12-11
31 BHS_372 036030 2016-03-12
32 BHS_372 036059 2017-03-11
For each IndID-GPSSrl grouping, I want to create a new field (NextCptr) that documents the date of the next encounter. For the last encounter, the new field would be NA, for example:
dat$NextCptr <- as.Date(c("2015-12-15", NA, "2016-12-11", "2016-03-12", "2017-03-11", NA))
> dat
IndIDII GPSSrl FstCptr NextCptr
1 BHS_115 035665 2016-12-11 2015-12-15
2 BHS_115 036052 2015-12-15 <NA>
29 BHS_372 034818 2015-03-20 2016-12-11
30 BHS_372 035339 2016-12-11 2016-03-12
31 BHS_372 036030 2016-03-12 2017-03-11
32 BHS_372 036059 2017-03-11 <NA>
I would like to work within dplyr and group_by(IndIDII, GPSSrl).
As always, many thanks!
Group by column IndIDII then use lead to shift FstCptr forward by one:
dat %>% group_by(IndIDII) %>% mutate(NextCptr = lead(FstCptr))
# A tibble: 6 x 4
# Groups: IndIDII [2]
# IndIDII GPSSrl FstCptr NextCptr
# <chr> <chr> <dttm> <dttm>
#1 BHS_115 035665 2016-12-11 02:00:00 2015-12-15 02:00:00
#2 BHS_115 036052 2015-12-15 02:00:00 NA
#3 BHS_372 034818 2015-03-20 02:00:00 2016-12-11 02:00:00
#4 BHS_372 035339 2016-12-11 02:00:00 2016-03-12 02:00:00
#5 BHS_372 036030 2016-03-12 02:00:00 2017-03-11 02:00:00
#6 BHS_372 036059 2017-03-11 02:00:00 NA
If you need to shift the column in the opposite direction, lag could also be useful, dat %>% group_by(IndIDII) %>% mutate(NextCptr = lag(FstCptr)).