I have some trips per each person and each household. the first 2 rows of data is like this
Household person trip are.time depends.time duration
1 1 0 02:20:00 08:20:00 NA
1 1 1 08:50:00 17:00:00 30
This means that the person start trip at 8:20 and get to destination at 8:50 . that's why duration is 30. (is trip duration)
Now I want to put start time of each trip in the same row as that trip.
like this
Household person trip start.time. are.time depends.time duration
1 1 0 NA. 02:20:00 08:20:00 NA
1 1 1 08:20:00 08:50:00 17:00:00 30
Notice that for zero trip of each person we do not have start time so I put NA.
A solution using dplyr with base R functions for time manipulation.
First create the data
df <- data.frame(
Household = c(1, 1),
person = c(1, 1),
trip = c(0, 1),
are.time = c("02:20:00", "08:50:00"),
depends.time = c("08:20:00", "17:00:00"),
duration = c(NA, 30)
)
Then subtract 30 minutes where duration is not NA. I multiply duration*60 here because as.numeric(strptime()) is converting the string to a value in the units of seconds and we want to add duration*60 seconds to the value.
library(dplyr)
df %>%
mutate(start.time = case_when(
!is.na(duration) ~ as.character(as.POSIXct(as.numeric(strptime(are.time, "%H:%M:%S")) - duration*60, origin="1970-01-01"), format="%H:%M:%S"),
TRUE ~ NA_character_
))
Output:
Household person trip are.time depends.time duration start.time
1 1 1 0 02:20:00 08:20:00 NA <NA>
2 1 1 1 08:50:00 17:00:00 30 08:20:00
Related
This seems like it should be straightforward but I cannot find a way to do this.
I have a sales cycle that begins ~ August 1 of each year and need to sum sales by week number. I need to create a "week number" field where week #1 begins on a date that I specify. Thus far I have looked at lubridate, baseR, and strftime, and I cannot find a way to change the "start" date from 01/01/YYYY to something else.
Solution needs to let me specify the start date and iterate week numbers as 7 days from the start date. The actual start date doesn't always occur on a Sunday or Monday.
EG Data Frame
eg_data <- data.frame(
cycle = c("cycle2019", "cycle2019", "cycle2018", "cycle2018", "cycle2017", "cycle2017", "cycle2016", "cycle2016"),
dates = as.POSIXct(c("2019-08-01" , "2019-08-10" ,"2018-07-31" , "2018-08-16", "2017-08-03" , "2017-08-14" , "2016-08-05", "2016-08-29")),
week_n = c("1", "2","1","3","1","2","1","4"))
I'd like the result to look like what is above - it would take the min date for each cycle and use that as a starting point, then iterate up week numbers based on a given date's distance from the cycle starting date.
This almost works. (Doing date arithmetic gives us durations in seconds: there may be a smoother way to convert with lubridate tools?)
secs_per_week <- 60*60*24*7
(eg_data
%>% group_by(cycle)
%>% mutate(nw=1+as.numeric(round((dates-min(dates))/secs_per_week)))
)
The results don't match for 2017, because there is an 11-day gap between the first and second observation ...
cycle dates week_n nw
<chr> <dttm> <chr> <dbl>
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 3
If someone has a better answer plz post, but this works -
Take the dataframe in the example, eg_data -
eg_data %>%
group_by(cycle) %>%
mutate(
cycle_start = as.Date(min(dates)),
days_diff = as.Date(dates) - cycle_start,
week_n = days_diff / 7,
week_n_whole = ceiling(days_diff / 7) ) -> eg_data_check
(First time I've answered my own question)
library("lubridate")
eg_data %>%
as_tibble() %>%
group_by(cycle) %>%
mutate(new_week = week(dates)-31)
This doesn't quite work the same as your example, but perhaps with some fiddling based on your domain experience you could adapt it:
library(lubridate)
eg_data %>%
mutate(aug1 = ymd_h(paste(str_sub(cycle, start = -4), "080100")),
week_n2 = ceiling((dates - aug1)/ddays(7)))
EDIT: If you have specific known dates for the start of each cycle, it might be helpful to join those dates to your data for the calc:
library(lubridate)
cycle_starts <- data.frame(
cycle = c("cycle2019", "cycle2018", "cycle2017", "cycle2016"),
start_date = ymd_h(c(2019080100, 2018072500, 2017080500, 2016071300))
)
eg_data %>%
left_join(cycle_starts) %>%
mutate(week_n2 = ceiling((dates - start_date)/ddays(7)))
#Joining, by = "cycle"
# cycle dates week_n start_date week_n2
#1 cycle2019 2019-08-01 1 2019-08-01 1
#2 cycle2019 2019-08-10 2 2019-08-01 2
#3 cycle2018 2018-07-31 1 2018-07-25 1
#4 cycle2018 2018-08-16 3 2018-07-25 4
#5 cycle2017 2017-08-03 1 2017-08-05 0
#6 cycle2017 2017-08-14 2 2017-08-05 2
#7 cycle2016 2016-08-05 1 2016-07-13 4
#8 cycle2016 2016-08-29 4 2016-07-13 7
This is a concise solution using lubridate
library(lubridate)
eg_data %>%
group_by(cycle) %>%
mutate(new_week = floor(as.period(ymd(dates) - ymd(min(dates))) / weeks()) + 1)
# A tibble: 8 x 4
# Groups: cycle [4]
cycle dates week_n new_week
<chr> <dttm> <chr> <dbl>
1 cycle2019 2019-08-01 00:00:00 1 1
2 cycle2019 2019-08-10 00:00:00 2 2
3 cycle2018 2018-07-31 00:00:00 1 1
4 cycle2018 2018-08-16 00:00:00 3 3
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 2
7 cycle2016 2016-08-05 00:00:00 1 1
8 cycle2016 2016-08-29 00:00:00 4 4
I have a dataset of several thousand ICU patients covering several years. Some patients (each with a unique identifier, ID) have had multiple ICU admissions. Each row covers a single ICU admission, and therefore an individual patient may have multiple rows of data.
For each patient, I want to determine whether their ICU admission was:
A readmission during the same hospital stay. This could be identified by an icu_adm time occurring prior to their previous hosp_dis time, or by multiple rows with the same hosp_dis time.
A transfer to a different ICU for management of the same illness. I am defining this as an icu_adm time occurring within 24 hours of their previous hosp_dis time. These patients icu_dis time and hosp_dis time should be the same, as their hospital discharge occured from ICU.
A new admission of the same patient
I am able to use lubridate to compare times without difficulty, but I am stuck on how to do the between-row comparisons, especially for patients with multiple ICU admissions (who have new admissions, readmissions, and transfers all in the time period of interest).
Some example data:
ID site icu_adm icu_dis hosp_adm hosp_dis
1 A 2016-02-02 15:38:00 2016-02-06 14:25:00 2016-02-02 15:17:00 2016-02-06 14:25:00
1 B 2016-02-06 16:17:00 2016-02-16 14:16:00 2016-02-06 16:16:00 2016-03-16 17:50:00
2 C 2009-08-09 14:27:00 2009-08-10 15:06:00 2009-08-03 02:51:00 2009-09-02 00:00:00
3 C 2009-08-18 20:32:00 2009-08-27 15:10:00 2009-08-03 02:51:00 2009-09-02 00:00:00
3 A 2010-02-20 21:00:00 2010-03-03 13:00:00 2010-02-18 03:00:00 2010-03-18 15:21:00
3 B 2010-05-05 17:00:00 2010-05-08 09:13:00 2010-05-03 11:21:00 2010-05-20 17:18:00
Desired output would be:
ID … readmission transferred new_adm
1 0 0 1
1 0 1 0
2 0 0 1
2 1 0 0
3 0 0 1
3 0 0 1
I'm not entirely sure this will work with all of your data, but thought this might be helpful.
Using tidyverse (or dplyr package in this case), you can start by grouping by ID to look at transfers. Based on your definition, if your icu_adm time is less than 24 hours of the previous row's discharge time (hosp_dis), then it is considered an ICU transfer. You can use lag to compare with previous row, assume dates/times are in chronological order (if not, you can use arrange to order).
Next, you can group by ID, hosp_adm, and hosp_dis. This will help look at readmissions. After grouping, all rows of data after the first row (for the same hospital admission) will be considered ICU readmissions.
Then, everything left that is not a transfer or readmission could be considered a new ICU admission.
Let me know if this is what you had in mind.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(transfer = ifelse(abs(difftime(icu_adm, lag(hosp_dis, default = first(hosp_dis)), units = "hours")) < 24, 1, 0)) %>%
group_by(ID, hosp_adm, hosp_dis) %>%
mutate(readmission = ifelse(row_number() > 1, 1, 0),
new_adm = ifelse(transfer != 1 & readmission != 1, 1, 0))
Output
ID site icu_adm icu_dis hosp_adm hosp_dis transfer readmission new_adm
<int> <chr> <dttm> <dttm> <dttm> <dttm> <dbl> <dbl> <dbl>
1 1 A 2016-02-02 15:38:00 2016-02-06 14:25:00 2016-02-02 15:17:00 2016-02-06 14:25:00 0 0 1
2 1 B 2016-02-06 16:17:00 2016-02-16 14:16:00 2016-02-06 16:16:00 2016-03-16 17:50:00 1 0 0
3 2 C 2009-08-09 14:27:00 2009-08-10 15:06:00 2009-08-03 02:51:00 2009-09-02 00:00:00 0 0 1
4 2 C 2009-08-18 20:32:00 2009-08-27 15:10:00 2009-08-03 02:51:00 2009-09-02 00:00:00 0 1 0
5 3 A 2010-02-20 21:00:00 2010-03-03 13:00:00 2010-02-18 03:00:00 2010-03-18 15:21:00 0 0 1
6 3 B 2010-05-05 17:00:00 2010-05-08 09:13:00 2010-05-03 11:21:00 2010-05-20 17:18:00 0 0 1
Data
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L), site = c("A",
"B", "C", "C", "A", "B"), icu_adm = structure(c(1454427480, 1454775420,
1249828020, 1250627520, 1266699600, 1273078800), tzone = "", class = c("POSIXct",
"POSIXt")), icu_dis = structure(c(1454768700, 1455632160, 1249916760,
1251385800, 1267621200, 1273309980), tzone = "", class = c("POSIXct",
"POSIXt")), hosp_adm = structure(c(1454426220, 1454775360, 1249267860,
1249267860, 1266462000, 1272885660), tzone = "", class = c("POSIXct",
"POSIXt")), hosp_dis = structure(c(1454768700, 1458150600, 1251849600,
1251849600, 1268925660, 1274375880), tzone = "", class = c("POSIXct",
"POSIXt"))), class = "data.frame", row.names = c(NA, -6L))
I have a dataset that looks like this:
id land datetime
pb1 0 2004-04-05 01:44:00
pb1 1 2004-04-05 02:00:00
pb1 1 2004-04-05 16:00:00
pb2 1 2004-04-05 18:01:00
pb2 1 2004-04-05 20:00:00
library(data.table)
DT = data.table(
id = c("pb1", "pb1", "pb1", "pb2", "pb2"),
land = c(0L, 1L, 1L, 1L, 1L),
datetime = sprintf("2004-04-05 %02d:%02d:00",
c(1, 2, 16, 18, 20),
c(44, 0, 0, 1, 0))
)
I would like to make a column that cumulatively adds time (in days) but ONLY if there is a '1' in the land column. I also would like the count to reset when the id changes.
I have tried a variety of methods using data.table, rleid, and even a nested for loop with no success. I have gotten errors using code like this:
DT[, total :=land*diff(as.numeric(datetime)), .(id, rleid(land))]
I have tried variations of the solution here: Calculating cumulative time in R
I'm not sure the best way to calculate the time interval (no success with difftime or lubridate).
I want the end result to look like this:
id land datetime cumtime.land
pb1 0 2004-04-05 01:44:00 0
pb1 1 2004-04-05 02:00:00 0
pb1 1 2004-04-06 16:00:00 1.58333
pb2 1 2004-04-05 18:00:00 0
pb2 1 2004-04-05 20:00:00 0.08333
I could not replicate #Japp's comment, but you can easily do this with dplyr.
Depending on what your exact expected output is, you could stop before the summarize call:
library(dplyr)
df=read.table(text=
"id land datetime
pb1 0 '2004-04-05 01:44:00'
pb1 1 '2004-04-05 02:00:00'
pb1 1 '2004-04-06 16:00:00'
pb1 1 '2004-04-07 16:00:00'
pb2 1 '2004-04-05 18:00:00'
pb2 1 '2004-04-05 20:00:00'", header=T) %>%
mutate(datetime=as.POSIXct(datetime,format='%Y-%m-%d %H:%M:%S'))
x = df %>%
group_by(id) %>%
arrange(id, datetime) %>%
mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0,
0,
difftime(datetime, lag(datetime), units="days"))) %>%
mutate(cumtime.land=time.land + ifelse(is.na(lag(time.land)), 0, lag(time.land)))
id land datetime time.land cumtime.land
<fct> <int> <dttm> <dbl> <dbl>
1 pb1 0 2004-04-05 01:44:00 0 0
2 pb1 1 2004-04-05 02:00:00 0 0
3 pb1 1 2004-04-06 16:00:00 1.58 1.58
4 pb1 1 2004-04-07 16:00:00 1 2.58
5 pb2 1 2004-04-05 18:00:00 0 0
6 pb2 1 2004-04-05 20:00:00 0.0833 0.0833
The key is to use the dplyr::lag() function which takes the "line just above" in the table (which implies that you have to arrange() it beforehand).
By wrapping this inside the ifelse, I'm checking that land and previous land were not 0 (and that we are not in the first line of the id, or lag(anything) will be missing).
I then just reuse the lag() function to get the cumtime.land variable.
I believe you're after:
DT[land == 1, cumtime.land =
cumsum(c(0, diff(as.numeric(datetime))))/86400, by = id]
as.numeric(datetime) converts it to seconds so we use 86400 to convert to days.
Somewhat more "official" in the sense of leveraging time/date classes directly is to use difftime and shift:
DT[land == 1, by = id,
cumtime.land :=
cumsum(as.double(difftime(
datetime, shift(datetime, fill = datetime[1L]), units = 'days'
)))]
I switched the order of the by argument simply to help with formatting.
We use datetime[1L] to fill so that the initial difference is 0; we need as.double because cumsum errors as it's not confident how to deal with difftime objects as input.
See also:
Calculate cumsum() while ignoring NA values
https://stackoverflow.com/a/40227629/3576984
So I have some data with a time stamp, and for each row, I want to count the number of rows that fall within a certain time window. For example, if I have the data below with a time stamp in h:mm (column ts), I want to count the number of rows that occur from that time stamp to five minutes in the past (column count). The first n rows that are less than five minutes from the first data point should be NAs.
ts data count
1:01 123 NA
1:02 123 NA
1:03 123 NA
1:04 123 NA
1:06 123 5
1:07 123 5
1:10 123 3
1:11 123 4
1:12 123 4
This is straightforward to do with a for loop, but I've been trying to implement with the apply() family and have not yet found any success. Any suggestions?
EDIT: modified to account for the potential for multiple readings per minute, raised in comment.
Data with new mid-minute reading:
library(dplyr)
df %>%
# Take the text above and convert to datetime
mutate(ts = lubridate::ymd_hms(paste(Sys.Date(), ts))) %>%
# Count how many observations per minute
group_by(ts_min = lubridate::floor_date(ts, "1 minute")) %>%
summarize(obs_per_min = sum(!is.na(data))) %>%
# Add rows for any missing minutes, count as zero observations
padr::pad(interval = "1 min") %>%
replace_na(list(obs_per_min = 0)) %>%
# Count cumulative observations, and calc how many in window that
# begins 5 minutes ago and ends at end of current minute
mutate(cuml_count = cumsum(obs_per_min),
prior_cuml = lag(cuml_count) %>% tidyr::replace_na(0),
in_window = cuml_count - lag(prior_cuml, 5)) %>%
# Exclude unneeded columns and rows
select(-cuml_count, -prior_cuml) %>%
filter(obs_per_min > 0)
Output (now reflects add'l reading at 1:06:30)
# A tibble: 12 x 3
ts_min obs_per_min in_window
<dttm> <dbl> <dbl>
1 2018-09-26 01:01:00 1 NA
2 2018-09-26 01:02:00 1 NA
3 2018-09-26 01:03:00 1 NA
4 2018-09-26 01:04:00 1 NA
5 2018-09-26 01:06:00 2 6
6 2018-09-26 01:07:00 1 6
7 2018-09-26 01:10:00 1 4
8 2018-09-26 01:11:00 1 5
9 2018-09-26 01:12:00 1 4
I'm getting started with R, so please bear with me
For example, I have this data.table (or data.frame) object :
Time Station count_starts count_ends
01/01/2015 00:30 A 2 3
01/01/2015 00:40 A 2 1
01/01/2015 00:55 B 1 1
01/01/2015 01:17 A 3 1
01/01/2015 01:37 A 1 1
My end goal is to group the "Time" column to hourly and sum the count_starts and count_ends based on the hourly time and station :
Time Station sum(count_starts) sum(count_ends)
01/01/2015 01:00 A 4 4
01/01/2015 01:00 B 1 1
01/01/2015 02:00 A 4 2
I did some research and found out that I should use the xts library.
Thanks for helping me out
UPDATE :
I converted the type of transactions$Time to POSIXct, so the xts package should be able to use the timeseries directly.
Using base R, we can still do the above. Only that the hour will be one less for all of them:
dat=read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
dat$Time=cut(strptime(dat$Time,"%m/%d/%Y %H:%M"),"hour")
aggregate(.~Time+Station,dat,sum)
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
2 2015-01-01 01:00:00 A 4 2
3 2015-01-01 00:00:00 B 1 1
You can use the order function to rearrange the table or even the sort.POSIXlt function:
m=aggregate(.~Time+Station,dat,sum)
m[order(m[,1]),]
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
3 2015-01-01 00:00:00 B 1 1
2 2015-01-01 01:00:00 A 4 2
A solution using dplyr and lubridate. The key is to use ceiling_date to convert the date time column to hourly time-step, and then group and summarize the data.
library(dplyr)
library(lubridate)
dt2 <- dt %>%
mutate(Time = mdy_hm(Time)) %>%
mutate(Time = ceiling_date(Time, unit = "hour")) %>%
group_by(Time, Station) %>%
summarise(`sum(count_starts)` = sum(count_starts),
`sum(count_ends)` = sum(count_ends)) %>%
ungroup()
dt2
# # A tibble: 3 x 4
# Time Station `sum(count_starts)` `sum(count_ends)`
# <dttm> <chr> <int> <int>
# 1 2015-01-01 01:00:00 A 4 4
# 2 2015-01-01 01:00:00 B 1 1
# 3 2015-01-01 02:00:00 A 4 2
DATA
dt <- read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
Explanation
mdy_hm is the function to convert the string to date-time class. It means "month-day-year hour-minute", which depends on the structure of the string. ceiling_date rounds a date-time object up based on the unit specified. group_by is to group the variable. summarise is to conduct summary operation.
There are basically two things required:
1) round of the Time to nearest 1 hour window:
library(data.table)
library(lubridate)
data=data.table(Time=c('01/01/2015 00:30','01/01/2015 00:40','01/01/2015 00:55','01/01/2015 01:17','01/01/2015 01:37'),Station=c('A','A','B','A','A'),count_starts=c(2,2,1,3,1),count_ends=c(3,1,1,1,1))
data[,Time_conv:=as.POSIXct(strptime(Time,'%d/%m/%Y %H:%M'))]
data[,Time_round:=floor_date(Time_conv,unit="1 hour")]
2) List the data table obtained above to get the desired result:
New_data=data[,list(count_starts_sum=sum(count_starts),count_ends_sum=sum(count_ends)),by='Time_round']