Select value from time range dataframe in R - r

I have a dataframe of datetimes
tdata_df <- data.frame(timestamp=seq(c(ISOdate(2018,4,20)), by = (60*229), length.out = 6))
tdata_df
timestamp
1 2018-04-20 21:00:00
2 2018-04-21 00:49:00
3 2018-04-21 04:38:00
4 2018-04-21 08:27:00
5 2018-04-21 12:16:00
6 2018-04-21 16:05:00
then I would like to get value from this time range table
time_range_df <- data.frame(start=c("08:30","11:35","15:10","05:00"),
end=c("11:29","15:09","02:29","08:29"),value=c(1,2,3,4))
timerange_df
start end value
1 08:30 11:29 1
2 11:35 15:09 2
3 15:10 02:29 3
4 05:00 08:29 4
like this
timestamp value
1 2018-04-20 21:00:00 3
2 2018-04-21 00:49:00 3
3 2018-04-21 04:38:00 NA
4 2018-04-21 08:27:00 4
5 2018-04-21 12:16:00 2
6 2018-04-21 16:05:00 3
Any help would be greatly appreciated.

The sqldf package provides greater flexibility to join in such cases. The approach is:
Change time in time_range_df to offset from mid-night.
Add a column in tdata_df to represent time elapsed since midnight
Join both data frames for overlapped time since midnight
library(lubridate)
time_range_df$start <- as.numeric(seconds(hm(time_range_df$start)))
time_range_df$end <- as.numeric(seconds(hm(time_range_df$end)))
tdata_df$timeSinceMidNigh <- as.numeric(seconds(hms(format(ymd_hms(tdata_df$timestamp),
format = "%H:%M:%S"))))
library(sqldf)
sqlquery <- "SELECT D1.timestamp, Q.value FROM tdata_df D1
LEFT JOIN (SELECT * FROM tdata_df D, time_range_df R
WHERE (R.start < R.end AND D.timeSinceMidNigh between R.start AND R.end) OR
(R.start > R.end AND D.timeSinceMidNigh between R.start AND 86400) OR
(R.start > R.end AND D.timeSinceMidNigh between 0 and R.end)) Q
ON D1.timestamp = Q.timestamp"
sqldf(sqlquery)
# timestamp value
# 1 2018-04-20 13:00:00 2
# 2 2018-04-20 16:49:00 3
# 3 2018-04-20 20:38:00 3
# 4 2018-04-21 00:27:00 3
# 5 2018-04-21 04:16:00 NA
# 6 2018-04-21 08:05:00 4
Data:
tdata_df <- data.frame(timestamp=seq(c(ISOdate(2018,4,20)), by = (60*229), length.out = 6))
time_range_df <- data.frame(start=c("08:30","11:35","15:10","05:00"),
end=c("11:29","15:09","02:29","08:29"),value=c(1,2,3,4))

Related

Create variable for day of the experiment

I have a large data set that spanned a month in time with the data stamped in a column called txn_date like the below. (this is a toy reproduction of it)
dat1 <- read.table(text = "var1 txn_date
5 2020-10-25
1 2020-10-25
3 2020-10-26
4 2020-10-27
1 2020-10-27
3 2020-10-31
3 2020-11-01
8 2020-11-02 ", header = TRUE)
Ideally I would like to get a column in my data frame for each date in the data which I think could be done by first getting a single column that is 1 for the first date that appears and then so on.
So something like this
dat1 <- read.table(text = "var1 txn_date day
5 2020-10-25 1
1 2020-10-25 1
3 2020-10-26 2
4 2020-10-27 3
1 2020-10-27 3
3 2020-10-31 7
3 2020-11-01 8
8 2020-11-12 9 ", header = TRUE
I'm not quite sure how to get this. The txn_date column is as.Date in my actual data frame. I think if I could get the single day column like is listed above (then convert it to a factor) then I could always one hot encode the actual levels of that column if I need to. Ultimately I need to use the day of the experiment as a regressor in a regression I'm going to run.
Something along the lines of y ~ x + day_1 + day_2 +...+ error
Would this be suitable?
library(tidyverse)
dat1 <- read.table(text = "var1 txn_date
5 2020-10-25
1 2020-10-25
3 2020-10-26
4 2020-10-27
1 2020-10-27
3 2020-10-31
3 2020-11-01
8 2020-11-02 ", header = TRUE)
dat1$txn_date <- as.Date(dat1$txn_date)
dat1 %>%
mutate(days = txn_date - txn_date[1] + 1)
# var1 txn_date days
#1 5 2020-10-25 1 days
#2 1 2020-10-25 1 days
#3 3 2020-10-26 2 days
#4 4 2020-10-27 3 days
#5 1 2020-10-27 3 days
#6 3 2020-10-31 7 days
#7 3 2020-11-01 8 days
#8 8 2020-11-02 9 days
We create a sequence of dates based on the min and max of 'txn_date' and match
dates <- seq(min(as.Date(dat1$txn_date)),
max(as.Date(dat1$txn_date)), by = '1 day')
dat1$day <- with(dat1, match(as.Date(txn_date), dates))
dat1$day
#[1] 1 1 2 3 3 7 8 9
Or may use factor route
with(dat1, as.integer(factor(txn_date, levels = as.character(dates))))
#[1] 1 1 2 3 3 7 8 9

Complex conditional groupby in R

Here is the problem I am trying to solve.
I want to take table 1 to table 2.
Table 1 :
df
# icustay_id starttime endtime vaso_rate vaso_amount
# 1 1 2019-09-10 13:20:00 2019-09-11 13:20:00 3 293.0896
# 2 1 2019-09-11 13:30:00 2019-09-12 01:20:00 9 602.9983
# 3 1 2019-09-14 16:40:00 2019-09-15 16:40:00 4 208.9360
# 4 2 2019-09-10 12:40:00 2019-09-13 13:20:00 2 864.1494
# 5 3 2019-09-10 01:20:00 2019-09-11 13:20:00 9 405.2939
Table 2 :
df
# icustay_id starttime endtime vaso_rate vaso_amount
# 1 1 2019-09-10 13:20:00 2019-09-12 01:20:00 3 293.0896
# 2 1 2019-09-14 16:40:00 2019-09-15 16:40:00 4 208.9360
# 3 2 2019-09-10 12:40:00 2019-09-13 13:20:00 2 864.1494
# 4 3 2019-09-10 01:20:00 2019-09-11 13:20:00 9 405.2939
As you notice :
I am trying to build a function that will :
For every single unique patient (unique icustay_id), groupby icustay_id ONLY if the medication has been stopped for less than an hour.
When the row merges :
Some columns will retain the same value (i.e. the patient identifiers)
Some columns must be modified :
Keep the earlier starttime
Keep the latter endttime
Average the vaso-rate
Sum the vaso-amount
To do so, I have decided to add another column identifier that takes the value 1 when the condition is met and when all the rows are verified, groupby (icustay_id and that new column)
My code as it is written however does not assign the appropriate ID in respect to the condition.
Here is the sample df creation code :
set.seed(1)
df <- data.frame(
icustay_id = c(1, 1, 1, 2, 3),
starttime = as.POSIXct(c("2019-09-10 13:20", "2019-09-11 13:30", "2019-09-14 16:40", "2019-09-10 12:40", "2019-09-10 01:20")),
endtime = as.POSIXct(c("2019-09-11 13:20", "2019-09-11 01:20", "2019-09-15 16:40", "2019-09-13 13:20", "2019-09-11 13:20")),
vaso_rate = sample(1:10, 5, replace = TRUE),
vaso_amount = runif(5, 0, 1000)
)
Here is the function code that I have right now :
merge_pressor_doses <- function(df){
df %>% arrange(icustay_id,starttime)
for (i in unique(df$icustay_id))
{
for (j in which(df$icustay_id==i))
{
start <- df$starttime[as.numeric(j)+1]
end <- df$endtime[as.numeric(j)]
stopduration <- as.numeric(difftime(start, end, units = 'mins'))
bool <- stopduration < 60
df <- df%>%mutate(
group = case_when(
bool = TRUE ~ 1,
bool = FALSE ~ 0)
)
}
}
return(df)
}
This should result in :
df
# icustay_id starttime endtime vaso_rate vaso_amount group
# 1 1 2019-09-10 13:20:00 2019-09-11 13:20:00 3 293.0896 1
# 2 1 2019-09-11 13:30:00 2019-09-12 01:20:00 9 602.9983 1
# 3 1 2019-09-14 16:40:00 2019-09-15 16:40:00 4 208.9360 0
# 4 2 2019-09-10 12:40:00 2019-09-13 13:20:00 2 864.1494 1
# 5 3 2019-09-10 01:20:00 2019-09-11 13:20:00 9 405.2939 1
But in my case the 3rd row is assign a value of 1...
If I can manage to make this portion of the code work, I could proceed with this portion of the code to achieve my objective.
The eventual second portion of the code would be :
group_by(group, icustay_id) %>%
summarise(
starttime = min(starttime),
endtime = max(endtime),
vaso_rate = mean(vaso_rate),
sum_vaso_amount = sum(vaso_amount))
Thank you in advance!!
I'd create a new column pause which says how much time passed since the last medication. Then using this column we assign groups ids to medications: cumsum(pause >= 1) - start with 0, then if pause is >=1 hours, it's a different group.
set.seed(1)
df <- data.frame(
icustay_id = c(1, 1, 1, 2, 3),
starttime = as.POSIXct(c("2019-09-10 13:20", "2019-09-11 13:30", "2019-09-14 16:40", "2019-09-10 12:40", "2019-09-10 01:20")),
endtime = as.POSIXct(c("2019-09-11 13:20", "2019-09-11 01:20", "2019-09-15 16:40", "2019-09-13 13:20", "2019-09-11 13:20")),
vaso_rate = sample(1:10, 5, replace = TRUE),
vaso_amount = runif(5, 0, 1000)
)
library(dplyr)
library(tidyr)
df <-
df %>%
group_by(icustay_id) %>%
mutate(pause = difftime(starttime, lag(endtime), units = "hours")) %>%
replace_na(list(pause = 0)) %>%
mutate(vaso_id = cumsum(pause >= 1))
# A tibble: 5 x 7
# Groups: icustay_id [3]
# icustay_id starttime endtime vaso_rate vaso_amount pause vaso_id
# <dbl> <dttm> <dttm> <int> <dbl> <drtn> <int>
# 1 1 2019-09-10 13:20:00 2019-09-11 13:20:00 9 898. 0.0000000 hours 0
# 2 1 2019-09-11 13:30:00 2019-09-11 01:20:00 4 945. 0.1666667 hours 0
# 3 1 2019-09-14 16:40:00 2019-09-15 16:40:00 7 661. 87.3333333 hours 1
# 4 2 2019-09-10 12:40:00 2019-09-13 13:20:00 1 629. 0.0000000 hours 0
# 5 3 2019-09-10 01:20:00 2019-09-11 13:20:00 2 61.8 0.0000000 hours 0
Then we can use the code you provided.
df %>%
group_by(icustay_id, vaso_id) %>%
summarise(
starttime = min(starttime),
endtime = max(endtime),
vaso_rate = mean(vaso_rate),
sum_vaso_amount = sum(vaso_amount)
)
# A tibble: 4 x 6
# Groups: icustay_id [3]
# icustay_id vaso_id starttime endtime vaso_rate sum_vaso_amount
# <dbl> <int> <dttm> <dttm> <dbl> <dbl>
# 1 1 0 2019-09-10 13:20:00 2019-09-11 13:20:00 6.5 1843.
# 2 1 1 2019-09-14 16:40:00 2019-09-15 16:40:00 7 661.
# 3 2 0 2019-09-10 12:40:00 2019-09-13 13:20:00 1 629.
# 4 3 0 2019-09-10 01:20:00 2019-09-11 13:20:00 2 61.8

How to go about adding row values in sequence a certain number of times and filling in every row in R?

I have been attempting to use R to clean some data. My data set looks like this:
DateTime Day ...
2018-10-01 10:00:00 0
2018-10-01 10:00:05 0
2018-10-01 10:00:10 0
2018-10-01 10:00:15 0
2018-10-01 10:00:20 0
2018-10-01 10:00:25 0
2018-10-01 10:00:30 0
2018-10-01 10:00:35 0
It's in 5 second bins so 24 hours = 17280 bins. I'm trying to add a Day column that basically just gives a value starting at the beginning as 1 and counts through 24 hours and gives '1,1,1,1,1...1' for Day 1 and '2,2,2,2,2...2' starting at 10:00:00 on Day 2 for each day and goes all the way through each data set.
So my desired output would look something like this:
DateTime Day
2018-10-01 10:00:00 1
2018-10-01 10:00:05 1
2018-10-01 10:00:10 1
2018-10-01 10:00:15 1
... ... ...
2018-10-02 9:59:50 1
2018-10-02 9:59:55 1
2018-10-02 10:00:00 2
2018-10-02 10:00:05 2
... ... ...
2018-10-03 9:59:50 2
2018-10-03 9:59:55 2
2018-10-03 10:00:00 3
2018-10-03 10:00:05 3
To do this, I came up with a code to create a vector of numbers using rep():
days<- round(nrow(df)/17280)
sdays <- rep(1:days, each = 17280, times = 1)
df$Day <- sdays
This works if the days are exactly matched for 24 hour day lengths and I didn't need to round down or up. However, not each data set I'm going through has perfect 24 h date ranges and I prefer not to edit out data as each bin is necessary for what I'm looking at. So, I have been trying to figure out a more appropriate approach for this but with little success (although I'm sure it's somewhere in the www since its pretty basic).
You can cut() date by 24 hours:
Data
dat <- data.frame(
Date = seq(ISOdatetime(2018, 10, 01, 10, 0, 0, "GMT"),
ISOdatetime(2018, 10, 03, 10, 0, 5, "GMT"),
by = 5
))
Cut Dates by 24 hours
dat$Day <- cut(dat$Date, "24 hours", F)
Output
head(dat, 4)
dat[(nrow(dat) %/% 2 - 2):(nrow(dat) %/% 2 + 1), ]
tail(dat, 4)
Date Day
1 2018-10-01 10:00:00 1
2 2018-10-01 10:00:05 1
3 2018-10-01 10:00:10 1
4 2018-10-01 10:00:15 1
.......................
17279 2018-10-02 09:59:50 1
17280 2018-10-02 09:59:55 1
17281 2018-10-02 10:00:00 2
17282 2018-10-02 10:00:05 2
.......................
34559 2018-10-03 09:59:50 2
34560 2018-10-03 09:59:55 2
34561 2018-10-03 10:00:00 3
34562 2018-10-03 10:00:05 3
just use a simple difftime caluculation, and round up to the nearest integer..
data
dat <- data.frame(
Date = seq(as.POSIXct("2018-10-01 10:00:00", format = "%Y-%m-%d %H:%M:%S"),
as.POSIXct("2018-10-03 10:00:05", format = "%Y-%m-%d %H:%M:%S"),
by = 5
))
code
dat$day <- as.numeric( ceiling( difftime( dat$Date, dat$Date[1] - 5, units = "days") ) )
output
# Date day
# 1 2018-10-01 10:00:00 1
# 2 2018-10-01 10:00:05 1
# 3 2018-10-01 10:00:10 1
# 4 2018-10-01 10:00:15 1
# 5 2018-10-01 10:00:20 1
# 6 2018-10-01 10:00:25 1
# ....
# 17278 2018-10-02 09:59:45 1
# 17279 2018-10-02 09:59:50 1
# 17280 2018-10-02 09:59:55 1
# 17281 2018-10-02 10:00:00 2
# 17282 2018-10-02 10:00:05 2

R: Create a New Column in R to determine Semester Based on Two Dates

I have some data. ID and date and I'm trying to create a new field for semester.
df:
id date
1 20160822
2 20170109
3 20170828
4 20170925
5 20180108
6 20180402
7 20160711
8 20150831
9 20160111
10 20160502
11 20160829
12 20170109
13 20170501
I also have a semester table:
start end season_year
20120801 20121222 Fall-2012
20121223 20130123 Winter-2013
20130124 20130523 Spring-2013
20130524 20130805 Summer-2013
20130806 20131228 Fall-2013
20131229 20140122 Winter-2014
20140123 20140522 Spring-2014
20140523 20140804 Summer-2014
20140805 20141227 Fall-2014
20141228 20150128 Winter-2015
20150129 20150528 Spring-2015
20150529 20150803 Summer-2015
20150804 20151226 Fall-2015
20151227 20160127 Winter-2016
20160128 20160526 Spring-2016
20160527 20160801 Summer-2016
20160802 20161224 Fall-2016
20161225 20170125 Winter-2017
20170126 20170525 Spring-2017
20170526 20170807 Summer-2017
20170808 20171230 Fall-2017
20171231 20180124 Winter-2018
20180125 20180524 Spring-2018
20180525 20180806 Summer-2018
20180807 20181222 Fall-2018
20181223 20190123 Winter-2019
20190124 20190523 Spring-2019
20190524 20180804 Summer-2019
I'd like to create a new field in df if df$date is between semester$start and semester$end, then place the respective value semester$season_year in df
I tried to see if the lubridate package could help but that seems to be more for calculations
I saw this question and it seems to be the closest to what i want, but, to make things more complicated, not all of our semesters are six months
Does this work?
library(lubridate)
semester$start <- ymd(semester$start)
semester$end <- ymd(semester$end)
df$date <- ymd(df$date)
LU <- Map(`:`, semester$start, semester$end)
LU <- data.frame(value = unlist(LU),
index = rep(seq_along(LU), lapply(LU, length)))
df$semester <- semester$season_year[LU$index[match(df$date, LU$value)]]
A solution using non-equi update joins using data.table and lubridate package can be as:
library(data.table)
setDT(df)
setDT(semester)
df[,date:=as.IDate(as.character(date), format = "%Y%m%d")]
semester[,':='(start = as.IDate(as.character(start), format = "%Y%m%d"),
end=as.IDate(as.character(end), format = "%Y%m%d"))]
df[semester, on=.(date >= start, date <= end), season_year := i.season_year]
df
# id date season_year
# 1: 1 2016-08-22 Fall-2016
# 2: 2 2017-01-09 Winter-2017
# 3: 3 2017-08-28 Fall-2017
# 4: 4 2017-09-25 Fall-2017
# 5: 5 2018-01-08 Winter-2018
# 6: 6 2018-04-02 Spring-2018
# 7: 7 2016-07-11 Summer-2016
# 8: 8 2015-08-31 Fall-2015
# 9: 9 2016-01-11 Winter-2016
# 10: 10 2016-05-02 Spring-2016
# 11: 11 2016-08-29 Fall-2016
# 12: 12 2017-01-09 Winter-2017
# 13: 13 2017-05-01 Spring-2017
Data:
df <- read.table(text="
id date
1 20160822
2 20170109
3 20170828
4 20170925
5 20180108
6 20180402
7 20160711
8 20150831
9 20160111
10 20160502
11 20160829
12 20170109
13 20170501",
header = TRUE, stringsAsFactors = FALSE)
semester <- read.table(text="
start end season_year
20120801 20121222 Fall-2012
20121223 20130123 Winter-2013
20130124 20130523 Spring-2013
20130524 20130805 Summer-2013
20130806 20131228 Fall-2013
20131229 20140122 Winter-2014
20140123 20140522 Spring-2014
20140523 20140804 Summer-2014
20140805 20141227 Fall-2014
20141228 20150128 Winter-2015
20150129 20150528 Spring-2015
20150529 20150803 Summer-2015
20150804 20151226 Fall-2015
20151227 20160127 Winter-2016
20160128 20160526 Spring-2016
20160527 20160801 Summer-2016
20160802 20161224 Fall-2016
20161225 20170125 Winter-2017
20170126 20170525 Spring-2017
20170526 20170807 Summer-2017
20170808 20171230 Fall-2017
20171231 20180124 Winter-2018
20180125 20180524 Spring-2018
20180525 20180806 Summer-2018
20180807 20181222 Fall-2018
20181223 20190123 Winter-2019
20190124 20190523 Spring-2019
20190524 20180804 Summer-2019",
header = TRUE, stringsAsFactors = FALSE)

Recode Date (time) varibre in to new discrete variable

i have time variable : "00:00:29","00:06:39","20:43:15"....
and I want to recode to new vector - time based work shifts:
07:00:00 - 13:00:00 - 1
13:00:00 - 20:00:00 - 2
23:00:00 - 7:00:00 - 3
thanks for any idea :)
Assuming the time variables are strings as shown, this seems to work:
secNr <- function(x){ sum(as.numeric(unlist(strsplit(x,":",fixed=TRUE))) * c(3600,60,1)) }
workShift <- function(x)
{
n <- which.max(secNr(x) >= c(secNr("23:00:00"),secNr("20:00:00"),secNr("13:00:00"),secNr("07:00:00"),secNr("00:00:00")))
c(3,NA,2,1,3)[n]
}
"workShift" computes the work shift of one such time string. If you have a vector of time strings, use "sapply". Example:
> Time <- sprintf("%i:%02i:00", 0:23, sample(0:59,24))
> Shift <- sapply(Time,"workShift")
> Shift
0:37:00 1:17:00 2:35:00 3:09:00 4:08:00 5:28:00 6:03:00 7:43:00 8:27:00 9:38:00 10:48:00 11:50:00 12:58:00 13:32:00 14:05:00 15:39:00 16:56:00
3 3 3 3 3 3 3 1 1 1 1 1 1 2 2 2 2
17:00:00 18:22:00 19:02:00 20:42:00 21:11:00 22:15:00 23:01:00
2 2 2 NA NA NA 3

Resources