How to create new variable based on time and preexisting variables? - r

I have a dataset with repeated measurements on multiple individuals over time. It looks something like this:
ID Time Event
1 Jan 1 2012, 4pm Abx
1 Jan 2 2012, 2pm Test
1 Jan 26 2012 3 pm Test
1 Jan 29 2012 10 pm Abx
1 Jan 30 2012, 3 pm Test
1 Jan 5 2012 3 pm Test
2 Jan 1 2012, 4pm Abx
2 Jan 2 2012, 2pm Test
2 Jan 26 2012 3 pm Test
The dataset is currently based around events. It will later be filtered down to just tests. What I need to do is make a new variable that is 1 when certain events (Abx, in this case) occur within a certain time range of tests. So if the event 'Abx' occurs within, let's say, 48 hours of a Test event, the new variable should equal 1. Otherwise, it should equal zero.
I'm hoping to produce something like this:
ID Time Event New_variable
1 Jan 1 2012, 4pm Abx 1
1 Jan 2 2012, 2pm Test 1
1 Jan 26 2012 3 pm Test 0
1 Jan 29 2012 10 pm Abx 1
1 Jan 30 2012, 3 pm Test 1
1 Jan 5 2012 3 pm Test 0
2 Jan 1 2012, 4pm Abx 1
2 Jan 2 2012, 2pm Test 1
2 Jan 26 2012 3 pm Test 0
I know that I could probably solve this with a combination of Dplyr mutate functions combined with ifelse statements, and if I just wanted to make a variable that reads "1" when the antibiotic event occurs I could do that like this:
test %>%
mutate(New_variable = ifelse(Event == 'Abx', 1, 0)) -> test2
But I don't know how to factor in time so that Test events = 1 within 48 hours of an Abx event. I also am not sure how to make sure that the condition is applied only within the same ID. How can I do this?
Any help is appreciated!
Update: Thank you so much for the suggestions! I'm going to try these out on the data, but I think they'll work. If they don't, I'll be back soon. Success! I also modified the suggested helper function to include additional options (for more than one type of Abx):
abxRows <- type == "Abx" | type == "Abx2"

To the data provided, I added two "Abx" events which should not be one (i.e. one that was not within 48 hours and one that wasn't in the same group as the test that was within 48 hours).
library(dplyr)
library(lubridate)
library(purrr)
eventData <-
data.frame(stringsAsFactors = FALSE,
ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1),
Time = c("Jan 1 2012 4 pm", "Jan 2 2012, 2pm",
"Jan 26 2012 3 pm", "Jan 29 2012 10 pm",
"Jan 30 2012 3 pm", "Jan 5 2012 3 pm",
"Jan 1 2012 4 pm", "Jan 2 2012, 2pm",
"Jan 26 2012 3 pm", "Feb 12 2012 1pm",
"Jan 16 2012 3 pm", "Jan 16 2012 1 pm"),
Event = c("Abx", "Test", "Test", "Abx", "Test", "Test",
"Abx", "Test", "Test", "Abx", "Abx", "Test")
) %>%
mutate(Time = mdy_h(Time),
window = if_else(Event == "Test",
interval(Time - hours(48), Time + hours(48)),
interval(NA, NA))
)
First, you want to make sure the Time column is a time format. Then create a column of the lubridate Interval class that creates a 48 hr window around "Test" events.
Define the helper function that will check if the event occurred within the window.
chkFun <- function(eventTime, intervals, grp, type){
abxRows <- type == "Abx"
testRows <- !abxRows
hits <- map2_lgl(eventTime, grp,
~any(.x %within% intervals[grp %in% .y], na.rm = TRUE)) &
abxRows
testHits <- map_lgl(which(testRows),
~any(eventTime[abxRows & (grp[.x] == grp)] %within%
intervals[.x]))
hits[testRows] <- testHits
as.integer(hits)
}
This function first goes through and test if the "Abx" events occur within the intervals. It then determines which "Test" rows have an interval that contains a "Abx" event. The function returns the combination of these cast as integers.
Last, just use a mutate statement with the helper function, dropping the window column
eventData %>%
mutate(New_variable = chkFun(Time, window, ID, Event)) %>%
select(-window)
Alternatively, the helper function could just take the data.frame as an argument and assume the column names. In the form above, though, if you define it first in your script, it could also be used in the original definition of eventData
Results:
#> ID Time Event New_variable
#> 1 1 2012-01-01 16:00:00 Abx 1
#> 2 1 2012-01-02 14:00:00 Test 1
#> 3 1 2012-01-26 15:00:00 Test 0
#> 4 1 2012-01-29 22:00:00 Abx 1
#> 5 1 2012-01-30 15:00:00 Test 1
#> 6 1 2012-01-05 15:00:00 Test 0
#> 7 2 2012-01-01 16:00:00 Abx 1
#> 8 2 2012-01-02 14:00:00 Test 1
#> 9 2 2012-01-26 15:00:00 Test 0
#> 10 2 2012-02-12 13:00:00 Abx 0
#> 11 2 2012-01-16 15:00:00 Abx 0
#> 12 1 2012-01-16 13:00:00 Test 0

So I dont have a copy of your data, so Im not sure what for kmat your dates are in...
I would recommend converting the date to the right format using as.POSIXct(Time, format="%b %d %Y, %I%p") For more info on the format look up ?strptime, but I think that is right for your column.
If we assume your data frame is like this... I know I have changed parts of it but this is for simplicity
df <- data.frame(ID = c(rep(1,6),rep(2,3)),
Time=c(seq(from=start, by=interval*6840, to=end)[1:6],seq(from=start, by=interval*6840, to=end)[1:3]),
Event = rep(c("Abs","Test","Test"),3))
This would look like this
ID Time Event
1 1 2012-01-01 00:00:00 Abs
2 1 2012-01-05 18:00:00 Test
3 1 2012-01-10 12:00:00 Test
4 1 2012-01-15 06:00:00 Abs
5 1 2012-01-20 00:00:00 Test
6 1 2012-01-24 18:00:00 Test
7 2 2012-01-01 00:00:00 Abs
8 2 2012-01-05 18:00:00 Test
9 2 2012-01-10 12:00:00 Test
So you can use the following code to test whether a Test falls within 48 hours of an Abs
df[which(df$Event=="Test"),]$Time %in% unlist(Map(`:`, df[which(df$Event=="Abs"),]$Time-48*60*60, df[which(df$Event=="Abs"),]$Time+48*60*60))
So this will return FALSE for all, but that is because the synthetic data is at larger time steps.
To unpack this...
df[which(df$Event=="Test"),]$Time Gives the times of tests
%in% Says look for what precedes this, in a set of values that follows it.
So what follows it is: unlist(Map(`:`, df[which(df$Event=="Abs"),]$Time-48*60*60, df[which(df$Event=="Abs"),]$Time+48*60*60))
This creates a list of dates +/- 48 hours from each Abs. to add or subtract 48 hours, POSIXct objects like this done in seconds, hence 48*60*60

Related

Formatting date column with different formats (including missing day information) - lubridate

I'm relatively new to R. I downloaded a dataset about clinical trial data, but it occurred to me, that the format of the dates in the relative column are mixed up: most of them are like "September 1, 2012", but some are missing the day information (e.g. October 2015).
I want to express them all in the same way (eg. yyyy-mm-dd), to work with them. That went fine, the only problem that is missing is the name of the output column. In the last function (date_correction) I planned to include an argument "output_col" which I can pass the intended name for the created (formatted) column, but it only prints output_col all the time.
Do you know, how I could handle this? To pass the intended name of the output column right into the function?
Is there a better way to solve my problem?
-> I even tried to manage more complex orders-argument for lubricate::parse_date_time like
parse_date_time(input_col, orders="mdy", "my")
but this didn't work.
Here's the code:
library("tidyverse")
library("lubridate")
Observation <- c(seq(1:5))
Date_original <- c("October 2014","August 2014","June 2013",
"June 24, 2010","January 2005")
df_dates <- data.frame(Observation, Date_original)
# looking for a comma in the cell
comma_detect <- function(a_string){
str_detect(a_string, ",")
}
# if comma: assume "mdy", if not apply "my" -> return formatted value
date_correction_row <- function(input_col){
if_else(comma_detect(input_col),
parse_date_time(input_col, orders="mdy"),
parse_date_time(input_col, orders="my"))
}
# prepare function for dataframe:
date_correction <- function(df, input_col, output_col){
mutate(df, output_col = date_correction_row(input_col))
}
df_dates %>% date_correction(df_dates$Date_original, date_formatted) %>% view()
OUTPUT
Observation Date_original output_col
1 1 October 2014 2014-10-01
2 2 August 2014 2014-08-01
3 3 June 2013 2013-06-01
4 4 June 24, 2010 2010-06-24
5 5 January 2005 2005-01-01
In the code below we assume that output_col equals "Date". They all set the column name, give no warnings and use Date class.
1) Try each format and take the one that does not give NA. This uses only base R.
output_col <- "Date"
within(df_dates, assign(output_col, pmin(na.rm = TRUE,
as.Date(Date_original, "%B %d, %Y"),
as.Date(paste(Date_original, 1), "%B %Y %d"))))
## Observation Date_original Date
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
2) This can also be done in lubridate. It is important that my is the first rather than second argument to coalesce since it outputs NA for those values that do not match the format whereas mdy gives a wrong date so if that were first coalesce would never get to my. This approach is shorter than (3) but you might prefer the robustness (3) since it does not depend on what is returned for non-matching dates.
library(dplyr)
library(lubridate)
output_col <- "Date"
df_dates %>%
mutate(!!output_col := coalesce(my(Date_original, quiet = TRUE),
mdy(Date_original)))
## Observation Date_original Date
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
3) If you prefer your own method of first checking for comma here is a variation of that which is more compact. It uses my and mdy instead of parse_date_time since my and mdy give Date class results which are more appropriate here than the POSIXct of parse_date_time given that there are no times.
library(dplyr)
library(lubridate)
output_col <- "Date"
df_dates %>%
mutate(!!output_col := if_else(grepl(",", Date_original),
mdy(Date_original), my(Date_original, quiet = TRUE)))
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
When the date structure is known, I like to explicitly correct the date structure first, then parse. Here I use regex to sub in 1 when the day is missing, then we just parse like normal.
library(tidyverse)
df_dates %>%
mutate(
output_col = gsub("(?<!,)\\s(?=\\d{4})", " 1, ", Date_original, perl = TRUE) %>%
as.Date(., format = '%B %d, %Y')
)
Observation Date_original output_col
1 1 October 2014 2014-10-01
2 2 August 2014 2014-08-01
3 3 June 2013 2013-06-01
4 4 June 24, 2010 2010-06-24
5 5 January 2005 2005-01-01

How to divide monthly totals by the seasonal monthly ratio in R

I am trying de-seasonalize my data by dividing my monthly totals by the average seasonality ratio per that month. I have two data frames. avgseasonality that has 12 rows of the average seasonality ratio per month. The problem is since the seasonality ratio is the ratio of each month averaged only has 12 rows and the ordertotal data frame has 147 rows.
deseasonlize <- transform(avgseasonalityratio, deseasonlizedtotal =
df1$OrderTotal / avgseasonality$seasonalityratio)
This runs but it does not pair the months appropriately. It uses the first ratio of april and runs it on the first ordertotal of december.
> avgseasonality
Month seasonalityratio
1 April 1.0132557
2 August 1.0054602
3 December 0.8316988
4 February 0.9813396
5 January 0.8357475
6 July 1.1181648
7 June 1.0439899
8 March 1.1772450
9 May 1.0430667
10 November 0.9841149
11 October 0.9595041
12 September 0.8312318
> df1
# A tibble: 157 x 3
DateEntLabel OrderTotal `d$Month`
<dttm> <dbl> <chr>
1 2005-12-01 00:00:00 512758. December
2 2006-01-01 00:00:00 227449. January
3 2006-02-01 00:00:00 155652. February
4 2006-03-01 00:00:00 172923. March
5 2006-04-01 00:00:00 183854. April
6 2006-05-01 00:00:00 239689. May
7 2006-06-01 00:00:00 237638. June
8 2006-07-01 00:00:00 538688. July
9 2006-08-01 00:00:00 197673. August
10 2006-09-01 00:00:00 144534. September
# ... with 147 more rows
I need the ordertotal and ratio of each month respectively. The calculations would for each month respectively be such as (december) 512758/0.8316988 = 616518.864762 The output for the calculations would be in their new column that corresponds with the month and ordertotal. Please any help is greatly appreciated!
Easiest way would be to merge() your data first, then do the operation. You can use R base merge() function, though I will show here using the tidyverse left_join() function. I see that one of your columns has a strange name d$Month, renameing this to Month will simplify the merge!
Reproducible example:
library(tidyverse)
df_1 <- data.frame(Month = c("Jan", "Feb"), seasonalityratio = c(1,2))
df_2 <- data.frame(Month = rep(c("Jan", "Feb"),each=2), OrderTotal = 1:4)
df_1 %>%
left_join(df_2, by = "Month") %>%
mutate(eseasonlizedtotal = OrderTotal / seasonalityratio)
#> Month seasonalityratio OrderTotal eseasonlizedtotal
#> 1 Jan 1 1 1.0
#> 2 Jan 1 2 2.0
#> 3 Feb 2 3 1.5
#> 4 Feb 2 4 2.0
Created on 2019-01-30 by the reprex package (v0.2.1)

creation of a time series adds day and time to the date

My data frame looks like that (with more columns and rows):
DT
Date PERMNO lag.ME.Jun
<S3: yearmon> <fctr> <dbl>
1 Gen 2000 34936 21.860
2 Feb 2000 34936 21.860
3 Mar 2000 34936 21.860
Then I create different time series for each column (for the variable lag.ME.Jun, for example):
v6<-xts( newdata11$lag.ME.Jun, newdata11$Date)
However, it adds also the date and time inside the time series; which was not provided. So v6 looks like:
34936
2000-01-01 01:00:00 21.86
2000-02-01 01:00:00 21.86
2000-03-01 01:00:00 21.86
How can I avoid to have the day and time in the time to appear in the time series? only the month and year.
It seems newdata11$Date is either a POSIXct type, or being converted to one. If yearmon is what you want, be explicit about it:
v6<-xts( newdata11$lag.ME.Jun, as.yearmon(newdata11$Date))
E.g.
d = as.yearmon(c("mar07", "apr07", "may07"), "%b%y")
> d
[1] "Mar 2007" "Apr 2007" "May 2007"
> xts(1:3, d)
[,1]
Mar 2007 1
Apr 2007 2
May 2007 3
> d2= as.Date(c("2017-05-01", "2017-06-01", "2017-07-01"))
> xts(1:3, as.yearmon(d2))
[,1]
May 2017 1
Jun 2017 2
Jul 2017 3

How to fill missing and adjust irregular time intervals in a data.frame in R

I have several datasets mostly with 15 min intervals of time. However, some datasets have missing readings (e.g., 3rd row in sample dataset was supposed to be "May 1 2015 00:40AM". In addition, there are some timesteps that are longer than 15 min (e.g., see 3rd and 6th rows)
How can add the missing time steps so that my Date will continue with 15 min intervals and at the same time adjust those time steps with more than 15 min intervals to 15 min?
s <- data.frame(Date = c(
"May 1 2015 00:10AM","May 1 2015 00:25AM",
"May 1 2015 00:56AM","May 1 2015 01:10AM",
"May 1 2015 01:25AM","May 1 2015 01:41AM",
"May 1 2015 01:55AM"),
val = c(1:7)
)
My desired output would be the following:
> s
Date val
1 May 1 2015 00:10AM 1
2 May 1 2015 00:25AM 2
3 May 1 2015 00:40AM NA
4 May 1 2015 00:55AM 3
5 May 1 2015 01:10AM 4
6 May 1 2015 01:25AM 5
7 May 1 2015 01:40AM 6
8 May 1 2015 01:55AM 7
You could try the following:
First, turn your s dataframe variable "Date" into POSIXct, so you can work with it:
s <- data.frame(Date = c(
"May 1 2015 00:10AM","May 1 2015 00:25AM",
"May 1 2015 00:56AM","May 1 2015 01:10AM",
"May 1 2015 01:25AM","May 1 2015 01:41AM",
"May 1 2015 01:55AM"),
val = c(1:7)
) %>% dplyr::mutate(Date = lubridate::parse_date_time(Date, "b d Y HM"))
Second, you can join this with another data frame that has all the time intervals you are expecting. First, we construct it, using a difference of time intervals (15 mins in this case):
one <- lubridate::parse_date_time("May 1 2015 00:10AM", orders = "b d Y HM")
two <- lubridate::parse_date_time("May 1 2015 00:25AM", orders = "b d Y HM")
dif <- two - one
Now the dataframe:
other_df <- data.frame(
Date = seq(from = lubridate::parse_date_time("May 1 2015 00:10AM",
orders = "b d Y HM"),
to = lubridate::parse_date_time("May 1 2015 01:55AM",
orders = "b d Y HM"),
by = dif))
Join the two:
result <- dplyr::full_join(other_df, s)
> result
Date val
1 2015-05-01 00:10:00 1
2 2015-05-01 00:25:00 2
3 2015-05-01 00:40:00 NA
4 2015-05-01 00:55:00 NA
5 2015-05-01 01:10:00 4
6 2015-05-01 01:25:00 5
7 2015-05-01 01:40:00 NA
8 2015-05-01 01:55:00 7
9 2015-05-01 00:56:00 3
10 2015-05-01 01:41:00 6

Using ISOdatetime when only have date and hour in R

I have date time data in 4 columns of my dataframe. I am wanting to create a single column with the date time.
A small example (apologies for poor formatting I'm working off a phone)
2014 3 1 23 2.1
2014 3 2 0 4.7
2014 3 2 1 2.4
So the above gives the data (final column) for three time steps an hour apart, 2300 (11pm) on 1st March 2014, midnight that night and 0100 (1am) on the 2nd March 2014.
I want to create an extra column that would have "2009-03-01 23:00:00 GMT".
I tried using
Mytimes <-with(my data,ISOdatetime(year,month,day,time,0,0)
Where my columns are year, month, day, time, datavalue, but get as output "2009-03-01 EST" with no hour data. I can then add this column to my dataframe
I'm not particularly worried about the time zone, that's just what the examples looked like.
See comments from MrFlick and Joshua, midnight time is not shown, but is there.
Is this what you want:
> require(lubridate)
> input <- read.table(text = "2014 3 1 23 2.1
+ 2014 3 2 0 4.7
+ 2014 3 2 1 2.4")
> # create date column
> input$date <- ymd_h(paste(input$V1, input$V2, input$V3, input$V4))
>
> input
V1 V2 V3 V4 V5 date
1 2014 3 1 23 2.1 2014-03-01 23:00:00
2 2014 3 2 0 4.7 2014-03-02 00:00:00
3 2014 3 2 1 2.4 2014-03-02 01:00:00

Resources