Identify only the first matching record - r

I have a large amount of time series data stored in a dataframe called "Tag.data" where one record is taken every 30 seconds over the course of several months. For example:
2013-09-30 23:59:00
2013-09-30 23:59:30
2013-10-01 00:00:00
2013-10-01 00:00:30
2013-10-01 00:01:00
2013-10-01 00:01:30
2013-10-01 00:02:00
...
2013-10-15 05:00:00
2013-10-15 05:00:30
2013-10-15 05:01:00
2013-10-15 05:01:30
2013-10-15 05:02:00
...
This data is stored in Tag.data$dt.
Within my data I would like to identify the 1st and 15th day of each month so that these can be used on a later plot.
I was successfully able to identify the first day of each month with this code:
locs <- tapply (X=Tag.data$dt, FUN=min, INDEX=format(Tag.data$dt, '%Y%m'))
at <- Tag.data$dt %in% locs
at <- at & format(Tag.data$dt, '%m') %in% c('01', '02', '03','04', '05', '06','07', '08', '09','10', '11', '12') & format(Tag.data$dt, '%d') == '01'
Unfortunately I was less successful when I attempted to also identify the 15th day of each month with this code:
locs <- tapply (X=Tag.data$dt, FUN=min, INDEX=format(Tag.data$dt, '%Y%m'))
at <- Tag.data$dt %in% locs
at <- at & format(Tag.data$dt, '%m') %in% c('01', '02', '03','04', '05', '06','07', '08', '09','10', '11', '12') & format(Tag.data$dt, '%d') == '01'|
format(Tag.data$dt, '%m') %in% c('01', '02', '03','04', '05', '06','07', '08', '09','10', '11', '12') & format(Tag.data$dt, '%d') == '15'
While this did identify both the 1st and the 15th days of each month, for some reason it identifies only one record for the 1st day of the month but every record for the 15th day of the month (of which there are a great many). I would like to identify only the first record for both the 1st and 15th days of each month. Any help would be much appreciated.

Judging from your code:
locs <- tapply (X=Tag.data$dt, FUN=min, INDEX=format(Tag.data$dt, '%Y%m'))
I assume Tag.data$dt is stored as one of POSIX classes.
I would like to identify only the first record for both the 1st and 15th days of each month.
Probably slow, but this does the work.
ymd <- format(Tag.data$dt,"%Y%m%d")
index.01.15 <- !duplicated(ymd) & grepl("01$|15$", ymd)
You can use the logical vector to select the rows Tag.data[index.01.15, ]

Try this. It makes use of lubridate. You can select all rows where the day is either 1 or 15.
library(lubridate)
options(stringsAsFactors=FALSE)
Tag.data = structure(list(dt = c("30/09/2013 23:59", "1/10/2013 0:00", "1/10/2013 0:00",
"1/10/2013 0:01", "1/10/2013 0:01", "1/10/2013 0:02", "2/10/2013 0:04",
"15/10/2013 5:00", "15/10/2013 5:00", "15/10/2013 5:01", "15/10/2013 5:01",
"15/10/2013 5:02")), .Names = "dt", class = "data.frame", row.names = c(NA,
-12L))
Tag.data$dt = parse_date_time(Tag.data$dt, '%d/%m/%Y %H%M')
at = Tag.data[day(Tag.data$dt) %in% c(1,15), ]
This is more flexible as you can specify any day you wish to subset on. E.g replace the values in c(1,15) for any day, or month(Tag.data$dt) %in% c(<INSERT MONTH NUMBER>) to subset on month.

It looks like your data are already stored as dates of some sort (e.g., POSIXct). Something like this, but with even more rows?
Tag.data <- data.frame(dt=seq(ISOdate(2013,10,1), by = "30 min", length.out = 10000))
Then if you want just the first record from each 1st or 15th day, this might work:
daychars <- format(Tag.data$dt, '%d')
day1or15 <- daychars %in% c("01","15")
newday <- c(TRUE, (daychars[1:(length(daychars)-1)] != daychars[2:length(daychars)]))
format(Tag.data[day1or15 & newday,"dt"],"%m/%d/%Y %H:%M:%S")
The newday line helpfully does not require that the day begins at any particular time, but it does assume that your time series is ordered.

I suggest you use the excellent xts package for time series data in R.
You didn't provide reproducible data, so i made some of my own.
require(xts)
Tag.data <- xts(rnorm(1e5), order.by = Sys.time() + seq(30, 3e6, 30))
Sub-setting by day of the month is a simple one-liner.
days_1n15 <- Tag.data[.indexmday(Tag.data) %in% c(1, 15)]
This returns all records on the 1st and 15th day of any month.
Now we just need to pull out the first observations on each matching day.
firstOf <- do.call(rbind, lapply(split(days_1n15, 'days'), first))
Which contains the data you want:
R> firstOf
[,1]
2014-02-01 21:29:01 1.284222
2014-02-15 00:00:01 -1.262235
2014-03-01 00:00:01 -0.465001

Related

Identifying change in variable by time in R

I am looking at how extubation rates in an intensive care unit have changed over the course of the pandemic.
I have a data set which has hourly timestamps next to a category of airway types which simplified looks like this:
Time
AirwayStatus
2020/01/01 00:00
ETT/LMA
2020/01/01 01:00
ETT/LMA
2020/01/01 02:00
Own Airway
2020/01/01 03:00
Own Airway
2020/01/01 04:00
ETT/LMA
What I am effectively looking to do is find the times when the patient is extubated (ETT/LMA turns to Own Airway) and also when intubated (own airway to ETT/LMA). Eventually I want to be able to see how often an extubated patient has to be re-intubated.
Within 48 hours this is known as a failed extubation and we are expecting to see vastly different data during the pandemic compared to before.
The ideas I have so far are creating a seperate column with the airwayStatus of the prior hour and then if these are not the same then counting this. This seems unsophisticated though and I was hoping some of you clever people may have a nicer option.
Thank you in advance
Using dplyr from tidyverse:
Supposing you have a dataframe (or tibble) df and patient(?) id ID:
library(dplyr)
df <- tibble(
ID = c(1,1,1,1,1),
Time = c("2020/01/01 00:00", "2020/01/01 01:00", "2020/01/01 02:00", "2020/01/01 03:00", "2020/01/01 04:00"),
AirwayStatus = c("ETT/LMA", "ETT/LMA", "Own Airway", "Own Airway", "ETT/LMA"))
df <- df %>%
group_by(ID) %>%
arrange(Time) %>%
mutate(
Extubated = ifelse(AirwayStatus == "Own Airway" & lag(AirwayStatus) == "ETT/LMA", TRUE, FALSE),
Intubated = ifelse(AirwayStatus == "ETT/LMA" & lag(AirwayStatus) == "Own Airway", TRUE, FALSE))
result <- df %>%
summarise_at(c("Extubated", "Intubated"), sum, na.rm = TRUE)
result
Result:
# A tibble: 1 x 3
ID Extubated Intubated
<dbl> <int> <int>
1 1 1 1
This allows grouping by patient id which you will most likely do.
It's a bit longer than Oliver's answer though.
Your idea is the right way to go. You can skip storing intermediary results but they have to be estimated anyway. Lets assume your data is called df, then we could do something similar to
# Read table: (Could get read.table to work)
library(data.table)
df <- fread("Time AirwayStatus
2020/01/01 00:00 ETT/LMA
2020/01/01 01:00 ETT/LMA
2020/01/01 02:00 Own Airway
2020/01/01 03:00 Own Airway
2020/01/01 04:00 ETT/LMA")
setDF(df)
# Convert time to a date format
df$Time <- as.POSIXct(df$Time)
n <- nrow(df)
# Find changes
df$change <- with(df, c(FALSE, AirwayStatus[seq(n - 1)] != AirwayStatus[seq(2, n)]))
# estimate the length of time since last change
df$hours_between_change[df$change] <- with(df, diff(c(NA, Time[change])) / 3600)
df
Time AirwayStatus change hours_between_change
1 2020-01-01 00:00:00 ETT/LMA FALSE NA
2 2020-01-01 01:00:00 ETT/LMA FALSE NA
3 2020-01-01 02:00:00 Own Airway TRUE NA
4 2020-01-01 03:00:00 Own Airway FALSE NA
5 2020-01-01 04:00:00 ETT/LMA TRUE 2
Note I store the intermediate results here. We likely could make it a bit more readable using dplyr but this does the job.
Here is an approach using dplyr.
First, you might want to consider a separate column to indicate an intubation or extubation "event." If someone is "Own Airway" and then the previous row has "ETT/LMA", we assume the person has been extubated. The opposite can also be determined for intubation.
Then, you can filter and only focus on these events.
For each event, you may want to capture when the event is "Extubation", and then following event is "Intubation", and the time difference is < 48 hrs. If this is true, then the extubation is actually a "failed extubation."
This may handle situations where someone has data that begins with "Own Airway" and gets intubated (if no extubation event, then cannot be failed extubation). It will also keep extubation events where the time difference is > 48 hrs as well.
library(tidyverse)
df %>%
mutate(Event = case_when(
AirwayStatus == "Own Airway" & lag(AirwayStatus) == "ETT/LMA" ~ "Extubation",
AirwayStatus == "ETT/LMA" & lag(AirwayStatus) == "Own Airway" ~ "Intubation",
TRUE ~ NA_character_)
) %>%
filter(!is.na(Event)) %>%
mutate(Event = ifelse(
Event == "Extubation" & lead(Event) == "Intubation" & (lead(Time) - Time < 48),
"Failed Extubation",
Event
))
Output
Time AirwayStatus Event
1 2020-01-01 02:00:00 Own Airway Failed Extubation
2 2020-01-01 04:00:00 ETT/LMA Intubation
Data
df <- structure(list(Time = structure(c(1577858400, 1577862000, 1577865600,
1577869200, 1577872800), class = c("POSIXct", "POSIXt"), tzone = ""),
AirwayStatus = c("ETT/LMA", "ETT/LMA", "Own Airway", "Own Airway",
"ETT/LMA"), Event = c(NA, NA, "Extubated", NA, "Intubated"
)), row.names = c(NA, -5L), class = "data.frame")

If date is on Sunday, then change date to the following Monday

Sunday date in mydates is 2018-05-06. I would like 1 day added so that 2018-05-06 becomes 2018-05-07 (Monday). That is, if a date falls on a Sunday add one day.
library(dplyr)
library(lubridate)
mydates <- as.Date(c('2018-05-01','2018-05-02','2018-05-05','2018-05-06'))
# find which are weekend dates
x = as.character(wday(mydates,TRUE))
if(x == 'Sun') { mydates + 1 }
# the Sunday date in mydates is 2018-05-06. I would like 1 day added so
that 2018-05-06 becomes 2018-05-07
Here's my error: Warning message:
In if (x == "Sun") { :
the condition has length > 1 and only the first element will be used
Try ifelse. Then convert to class Date.
as.Date(ifelse(x == 'Sun', mydates + 1, mydates), origin = '1970-01-01')
#[1] "2018-05-01" "2018-05-02" "2018-05-05" "2018-05-07"
X is a vector so you can use anif_else statement to increment the Sundays as follows:
library(dplyr)
library(lubridate)
new_dates <- if_else(x == 'Sun', mydates + days(1), mydates)
First, identify which of your dates are Sundays. Then, selectively add 1
library(lubridate)
mydates <- as.Date(c('2018-05-01','2018-05-02','2018-05-05','2018-05-06'))
i <- which(as.character(wday(mydates,TRUE))=="Sun")
mydates[i] <- mydates[i]+1
this outputs
"2018-05-01" "2018-05-02" "2018-05-05" "2018-05-07"
which, I believe, is the desired result.

Creating Subsets of data with multiple where/between statements

I have a dataset which consists of 2 days in 2 different months and the same time periods. It shows how many occupants were in a house during the time. I want to separate the data by date, time period AND houseid.
So i want to get all the records where the date is 01-02-2010, between the time periods 14:00:00 - 19:00:00 where houseid is N60421A. At the moment data.type is stored as characters except for occupants which is numeric.
http://www.sharecsv.com/s/aa6d4dc34acfbaf73ada1d2c8764b888/modecsv.csv
Atm i have tried this but i seem to get no results
data2 = subset(data, dayMonthYear == "01/02/2010" && Houses == "N60421A")
In SQL i would do something like
SELECT *
From data
where dayMonthYear == "01/02/2010"
AND houses == "N60421A"
AND time > 14:00:00
AND time < 19:00:00
This should work for you...
#Combine date and time into a new POSIXct variable "Time1"
data$Time1 <- as.POSIXct(paste(data$dayMonthYear, data$Time), format="%d/%m/%Y %H:%M:%S")
#Subset
data2 <-subset(data, dayMonthYear == "01/02/2010" & Houses == "N60421A" & strftime(Time1, "%H") %in% c('14','15','16','17','18','19'))
You could also use the "chron" package and standard R subsetting...
#Approach 2
#Load Library
library(chron)
#Convert Time from factor while creating new variable "Time2"
data$Time2 <- chron(times = as.character(data$Time))
#Subset
data2 <- data[(data$dayMonthYear == "01/02/2010" & data$Houses == "N60421A" & data$Time2 >= "14:00:00" & data$Time2 <= "19:00:00" ),]

Match events with weather data

I have a list with event and times. So a little like this df:
event <- c("x", "y")
date <- c("12-12-2014", "13-12-2014")
time <- c("11:00", "14:00")
df_event <- data.frame(event, date, time)
What I would like to do now is match these events with weather data. Thing is however that the timestamps from the weather I have do not match the event dates. They are like:
date <- c("12-12-2014", "12-12-2015")
time <- c("12:00", "14:00")
degrees <- c(12, 13)
df_weather <- data.frame(date,time, degrees)
Does anybody have suggestions on how I can easily match the so I can the weather data that is closest to the event?
Looks like a duplicate of this question. Adapting one of those answers for you:
#First, convert your date+time into POSIXct so that we have an index to search
df_event$date2 <- as.POSIXct(strptime(paste(df_event$date, df_event$time),
format = "%d-%m-%Y %H:%M"))
df_weather$datePXct <- as.POSIXct(strptime(paste(df_weather$date, df_weather$time),
format = "%d-%m-%Y %H:%M"))
#Find variables in df_weather that match timestamp in df_event
df_event <- cbind(df_event, event.degrees = df_weather[ unlist(sapply((df_event$date2),
function(x) which.min(abs(x - df_weather$datePXct))) ), c("degrees")])
df_event
# event date time date2 event.degrees
#1 x 12-12-2014 11:00 2014-12-12 11:00:00 12
#2 y 13-12-2014 14:00 2014-12-13 14:00:00 12

How to select time range during weekdays and associated data on the next column

Here is an example of a subset data in .csv files. There are three columns with no header. The first column represents the date/time and the second column is load [kw] and the third column is 1= weekday, 0 = weekends/ holiday.
9/9/2010 3:00 153.94 1
9/9/2010 3:15 148.46 1
I would like to program in R, so that it selects the first and second column within time ranges from 10:00 to 20:00 for all weekdays (when the third column is 1) within a month of September and do not know what's the best and most efficient way to code.
code dt <- read.csv("file", header = F, sep=",")
#Select a column with weekday designation = 1, weekend or holiday = 0
y <- data.frame(dt[,3])
#Select a column with timestamps and loads
x <- data.frame(dt[,1:2])
t <- data.frame(dt[,1])
#convert timestamps into readable format
s <- strptime("9/1/2010 0:00", format="%m/%d/%Y %H:%M")
e <- strptime("9/30/2010 23:45", format="%m/%d/%Y %H:%M")
range <- seq(s,e, by = "min")
df <- data.frame(range)
OP ask for "best and efficient way to code" this without showing "inefficient code", so #Justin is right.
It's seems that the OP is new to R (and it's officially the summer of love) so I give it a try and I have a solution (not sure about efficiency..)
index <- c("9/9/2010 19:00", "9/9/2010 21:15", "10/9/2010 11:00", "3/10/2010 10:30")
index <- as.POSIXct(index, format = "%d/%m/%Y %H:%M")
set.seed(1)
Data <- data.frame(Date = index, load = rnorm(4, mean = 120, sd = 10), weeks = c(0, 1, 1, 1))
## Data
## Date load weeks
## 1 2010-09-09 19:00:00 113.74 0
## 2 2010-09-09 21:15:00 121.84 1
## 3 2010-09-10 11:00:00 111.64 1
## 4 2010-10-03 10:30:00 135.95 1
cond <- expression(format(Date, "%H:%M") < "20:00" &
format(Date, "%H:%M") > "10:00" &
weeks == 1 &
format(Date, "%m") == "09")
subset(Data, eval(cond))
## Date load weeks
## 3 2010-09-10 11:00:00 111.64 1

Resources