Difftime in R with missing times: how to add 'na.rm=TRUE'? - r

I would like to use difftime() to extract the difference between two date/time variables which are as.posixct. But sometimes one (or both) of the values are missing (NA), just like below.
Start time Antibiotic time
2016-06-28 08:36:00 NA
2019-10-30 10:43:00 2019-10-30 10:11:56
NA NA
I want: start time - antibiotic time
Like:
Antibiotica$ABS <- difftime(Antibiotica$StartTime, Antibiotica$AntibioticTime, units=c("mins"), na.rm=TRUE)
But now, I get an error. I think it is because of the wrong use of na.rm=TRUE.
How to add this in the right way?

As Roland points out in the comments, it's not clear that you should remove NA values. If there is a start time but antibiotic time is NA, then the time difference should also be NA. If both times are NA, then again the time difference should be NA
If you were to remove all the NA values in the resulting difftime, then you will only get results for those rows with complete data, but then these will no longer match up to your Antibiotica data frame. In your little example data frame for example, you would only get a single non-NA result. How would you store that in a column?
From your example, your code should work like this:
Antibiotica$ABS <- difftime(Antibiotica$StartTime, Antibiotica$AntibioticTime)
Antibiotica
#> StartTime AntibioticTime ABS
#> 1 2016-06-28 08:36:00 <NA> NA mins
#> 2 2019-10-30 10:43:00 2019-10-30 10:11:56 31.06667 mins
#> 3 <NA> <NA> NA mins
If you're not getting this result, you might need to make sure that your columns are in an actual date-time format (e.g. ensure class(Antibiotica$StartTime) is not "character").
If, once you have the calculation and you only want to have complete cases, you can do
Antibiotica[complete.cases(Antibiotica),]
#> StartTime AntibioticTime
#> 2 2019-10-30 10:43:00 2019-10-30 10:11:56
Data used
Antibiotica <- structure(list(StartTime = structure(c(1467102960, 1572432180, NA),
class = c("POSIXct", "POSIXt"), tzone = ""),
AntibioticTime = structure(c(NA, 1572430316, NA),
class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c(NA, -3L),
class = "data.frame")
Antibiotica
#> StartTime AntibioticTime
#> 1 2016-06-28 08:36:00 <NA>
#> 2 2019-10-30 10:43:00 2019-10-30 10:11:56
#> 3 <NA> <NA>
Created on 2022-01-31 by the reprex package (v2.0.1)

Related

How to change values in 1 column based of date/time range in a different column in R

I have a dataframe with a DATE/TIME column and a column with some numeric values. I'd like to change some numeric values to "N/A" based of a range of DATE/TIME they are recorded at.
This is what my dataframe looks like
df = structure(list(Date_Time_GMT_3 = structure(c(1592226000, 1592226900,
1592227800, 1592228700, 1592229600, 1592230500), class = c("POSIXct",
"POSIXt"), tzone = "EST"), diff_20676892_AIR_X3lh = c(NA, 0.385999999999999,
0.193, 0.290000000000001, 0.385, 0.576000000000001), diff_20819828_B1LH_DOUBLE_CHECK = c(NA,
0, 0, 0, 0.0949999999999989, 0)), row.names = c(NA, 6L), class = "data.frame")
I want to change all values for diff_20819828_B1LH_DOUBLE_CHECK to N/A if they are between 2020-06-15 08:30:00 and 2020-06-15 09:00:00
I tried this code
df[df$Date_Time_GMT_3 > "2020-06-15 08:30:00"| < "2020-06-15 09:00:00"] = "NA"
but to no surprise this doesn't work. How can I fix this?
Your date column is in "EST", so you can do this:
df[df$Date_Time_GMT_3 > as.POSIXct("2020-06-15 08:30:00", tz="EST") &
df$Date_Time_GMT_3 < as.POSIXct("2020-06-15 09:00:00", tz="EST"),3] <- NA
Date_Time_GMT_3 diff_20676892_AIR_X3lh diff_20819828_B1LH_DOUBLE_CHECK
1 2020-06-15 08:00:00 NA NA
2 2020-06-15 08:15:00 0.386 0.000
3 2020-06-15 08:30:00 0.193 0.000
4 2020-06-15 08:45:00 0.290 NA
5 2020-06-15 09:00:00 0.385 0.095
6 2020-06-15 09:15:00 0.576 0.000
Note that there is only one row between those times, row 4, and above changes the value(s) in the 3rd column for such row(s) to NA
Your base R code isn't working because
You didn't specify which column's values should be changed
You're using an | instead of an &
After a logical operator you need to repeat which vector to assess
You're not telling R that those strings are date-times.
Langtang's solution is very neat. Another option using dplyr and lubridate is:
library(dplyr)
library(lubridate)
df %>% mutate(diff_20819828_B1LH_DOUBLE_CHECK = na_if(
diff_20819828_B1LH_DOUBLE_CHECK,
Date_Time_GMT_3 %within% interval("2020-06-15 08:30:00", "2020-06-15 09:00:00")
))

how to select/subset certain dates in my data set in r

I have a data set for 10 years. I want to select or subset the data by fiscal year using date variable. For date variable is a character. For instance, I want to select data for fiscal year 2020 which is from 01-10-2019 to 30-09-2020. How can I do this in R ?
Here is an example using zoo package:
df1 <- structure(list(dateA = structure(c(14974, 18628, 14882, 16800,
17835, 16832, 16556, 15949, 16801), class = "Date")), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))
library(dplyr)
library(zoo)
df1 %>%
mutate(fiscal_year = as.integer(as.yearmon(dateA) - 4/12 +1))
output:
dateA fiscal_year
<date> <int>
1 2010-12-31 2011
2 2021-01-01 2021
3 2010-09-30 2011
4 2015-12-31 2016
5 2018-10-31 2019
6 2016-02-01 2016
7 2015-05-01 2016
8 2013-09-01 2014
9 2016-01-01 2016
as said by #r2evans you should post a minimal reprex.
However with the few information you posted maybe this worked example may fit:
date_vect <- c('01-10-2019','30-07-2020','15-07-2019','03-03-2020')
date_vect[substr(date_vect,7,12) == "2020"]
Under the hypothesis that you have a vector of dates in a string format. You may want to pick all the strings with the last four character equal to 2020 (the year you're interested in).
P.S: It's good practice to use the appropriate format when dealing with dates. You can unlock other features such as ordering with R date libraries.

datetime object to minutes: I need 3 packages

I am wondering if I am missing something?!
I would like to know: is there a better/shorter way to get minutes from a datetime object:
Lessons studied so far:
Extract time (HMS) from lubridate date time object?
converting from hms to hour:minute in r, or rounding in minutes from hms
R: Convert hours:minutes:seconds
My tibble:
df <- structure(list(dttm = structure(c(-2209068000, -2209069200, -2209061520,
-2209064100, -2209065240), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), row.names = c(NA, -5L), class = c("tbl_df", "tbl",
"data.frame"))
# A tibble: 5 x 1
dttm
<dttm>
1 1899-12-31 02:00:00
2 1899-12-31 01:40:00
3 1899-12-31 03:48:00
4 1899-12-31 03:05:00
5 1899-12-31 02:46:00
I would like to add a new column with minutes as integer:
My approach so far:
library(dplyr)
library(lubridate) # ymd_hms
library(hms) # as_hms
library(chron) # times
test %>%
mutate(dttm_min = as_hms(ymd_hms(dttm)),
dttm_min = 60*24*as.numeric(times(dttm_min)))
# A tibble: 5 x 2
dttm dttm_min
<dttm> <dbl>
1 1899-12-31 02:00:00 120
2 1899-12-31 01:40:00 100
3 1899-12-31 03:48:00 228
4 1899-12-31 03:05:00 185
5 1899-12-31 02:46:00 166
This gives me the result I want, but I need for this operation 3 packages. Is there a more direct way?
Here is a base R way -
You can extract the time using format, change the date to '1970-01-01' (Since R datetime starts with '1970-01-01'), convert to numeric and divide the time by 60 to get the duration in minutes.
as.numeric(as.POSIXct(paste('1970-01-01', format(df$dttm, '%T')), tz = 'UTC'))/60
#[1] 120 100 228 185 166
Here are two ways.
Base R
with(df, as.integer(format(dttm, "%M")) + 60*as.integer(format(dttm, "%H")))
#[1] 120 100 228 185 166
Another base R option, using class "POSIXlt" as proposed here.
minute_of_day <- function(x){
y <- as.POSIXlt(x)
60*y$hour + y$min
}
minute_of_day(df$dttm)
#[1] 120 100 228 185 166
Package lubridate
lubridate::minute(df$dttm) + 60*lubridate::hour(df$dttm)
#[1] 120 100 228 185 166
If the package is loaded, this can be simplified, with the same output, to
library(lubridate)
minute(df$dttm) + 60*hour(df$dttm)
We can use
library(data.table)
as.numeric(as.ITime(format(df$dttm, '%T')))/60
[1] 120 100 228 185 166
For sake of completeness, the time of day (in minutes) can be calculated by taking the difftime() between the POSIXct datetime object and the beginning of the day, e.g.,
difftime(df$dttm, lubridate::floor_date(df$dttm, "day"), units = "min")
Time differences in mins
[1] 120 100 228 185 166
Besides base R only one other package is required.
According to help("difftime"), difftime() returns an object of class "difftime" with an attribute indicating the units.

How can I add 1 to a column in R when A conditional is met?

I am trying to fill a new column in a data frame (in R) based on the following conditional:
df$B<- ifelse(difftime(df$A,lag(df$A))>minutes(30), increment(1), increment(0))
Here, the A column is time. So in A, every time the time difference between row i and row i-1 is greater than 30 minutes, I increment the new column B by one.
A B
1:00 1
1:31 2
1:40 2
2:30 3
Example
Any help is greatly appreciated, thank you.
In base R, you can use cumsum with difftime :
df$B <- cumsum(c(TRUE, difftime(df$A[-1], df$A[-nrow(df)], units = 'mins') > 30))
df
# A B
#1 2020-02-03 01:00:00 1
#2 2020-02-03 02:00:00 2
#3 2020-02-03 02:15:00 2
#4 2020-02-03 03:00:00 3
data
Make sure class(df$A) returns "POSIXct" :
df <- structure(list(A = structure(c(1580691600, 1580695200, 1580696100,
1580698800), class = c("POSIXct", "POSIXt"), tzone = "UTC")),
class = "data.frame", row.names = c(NA, -4L))

How to identify datapoints that drastically increase/decrease and make them NA? How to identify and nullify outliers?

I am cleaning data from multiple temperature sensors. I am trying to write code that will find places where the data drastically increased or decreased relative to the neighboring datapoint, and make that point NA/null.
I was trying to do this with a for loop and a couple of if statements, but there seem to be a few issues with this approach. Namely, the if statements don't really work with NA values. So if the first part of the loop makes one of the entries NA because it increased too much, the second part would return an error because it is trying to perform the operation with a NA entry.
I would prefer to make the outliers NA instead of deleting the entries, because I would like the option to replace the NA values with averages of the neighboring values later on.
Does anyone know of a different approach for identifying/nullifying data that changes too drastically/outlier data?
#maximum change per sampling interval
c<- 1.5
#make datapoints that increased/decreased too much from the previous datapoint NA
for(x in 2: length(cleandata)){
if((cleandata$tempdiff[x] - cleandata$tempdiff[x-1])>=c) cleandata$tempdiff<-NA
if((cleandata$tempdiff[x-1]-cleandata$tempdiff[x])>=c) cleandata$tempdiff<-NA
}
Here is a simplified piece of the dataset:
structure(list(TIMESTAMP = structure(c(1594911720, 1594911780,
1594911840, 1594911900, 1594911960, 1594127280, 1594127340, 1594127400,
1594127460, 1594127520, 1594127580), tzone = "", class = c("POSIXct",
"POSIXt")), sensor = c("TempDiffs.1.", "TempDiffs.1.", "TempDiffs.1.",
"TempDiffs.1.", "TempDiffs.1.", "TempDiffs.2.", "TempDiffs.2.",
"TempDiffs.2.", "TempDiffs.2.", "TempDiffs.2.", "TempDiffs.2."
), tempdiff = c(10.45, 12.5, 10.52, 10.48, 10.48, 12.47, 12.48,
12.49, 12.5, 12.52, 12.52)), row.names = c(NA, -11L), groups = structure(list(
sensor = c("TempDiffs.1.", "TempDiffs.2."), .rows = structure(list(
1:5, 6:11), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = 1:2, class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
My other concern with this process is the transtiton between sensors. My data is formmatted to be long/tall. So there is one column identifying the sensor, and another column with the temperature data. Each sensor has a different "typical range" of temperatures. So, when switching from one sensor to the next, this code would probably nullify the data because it changes drastically. I figured one way to deal with this would be to group the data by the sensor column before nullifying the outliers. I would appreciate any suggestions about that!
I think this should work.
c <- 1.5
d <- abs(cleandata$tempdiff[2:nrow(cleandata)] - cleandata$tempdiff[1:(nrow(cleandata)-1)])
d <- which(d > c)
cleandata$tempdiff[d] <- NA
d <- d + 1
cleandata$tempdiff[d] <- NA
> cleandata
TIMESTAMP sensor tempdiff
1 2020-07-16 11:02:00 TempDiffs.1. 10.45
2 2020-07-16 11:03:00 TempDiffs.1. 11.50
3 2020-07-16 11:04:00 TempDiffs.1. 10.52
4 2020-07-16 11:05:00 TempDiffs.1. 10.48
5 2020-07-16 11:06:00 TempDiffs.1. NA
6 2020-07-07 09:08:00 TempDiffs.2. NA
7 2020-07-07 09:09:00 TempDiffs.2. 12.48
8 2020-07-07 09:10:00 TempDiffs.2. 12.49
9 2020-07-07 09:11:00 TempDiffs.2. 12.50
10 2020-07-07 09:12:00 TempDiffs.2. 12.52
11 2020-07-07 09:13:00 TempDiffs.2. 12.52

Resources