dplyr::if_else changes datetime (POSIXct) values - r

I'm working with a dataset that has a lot of timestamps. There are some invalid timestamps which I try to identify and set to NA. Because if_else() forces me to have the same data type in both arms, I'm using as.POSIXct(NA) to encode such missing values.
Interestingly, the results differ when I invert the test (and change the true and false argument) in if_else().
Here is some code to illustrate my problems:
x <- tibble(
A = parse_datetime("2020-08-18 19:00"),
B = if_else(TRUE, A, as.POSIXct(NA)),
C = if_else(FALSE, as.POSIXct(NA), A)
)
> x
# A tibble: 1 x 3
A B C
<dttm> <dttm> <dttm>
1 2020-08-18 19:00:00 2020-08-18 19:00:00 2020-08-18 21:00:00
Any idea, why C is two hours later?
Follow-up:
Based on the great answers below, I think a more readable solution should perhaps generate a missing datetime object with parse_datetime(NA_character_) and use this in the code instead of as.POSIXct().
R> NA_datetime_ <- parse_datetime(NA_character_)
R> x <- tibble(
A = parse_datetime("2020-08-18 19:00"),
B = if_else(TRUE, A, NA_datetime_),
C = if_else(FALSE, NA_datetime_, A)
)
R> map(x, lubridate::tz)
$A
[1] "UTC"
$B
[1] "UTC"
$C
[1] "UTC"

At First, you need to know that parse_datetime() returns a date-time object with an tzone attribute default to UTC. You can use lubridate::tz(x$A) and attributes(x$A) to check it.
From the document of if_else(), it said the true and false arguments must be the same type. All other attributes are taken from true. Hence, in part C of your tibble:
C = if_else(FALSE, as.POSIXct(NA), A)
as.POSIXct(NA) doesn't have a tzone attribute, so A's tzone is dropped and reset to the time zone of your region. Actually, C is not two hours later. The three columns have equal time but unequal time zones. To fix it, you can adjust as.POSIXct(NA) to own a tzone attribute, i.e. replace it with
as.POSIXct(NA_character_, tz = "UTC")
Note: You must use NA_character_ instead of NA because the tz argument in as.POSIXct() only works on character objects.
Finally, revise your code as
x <- tibble(
A = parse_datetime("2020-08-18 19:00"),
B = if_else(TRUE, A, as.POSIXct(NA_character_, tz = "UTC")),
C = if_else(FALSE, as.POSIXct(NA_character_, tz = "UTC"), A)
)
# # A tibble: 1 x 3
# A B C
# <dttm> <dttm> <dttm>
# 1 2020-08-18 19:00:00 2020-08-18 19:00:00 2020-08-18 19:00:00
Remember to check their time zones.
R > lubridate::tz(x$A)
[1] "UTC"
R > lubridate::tz(x$B)
[1] "UTC"
R > lubridate::tz(x$C)
[1] "UTC"

This is a timezone problem :
lubridate::tz(x$A)
[1] "UTC"
lubridate::tz(x$B)
[1] "UTC"
lubridate::tz(x$C)
[1] ""
This is due to the way if_else <- function (test, yes, no) works : it uses the attributes of the yes argument which for C is NA.

Related

Create a loop to record next record time and date (based on unique id) in same row

I'm trying to create a loop and a if/else statement to pull the next row's timing to record it as timeout. In the event if there is no next row (i.e. no car id#) to return as end/exit.
Data and envisioned output
Here is my code but it doesnt work at all probably not getting the fundamentals right.
for(i in 1:dim(df2)[1]){
if(df2$car.id[i] == df2$car.id[i +1]){
return$timein[i+1]
}else{
print("end")
}
}
)
Several notes below, but up front try this:
df2$Timeout <- ave(df2$Timein, df2$car.id, FUN = function(z) c(z[-1], NA))
The above code returns the next value of df2$Timein per df2$car.id, properly resetting when the next row is a different car.id. The order matters, so if you need the Timein sorted, then you should sort it before calling ave(.). (This will deal correctly with car.id being out of order and even mixed.)
Issues with your (image of) code. If the above doesn't work, you'll need to clarify these points.
nrow(df2) is the more canonical method than dim(df2)[1].
Instead of nrow(.), though, I recommend for (i in seq_len(nrow(df2))), as it behaves better in at least one corner case.
Your [i+1] indexing will go beyond the index limit and return NA, which will eventually error with missing value where TRUE/FALSE needed.
return$timein[i+1] seems wrong, unless you have a named-list or data.frame object that is return; I discourage that, as it can be confused (by people) with the base R primitive (function) return(.). If it is not an object, then you are using it wrong, and frankly I don't know what it should be since a for loop here seems unnecessary.
Your expected output is not fully clear, but I'll guess that you want either a timestamp or the literal "End". The latter will break your timestamps, converting them from POSIXt-class objects to strings. In general, a column in a frame cannot be mixed classes.
Try using dplyr. Working with toy data.
library(dplyr)
dat %>% group_by( car.id ) %>%
mutate( Timeout=lead(as.character(Timein), default="END") ) %>% ungroup
# A tibble: 10 x 4
car.id car.type Timein Timeout
<dbl> <dbl> <dttm> <chr>
1 14359825 1 2021-12-18 17:28:58 2021-12-18 17:33:58
2 14359825 1 2021-12-18 17:33:58 2021-12-18 18:03:58
3 14359825 1 2021-12-18 18:03:58 2021-12-18 18:08:58
4 14359825 1 2021-12-18 18:08:58 2021-12-18 18:13:58
5 14359825 1 2021-12-18 18:13:58 END
6 243095743 2 2021-12-18 18:30:38 2021-12-18 18:37:18
7 243095743 2 2021-12-18 18:37:18 2021-12-18 19:17:18
8 243095743 2 2021-12-18 19:17:18 2021-12-18 19:23:58
9 243095743 2 2021-12-18 19:23:58 2021-12-18 19:30:38
10 243095743 2 2021-12-18 19:30:38 END
If you want a date-only Timeout column you can always recast
as.POSIXct( dat$Timeout, format="%F %T" )
[1] "2021-12-18 17:33:58 CET" "2021-12-18 18:03:58 CET"
[3] "2021-12-18 18:08:58 CET" "2021-12-18 18:13:58 CET"
[5] NA "2021-12-18 18:37:18 CET"
[7] "2021-12-18 19:17:18 CET" "2021-12-18 19:23:58 CET"
[9] "2021-12-18 19:30:38 CET" NA
or directly use
dat %>% group_by( car.id ) %>% mutate( Timeout=lead( Timein ) )
Data
dat <- structure(list(car.id = c(14359825, 14359825, 14359825, 14359825,
14359825, 243095743, 243095743, 243095743, 243095743, 243095743
), car.type = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Timein = structure(c(1639844938.6685,
1639845238.6685, 1639847038.6685, 1639847338.6685, 1639847638.6685,
1639848638.6685, 1639849038.6685, 1639851438.6685, 1639851838.6685,
1639852238.6685), class = c("POSIXct", "POSIXt"))), row.names = c(NA,
10L), class = "data.frame")

How to calculate time difference in milliseconds using R when formats are different?

I have a problem in R that is killing me! Can you help me?
I found a question in StackOverflow that gave me a very good explanation.
Here is the link: How to parse milliseconds?
I was able to implement the following code that works very well.
z2 <- strptime("10/2/20 11:16:17.682", "%d/%m/%y %H:%M:%OS")
z1 <- strptime("10/2/20 11:16:16.683", "%d/%m/%y %H:%M:%OS")
When I calculate z2-z1, I get
Time difference of 0.9989998 secs
Similarly, when I use
z3 <- strptime("130 11:16:16.683", "%j %H:%M:%OS")
z4 <- strptime("130 11:16:18.682", "%j %H:%M:%OS")
When I calculate z4-z3, I get
Time difference of 1.999 secs
What is my problem?
The first column has the format 130 18:25:50.408, with millions of rows!!!
The second column has the format 2020 130 18:25:51.357 that is like the first column but has the year 2020.
The first column is also from 2020, but as the year is not there R uses the current year.
First question,
How can I substract both columns? I know how to substract columns.
What I do not know is to subtract these two times.
For example, second time is 2020 130 18:25:51.357
and first time is 130 18:25:50.408
I guess that I can do it programmatically converting it to a string, and eliminating the 2020. However, I am hoping that a quicker solution is available using base R or the lubridate package.
Second question,
"%j %H:%M:%OS" is the format for 130 11:16:16.683
What is the format for 2020 130 18:25:51.357?
As explained before this is working very well:
z3 <- strptime("130 11:16:16.683", "%j %H:%M:%OS")
But, this is NOT working.
z7 <- strptime("2020 130 11:16:16.683", "%y %j %H:%M:%OS")
UPDATE 1
I solved the second question!
However, I have not figured out yet the first question.
For the second question, the mistake in the format was that instead of %y, I need to write %Y with upper case.
Here is one example:
later <- strptime("2020 130 11:16:17.683", "%Y %j %H:%M:%OS")
earlier <- strptime("2020 130 11:16:16.684", "%Y %j %H:%M:%OS")
difftime(later,earlier,units="secs")
The R results is:
Time difference of 0.9990001 secs
UPDATE 2
At this point, what is pending is the following:
I need to substract two times that were made the same day on 2020.
The second time does have the year, the first time does not.
later <- strptime("2020 130 11:16:17.683", "%Y %j %H:%M:%OS")
earlier <- strptime("130 11:16:16.684", "%j %H:%M:%OS")
difftime(later,earlier,units="secs")
R produces the following result:
Time difference of -31622399 secs
Why? As we are on 2021, R formats the vector earlier as the current year, 2021 because the year is not there.
My columns has millions of rows.
At this point, my guess is that I would need to add 2020 with a concatenation or something like that. Is there any other method?
Thank you for your help!
Your object z2 is a POSIX list object. What this means is that it is a list of the time elements of your time.
print.default(z2)
# $sec
# [1] 17.682
#
# $min
# [1] 16
#
# $hour
# [1] 11
#
# $mday
# [1] 10
#
# $mon
# [1] 1
#
# $year
# [1] 120
#
# $wday
# [1] 1
#
# $yday
# [1] 40
#
# $isdst
# [1] 0
#
# $zone
# [1] "GMT"
#
# $gmtoff
# [1] NA
#
# attr(,"class")
# [1] "POSIXlt" "POSIXt"
When you do a subtraction, z2 - z1 R dispatches this operation to a function called -.POSIXt, which itself calls difftime. This function converts z2 to a POSIX count object. What this means is that it gets converted to a count of seconds since the beginning of the epoch, by default "1970-01-01".
options("digits" = 16)
print.default(as.POSIXct(z2))
# [1] 1581333377.682
# attr(,"class")
# [1] "POSIXct" "POSIXt"
# attr(,"tzone")
# [1] ""
difftime(z2, z1)
# Time difference of 0.9989998340606689 secs
R, like most software, works with double precision numerics. This means that arithmetic is imprecise, although approximately true. Most software will try to hide this imprecision by reducing the number of digits shown. That said, different numbers will give you different imprecision, so you might prefer referring directly to the list element of z2.
print.default(z2$sec - z1$sec)
# [1] 0.9989999999999988
You could therefore apply the time difference using your favourite data.frame tools.
options("digits" = 6)
# character columns
df1 <- data.frame(
col1 = c("10/2/20 11:16:17.682", "10/2/20 11:16:16.683"),
col2 = c("130 11:16:16.683", "130 11:16:18.682"),
stringsAsFactors = FALSE)
library(dplyr)
# convert columns to POSIXlt
df2 <- mutate(df1,
col1 = strptime(col1, "%d/%m/%y %H:%M:%OS"),
col2 = strptime(stringr::str_c("2020 ", col2), "%Y %j %H:%M:%OS"),
diff_days = unclass(difftime(col2, col1, units = "days")))
df2
# col1 col2 diff_days
# 1 2020-02-10 11:16:17 2020-05-09 11:16:16 88.9583
# 2 2020-02-10 11:16:16 2020-05-09 11:16:18 88.9584

Assign vector of time intervals to non-overlapping groups

I have vectors of Intervals created by the R package lubridate:
library(lubridate)
ints <- new("Interval", .Data = c(61379.0158998966, 61379.0158998966,
174450.142500162, 2105574.12809992,
1986079.47369981),
start = structure(c(1477895188.5302, 1477895188.5302,
1478301991.7993, 1478488100.319,
1478607594.9734),
tzone = "America/New_York", class = c("POSIXct", "POSIXt")),
tzone = "America/New_York")
ints
#> [1] 2016-10-31 02:26:28 EDT--2016-10-31 19:29:27 EDT
#> [2] 2016-10-31 02:26:28 EDT--2016-10-31 19:29:27 EDT
#> [3] 2016-11-04 19:26:31 EDT--2016-11-06 18:54:01 EST
#> [4] 2016-11-06 22:08:20 EST--2016-12-01 07:01:14 EST
#> [5] 2016-11-08 07:19:54 EST--2016-12-01 07:01:14 EST
I'd like to pass this vector of Intevals to a function and have it return an identical-length vector of group membership, where group membership is determined by overlapping time intervals. In this example, the returned vector would be:
c(1, 1, 2, 3, 3)
lubridate is able to evaluate overlap of pairs of intervals with int_overlaps, but I'm hoping someone has already generalized this to identify groups of non-overlapping intervals.
We can use the int_overlaps from lubridate. The idea is to check whether there is any overlaps between the intervals on the current and the previous (lag) to return a logical vector, which we convert to integer with cumsum
library(lubridate)
library(dplyr)
cumsum(!int_overlaps(ints, lag(ints, default = first(ints)))) + 1
#[1] 1 1 2 3 3

Vectorised time zone conversion with lubridate

I have a data frame with a column of date-time strings:
library(tidyverse)
library(lubridate)
testdf = data_frame(
mytz = c('Australia/Sydney', 'Australia/Adelaide', 'Australia/Perth'),
mydt = c('2018-01-17T09:15:00', '2018-01-17T09:16:00', '2018-01-17T09:18:00'))
testdf
# A tibble: 3 x 2
# mytz mydt
# <chr> <chr>
# 1 Australia/Sydney 2018-01-17T09:15:00
# 2 Australia/Adelaide 2018-01-17T09:16:00
# 3 Australia/Perth 2018-01-17T09:18:00
I want to convert these date-time strings to POSIX date-time objects with their respective timezones:
testdf %>% mutate(mydt_new = ymd_hms(mydt, tz = mytz))
Error in mutate_impl(.data, dots) :
Evaluation error: tz argument must be a single character string.
In addition: Warning message:
In if (tz != "UTC") { :
the condition has length > 1 and only the first element will be used
I get the same result if I use ymd_hms without a timezone and pipe it into force_tz. Is it fair to conclude that lubridate doesn't support any sort of vectorisation when it comes to timezone operations?
Another option is map2. It may be better to store different tz output in a list as this may get coerced to a single tz
library(tidyverse)
out <- testdf %>%
mutate(mydt_new = map2(mydt, mytz, ~ymd_hms(.x, tz = .y)))
If required, it can be unnested
out %>%
unnest
The values in the list are
out %>%
pull(mydt_new)
#[[1]]
#[1] "2018-01-17 09:15:00 AEDT"
#[[2]]
#[1] "2018-01-17 09:16:00 ACDT"
#[[3]]
#[1] "2018-01-17 09:18:00 AWST"
tz argument must be a single character string. indicates that there are more than one time zones thrown into ymd_hms(). In order to make sure that there is only one time zone being thrown into the function, I used rowwise(). Note that I am not in Australian time zone. So I am not sure if the outcome I have is identical to yours.
testdf <- data_frame(mytz = c('Australia/Sydney', 'Australia/Adelaide', 'Australia/Perth'),
mydt = c('2018-01-17 09:15:00', '2018-01-17 09:16:00', '2018-01-17 09:18:00'))
testdf %>%
rowwise %>%
mutate(mydt_new = ymd_hms(mydt, tz = mytz))
mytz mydt mydt_new
<chr> <chr> <dttm>
1 Australia/Sydney 2018-01-17 09:15:00 2018-01-17 06:15:00
2 Australia/Adelaide 2018-01-17 09:16:00 2018-01-17 06:46:00
3 Australia/Perth 2018-01-17 09:18:00 2018-01-17 09:18:00

Epoch time and local time, different time zone

I want to convert epoch time to the local time. As you can see here, I have different time zones and I want to get a local time for each row. How can I do the conversion considering each time zone?
df <- data.frame(Epoch_Time = c(1460230930,1460231830, 1459929664),
Time_Zone = c("UTC−12:00", "UTC+10:00", "UTC-9:00"))
You need to store your epoch time as POSIX and then you can manipulate more easily.
library(dplyr)
library(lubridate)
df <- tibble(
time_epoch = as.POSIXct(
c(1460230930,1460231830, 1459929664), tz = "UTC", origin = "1970-01-01"
),
time_zone = c("UTC-12:00", "UTC+10:00", "UTC-09:00")
)
df <- mutate(df,
time_zone = as.numeric(substr(time_zone, 4, 6)),
time_local = as.character(time_epoch + hours(time_zone))
)
df
# # A tibble: 3 x 3
# time_epoch time_zone time_local
# <dttm> <dbl> <chr>
# 1 2016-04-09 21:42:10 -12 2016-04-09 11:42:10
# 2 2016-04-09 21:57:10 10 2016-04-10 09:57:10
# 3 2016-04-06 10:01:04 -9 2016-04-06 03:01:04
Notes:
I haven't put the effort in to properly generalise the conversion
from your UTC strings, only enough to use for this example. Ideally,
you want Olson Names instead of offsets, you can get these
here
the time_local is stored as character, you cannot store a date/time column with multiple time zones, they are stored with a
single value, see attributes(df$time_epoch)
attributes(df$time_epoch)
# $class
# [1] "POSIXct" "POSIXt"
#
# $tzone
# [1] "UTC"

Resources