I have a dataset with dates stored in the DB as UTC, however, the timezone is actually different.
mydat <- data.frame(
time_stamp=c("2022-08-01 05:00:00 UTC","2022-08-01 17:00:00 UTC","2022-08-02 22:30:00 UTC","2022-08-04 05:00:00 UTC","2022-08-05 02:00:00 UTC"),
timezone=c("America/Chicago","America/New_York","America/Los_Angeles","America/Denver","America/New_York")
)
I want to apply the timezone to the UTC saved timestamps, over the entire column.
I looked into the with_tz function in the lubridate package, but I don't see how to reference the "timezone" column, rather than hardcoding in a value.
Such as if I try
with_tz(mydat$time_stamp, tzone = mydat$timezone)
I get the following error
Error in as.POSIXlt.POSIXct(x, tz) : invalid 'tz' value`
However, if I try
mydat$time_stamp2 <- with_tz(mydat$time_stamp,"America/New_York")
that will render a new column without issue. How can I do this just referencing column values?
Welcome to StackOverflow. This is nice, common, and tricky problem! The following should do what you ask for:
Code
mydat <- data.frame(time_stamp=c("2022-08-01 05:00:00 UTC",
"2022-08-01 17:00:00 UTC",
"2022-08-02 22:30:00 UTC",
"2022-08-04 05:00:00 UTC",
"2022-08-05 02:00:00 UTC"),
timezone=c("America/Chicago", "America/New_York",
"America/Los_Angeles", "America/Denver",
"America/New_York"))
mydat$utc <- anytime::utctime(mydat$time_stamp, tz="UTC")
mydat$format <- ""
for (i in seq_len(nrow(mydat)))
mydat[i, "format"] <- strftime(mydat[i,"utc"],
"%Y-%m-%d %H:%M:%S",
tz=mydat[i,"timezone"])
Output
> mydat
time_stamp timezone utc format
1 2022-08-01 05:00:00 UTC America/Chicago 2022-08-01 05:00:00 2022-08-01 00:00:00
2 2022-08-01 17:00:00 UTC America/New_York 2022-08-01 17:00:00 2022-08-01 13:00:00
3 2022-08-02 22:30:00 UTC America/Los_Angeles 2022-08-02 22:30:00 2022-08-02 15:30:00
4 2022-08-04 05:00:00 UTC America/Denver 2022-08-04 05:00:00 2022-08-03 23:00:00
5 2022-08-05 02:00:00 UTC America/New_York 2022-08-05 02:00:00 2022-08-04 22:00:00
>
Comment
We first parse your data as UTC, I once wrote a helper function for that in my anytime package (there are alternatives but this is how I do it...). We then need to format from the given (numeric !!) UTC representation to the give timezone. We need a loop for this as the tz argument to strftime() is not vectorized.
Dirk gave a great answer that uses (mostly) base R tooling, if that is a requirement of yours. I wanted to also add an answer that uses the clock package that I developed because it doesn't require working rowwise over your data frame. clock has a function called sys_time_info() that retrieves low level information about a UTC time point in a specific time zone. It is one of the few functions where it makes sense to have a vectorized zone argument (which you need here) and returns an offset from UTC that will be useful here for converting to a "local" time.
As others have mentioned, you won't be able to construct a date-time vector that stores multiple time zones in it, but if you just need to see what the local time would have been in those zones, this can still be useful.
library(clock)
mydat <- data.frame(
time_stamp=c("2022-08-01 05:00:00 UTC","2022-08-01 17:00:00 UTC","2022-08-02 22:30:00 UTC","2022-08-04 05:00:00 UTC","2022-08-05 02:00:00 UTC"),
timezone=c("America/Chicago","America/New_York","America/Los_Angeles","America/Denver","America/New_York")
)
# Parse into a "sys-time" type, which can be thought of as a UTC time point
mydat$time_stamp <- sys_time_parse(mydat$time_stamp, format = "%Y-%m-%d %H:%M:%S")
mydat
#> time_stamp timezone
#> 1 2022-08-01T05:00:00 America/Chicago
#> 2 2022-08-01T17:00:00 America/New_York
#> 3 2022-08-02T22:30:00 America/Los_Angeles
#> 4 2022-08-04T05:00:00 America/Denver
#> 5 2022-08-05T02:00:00 America/New_York
# "Low level" information about DST, the time zone abbreviation,
# and offset from UTC in that zone. This is one of the few functions where
# it makes sense to have a vectorized `zone` argument.
info <- sys_time_info(mydat$time_stamp, mydat$timezone)
info
#> begin end offset dst abbreviation
#> 1 2022-03-13T08:00:00 2022-11-06T07:00:00 -18000 TRUE CDT
#> 2 2022-03-13T07:00:00 2022-11-06T06:00:00 -14400 TRUE EDT
#> 3 2022-03-13T10:00:00 2022-11-06T09:00:00 -25200 TRUE PDT
#> 4 2022-03-13T09:00:00 2022-11-06T08:00:00 -21600 TRUE MDT
#> 5 2022-03-13T07:00:00 2022-11-06T06:00:00 -14400 TRUE EDT
# Add the offset to the sys-time and then convert to a character column
# (these times don't really represent sys-time anymore since they are now localized)
mydat$localized <- as.character(mydat$time_stamp + info$offset)
mydat
#> time_stamp timezone localized
#> 1 2022-08-01T05:00:00 America/Chicago 2022-08-01T00:00:00
#> 2 2022-08-01T17:00:00 America/New_York 2022-08-01T13:00:00
#> 3 2022-08-02T22:30:00 America/Los_Angeles 2022-08-02T15:30:00
#> 4 2022-08-04T05:00:00 America/Denver 2022-08-03T23:00:00
#> 5 2022-08-05T02:00:00 America/New_York 2022-08-04T22:00:00
Related
Suppose there is a csv file named ta_sample.csv as under:
"BILL_DT","AMOUNT"
"2015-07-27T18:30:00Z",16000
"2015-07-07T18:30:00Z",6110
"2015-07-26T18:30:00Z",250
"2015-07-22T18:30:00Z",1000
"2015-07-06T18:30:00Z",2640000
Reading the above using read_csv_arrow and customizing the column types which is always needed in actual production data:
library(arrow)
read_csv_arrow(
"ta_sample.csv",
col_names = c("BILL_DT", "AMOUNT"),
col_types = "td",
skip = 1,
timestamp_parsers = c("%Y-%m-%dT%H:%M:%SZ"))
the result is as under:
# A tibble: 5 x 2
BILL_DT AMOUNT
<dttm> <dbl>
1 2015-07-28 00:00:00 16000
2 2015-07-08 00:00:00 6110
3 2015-07-27 00:00:00 250
4 2015-07-23 00:00:00 1000
5 2015-07-07 00:00:00 2640000
The issue here is that the dates are increased by one day and the time disappears. It is worth mentioning here that data.table::fread() as well as readr::read_csv() read it properly, eg,
library(readr)
read_csv("ta_sample.csv")
# A tibble: 5 x 2
BILL_DT AMOUNT
<dttm> <dbl>
1 2015-07-27 18:30:00 16000
2 2015-07-07 18:30:00 6110
3 2015-07-26 18:30:00 250
4 2015-07-22 18:30:00 1000
5 2015-07-06 18:30:00 2640000
Parsing example values in BILL_DT column with strptime also work perfectly as under:
strptime(c("2015-07-27T18:30:00Z", "2015-07-07T18:30:00Z"), "%Y-%m-%dT%H:%M:%SZ")
[1] "2015-07-27 18:30:00 IST" "2015-07-07 18:30:00 IST"
What parameters in read_csv_arrow need to be adjusted to get results identical to that given by readr::read_csv() ?
There are a few things going on here, but they all relate to timezones + how they are interpreted by various parts of R + Arrow + other packages.
When Arrow reads in timestamps, it treats the values as if they were UTC. Arrow does not yet have the ability to specify alternative timezones when parsing[1], so stores these values as timezoneless (and assumes UTC). Though in this case, since the timestamps you have are UTC (according to ISO_8601, the Z at the end means UTC) they are stored correctly in Arrow as timezoneless UTC timestamps. The values of the timestamps are the same (that is, they represent the same time in UTC), the difference is in how they are displayed: are they displayed as the time in UTC or are they displayed in the local timezone.
When the timestamps are converted into R, the timezonelessness is preserved:
> from_arrow <- read_csv_arrow(
+ "ta_sample.csv",
+ col_names = c("BILL_DT", "AMOUNT"),
+ col_types = "td",
+ skip = 1,
+ timestamp_parsers = c("%Y-%m-%dT%H:%M:%SZ"))
>
> attr(from_arrow$BILL_DT, "tzone")
NULL
R defaults to displaying timestamps without a tzone attribute in the local timezone (for me it's currently CDT, for you it looks like it's IST). And, note that timestamps with an explicit timezone are displayed in that timezone.
> from_arrow$BILL_DT
[1] "2015-07-27 13:30:00 CDT" "2015-07-07 13:30:00 CDT"
[3] "2015-07-26 13:30:00 CDT" "2015-07-22 13:30:00 CDT"
[5] "2015-07-06 13:30:00 CDT"
If you would like to display the UTC timestamps, you can do a few things:
Explicitly set the tzone attribute (or you could use lubridate::with_tz() for the same operation):
> attr(from_arrow$BILL_DT, "tzone") <- "UTC"
> from_arrow$BILL_DT
[1] "2015-07-27 18:30:00 UTC" "2015-07-07 18:30:00 UTC"
[3] "2015-07-26 18:30:00 UTC" "2015-07-22 18:30:00 UTC"
[5] "2015-07-06 18:30:00 UTC"
You can set the timezone in your R session so that when R goes to display the time it uses UTC (Note: the tzone attribute is still unset here, but the display is UTC because the session timezone is set to UTC)
> Sys.setenv(TZ="UTC")
> from_arrow <- read_csv_arrow(
3. "ta_sample.csv",
4. col_names = c("BILL_DT", "AMOUNT"),
5. col_types = "td",
6. skip = 1,
7. timestamp_parsers = c("%Y-%m-%dT%H:%M:%SZ"))
> from_arrow$BILL_DT
[1] "2015-07-27 18:30:00 UTC" "2015-07-07 18:30:00 UTC"
[3] "2015-07-26 18:30:00 UTC" "2015-07-22 18:30:00 UTC"
[5] "2015-07-06 18:30:00 UTC"
> attr(from_arrow$BILL_DT, "tzone")
NULL
You could read the data into an Arrow table, and cast the timestamp to have an explicit timezone in Arrow before pulling the data into R with collect(). This csv -> Arrow table -> data.frame is what happens under the hood, so there are no additional conversions going on here (other than the cast). And it can be useful + more efficient to do operations on the Arrow table if you have other transformations you are applying, though it is more code than the first two.
> library(arrow)
> library(dplyr)
> tab <- read_csv_arrow(
+ "ta_sample.csv",
+ col_names = c("BILL_DT", "AMOUNT"),
+ col_types = "td",
+ skip = 1,
+ as_data_frame = FALSE)
>
> tab_df <- tab %>%
+ mutate(BILL_DT_cast = cast(BILL_DT, timestamp(unit = "s", timezone = "UTC"))) %>%
+ collect()
> attr(tab_df$BILL_DT, "tzone")
NULL
> attr(tab_df$BILL_DT_cast, "tzone")
[1] "UTC"
> tab_df
# A tibble: 5 × 3
BILL_DT AMOUNT BILL_DT_cast
<dttm> <dbl> <dttm>
1 2015-07-27 13:30:00 16000 2015-07-27 18:30:00
2 2015-07-07 13:30:00 6110 2015-07-07 18:30:00
3 2015-07-26 13:30:00 250 2015-07-26 18:30:00
4 2015-07-22 13:30:00 1000 2015-07-22 18:30:00
5 2015-07-06 13:30:00 2640000 2015-07-06 18:30:00
This is also made a bit more confusing because base R's strptime() doesn't parse timezones (which is why you're seeing the same clock time but with IST in your example above). lubridate's[2] parsing functions do respect this, and you can see the difference here:
> lubridate::parse_date_time(c("2015-07-27T18:30:00Z", "2015-07-07T18:30:00Z"), "YmdHMS")
[1] "2015-07-27 18:30:00 UTC" "2015-07-07 18:30:00 UTC"
[1] Though we have two issues related to adding this functionality https://issues.apache.org/jira/browse/ARROW-12820 and https://issues.apache.org/jira/browse/ARROW-13348
[2] And, lubridate's docs even mention this:
ISO8601 signed offset in hours and minutes from UTC. For example -0800, -08:00 or -08, all represent 8 hours behind UTC. This format also matches the Z (Zulu) UTC indicator. Because base::strptime() doesn't fully support ISO8601 this format is implemented as an union of 4 orders: Ou (Z), Oz (-0800), OO (-08:00) and Oo (-08). You can use these four orders as any other but it is rarely necessary. parse_date_time2() and fast_strptime() support all of the timezone formats.
https://lubridate.tidyverse.org/reference/parse_date_time.html
I can't seem to find a solution on here, please let me know if there is already one thanks!
I have a column in r with date and time, some of which are wrong. I am hoping to change the incorrect ones to the correct ones.
The data is in POSIXct. I have duplicated the original column so I can modify the new one.
An example of what the data looks like,
id <- c(1:5)
opdate <- c(2018-03-01 11:50:00 UTC, 2018-03-02 09:35:00 UTC, 2018-02-27 17:00:00 UTC, 2018-03-06 09:00:00 UTC, 2018-03-08 08:40:00 UTC)
I want to change the date for opdate[3] to "2018-03-08 10:05:00 UTC"
I have tried
opdate[3] <- "2018-03-08 10:05:00 UTC"
and
opdate [3] <- "2018-03-08 10:05:00 UTC", format = "%Y-%m-%d %H:%M:%S"
but it's not working, I think it is probably something basic but I don't know what
ideally, the output for the ammended opdate would be
2018-03-01 11:50:00 UTC, 2018-03-02 09:35:00 UTC, 2018-03-08 10:05:00 UTC, 2018-03-06 09:00:00 UTC, 2018-03-08 08:40:00 UTC
but it comes up with error that there is an unexpected numeric constant
I have this database of time stamps (AlertTime), and I know what time zone these are in (TimeZone). I know how to set these date to POSIXCT or if they were all UTC, but I'm struggling to get them identified as their local time stamps because most functions don't accept a vector for tz.
I do need both the local time stamp properly formatted (AlertTimeLocal) and the UTC equivalent (AlertTimeUTC).
AlertTime TimeZone AlertTimeLocal (desired) AlertTimeUTC (desired)
11 May 2020, 06:22 PM America/Denver 2020-05-11 18:22:00 MDT 2020-05-12 00:22:00 MDT
11 MAY 2020, 04:11 AM America/Los_Angeles 2020-05-11 04:11:00 PDT 2020-05-11 11:11:00 UTC
10 MAY 2020, 03:38 PM America/Chicago 2020-05-10 15:38:00 CDT 2020-05-10 20:38:00 CDT
I was using this code but it doesn't seem to do anything anymore:
FreshAir$AlertTimeLocal <- mapply(function(x,y) {format(x, tz=y, usetz=TRUE)}, FreshAir$AlertTime, FreshAir$TimeZone)
Would a hacky solution be to set all the RAW time stamps to UTC, then convert them to the equivalent time zone in the other direction?
We can use force_tzs from lubridate
library(lubridate)
library(dplyr)
df1 %>%
mutate(AlertTimeLocal = dmy_hm(AlertTime),
AlertTimeUTC = force_tzs(AlertTimeLocal, tzones = TimeZone))
# AlertTime TimeZone AlertTimeLocal AlertTimeUTC
#1 11 May 2020, 06:22 PM America/Denver 2020-05-11 18:22:00 2020-05-12 00:22:00
#2 11 MAY 2020, 04:11 AM America/Los_Angeles 2020-05-11 04:11:00 2020-05-11 11:11:00
#3 10 MAY 2020, 03:38 PM America/Chicago 2020-05-10 15:38:00 2020-05-10 20:38:00
Update
If we need to store as separate time zones, we can use a list column
library(purrr)
df2 <- df1 %>%
mutate(AlertTime2 = dmy_hm(AlertTime),
AlertTimeUTC = force_tzs(AlertTime2, tzones = TimeZone),
AlertTimeLocal = map2(AlertTime2, TimeZone, ~ force_tz(.x, tzone = .y)))
df2$AlertTimeLocal
#[[1]]
#[1] "2020-05-11 18:22:00 MDT"
#[[2]]
#[1] "2020-05-11 04:11:00 PDT"
#[[3]]
#[1] "2020-05-10 15:38:00 CDT"
data
df1 <- structure(list(AlertTime = c("11 May 2020, 06:22 PM",
"11 MAY 2020, 04:11 AM",
"10 MAY 2020, 03:38 PM"), TimeZone = c("America/Denver",
"America/Los_Angeles",
"America/Chicago")), class = "data.frame", row.names = c(NA,
-3L))
I think a tidy solution might look cleaner, but if you want a base R solution, here's an alternative using #akrun's df1:
df1$AlertTimeLocal <- df1$AlertTimeUTC <-
c.POSIXct(Map(as.POSIXct, df1$AlertTime, tz = df1$TimeZone, format = "%d %b %Y, %I:%M %p"))
attr(df1$AlertTimeUTC, "tzone") <- "UTC"
attr(df1$AlertTimeLocal, "tzone") <- "US/Mountain"
df1
# AlertTime TimeZone AlertTimeUTC AlertTimeLocal
# 1 11 May 2020, 06:22 PM America/Denver 2020-05-12 00:22:00 2020-05-11 18:22:00
# 2 11 MAY 2020, 04:11 AM America/Los_Angeles 2020-05-11 11:11:00 2020-05-11 05:11:00
# 3 10 MAY 2020, 03:38 PM America/Chicago 2020-05-10 20:38:00 2020-05-10 14:38:00
Something that has not been discussed, though: in R, you cannot have different time zones within one vector of POSIXt. That is, in a vector, time zone is an attribute of the vector, not of the element. If you need individual time zones for each time in that column, you'll need to do a list-column. This works but is not always supported well by utilities/functions that work on data.frame.
I'm working with some timestamps in POSIXct format. Right now they are all showing up as being in the timezone "UTC", but in reality some are known to be in the "America/New_York" timezone. I'd like to correct the timestamps so that they all read as the correct times.
I initially used an ifelse() statement along with lubridate::with_tz(). This didn't work as expected because ifelse() didn't return values in POSIXct.
Then I tried dplyr::if_else() based on other posts here, and that's not working as expected either.
I can change a single timestamp or even a list of timestamps to a different timezone using with_tz() (so I know it works), but when I use it within if_else() the output is such that all the values are returned given the "yes" argument in if_else().
library(lubridate)
library(dplyr)
x <- data.frame("ts" = as.POSIXct(c("2017-04-27 13:44:00 UTC",
"2017-03-10 12:22:00 UTC", "2017-03-22 10:24:00 UTC"), tz = "UTC"),
"tz" = c("UTC","EST","UTC"))
x <- mutate(x, ts_New = if_else(tz == "UTC", with_tz(ts, "America/New_York"), ts))
Expected results are below where ts_New has timestamps adjusted to new time zone but only when values in tz = "UTC". Timestamps with tz = "America/New_York" shouldn't change.
ts tz ts_NEW
1 2017-04-27 13:44:00 UTC 2017-04-27 09:44:00
2 2017-03-10 12:22:00 EST 2017-03-10 12:22:00
3 2017-01-22 10:24:00 UTC 2017-03-22 06:24:00
Actual results are below where all ts_New timestamps are adjusted to new time zone regardless of value in tz
x
ts tz ts_New
1 2017-04-27 13:44:00 UTC 2017-04-27 09:44:00
2 2017-03-10 12:22:00 EST 2017-03-10 07:22:00
3 2017-03-22 10:24:00 UTC 2017-03-22 06:24:00
This doesn't answer your original question about why with_tz doesn't work with if_else but here is one workaround. We subtract 4 hours (difference between UTC and EST) where tz == "UTC".
library(dplyr)
library(lubridate)
x %>% mutate(ts_New = if_else(tz == "UTC", ts - hours(4), ts))
# ts tz ts_New
#1 2017-04-27 13:44:00 UTC 2017-04-27 09:44:00
#2 2017-03-10 12:22:00 EST 2017-03-10 12:22:00
#3 2017-03-22 10:24:00 UTC 2017-03-22 06:24:00
Or in base R
x$ts_New <- x$ts
inds <- x$tz == "UTC"
x$ts_New[inds] <- x$ts_New[inds] - 4 * 60 * 60
I have a dataframe df with a certain number of columns. One of them, ts, is timestamps:
1462147403122 1462147412990 1462147388224 1462147415651 1462147397069 1462147392497
...
1463529545634 1463529558639 1463529556798 1463529558788 1463529564627 1463529557370.
I have also at my disposal the corresponding datetime in the datetime column:
"2016-05-02 02:03:23 CEST" "2016-05-02 02:03:32 CEST" "2016-05-02 02:03:08 CEST" "2016-05-02 02:03:35 CEST" "2016-05-02 02:03:17 CEST" "2016-05-02 02:03:12 CEST"
...
"2016-05-18 01:59:05 CEST" "2016-05-18 01:59:18 CEST" "2016-05-18 01:59:16 CEST" "2016-05-18 01:59:18 CEST" "2016-05-18 01:59:24 CEST" "2016-05-18 01:59:17 CEST"
As you can see my dataframe contains data accross several day. Let's say there are 3. I would like to add a column containing number 1, 2 or 3. 1 if the line belongs to the first day, 2 for the second day, etc...
Thank you very much in advance,
Clement
One way to do this is to keep track of total days elapsed each time the date changes, as demonstrated below.
# Fake data
dat = data.frame(datetime = c(seq(as.POSIXct("2016-05-02 01:03:11"),
as.POSIXct("2016-05-05 01:03:11"), length.out=6),
seq(as.POSIXct("2016-05-09 01:09:11"),
as.POSIXct("2016-05-16 02:03:11"), length.out=4)))
tz(dat$datetime) = "UTC"
Note, if your datetime column is not already in a datetime format, convert it to one using as.POSIXct.
Now, create a new column with the day number, counting the first day in the sequence as day 1.
dat$day = c(1, cumsum(as.numeric(diff(as.Date(dat$datetime, tz="UTC")))) + 1)
dat
datetime day
1 2016-05-02 01:03:11 1
2 2016-05-02 15:27:11 1
3 2016-05-03 05:51:11 2
4 2016-05-03 20:15:11 2
5 2016-05-04 10:39:11 3
6 2016-05-05 01:03:11 4
7 2016-05-09 01:09:11 8
8 2016-05-11 09:27:11 10
9 2016-05-13 17:45:11 12
10 2016-05-16 02:03:11 15
I specified the timezone in the code above to avoid getting tripped up by potential silent shifts between my local timezone and UTC. For example, note the silent shift from my default local time zone ("America/Los_Angeles") to UTC when converting a POSIXct datetime to a date:
# Fake data
datetime = seq(as.POSIXct("2016-05-02 01:03:11"), as.POSIXct("2016-05-05 01:03:11"), length.out=6)
tz(datetime)
[1] ""
date = as.Date(datetime)
tz(date)
[1] "UTC"
data.frame(datetime, date)
datetime date
1 2016-05-02 01:03:11 2016-05-02
2 2016-05-02 15:27:11 2016-05-02
3 2016-05-03 05:51:11 2016-05-03
4 2016-05-03 20:15:11 2016-05-04 # Note day is different due to timezone shift
5 2016-05-04 10:39:11 2016-05-04
6 2016-05-05 01:03:11 2016-05-05