Read character datetimes without timezones - r

I am trying to import in R a text file including datetimes. Times are stored in character format, without timezone information, but we know it is French time (Europe/Paris).
An issue arise for the days of timezone change: e.g. there is a time change from 2018-10-28 03:00:00 CEST to 2018-10-28 02:00:00 CET, thus we have duplicates in our character format, and R cannot tell wether it is CEST or CET.
Consider the following example:
data_in <- "date,val
2018-10-28 01:30:00,25
2018-10-28 02:00:00,26
2018-10-28 02:30:00,27
2018-10-28 02:00:00,28
2018-10-28 02:30:00,29
2018-10-28 03:00:00,30"
library(readr)
data <- read_delim(data_in, ",", locale = locale(tz = "Europe/Paris"))
We end up having duplicates in our dates:
data$date
[1] "2018-10-28 01:30:00 CEST" "2018-10-28 02:00:00 CEST" "2018-10-28 02:30:00 CET" "2018-10-28 02:00:00 CEST"
[5] "2018-10-28 02:30:00 CET" "2018-10-28 03:00:00 CET"
Expected output would be:
data$date
[1] "2018-10-28 01:30:00 CEST" "2018-10-28 02:00:00 CEST" "2018-10-28 02:30:00 CEST" "2018-10-28 02:00:00 CET"
[5] "2018-10-28 02:30:00 CET" "2018-10-28 03:00:00 CET"
Any idea how to solve the issue (besides telling people to use UTC or ISO formats). I guess the only way is to suppose the dates are sorted, so we can tell the first ones are CEST.

If you are certain that your time is always-increasing, then you can look for an apparent decrease (of time-of-day) and manually insert the TZ offset to the string, then parse as usual. I added some logic to look for this decrease only around 2-3am so that if you have multiple days of data spanning midnight, you would not get a false-alarm.
data <- read.csv(text = data_in)
fakedate <- as.POSIXct(gsub("^[-0-9]+ ", "2000-01-01 ", data$date))
decreases <- cumany(grepl(" 0[23]:", data$date) & c(FALSE, diff(fakedate) < 0))
data$date <- paste(data$date, ifelse(decreases, "+0100", "+0200"))
data
# date val
# 1 2018-10-28 01:30:00 +0200 25
# 2 2018-10-28 02:00:00 +0200 26
# 3 2018-10-28 02:30:00 +0200 27
# 4 2018-10-28 02:00:00 +0100 28
# 5 2018-10-28 02:30:00 +0100 29
# 6 2018-10-28 03:00:00 +0100 30
as.POSIXct(data$date, format="%Y-%m-%d %H:%M:%S %z", tz="Europe/Paris")
# [1] "2018-10-28 01:30:00 CEST" "2018-10-28 02:00:00 CEST" "2018-10-28 02:30:00 CEST"
# [4] "2018-10-28 02:00:00 CET" "2018-10-28 02:30:00 CET" "2018-10-28 03:00:00 CET"
My use of "2000-01-01" was just some non-DST day so that we can parse the timestamp into POSIXt and calculate a diff on it. (If we didn't insert a date, we could still use as.POSIXct with a format, but if you ever ran this on one of the two DST days, you might get different results since as.POSIXct("01:02:03", format="%H:%M:%S") always assumes "today".
This is obviously a bit fragile with its assumptions, but perhaps it'll be good enough for what you need.

Related

Coercing a date object into a POSIXct object

When converting a date object to a POSIXct object, I expected the hours to be zero.
Turns out the hours are either 1 or 2, depending on summer/winter time.
eg:
oct.days <- (as.Date("2018-10-26")+0:5)
as.POSIXct(oct.days)
[1] "2018-10-26 02:00:00 CEST" "2018-10-27 02:00:00 CEST" "2018-10-28 02:00:00 CEST"
[4] "2018-10-29 01:00:00 CET" "2018-10-30 01:00:00 CET" "2018-10-31 01:00:00 CET"
(I'm in Germany, winter time was implemented on Oct 28th at 3 am.)
Rounding it down fixed the issue
round(as.POSIXct(oct.days),"days")
but I wonder for what reason the date object contains extra hours?
tks!

Find where a timestamp vector changes timezones

How can I find the element in a timestamp vector where it switches to a different time zone due to the time changeover?
Sample data:
ts <- structure(c(1521921600, 1521925200, 1521928800, 1521932400, 1521936000,
1521939600, 1521943200, 1521946800, 1521950400, 1521954000, 1521957600
), class = c("POSIXct", "POSIXt"))
Output:
"2018-03-24 21:00:00 CET" "2018-03-24 22:00:00 CET" "2018-03-24 23:00:00 CET" "2018-03-25 00:00:00 CET" "2018-03-25 01:00:00 CET" "2018-03-25 03:00:00 CEST" "2018-03-25 04:00:00 CEST" "2018-03-25 05:00:00 CEST" "2018-03-25 06:00:00 CEST" "2018-03-25 07:00:00 CEST" "2018-03-25 08:00:00 CEST"
The first 5 elements are in CET and then it switches to CEST. So the answer here would be 5 or 6. Both answers would be fine.
In the sample data the difference is always 1 hour, but I need it aswell for different time intervalls, for example 15 or 30 minutes.
seq(min(ts), to = max(ts), by = 15*60)
seq(min(ts), to = max(ts), by = 30*60)
The expected answer for 15 min would be 20/21.
The expected answer for 30 min would be 10/11.
You can use lubridate's dst:
which(!duplicated(dst(ts)))[2]
This will give you the point where the time zone changes to DST.

timezone issue in R, i want to use UTC+0100, even in summer, although CET switches to CEST automatically

I've got some data with POSIXct timestamps in "CET" (Central European Time = Winter time = UTC+0100) and "CEST" (Central European Summer Time = UTC+0200). Since I've had some trouble with plots and calculations because of that daylight savings time, I want all of the timestamps to be in UTC+0100 time.
Here is an example for my timestamps on switch-back-to-winter-time-day:
> tdf$time_posix_vec[1:20]
[1] "2015-10-25 00:00:00 CEST" "2015-10-25 00:15:00 CEST" "2015-10-25 00:30:00 CEST" "2015-10-25 00:45:00 CEST" "2015-10-25 01:00:00 CEST"
[6] "2015-10-25 01:15:00 CEST" "2015-10-25 01:30:00 CEST" "2015-10-25 01:45:00 CEST" "2015-10-25 02:00:00 CEST" "2015-10-25 02:15:00 CEST"
[11] "2015-10-25 02:30:00 CEST" "2015-10-25 02:45:00 CEST" "2015-10-25 02:00:00 CET" "2015-10-25 02:15:00 CET" "2015-10-25 02:30:00 CET"
[16] "2015-10-25 02:45:00 CET" "2015-10-25 03:00:00 CET" "2015-10-25 03:15:00 CET" "2015-10-25 03:30:00 CET" "2015-10-25 03:45:00 CET"
To demonstrate the issue i picked an example timestamp:
> tx <- tdf$time_posix_vec[7]
> tx
[1] "2015-10-25 01:30:00 CEST"
I already tried lubridate's with_tz function, but if I use it with "CET", this is what happens:
> with_tz(tx, tzone = "CET")
[1] "2015-10-25 01:30:00 CEST"
I assume, the timezone handler knows that in my location CET becomes CEST between last week of march and last week of october.
To solve the issue I could use Algeria's timezone, since Algeria uses CET without daylight savings time (as wikipedia told me). However, this could change in the future, and
I wonder if this solution would be a bit unsafe because of that?
> with_tz(tx, tzone = "Africa/Algiers")
[1] "2015-10-25 00:30:00 CET"
The best way, I thought, would be to use "UTC+1", but the behaviour of with_tz is exactly the opposite of what I expected:
> with_tz(tx, tzone = "UTC+1")
[1] "2015-10-24 22:30:00 UTC"
to get 00:30:00 I would have to use:
> with_tz(tx, tzone = "UTC-1")
[1] "2015-10-25 00:30:00 UTC"
but then also the label "UTC" is wrong, because in UTC it would be
> with_tz(tx, tzone = "UTC")
[1] "2015-10-24 23:30:00 UTC"
Why is "UTC+1" switching the timestamp to UTC-0100 instead of UTC+0100?
And is there a function that forces the timestamp to UTC+0100 and also gives puts the correct timezone label to the timestamp, so the result would be "2015-10-25 00:30:00 UTC+1"?
Thanks in advance,
greetings, Peter
I think I found the solution: now I use
t1 <- as.POSIXct("2016-07-12 17:43","Etc/GMT-1")
for example. It confused me that GMT-1 is the same as UTC+0100, they seem to turn around the sign at bsd style timezones.

Create sequence of dates and times in R without time zones

I need to create a sequence of dates and times in R, increasing in 15 minute periods.
Currently, I am doing this:
datestimes=seq(as.POSIXlt("2011-01-01 00:00:00"), as.POSIXlt("2015-09-30 23:45:00"), by="15 min")
I should have one reading for each time in the year. The problem is that since it is adjusting for BST, I get two values for certain dates in October.
anm=aggregate(datestimes, by=list(datestimes$datestimes), FUN=length)
anm[which(anm$datestimes>1),]
Group.1 datestimes X.Date.
28993 2011-10-30 01:00:00 2 2
28994 2011-10-30 01:15:00 2 2
28995 2011-10-30 01:30:00 2 2
28996 2011-10-30 01:45:00 2 2
63933 2012-10-28 01:00:00 2 2
63934 2012-10-28 01:15:00 2 2
63935 2012-10-28 01:30:00 2 2
63936 2012-10-28 01:45:00 2 2
98873 2013-10-27 01:00:00 2 2
98874 2013-10-27 01:15:00 2 2
98875 2013-10-27 01:30:00 2 2
98876 2013-10-27 01:45:00 2 2
133813 2014-10-26 01:00:00 2 2
133814 2014-10-26 01:15:00 2 2
133815 2014-10-26 01:30:00 2 2
133816 2014-10-26 01:45:00 2 2
I tried using the as.chron command since this does not use timezones, but it will not allow increments of 15 minutes which is what I need.
The problem is that since it is adjusting for BST, I get two values for certain dates in October.
That's because the 'fall back' (mnemonic for daylight savings times adjustment adding an hour in the fall) happens under human time and that is what you get by default unless you override it.
R> seq(as.POSIXlt("2012-10-28 00:00:00", tz="UTC"),
+ as.POSIXlt("2012-10-28 03:00:00", tz="UTC"), by="15 min")
[1] "2012-10-28 00:00:00 UTC" "2012-10-28 00:15:00 UTC"
[3] "2012-10-28 00:30:00 UTC" "2012-10-28 00:45:00 UTC"
[5] "2012-10-28 01:00:00 UTC" "2012-10-28 01:15:00 UTC"
[7] "2012-10-28 01:30:00 UTC" "2012-10-28 01:45:00 UTC"
[9] "2012-10-28 02:00:00 UTC" "2012-10-28 02:15:00 UTC"
[11] "2012-10-28 02:30:00 UTC" "2012-10-28 02:45:00 UTC"
[13] "2012-10-28 03:00:00 UTC"
R>
The example I show here covers the same subset as above but without the fall back as we now impose UTC as a time zone. And UTC has be construction no daylight savings adjustment.
Maybe try this (UTC timezone should not allow any duplicate):
datestimes=seq(as.POSIXlt("2015-09-01 00:00:00", tz="UTC"),
as.POSIXlt("2015-10-30 23:45:00", tz="UTC"),
by="15 min")

melt.data.frame() changes behavior how POSIXct columns are printed

Melting the dataframe t.wide changes how the column "time" (class POSIXct) is printed.
t.wide <- data.frame(product=letters[1:5],
result=c(2, 4, 0, 0, 1),
t1=as.POSIXct("2014-05-26") + seq(0, 10800, length.out=5),
t2=as.POSIXct("2014-05-27") + seq(0, 10800, length.out=5),
t3=as.POSIXct("2014-05-28") + seq(0, 10800, length.out=5))
library(reshape2)
t.long <- melt(t.wide, measure.vars=c("t1", "t2", "t3"), value.name="time")
t.long$time
[1] 1401055200 1401057900 1401060600 1401063300 1401066000 1401141600 1401144300
[8] 1401147000 1401149700 1401152400 1401228000 1401230700 1401233400 1401236100
[15] 1401238800
attr(,"class")
[1] "POSIXct" "POSIXt"
Strangely, if print() is called explicitly, the object is printed as expected (timestamps, not their numeric representation).
print(t.long$time)
[1] "2014-05-26 00:00:00 CEST" "2014-05-26 00:45:00 CEST" "2014-05-26 01:30:00 CEST"
[4] "2014-05-26 02:15:00 CEST" "2014-05-26 03:00:00 CEST" "2014-05-27 00:00:00 CEST"
[7] "2014-05-27 00:45:00 CEST" "2014-05-27 01:30:00 CEST" "2014-05-27 02:15:00 CEST"
[10] "2014-05-27 03:00:00 CEST" "2014-05-28 00:00:00 CEST" "2014-05-28 00:45:00 CEST"
[13] "2014-05-28 01:30:00 CEST" "2014-05-28 02:15:00 CEST" "2014-05-28 03:00:00 CEST"
Setting the attributes to the same value as before magically changes how the object is printed.
attributes(t.long$time) <- attributes(t.long$time)
t.long$time
[1] "2014-05-26 00:00:00 CEST" "2014-05-26 00:45:00 CEST" "2014-05-26 01:30:00 CEST"
[4] "2014-05-26 02:15:00 CEST" "2014-05-26 03:00:00 CEST" "2014-05-27 00:00:00 CEST"
[7] "2014-05-27 00:45:00 CEST" "2014-05-27 01:30:00 CEST" "2014-05-27 02:15:00 CEST"
[10] "2014-05-27 03:00:00 CEST" "2014-05-28 00:00:00 CEST" "2014-05-28 00:45:00 CEST"
[13] "2014-05-28 01:30:00 CEST" "2014-05-28 02:15:00 CEST" "2014-05-28 03:00:00 CEST"
Can anyone explain this behavior?
UPDATE:
I opened this as Issue #50 on the git repo hadley/reshape2.
UPDATE: FIXED
This issue has been fixed in the development version of reshape2.
Thanks #kevin-ushey!
I believe the reason is because after the reshaping for whatever reason R does not think that t.long$time has attributes. For some reason the OBJECT flag (which indicates the vector has attributes) in the SEXP header for your vector is not being set. When you copy the attributes back to it, the OBJECT flag gets set and the correct print method is dispatched...
# No "OBJ" in SEXP header (the '[NAM(2),ATT]' part below)
.Internal(inspect( t.long$time ) )
##10359e548 14 REALSXP g0c6 [NAM(2),ATT] (len=15, tl=0) 1.40106e+09,...
# Now we have "OBJ" in the SEXP header indicating attributes
# So the print method for POSIXct get dispatched...
attributes(t.long$time) <- attributes(t.long$time)
.Internal(inspect( t.long$time ) )
##1118d7f50 14 REALSXP g0c6 [OBJ,NAM(2),ATT] (len=15, tl=0) 1.40106e+09,...
From the R Internals document...
The actual autoprinting is done by PrintValueEnv in file print.c. If the object to be printed has the S4 bit set and S4 methods dispatch is on, show is called to print the object. Otherwise, if the object bit is set (so the object has a "class" attribute), print is called to dispatch methods: for objects without a class the internal code of print.default is called.
Check the difference between..
print.default(t.long$time)
# [1] 1401058800 1401061500 1401064200 1401066900 1401069600 1401145200 1401147900 1401150600 1401153300 1401156000 1401231600 1401234300
#[13] 1401237000 1401239700 1401242400
#attr(,"class")
#[1] "POSIXct" "POSIXt"
print.POSIXct(t.long$time)
# [1] "2014-05-26 00:00:00 BST" "2014-05-26 00:45:00 BST" "2014-05-26 01:30:00 BST" "2014-05-26 02:15:00 BST" "2014-05-26 03:00:00 BST"
# [6] "2014-05-27 00:00:00 BST" "2014-05-27 00:45:00 BST" "2014-05-27 01:30:00 BST" "2014-05-27 02:15:00 BST" "2014-05-27 03:00:00 BST"
#[11] "2014-05-28 00:00:00 BST" "2014-05-28 00:45:00 BST" "2014-05-28 01:30:00 BST" "2014-05-28 02:15:00 BST" "2014-05-28 03:00:00 BST"
Now I can only speculate, but perhaps this is due to some internal code in reshape2 and is related to this warning..
One thing to watch is that if you copy attributes from one object to another you may (un)set the "class" attribute and so need to copy the object and S4 bits as well. There is a macro/function DUPLICATE_ATTRIB to automate this.

Resources