How do I parse a java simple date format in R? - r

The extra T and Z are throwing me off:
strptime("2017-06-08T11:55:53.179000Z", "%Y-%m-%dT%H:%M:%SZ")
Returns NA

Just use the anytime() function of the anytime package:
R> anytime("2017-06-08T11:55:53.179000Z")
[1] "2017-06-08 11:55:53.178 CDT"
R>
The whole point of the anytime package is to parse such common formats without requiring a format. Dealing with the trailing Z comes for free via the Boost parser.
If you want it interpreted as UTC use the corresponding utctime() package:
R> utctime("2017-06-08T11:55:53.179000Z")
[1] "2017-06-08 06:55:53.178 CDT"
R>
If you want it stored as UTC, set the timezone accordingly:
R> utctime("2017-06-08T11:55:53.179000Z", tz="UTC")
[1] "2017-06-08 11:55:53.178 UTC"
R>

According to ?strptime, you can parse it with %OS parameter by setting the digits.secs option:
Specific to R is %OSn, which for output gives the seconds truncated to
0 <= n <= 6 decimal places (and if %OS is not followed by a digit, it
uses the setting of getOption("digits.secs"), or if that is unset, n =
0). Further, for strptime %OS will input seconds including fractional
seconds. Note that %S does not read fractional parts on output.
options(digits.secs = 6)
strptime("2017-06-08T11:55:53.179000Z", "%Y-%m-%dT%H:%M:%OSZ")
# [1] "2017-06-08 11:55:53.179 EDT"

Related

as.POSIXct behaving inconsistently

This might sound like a duplicate issue but I have gone through many POSIxct related bugs but did not come across this. If you still find one, I will really appreciate being pointed in that direction. as.POSIXct is behaving very awkwardly in my case. See the example below:
options(digits.secs = 3)
test_time <- "2017-01-26 23:00:00.010"
test_time <- as.POSIXct(test_time, format = "%Y-%m-%d %H:%M:%OS")
This returns:
"2017-01-26 23:00:00.00"
Now, I try the following option and it returns NA. I have no idea why is this behaving like that when all I need it to convert to is "2017-01-26 23:00:00.010".
test_time <- "2017-01-26 23:00:00.010"
test_time <- as.POSIXct(test_time, format = "%Y-%m-%d %H:%M:%OS3")
Now it works fine when I do this:
as.POSIXlt(strptime(test_time,format = "%Y-%m-%d %H:%M:%OS"), format = "%Y-%m-%d %H:%M:%OS")
But for my purpose I need to have this as a POSIxct object because some libraries I am working with only take POSIXct objects. Converting POSIXlt to POSIXct again results in the same problem as before.
Is there an issue with my system settings? The date is also not one of those daylight savings times one to throw an error. Why would it work with one format and not others? Any leads/suggestions are welcome!
Running on Windows 10 64-bit
The issue here has to do with the maximum precision that POSIXct can handle. It is backed by a double under the hood, representing the number of seconds since the epoch, midnight on 1970-01-01 UTC. Fractional seconds are represented as fractional parts of that double, i.e. 63.02 represents 1970-01-01 00:01:03.02 UTC.
options(digits = 22, digits.secs = 3)
.POSIXct(63.02, tz = "UTC")
#> [1] "1970-01-01 00:01:03.02 UTC"
63.02
#> [1] 63.02000000000000312639
Now, when working with doubles there are limits to the precision that they can represent exactly. You can see this with the above example; typing in 63.02 in the console doesn't return exactly the same number, and instead returns something close, but with some extra bits at the end.
So now let's take a look at your example. If we start as "low level" as possible, the first thing as.POSIXct() does is call strptime(), which returns a POSIXlt object. That keeps each "field" of the date-time as a separate element (i.e. year is kept separate from month, day, second, etc). We can see that it parsed correctly and our sec field holds 0.01.
# `digits.secs` to print 3 fractional digits (has no effect on parsing)
# `digits` to print 22 fractional digits for double values
options(digits.secs = 3, digits = 22)
x <- "2017-01-26 23:00:00.010"
# looks good
lt <- strptime(x, format = "%Y-%m-%d %H:%M:%OS", tz = "America/New_York")
lt
#> [1] "2017-01-26 23:00:00.01 EST"
# This is a POSIXlt, which is a list holding fields like year,month,day,...
class(lt)
#> [1] "POSIXlt" "POSIXt"
# sure enough...
lt$sec
#> [1] 0.01000000000000000020817
But now convert that to POSIXct. At this point, the individual fields are collapsed into a single double, which might have precision issues.
# now convert to POSIXct (i.e. a single double holding all the info)
# looks like we lost the fractional seconds?
ct <- as.POSIXct(lt)
ct
#> [1] "2017-01-26 23:00:00.00 EST"
# no, they are still there, but the precision in the `double` data type
# isn't enough to be able to represent this exactly as `1485489600.010`
unclass(ct)
#> [1] 1485489600.009999990463
#> attr(,"tzone")
#> [1] "America/New_York"
So the ct fractional part of the double value is close to .010, but can't represent it exactly and returns a value slightly less than .010, which gets (I presume) rounded down when the POSIXct is printed, making it look like you lost the fractional seconds.
Because these issues are so troublesome, I recommend using the low level API of the clock package (note that I wrote this package). It has support for fractional seconds up to nanoseconds without loss of precision (by using a different data structure than POSIXct).
https://clock.r-lib.org/
library(clock)
x <- "2017-01-26 23:00:00.010"
nt <- naive_time_parse(x, format = "%Y-%m-%d %H:%M:%S", precision = "millisecond")
nt
#> <time_point<naive><millisecond>[1]>
#> [1] "2017-01-26 23:00:00.010"
# If you need it in a time zone
as_zoned_time(nt, zone = "America/New_York")
#> <zoned_time<millisecond><America/New_York>[1]>
#> [1] "2017-01-26 23:00:00.010-05:00"

How to convert date to datetime to seconds since UNIX epoch in R with lubridate?

I'm noticing this very confusing behavior.
library(lubridate)
x = as_date(-25567)
as.integer(as_datetime(x)) # Returns NA
How can I get this to return the seconds since (or in this case before) UNIX epoch?
This works with base R, now that we covered that you really want as.Date("1970-01-01").
R> as.POSIXct("1900-01-01 00:00:00")
[1] "1900-01-01 CST"
R> as.numeric(as.POSIXct("1900-01-01 00:00:00"))
[1] -2208967200
R>
I vaguely recall some OS-level irritations for dates prior to the epoch. This may fail for you on the world's most commonly used OS but that is not really R's fault...

How can I keep timezone shifts when converting characters to POSIXct

I have a large dataframe with a column containing date-times, encoded as a factor variable.My Sys.timezone() is "Europe/Berlin". The date-times have this format:
2015-05-05 17:27:04+05:00
where +05:00 represents the timeshift from GMT. Importantly, I have multiple timezones in my dataset, so I cannot set a specific timezone and ignore the last 6 characters of the strings. This is what I tried so far:
# Test Date
test <- "2015-05-05 17:27:04+05:00"
# Removing the ":" to make it readable by %z
A <- paste(substr(test,1,22),substr(test,24,25),sep = "");A
# Returns
# "2015-05-05 17:27:04+0500"
output <- as.POSIXct(as.character(A, "%Y-%B-%D %H:%M:%S%z"))
# Returns
# "2015-05-05 17:27:04 CEST"
The output of "CEST" for +0500 is incorrect. Moreover, when I run this code on the whole column I see that every date is coded as CEST, regardless of the offset.
How can I keep the specified timezone when converting to POSIXct?
In order to facilitate the process you can use lubridate package.
E.g.
library("lubridate")#load the package
ymd_hms("2015-05-05 17:27:04+05:00",tz="GMT")#set the date format
[1] "2015-05-05 12:27:04 GMT"
Therefore you keep the timezone info. Finally:
as.POSIXct(ymd_hms("2015-05-05 17:27:04+05:00",tz="GMT"),tz = "GMT")#transform the date into another timezone
[1] "2015-05-05 12:27:04 GMT"

lubridate yyyy-MM-dd'T'HH:mm:ssX conversion unexpected. Bug?

Very unexpected behaviour when parsing "yyyy-MM-dd'T'HH:mm:ssX"-string (ISO 8601)
> as_datetime("2017-03-22T15:48:00.000Z")
[1] "2017-03-21 23:00:00 UTC"
> packageDescription("lubridate")$Version
[1] "1.6.0"
Could someone explain the rationale for this?
edit: Seems like a bug, see issue #536
update: resolved in lubridate commit here (May 2017). Works with lubridate 1.7.4, probably some earlier versions as well.
Without digging into the guts of as_datetime,
I think this may be a combination of (1) as_datetime
not being able to handle (i.e., ignore) the T in your format;
(2) conversion from local to UTC time zone.
dstr <- "2017-03-22T15:48:00.000Z"
library(lubridate)
as_datetime(dstr)
## [1] "2017-03-22 04:00:00 UTC"
If as_datetime() ignores everything after the T
that gets us to midnight on 2017-03-22. However, this is
taken as midnight in my local time zone which is GMT+04,
so the resulting time is 04:00:00. Presumably your local time
is GMT-01.
If you manually substitute a space for the T things work better (you can use
stringr::str_replace if you prefer)
as_datetime(sub("T"," ",dstr))
## [1] "2017-03-22 19:48:00 UTC"
Or use strptime:
strptime(dstr,format="%Y-%m-%dT%H:%M:%S")
## [1] "2017-03-22 15:48:00 EDT"
(note that strptime automatically discards trailing characters)
For what it's worth Dirk Eddelbuettel's anytime package handles this case:
anytime(dstr)
## [1] "2017-03-22 15:48:00 EDT"
If you have imported your data in the format presented here and you want to use lubridate to convert it into a date-time object I would recommend using the ymd_hms function of lubridate.
In your case it would look like this:
ymd_hms("2017-03-22T15:48:00.000Z")
[1] "2017-03-22 15:48:00 UTC"

Why R package lubridate can't parse vector with multiple formats?

I'm using package lubridate to parse a vector of heterogeneously-formatted dates and convert them to string, like this:
parse_date_time(c('12/17/1996 04:00:00 PM','4/18/1950 0130'), c('%m/%d/%Y %I:%M:%S %p','%m/%d/%Y %H%M'))
This is the result:
[1] NA NA
Warning message:
All formats failed to parse. No formats found.
If I remove the %p in the 1st format string, it incorrectly parses the 1st date string, and still doesn't parse the 2nd, like so:
[1] "1996-12-17 04:00:00 UTC" NA
Warning message:
1 failed to parse.
The 4PM time in the string is parsed to 4AM in the result.
Has anyone experienced this strange behavior?
This probably relate to your system locale.
parse_date_time {lubridate}
p : AM/PM indicator in the locale. Used in conjunction with I and not with H. An empty string in some locales.
Because different languages have different string for AM/PM, if your locale is not English, lubridate will not pick up the AM/PM indicator even if you specify it.
The locale in OS could include display language, time format, time zones. I'm using English windows with US time zone and Chinese locale, so I had been fighting with AM/PM in time parsing too.
Sys.getlocale("LC_TIME")
[1] "Chinese (Simplified)_China.936"
You can specify locale in parse_date_time {lubridate}, but it didn't work for me at first:
Sys.setlocale("LC_TIME", "en_US")
[1] ""
Warning message:
In Sys.setlocale("LC_TIME", "en_US") :
OS reports request to set locale to "en_US" cannot be honored
locales {base}
The locale describes aspects of the internationalization of a program. Initially most aspects of the locale of R are set to "C" (which is the default for the C language and reflects North-American usage).
strptime for uses of category = "LC_TIME".
Then I found this and used this to success:
Sys.setlocale("LC_TIME", "C")
[1] "C"
After this the parsing works:
parse_date_time('12/17/1996 04:00:00 PM', '%m/%d/%Y %I:%M:%S %p')
[1] "1996-12-17 16:00:00 UTC"
You can also specify time zone and locale
parse_date_time('12/17/1996 04:00:00 PM', '%m/%d/%Y %I:%M:%S %p', tz = "America/New_York", locale = "C")
[1] "1996-12-17 16:00:00 EST"
The problem with %p part is locale related. See this issue.
The inability to parse has to do with the way lubridate guesser works.
Tthere are two ways lubridate infers formats, flex and exact. With flex matching all numeric elements can have flexible length (for example both 4 and 04 for day will work), but then, there must be non-numeric separators between the elements. For the exact matcher there need not be non-numeric separators but elements must have exact number of digits (like 04).
Unfortunately you cannot combine both matchers within one expression. It would be extremely hard to fix this and preserve the current flexibility of the lubridate parser.
In your example
> parse_date_time('4/18/1950 0130', 'mdY HM')
[1] NA
Warning message:
All formats failed to parse. No formats found.
you want to perform flex matching on the date part 4/18/1950 and exact matching on time part 0130.
Please note that if your date-time is in fully flex, or fully exact format the parsing will work as expected:
> parse_date_time('04/18/1950 0130', 'mdY HM')
[1] "1950-04-18 01:30:00 UTC"
> parse_date_time('4/18/1950 1:30', 'mdY HM')
[1] "1950-04-18 01:30:00 UTC"
The lubridate 1.4.1 "fixes" this by adding a new argument to parse_date_time, exact=FALSE. When set toTRUE the orders argument is interpreted as containing exact strptime formats and no guessing or training is performed. This way you can add as many exact formats as you want and you will also gain in speed because no guessing is performed at all.
> parse_date_time(c('12/17/1996 04:00:00','4/18/1950 0130'),
+ c('%m/%d/%Y %I:%M:%S','%m/%d/%Y %H%M'),
+ exact = T)
[1] "1996-12-17 04:00:00 UTC" "1950-04-18 01:30:00 UTC"
Relatedly, there was an explicit requested asking for such an option.

Resources