cast string directly to IDateTime - r

I am using the new version of data.table and especially the AWESOME fread function. My files contain dates that are loaded as strings (cause I don't know to do it otherwise) looking like 01APR2008:09:00:00.
I need to sort the data.table on those datetimes and then for the sort to be efficient to cast then in the IDateTime format (or anything alse I would not know yet).
> strptime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S")
[1] "2008-04-01 09:00:00"
> IDateTime(strptime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S"))
idate itime
1: 2008-04-01 09:00:00
> IDateTime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S")
Error in charToDate(x) :
character string is not in a standard unambiguous format
It looks like I cannot do DT[ , newType := IDateTime(strptime(oldType, "%d%b%Y:%H:%M:%S"))].
My questions are then:
Is there a way to cast directly to IDateTime from fread, such that I can sort afterward efficiently?
If not, what is the most efficient way to go knowing that I would like to be able to sort DT by this datetime column

Unfortunately (for efficiency) strptime produces a POSIXlt type, which is unsupported by data.table and always will be due its size (40 bytes per date!) and structure. Although strftime produces the much better POSIXct, it still does it via POSIXlt. More info here :
http://stackoverflow.com/a/12788992/403310
Looking to base functions such as as.Date, it uses strptime too, creating an integer offset from epoch (oddly) stored as double. The IDate (and friends) class in data.table aims to achieve integer epoch offsets stored as, um, integer. Suitable for fast sorting by base::sort.list(method = "radix") (which is really a counting sort). IDate doesn't really aim to be fast at (usually one off) conversion.
So to convert string dates/times, rightly or wrongly, I tend to roll my own helper function.
If the string date is "2012-12-24" I'd lean towards: as.integer(gsub("-", "", col)) and proceed with YYYYMMDD integer dates. Similarly times can be HHMMDD as an integer. Two columns: date and time separately can be useful if you generally want to roll = TRUE within a day, but not to the previous day. Grouping by month is simple and fast: by = date %/% 100L. Adding and subtracting days is troublesome, but it is anyway because rarely do you want to add calendar days, rather weekdays or business days. So that's a lookup to your business day vector anyway.
In your case the character month would need a conversion to 1:12. There isn't a separator in your dates "01APR2008", so a substring would be one way followed by a match or fmatch on the month name. Are you in control of the file format? If so, numbers are better in an unambiguous format that sorts naturally such as %Y-%m-%d, or %Y%m%d.
I haven't yet got to how best do this in fread, so date/times are left as character currently because I'm not yet sure how to detect the date format or which type to output. It does need to output either integer or double dates though, rather than inefficient character. I suspect that my use of YYYYMMDD integers are seen as unconventional, so I'm a little hesitant to make that the default. They have their place, and there are pros and cons of epoch based dates too. Dates don't have to be always epoch based is all I'm suggesting.
What do you think? Btw, thanks for encouragement on fread; was nice to see.

I d'ont know how your file is structured, but from your comment you want to use the date field as a key. Why not to read it as a time series and format it when in reading?
Here I use zoo to do it.(Here I suppose that the date column is the first one,otherwise see index.colum argument)
ff <- function(x) as.POSIXct(strptime(x,"%d%b%Y:%H:%M:%S"))
h <- read.zoo(text = "03avril2008:09:00:00 125
02avril2008:09:30:00 126
05avril2008:09:10:00 127
04avril2008:09:20:00 128
01avril2008:09:00:00 128"
,FUN=ff)
You get your dates sorted in the right format and sorted.
The conversion is natural from POSIXct to IDateTime
IDateTime(index(h))
idate itime
1: 2008-04-01 09:00:00
2: 2008-04-02 09:30:00
3: 2008-04-03 09:00:00
4: 2008-04-04 09:20:00
5: 2008-04-05 09:10:00
Here sure you still do 2 conversions, But you do it when reading data, and the second you do it without dealing with any format problem.

Related

Using R for a Date format of 07-JUL-16 06.05.54.000000 AM

I have 2 Date variables in a .csv file with formats of "07-JUL-16 06.05.54.000000 AM". I want to use these in a regression model. Should I be reading these into a data frame as factors or characters? How can I take a difference of the 2 dates in each case?
Read them in as characters (e.g. stringsAsFactors=FALSE or tidyverse functions), then use as.POSIXct, e.g.
as.POSIXct("07-JUL-16 06.05.54.000000 AM",format="%d-%b-%y %I.%M.%OS %p")
## [1] "2016-07-07 06:05:54 EDT"
(I'm assuming that you are intending a day-month-year format rather than a month-day-year format -- but actually I don't have any evidence to support that thought!)
Once you've done this, subtracting the values should just work (give you an object of difftime) -- but be careful with units when converting to numeric!
For what it's worth, lubridate::ymd_hms thinks it can guess the format, but guesses wrong (?? assuming I guessed right above: with a two-digit year, and without any year values greater than 31, there's really nothing to distinguish years and days ...)

How can I change character class date variables to POSIXlt class when there are multiple date formats?

I'm struggling with converting character class dates of many different format types (e.g., yyyy/mm/dd; mm/dd/yyyy; yyyy-mm-dd; mm-dd-yyyy; yy-mm-dd; mm-dd-yy; etc.) to POSIXlt class. Ideally, I would like to convert all birth_dates to POSIXlt class with yyyy/mm/dd format (see sample data below). Is there any simple way to do this in R?:
id birth_date start_date age
102 08/09/1993 2013/09/01 20
103 1995-02-21 2013/09/01 18
104 01-15-94 2013/09/01 19
105 88-12-30 2013/09/01 24
Here is what I have been doing thus far. Unfortunately, this doesn't seem to work (I wind up with more NAs than there should be) given all of the different ways in which the original date is formatted:
library(lubridate)
data$birth_date1<-as.Date(data$birth_date,format="%Y-%m-%d") #Convert character class to date class
data$birth_date2<-ymd(swc3$birth_date1) #Convert date class to POSIXlt class using lubridate pkg
That's horrible. Could be worse though. At least there are delimiters in there, like "-" and "/".
Short Answer
Yes, there's an easy way to parse that in R. Apply parse_date_time() separately to each birth date, giving it a decent orders list to chose from, and carefully set the order of the guesses. You'll need to convert the "integer-time" to a useful time when you're done.
See the Long Answer for details.
Long Answer
This is why the lubridate package has parse_date_time(). But there are problems. Let's see:
require(lubridate)
# WRONG! doesn't work as intended.
as.Date(
parse_date_time(data$birth_date,
orders=c("ymd", "mdy", "mdY", "Ymd")
)
)
[1] "1993-08-09" "1995-02-21" "1994-01-15" "0088-12-30"
That looks great, except for the last one. What's going on?
parse_date_time() is selecting a "best fit" set of orders and formats to use when parsing the dates, and the last element is the odd one out.
To make this work as intended, you'll need to apply parse_date_time() one-by-one to each date, because each date format was apparently selected more-or-less at random. This will be slower, but it will give more useful answers.
# RIGHT. Some conversion of results required.
parsed <- sapply(data[,"birth_date"],
parse_date_time,
orders=c("ymd", "mdy", "mdY", "Ymd") )
parsed
08/09/1993 1995-02-21 01-15-94 88-12-30
744854400 793324800 758592000 599443200
Ok, those look like Unix-time integers, which are the unclass()'d version of what parse_date_time() produces. And none are negative, so they must all have happened after 1970. This is encouraging. Convert:
# Conversion of results
parsed <- as.POSIXct(parsed, origin="1970-01-01", tz = "GMT")
as.Date(parsed)
08/09/1993 1995-02-21 01-15-94 88-12-30
"1993-08-09" "1995-02-21" "1994-01-15" "1988-12-30"
lubridate and parse_date_time() are very good at what they do.
Since you asked for POSIXlt, not Date types:
as.POSIXlt(parsed)
08/09/1993 1995-02-21
"1993-08-09 10:00:00 AEST" "1995-02-21 11:00:00 AEDT"
01-15-94 88-12-30
"1994-01-15 11:00:00 AEDT" "1988-12-30 11:00:00 AEDT"
Though I personally prefer only having dates when the actual time isn't important; these are assumed to be all happening at midnight UTC, and are converted to my time zone (Eastern Australia).

convert string to time in r

I have an array of time strings, for example 115521.45 which corresponds to 11:55:21.45 in terms of an actual clock.
I have another array of time strings in the standard format (HH:MM:SS.0) and I need to compare the two.
I can't find any way to convert the original time format into something useable.
I've tried using strptime but all it does is add a date (the wrong date) and get rid of time decimal places. I don't care about the date and I need the decimal places:
for example
t <- strptime(105748.35, '%H%M%OS') = ... 10:57:48
using %OSn (n = 1,2 etc) gives NA.
Alternatively, is there a way to convert a time such as 10:57:48 to 105748?
Set the options to allow digits in seconds, and then add the date you wish before converting (so that the start date is meaningful).
options(digits.secs=3)
strptime(paste0('2013-01-01 ',105748.35), '%Y-%M-%d %H%M%OS')

Converting time format to numeric with R

In most cases, we convert numeric time to POSIXct format using R. However, if we want to compare two time points, then we would prefer the numeric time format. For example, I have a date format like "2001-03-13 10:31:00",
begin <- "2001-03-13 10:31:00"
Using R, I want to covert this into a numeric (e.g., the Julian time), perhaps something like the passing seconds between 1970-01-01 00:00:00 and 2001-03-13 10:31:00.
Do you have any suggestions?
The Julian calendar began in 45 BC (709 AUC) as a reform of the Roman calendar by Julius Caesar. It was chosen after consultation with the astronomer Sosigenes of Alexandria and was probably designed to approximate the tropical year (known at least since Hipparchus). see http://en.wikipedia.org/wiki/Julian_calendar
If you just want to remove ":" , " ", and "-" from a character vector then this will suffice:
end <- gsub("[: -]", "" , begin, perl=TRUE)
#> end
#[1] "20010313103100"
You should read the section about 1/4 of the way down in ?regex about character classes. Since the "-" is special in that context as a range operator, it needs to be placed first or last.
After your edit then the answer is clearly what #joran wrote, except that you would need first to convert to a DateTime class:
as.numeric(as.POSIXct(begin))
#[1] 984497460
The other point to make is that comparison operators do work for Date and DateTime classed variables, so the conversion may not be necessary at all. This compares 'begin' to a time one second later and correctly reports that begin is earlier:
as.POSIXct(begin) < as.POSIXct(begin) +1
#[1] TRUE
Based on the revised question this should do what you want:
begin <- "2001-03-13 10:31:00"
as.numeric(as.POSIXct(begin))
The result is a unix timestamp, the number of seconds since epoch, assuming the timestamp is in the local time zone.
Maybe this could also work:
library(lubridate)
...
df <- '24:00:00'
as.numeric(hms(df))
hms() will convert your data from one time format into another, this will let you convert it into seconds. See full documentation.
I tried this because i had trouble with data which was in that format but over 24 hours.
The example from ?as.POSIX help gives
as.POSIXct(strptime(begin, "%Y-%m-%d %H:%M:%S"))
so for you it would be
as.numeric(as.POSIXct(strptime(begin, "%Y-%m-%d %H:%M:%S")))

How to add/subtract time from a POSIXlt time while keeping its class in R?

I am manipulating some POSIXlt DateTime objects. For example I would like to add an hour:
my.lt = as.POSIXlt("2010-01-09 22:00:00")
new.lt = my.lt + 3600
new.lt
# [1] "2010-01-09 23:00:00 EST"
class(new.lt)
# [1] "POSIXct" "POSIXt"
The thing is I want new.lt to be a POSIXlt object. I know I could use as.POSIXlt to convert it back to POSIXlt, but is there a more elegant and efficient way to achieve this?
POSIXct-classed objects are internally a numeric value that allows numeric calculations. POSIXlt-objects are internally lists. Unfortunately for your desires, Ops.POSIXt (which is what is called when you use "+") coerces to POSIXct with this code:
if (inherits(e1, "POSIXlt") || is.character(e1))
e1 <- as.POSIXct(e1)
Fortunately, if you just want to and an hour there is a handy alternative to adding 3600. Instead use the list structure and add 1 to the hour element:
> my.lt$hour <- my.lt$hour +1
> my.lt
[1] "2010-01-09 23:00:00"
This approach is very handy when you want to avoid thorny questions about DST changes, at least if you want adding days to give you the same time-of-day.
Edit (adding #sunt's code demonstrating that Ops.POSIXlt is careful with time "overflow".))
my.lt = as.POSIXlt("2010-01-09 23:05:00")
my.lt$hour=my.lt$hour+1
my.lt
# [1] "2010-01-10 00:05:00"
Short answer: No
Long answer:
POSIXct and POSIXlt objects are two specific types of the more general POSIXt class (not in a strictly object oriented inheritance sense, but in a quasi-object oriented implementation sense). Code freely switches between these. When you add to a POSIXlt object, the actual function used is +.POSIXt, not one specifically for POSIXlt. Inside this function, the argument is converted into a POSIXct and then dealt with (added to).
Additionally, POSIXct is the number of seconds from a specific date and time. POSIXlt is a list of date parts (seconds, minutes, hours, day of month, month, year, day of week, day of year, DST info) so adding to that directly doesn't make any sense. Converting it to a number of seconds (POSIXct) and adding to that does make sense.
It may not be significantly more elegant, but
seq.POSIXt( from=Sys.time(), by="1 hour", length.out=2 )[2]
IMHO is more descriptive than
Sys.time()+3600; # 60 minutes * 60 seconds
because the code itself documents that you're going for a "POSIX" "seq"uence incremented "by 1 hour", but it's a matter of taste. Works just fine on POSIXlt, but note that it returns a POSIXct either way. Also works for "days". See help(seq.POSIXt) for details on how it handles months, daylight savings, etc.
?POSIXlt tells you that:
Any conversion that needs to go between the two date-time classes requires a timezone: conversion from "POSIXlt" to "POSIXct" will validate times in the selected timezone.
So I guess that 3600 not being a POSIXlt object, there is an automatic conversion.
I would stick with simple:
new.lt = as.POSIXlt(my.lt + 3600)
class(new.lt)
[1] "POSIXlt" "POSIXt"
It's not that much of a hassle to add as.POSIXlt before your time operation.

Resources