How to fast convert different time formats in large data frames? - r

I want to calculate length in different time dimensions but I have problems dealing with the two slightly different time formats in my data frame column.
The original data frame column has about a million rows with the two formats (shown in the example code) mixed up .
Example code:
time <- c("2018-07-29T15:02:05Z", "2018-07-29T14:46:57Z",
"2018-10-04T12:13:41.333Z", "2018-10-04T12:13:45.479Z")
length <- c(15.8, 132.1, 12.5, 33.2)
df <- data.frame(time, length)
df$time <- format(as.POSIXlt(strptime(df$time,"%Y-%m-%dT%H:%M:%SZ", tz="")))
df
The formats "2018-10-04T12:13:41.333Z" and "2018-10-04T12:13:45.479Z" result in NA.
Is there a solution that would also be applicable to a big data frame where the two formats are mixed up?

We may use %OS instead of %S to account for decimals in seconds.
help("strptime")
Specific to R is %OSn, which for output gives the seconds truncated to
0 <= n <= 6 decimal places (and if %OS is not followed by a digit, it
uses the setting of getOption("digits.secs"), or if that is unset, n =
0).
as.POSIXct(time, format="%Y-%m-%dT%H:%M:%OSZ")
# [1] "2018-07-29 15:02:05 CEST" "2018-07-29 14:46:57 CEST"
# [3] "2018-10-04 12:13:41 CEST" "2018-10-04 12:13:45 CEST"
This base R code is considerably faster than the package solutions, try it yourself.
Update 1
time2 <- c("2018-09-01T12:42:37.000+02:00", "2018-10-01T11:42:37.000+03:00")
This one is trickier. ?strptime says we should use %z for offsets from UTC, but somehow it won't work with as.POSIXct. Instead we could do this,
as.POSIXct(substr(time2, 1, 23), format="%Y-%m-%dT%H:%M:%OS") +
{os <- as.numeric(el(strsplit(substring(time2, 24), "\\:")))
(os[1]*60 + os[2])*60}
# [1] "2018-09-01 14:42:37 CEST" "2018-10-01 13:42:37 CEST"
which cuts the unreadable part from the string, converts it to seconds and adds it to the "POSIXct" object.
If there are only hours as in time2, we could also say:
as.POSIXct(substr(time2, 1, 23), format="%Y-%m-%dT%H:%M:%OS") +
as.numeric(substr(time2, 24, 26))*3600
# [1] "2018-09-01 14:42:37 CEST" "2018-10-01 13:42:37 CEST"
That the code is slightly longer now should not obscure the fact that it runs practically as fast as the one at top of the answer.
Update 2
You could wrap the current three variants into a function with if (nchar(x) == 29) ... else structure, such as this one:
fixDateTime <- function(x) {
s <- split(x, nchar(x))
if ("20" %in% names(s))
s$`20` <- as.POSIXct(s$`20` , format="%Y-%m-%dT%H:%M:%SZ")
else if ("24" %in% names(s))
s$`24` <- as.POSIXct(s$`24`, format="%Y-%m-%dT%H:%M:%OSZ")
else if ("29" %in% names(s))
s$`29` <- as.POSIXct(substr(s$`29`, 1, 23), format="%Y-%m-%dT%H:%M:%OS") +
{os <- as.numeric(el(strsplit(substring(s[[3]], 24), "\\:")))
(os[1]*60 + os[2])*60}
return(unsplit(s, nchar(x)))
}
res <- fixDateTime(time3)
res
# [1] "2018-07-29 15:02:05 CEST" "2018-10-04 00:00:00 CEST" "2018-10-01 00:00:00 CEST"
str(res)
# POSIXct[1:3], format: "2018-07-29 15:02:05" "2018-10-04 00:00:00" "2018-10-01 00:00:00"
Compared to the packages only fixDateTime can handle all three defined date-time types. According to the concluding benchmark the function is still very fast.
Note: The function logically fails if different date formats have the same nchar, and it should be customized in the case (e.g. by another split condition)! Not tested: daylight saving time behavior when adding seconds to POSIXct.
Benchmark
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# fixDateTime 35.46387 35.94761 40.07578 36.05923 39.54706 68.46211 10 c
# as.POSIXct 20.32820 20.45985 21.00461 20.62237 21.16019 23.56434 10 b # to compare
# lubridate 11.59311 11.68956 12.88880 12.01077 13.76151 16.54479 10 a # produces NAs!
# anytime 198.57292 201.06483 203.95131 202.91368 203.62130 212.83272 10 d # produces NAs!
Data
time <- c("2018-07-29T15:02:05Z", "2018-07-29T14:46:57Z", "2018-10-04T12:13:41.333Z",
"2018-10-04T12:13:45.479Z")
time2 <- c("2018-07-29T15:02:05Z", "2018-07-29T15:02:05Z", "2018-07-29T15:02:05Z")
time3 <- c("2018-07-29T15:02:05Z", "2018-10-04T12:13:41.333Z",
"2018-10-01T11:42:37.000+03:00")
Benchmark code
n <- 1e3
t1 <- sample(time2, n, replace=TRUE)
t2 <- sample(time3, n, replace=TRUE)
library(lubridate)
library(anytime)
microbenchmark::microbenchmark(fixDateTime=fixDateTime(t2),
as.POSIXct=as.POSIXct(t1, format="%Y-%m-%dT%H:%M:%OSZ"),
lubridate=parse_date_time(t2, "ymd_HMS"),
anytime=anytime(t2),
times=10L)

You can use library anytime
library(anytime)
time<- c("2018-07-29T15:02:05Z",
"2018-07-29T14:46:57Z",
"2018-10-04T12:13:41.333Z",
"2018-10-04T12:13:45.479Z")
anytime(time)
#[1] "2018-07-29 15:02:05 CEST" "2018-07-29 14:46:57 CEST" "2018-10-04 12:13:41 CEST" "2018-10-04 12:13:45 CEST"

or you can also use:
time<- c("2018-07-29T15:02:05Z",
"2018-07-29T14:46:57Z",
"2018-10-04T12:13:41.333Z",
"2018-10-04T12:13:45.479Z")
length<-c(15.8,132.1,12.5,33.2)
df<-data.frame(time,length)
library(lubridate)
# df$time2<-as_datetime(df$time)
df$time2 <-parse_date_time(df$time, "ymd_HMS")
df

Related

Converting timestamp in microseconds to data and time in r

I'm trying to convert a timestamp in microseconds to the following format in R:
YYYY-MM-DD HH:MM:SS
I've tried different approaches, but couldn't succeed. Following my code:
options(digits=16)
value = 1521222492687300
as.POSIXct(value, tz = "UTC", origin="1970-01-01 00:00:00")
And I get this as return:
[1] "48207591-10-13 12:15:00 UTC"
Even divided by 1000, as some posts suggested, I'm still getting a non sense result:
as.POSIXct(value/1000, tz = "UTC", origin="1970-01-01 00:00:00")
[1] "50175-08-15 19:31:27.300048 UTC"
Any suggestion to solve this problem?
As Gabor hinted you need to divide by 1e6, not 1e3:
R> v <- 1521222492687300
R> v
[1] 1.52122e+15
R> anytime::anytime(v / 1e6)
[1] "2018-03-16 12:48:12.6872 CDT"
R>
Same of course with as.POSIXct etc but you nee to supply the redundant origin:
R> as.POSIXct(v / 1e6, origin="1970-01-01")
[1] "2018-03-16 12:48:12.6872 CDT"
R>
One way to see your scale is to convert current time:
R> w <- as.numeric(Sys.time())
R> c(v, w)
[1] 1.52122e+15 1.52346e+09
R>
which makes the scaling difference more obvious.

Round time by X hours in R?

While doing predicting modeling on timestamped data, I want to write a function in R (possibly using data.table) that rounds the date by X number of hours. E.g. rounding by 2 hours should give this:
"2014-12-28 22:59:00 EDT" becomes "2014-12-28 22:00:00 EDT"
"2014-12-28 23:01:00 EDT" becomes "2014-12-29 00:00:00 EDT"
It's very easy to do when you round by 1 hour - using round.POSIXt(.date, "hour") function.
Writing a generic function, like I'm doing below using multiple if statements, becomes quite ugly however:
d7.dateRoundByHour <- function (.date, byHours) {
if (byHours == 1)
return (round.POSIXt(.date, "hour"))
hh = hour(.date); dd = mday(.date); mm = month(.date); yy = year(.date)
hh = round(hh/byHours,digits=0) * byHours
if (hh>=24) {
hh=0; dd=dd+1
}
if ((mm==2 & dd==28) |
(mm %in% c(1,3,5,7,8,10,12) & dd==31) |
(mm %in% c(2,4,6,9,11) & dd==30)) { # NB: it won't work on 29 Feb leap year.
dd=1; mm=mm+1
}
if (mm==13) {
mm=1; yy=yy+1
}
str = sprintf("%i-%02.0f-%02.0f %02.0f:%02.0f:%02.0f EDT", yy,mm,dd, hh,0,0)
as.POSIXct(str, format="%Y-%m-%d %H:%M:%S")
}
Anyone can show a better way to do that?
(perhaps by converting to numeric and back to POSIXt or some other POSIXt functions?)
Use the round_date function from the lubridate package. Assuming you had a data.table with a column named date you could do the following:
dt[, date := round_date(date, '2 hours')]
A quick example will give you exactly the results you were looking for:
x <- as.POSIXct("2014-12-28 22:59:00 EDT")
round_date(x, '2 hours')
This is actually really easy with just base R. The basic idea for round by "odd lots" that you
scale down by an appropriate scale factor
round down to integer in the downscaled unit
scale back up and re-convert
Or in two R code statements:
R> pt <- as.POSIXct(c("2014-12-28 22:59:00", "2014-12-28 23:01:00 EDT"))
R> pt # just to check
[1] "2014-12-28 22:59:00 CST" "2014-12-28 23:01:00 CST"
R>
R> scalefactor <- 60*60*2 # 2 hours of 60 minutes times 60 seconds
R>
R> as.POSIXct(round(as.numeric(pt)/scalefactor) * scalefactor, origin="1970-01-01")
[1] "2014-12-28 22:00:00 CST" "2014-12-29 00:00:00 CST"
R>
The key last line just does what I outlined: convert the POSIXct to a numeric representation, scales it down, then rounds before scaling back up and converting to a POSIXct again.

R drops hours, minutes, and seconds from date

While converting a dataframe to xts I realized that there is something wrong with the formatter. Here's an example dataframe:
effective_date price
"1990-01-01" "100"
"1990-01-02 00:05:00" "200"
This is example output from a package that I use.
Converting this to xts is straight-forward
xts(df["price"], order_by=as.POSIXct(df["effective_date"], format="%Y-%m-%d %H:%M:%S")
However this errors out, saying NAs can't be in row names, and the result is:
<NA> 100
1990-01-02 00:05:00 200
Obviously xts can't figure out what to do with the weird date there (midnight) and it won't coerce it.
If I add tz="UTC" to as.POSIXct it doesn't work. Additionally, as.POSIXlt doesnt change anything here either.
What can I do to coerce that midnight date to the correct format?
Two issues:
1) You cannot parse a date alone as POSIXct with a given format:
R> as.POSIXct(c("2017-01-02", "2017-01-03 04:05:06"), format="%Y-%m-%d %H:%M:%S")
[1] NA "2017-01-03 04:05:06 CST"
R>
2) You can however use the anytime() function to do it:
R> anytime::anytime(c("2017-01-02", "2017-01-03 04:05:06"))
[1] "2017-01-02 00:00:00 CST" "2017-01-03 04:05:06 CST"
R>
Once you have a POSIXct, forming the xts is easy.
Also note that you have typos: you need a comma before the column indicator: df[, "price"].
Edit: Getting a little tired of #42's comment about Gabor's (fine) solution "dominating" this one, so here's minimal benchmark:
R> library(microbenchmark)
R> v <- c("2017-01-02", "2017-01-03 04:05:06")
R> library(anytime)
R> print(microbenchmark(anytime(v), do.call("c", lapply(v, as.POSIXct))), digits=3)
Unit: microseconds
expr min lq mean median uq max neval cld
anytime(v) 33.6 36.8 42.1 45.6 46.6 80.7 100 a
do.call("c", lapply(v, as.POSIXct)) 571.5 579.1 586.4 586.8 589.5 695.7 100 b
R>
so in short "not really". It is using only R Base, which is a plus, put it is a) harder read and understand, b) more limited as it deals with exactly one format (in ISO style) and c) it is about thirteen times slower.
1) To get the "POSIXct" datetime vector try converting each datetime to "POSIXct" separately and then concatenate them together:
do.call("c", lapply(df$effective_date, as.POSIXct))
2) Another base solution that is even shorter and is also substantially faster is the following which relies on the fact that as.POSIXct will ignore junk at the end.
as.POSIXct(paste(df$effective, "00:00:00"))
Most of lubridate's parsing functions have a truncated parameter that takes a number indicating the number of elements that can be missing from the end. Missing elements will be replaced by zero.
Example with the data at hand:
lubridate::ymd_hms(c("2017-01-02", "2017-01-03 04:05:06"), truncated = 3)
## [1] "2017-01-02 00:00:00 UTC" "2017-01-03 04:05:06 UTC"
Assuming you want the timestamps, preprocess with something like:
temp <- c("1990-01-01", "1990-01-02 00:05:00")
# match a date string at the end of string (indicated by $). Replace
# with the full string (indicated by \\1 and 00:00:00
temp2 <- gsub("(\\d{4}\\-\\d{2}\\-\\d{2}$)", "\\1 00:00:00", temp)
# [1] "1990-01-01 00:00:00" "1990-01-02 00:05:00"

Convert Date with special format using R

I have several variables that exist in the following format:
/Date(1353020400000+0100)/
I want to convert this format to ddmmyyyy. I found this solution for the same problem using php, but I don't know anything about php, so I'm unable to convert that solution to what I need, which is a solution that I can use in R.
Any suggestions?
Thanks.
If the format is milliseconds since the epoch then anytime() or as.POSIXct() can help you:
R> anytime(1353020400000/1000)
[1] "2012-11-15 17:00:00 CST"
R> anytime(1353020400.000)
[1] "2012-11-15 17:00:00 CST"
R>
anytime() converts to local time, which is Chicago for me. You would have to deal with the UTC offset separately.
Base R can do it too, but you need the dreaded origin:
R> as.POSIXct(1353020400.000, origin="1970-01-01")
[1] "2012-11-15 17:00:00 CST"
R>
As far as I can tell from the linked question, this is milliseconds since the epoch:
x <- "/Date(1353020400000+0100)/"
spl <- strsplit(x, "[()+]")
as.POSIXct(as.numeric(sapply(spl,`[[`,2)) / 1000, origin="1970-01-01", tz="UTC")
#[1] "2012-11-15 23:00:00 UTC"
If you want to pick up the timezone difference as well, here's an attempt:
x <- "/Date(1353020400000+0100)/"
spl <- strsplit(x, "(?=[+-])|[()]", perl=TRUE)
tzo <- sapply(spl, function(x) paste(x[3:4],collapse="") )
dt <- as.POSIXct(as.numeric(sapply(spl,`[[`,2)) / 1000, origin="1970-01-01", tz="UTC")
as.POSIXct(paste(format(dt), tzo), tz="UTC", format = '%F %T %z')
#[1] "2012-11-15 22:00:00 UTC"
The package lubridate can come to the rescue as follows:
as.Date("1970-01-01") + lubridate::milliseconds(1353020400000)
Read: Number of milliseconds since epoch (= 1. January 1970, UTC + 0)
A parsing function can now be made using regular expressions:
parse.myDate <- function(text) {
num <- as.numeric(stringr::str_extract(text, "(?<=/Date\\()\\d+"))
as.Date("1970-01-01") + lubridate::milliseconds(num)
}
finally, format the Date with
format(theDate, "%d/%m/%Y %H:%M")
If you also need the time zone information, you can use this instead:
parse.myDate <- function(text) {
parts <- stringr::str_match(text, "^/Date\\((\\d+)([+-])(\\d{4})\\)/$")
as.POSIXct(as.numeric(parts[,2])/1000, origin = "1970-01-01", tz = paste0("Etc/GMT", parts[,3], as.integer(parts[,4])/100))
}

Parse "next day" time

Is there a function (built-in or packaged) that would allow parsing a time like "25:15:00" as "1:15 on the next day"? Unfortunately, as.POSIXct doesn't like it with the %X specification (equivalent to %H:%M:%S),
> as.POSIXct('25:15:00', format='%X')
[1] NA
> as.POSIXct('15:15:00', format='%X')
[1] "2013-05-24 15:15:00 CEST"
and I can't find a suitable conversion specification in the strptime docs.
Not thoroughly tested but you can try this function
parse_time <- function(x, format = "%X") {
hour <- as.numeric(substr(x, 1, 2))
delta <- ifelse(hour >= 24, 24 * 3600, 0)
hour <- hour %% 24
date <- paste0(hour, substr(x, 3, nchar(x)))
strptime(date, format = format) + delta
}
parse_time(c('25:15:00', "23:10:00"))
##[1] "2013-05-25 01:15:00 GMT" "2013-05-24 23:10:00 GMT"
Now there is:
library(devtools)
install_github('kimisc', 'krlmlr')
library(kimisc)
hms.to.seconds('25:15:00')
It uses a slightly different approach than dickoa's code: The argument is filtered by gsub using a suitable regular expression, and the actual conversion doesn't involve strptime at all. See the code.

Resources