Parsing ambiguous timestamps - r

I have been provided a dataset with an ambiguous date format, e.g:
d_raw <- c("1102001 23:00", "1112001 0:00")
I would like to try to parse this date into a POSIXlt object in R. The source of the file assures me that the file is in chronological order, that the date format is month, then day, then year, and that there are no gaps in the time series.
Is there any way to parse this date format, using the ordering to resolve ambiguities? E.g. the first element above should parse to c("2001-01-10 23:00:00", "2001-01-11 00:00:00") rather than c("2001-01-10 23:00:00", "2001-11-01 00:00:00").

How about this (using regular expressions)
d_raw <- c("192001 16:00", "1102001 23:00", "1112001 0:00")
re <- "^(.+?)([1-9]|[1-3][0-9])(\\d{4}) (\\d{1,2}):(\\d{2})$"
m <- regexec(re, d_raw)
parts <- regmatches(d_raw, m)
lapply(parts, function(x) {
x<-as.numeric(x[-1])
ISOdate(x[3], x[1], x[2], x[4], x[5])
})
# [[1]]
# [1] "2001-01-09 16:00:00 GMT"
#
# [[2]]
# [1] "2001-01-10 23:00:00 GMT"
#
# [[3]]
# [1] "2001-01-11 GMT"
If you had more test cases that would be helpful just to make sure the regular expression correctly works.

I pity you for your horrible data vendor, so I decided to try and fix this for you.
# make up some horrid data
d_bad <- as.POSIXlt(seq(as.Date("2014-01-01"), as.Date("2014-12-31"), by=1))
d_raw <- paste0(d_bad$mon+1, d_bad$mday, d_bad$year+1900)
d_new <- d_raw
# not ambiguous when nchar is 6
d_new <- ifelse(nchar(d_new)==6,
paste0("0", substr(d_new,1,1), "0", substr(d_new,2,nchar(d_new))), d_new)
# now not ambiguous when nchar is 7 and it doesn't begin with a "1"
d_new <- ifelse(nchar(d_new)==7 & substr(d_new,1,1) != "1",
paste0("0",d_new), d_new)
# now guess a leading zero and parse
d_new <- ifelse(nchar(d_new)==7, paste0("0",d_new), d_new)
d_try <- as.Date(d_new, "%m%d%Y")
# now only days in October, November, and December might be wrong
bad <- cumsum(c(1L,as.integer(diff(d_try)))-1L) < 0L
# put the leading zero in the day, but remember "bad" rows have an
# extra leading zero, so make sure to skip it
d_try2 <- ifelse(bad,
paste0(substr(d_new,2,3),"0", substr(d_new,4,nchar(d_new))), d_new)
# convert to Date, POSIXlt, whatever and do a happy dance
d_YAY <- as.Date(d_try2, "%m%d%Y")
data.frame(d_raw, d_new, d_try, bad, d_try2, d_YAY)
# d_raw d_new d_try bad d_try2 d_YAY
# 1 112014 01012014 2014-01-01 FALSE 01012014 2014-01-01
# 2 122014 01022014 2014-01-02 FALSE 01022014 2014-01-02
# 3 132014 01032014 2014-01-03 FALSE 01032014 2014-01-03
# 4 142014 01042014 2014-01-04 FALSE 01042014 2014-01-04
# 5 152014 01052014 2014-01-05 FALSE 01052014 2014-01-05
# 6 162014 01062014 2014-01-06 FALSE 01062014 2014-01-06
I only did this with Dates in order to keep the example data set small. Doing this for POSIXlt would be very similar, except you would need to change the as.Date calls to as.POSIxlt and adjust the format accordingly.

Related

Converting character to dates with hours and minutes

I'm having trouble converting character values into date (hour + minutes), I have the following codes:
start <- c("2022-01-10 9:35PM","2022-01-10 10:35PM")
end <- c("2022-01-11 7:00AM","2022-01-11 8:00AM")
dat <- data.frame(start,end)
These are all in character form. I would like to:
Convert all the datetimes into date format and into 24hr format like: "2022-01-10 9:35PM" into "2022-01-10 21:35",
and "2022-01-11 7:00AM" into "2022-01-11 7:00" because I would like to calculate the difference between the dates in hrs.
Also I would like to add an ID column with a specific ID, the desired data would like this:
ID <- c(101,101)
start <- c("2022-01-10 21:35","2022-01-10 22:35")
end <- c("2022-01-11 7:00","2022-01-11 8:00")
diff <- c(9,10) # I'm not sure how the calculations would turn out to be
dat <- data.frame(ID,start,end,diff)
I would appreciate all the help there is! Thanks!!!
You can use lubridate::ymd_hm. Don't use floor if you want the exact value.
library(dplyr)
library(lubridate)
dat %>%
mutate(ID = 101,
across(c(start, end), ymd_hm),
diff = floor(end - start))
start end ID diff
1 2022-01-10 21:35:00 2022-01-11 07:00:00 101 9 hours
2 2022-01-10 22:35:00 2022-01-11 08:00:00 101 9 hours
The base R approach with strptime is:
strptime(dat$start, "%Y-%m-%d %H:%M %p")
[1] "2022-01-10 09:35:00 CET" "2022-01-10 10:35:00 CET"

How to calculate time difference in milliseconds using R when formats are different?

I have a problem in R that is killing me! Can you help me?
I found a question in StackOverflow that gave me a very good explanation.
Here is the link: How to parse milliseconds?
I was able to implement the following code that works very well.
z2 <- strptime("10/2/20 11:16:17.682", "%d/%m/%y %H:%M:%OS")
z1 <- strptime("10/2/20 11:16:16.683", "%d/%m/%y %H:%M:%OS")
When I calculate z2-z1, I get
Time difference of 0.9989998 secs
Similarly, when I use
z3 <- strptime("130 11:16:16.683", "%j %H:%M:%OS")
z4 <- strptime("130 11:16:18.682", "%j %H:%M:%OS")
When I calculate z4-z3, I get
Time difference of 1.999 secs
What is my problem?
The first column has the format 130 18:25:50.408, with millions of rows!!!
The second column has the format 2020 130 18:25:51.357 that is like the first column but has the year 2020.
The first column is also from 2020, but as the year is not there R uses the current year.
First question,
How can I substract both columns? I know how to substract columns.
What I do not know is to subtract these two times.
For example, second time is 2020 130 18:25:51.357
and first time is 130 18:25:50.408
I guess that I can do it programmatically converting it to a string, and eliminating the 2020. However, I am hoping that a quicker solution is available using base R or the lubridate package.
Second question,
"%j %H:%M:%OS" is the format for 130 11:16:16.683
What is the format for 2020 130 18:25:51.357?
As explained before this is working very well:
z3 <- strptime("130 11:16:16.683", "%j %H:%M:%OS")
But, this is NOT working.
z7 <- strptime("2020 130 11:16:16.683", "%y %j %H:%M:%OS")
UPDATE 1
I solved the second question!
However, I have not figured out yet the first question.
For the second question, the mistake in the format was that instead of %y, I need to write %Y with upper case.
Here is one example:
later <- strptime("2020 130 11:16:17.683", "%Y %j %H:%M:%OS")
earlier <- strptime("2020 130 11:16:16.684", "%Y %j %H:%M:%OS")
difftime(later,earlier,units="secs")
The R results is:
Time difference of 0.9990001 secs
UPDATE 2
At this point, what is pending is the following:
I need to substract two times that were made the same day on 2020.
The second time does have the year, the first time does not.
later <- strptime("2020 130 11:16:17.683", "%Y %j %H:%M:%OS")
earlier <- strptime("130 11:16:16.684", "%j %H:%M:%OS")
difftime(later,earlier,units="secs")
R produces the following result:
Time difference of -31622399 secs
Why? As we are on 2021, R formats the vector earlier as the current year, 2021 because the year is not there.
My columns has millions of rows.
At this point, my guess is that I would need to add 2020 with a concatenation or something like that. Is there any other method?
Thank you for your help!
Your object z2 is a POSIX list object. What this means is that it is a list of the time elements of your time.
print.default(z2)
# $sec
# [1] 17.682
#
# $min
# [1] 16
#
# $hour
# [1] 11
#
# $mday
# [1] 10
#
# $mon
# [1] 1
#
# $year
# [1] 120
#
# $wday
# [1] 1
#
# $yday
# [1] 40
#
# $isdst
# [1] 0
#
# $zone
# [1] "GMT"
#
# $gmtoff
# [1] NA
#
# attr(,"class")
# [1] "POSIXlt" "POSIXt"
When you do a subtraction, z2 - z1 R dispatches this operation to a function called -.POSIXt, which itself calls difftime. This function converts z2 to a POSIX count object. What this means is that it gets converted to a count of seconds since the beginning of the epoch, by default "1970-01-01".
options("digits" = 16)
print.default(as.POSIXct(z2))
# [1] 1581333377.682
# attr(,"class")
# [1] "POSIXct" "POSIXt"
# attr(,"tzone")
# [1] ""
difftime(z2, z1)
# Time difference of 0.9989998340606689 secs
R, like most software, works with double precision numerics. This means that arithmetic is imprecise, although approximately true. Most software will try to hide this imprecision by reducing the number of digits shown. That said, different numbers will give you different imprecision, so you might prefer referring directly to the list element of z2.
print.default(z2$sec - z1$sec)
# [1] 0.9989999999999988
You could therefore apply the time difference using your favourite data.frame tools.
options("digits" = 6)
# character columns
df1 <- data.frame(
col1 = c("10/2/20 11:16:17.682", "10/2/20 11:16:16.683"),
col2 = c("130 11:16:16.683", "130 11:16:18.682"),
stringsAsFactors = FALSE)
library(dplyr)
# convert columns to POSIXlt
df2 <- mutate(df1,
col1 = strptime(col1, "%d/%m/%y %H:%M:%OS"),
col2 = strptime(stringr::str_c("2020 ", col2), "%Y %j %H:%M:%OS"),
diff_days = unclass(difftime(col2, col1, units = "days")))
df2
# col1 col2 diff_days
# 1 2020-02-10 11:16:17 2020-05-09 11:16:16 88.9583
# 2 2020-02-10 11:16:16 2020-05-09 11:16:18 88.9584

Subsetting a column in data frame of class chron by a vector of dates

I have a dataframe like the one given by:
x <- c(1:6)
y <- c("06/01/13 16:00:00",
"06/01/13 16:00:00",
"06/03/13 20:00:00",
"06/03/13 20:00:00",
"06/07/13 20:00:00",
"06/08/13 20:00:00")
dfrm <- data.frame(x,y)
dfrm
x y
1 06/01/13 16:00:00
2 06/01/13 16:00:00
3 06/03/13 20:00:00
4 06/03/13 20:00:00
5 06/07/13 20:00:00
6 06/08/13 20:00:00
I want to make y a chron object:
dfrm$y <- as.chron(dfrm$y, "%m/%d/%y %H:%M")
Then I have a vector of dates:
intensives <- c("06/01/13", "06/07/13")
Then I want to subset the data frame "dfrm" by the dates in the "intensives" vector.
What I would do it would something like:
subset(dfrm, y==dates(intensives))
or
subset(dfrm, y %in% dates(intensives))
but both give me a null result.
Note:In most person's setups where stringAsFactors=TRUE that conversion to chron would have failed. They would need to do this:
dfrm$y <- as.chron(as.character(dfrm$y), "%m/%d/%y %H:%M")
date-objects are not chron-objects, but chron objects can be coerced with the dates function
subset(dfrm, dates(y) %in% dates(intensives))
x y
1 1 (06/01/13 16:00:00)
2 2 (06/01/13 16:00:00)
5 5 (06/07/13 20:00:00)
That's because you're comparing datetimes to dates.
Do subset(dfrm, dates(y) %in% dates(intensives)) instead.
You first subset using == will never work, regardless of data type.

How to merge couples of Dates and values contained in a unique csv

We have a csv file with Dates in Excel format and Nav for Manager A and Manager B as follows:
Date,Manager A,Date,Manager B
41346.6666666667,100,40932.6666666667,100
41347.6666666667,100,40942.6666666667,99.9999936329992
41348.6666666667,100,40945.6666666667,99.9999936397787
41351.6666666667,100,40946.6666666667,99.9999936714362
41352.6666666667,100,40947.6666666667,100.051441180137
41353.6666666667,100,40948.6666666667,100.04877283951
41354.6666666667,100.000077579585,40949.6666666667,100.068400298752
41355.6666666667,100.00007861475,40952.6666666667,100.070263374822
41358.6666666667,100.000047950872,40953.6666666667,99.9661095940006
41359.6666666667,99.9945012295984,40954.6666666667,99.8578245935173
41360.6666666667,99.9944609274138,40955.6666666667,99.7798031949116
41361.6666666667,99.9944817907402,40956.6666666667,100.029523604978
41366.6666666667,100,40960.6666666667,100.14859511024
41367.6666666667,99.4729804387476,40961.6666666667,99.7956029017769
41368.6666666667,99.4729804387476,40962.6666666667,99.7023420799123
41369.6666666667,99.185046151864,40963.6666666667,99.6124531927299
41372.6666666667,99.1766469096966,40966.6666666667,99.5689030038018
41373.6666666667,98.920738006398,40967.6666666667,99.5701493637685
,,40968.6666666667,99.4543885041996
,,40969.6666666667,99.3424528379521
We want to create a zoo object with the following structure [Dates, Manager A Nav, Manager B Nav].
After reading the csv file with:
data = read.csv("...", header=TRUE, sep=",")
we set an index for splitting the object and use lapply to split
INDEX <- seq(1, by = 2, length = ncol(data) / 2)
data.zoo <- lapply(INDEX, function(i, data) data[i:(i+1)], data = zoo(data))
I'm stuck with the fact that Dates are in Excel format and don't know how to fix that stuff. Is the problem set in a correct way?
If all you want to do is to convert the dates to proper dates you can do this easily enough. The thing you need to know is the origin date. Your numbers represent the integer and fractional number of days that have passed since the origin date. Usually this is Jan 0 1990!!! Go figure, but be careful as I don't think this is always the case. You can try this...
# Excel origin is day 0 on Jan 0 1900, but treats 1900 as leap year so...
data$Date <- as.Date( data$Date , origin = "1899/12/30")
data$Date.1 <- as.Date( data$Date.1 , origin = "1899/12/30")
# For more info see ?as.Date
If you are interested in keeping the times as well, you can use as.POSIXct, but you must also specify the timezone (UTC by default);
data$Date <- as.POSIXct(data$Date, origin = "1899/12/30" )
head(data)
# Date Manager.A Date.1 Manager.B
# 1 2013-03-13 16:00:00 100 2012-01-24 100.00000
# 2 2013-03-14 16:00:00 100 2012-02-03 99.99999
# 3 2013-03-15 16:00:00 100 2012-02-06 99.99999
# 4 2013-03-18 16:00:00 100 2012-02-07 99.99999
# 5 2013-03-19 16:00:00 100 2012-02-08 100.05144
# 6 2013-03-20 16:00:00 100 2012-02-09 100.04877

R some time stamps changing to NA

My Code is reading in a CSV file and converting the time stamp column to the R time format
DF <- read.csv("DF.CSV",head=TRUE,sep=",")
DF[51082,1]
[1] 03/01/2012 19:29
DF[1,1]
[1] 02/24/12 00:29
It reads it in properly and the above 2 rows are displayed as expected
DF$START <- as.POSIXct(strptime(paste(DF$START),format="%m/%d/%y %H:%M"))
DF[1,1]
[1] "2012-02-24 00:29:00 GMT"
DF[51082,1]
[1] NA
After converting them to the R time format using strptime and then displaying them again some of the values have NA and there was no error message displayed or reason for it that I can figure out
You have (at least) two different date formats,
one in %Y (4-digit years), one in %y (2-digit years).
Unless 12 really means 12AD, you need to try both.
DF <- data.frame(
START = c(
"03/01/2012 19:29",
"02/24/12 00:29"
),
stringsAsFactors = FALSE
)
coalesce <- function (x, ...) {
z <- class(x)
for (y in list(...)) {
x <- ifelse(is.na(x), y, x)
}
class(x) <- z
x
}
DF$START <- coalesce(
as.POSIXct(strptime(DF$START, format="%m/%d/%y %H:%M")),
as.POSIXct(strptime(DF$START, format="%m/%d/%Y %H:%M"))
)
# START
# 1 2012-03-01 19:29:00
# 2 2012-02-24 00:29:00
Try to use this:
> DF$START <- as.POSIXct(strptime(paste(DF$START),format="%m/%d/%Y %H:%M"))
This adds year with century.

Resources