Can I apply a function over a vector using base tryCatch? - r

I'm trying to parse dates (using lubridate functions) from a vector which has mixed date formats.
departureDate <- c("Aug 17, 2020 12:00:00 AM", "Nov 19, 2019 12:00:00 AM", "Dec 21, 2020 12:00:00 AM",
"Dec 24, 2020 12:00:00 AM", "Dec 24, 2020 12:00:00 AM", "Apr 19, 2020 12:00:00 AM", "28/06/2019",
"16/08/2019", "04/02/2019", "10/04/2019", "28/07/2019", "26/07/2019",
"Jun 22, 2020 12:00:00 AM", "Apr 5, 2020 12:00:00 AM", "May 1, 2021 12:00:00 AM")
As I didn't notice at first, I tried to parse with lubridate::mdy_hms(departureDate) which resulted in NA values for dates whose format differs from that of the parser.
As the format may change on random positions of the vector I tried to use the following sentence:
departureDate <- tryCatch(mdy_hms(departureDate),
warning = function(w){return(dmy(departureDate))})
Which brought even more NA's as it only applied the warning function call. Is there a way to solve this by using my approach?
Thanks in advance

We can use lubridate::parse_date_time which can take multiple formats.
lubridate::parse_date_time(departureDate, c('%b %d, %Y %I:%M:%S %p', '%d/%m/%Y'))
#[1] "2020-08-17 UTC" "2019-11-19 UTC" "2020-12-21 UTC" "2020-12-24 UTC"
#[5] "2020-12-24 UTC" "2020-04-19 UTC" "2019-06-28 UTC" "2019-08-16 UTC"
#[9] "2019-02-04 UTC" "2019-04-10 UTC" "2019-07-28 UTC" "2019-07-26 UTC"
#[13] "2020-06-22 UTC" "2020-04-05 UTC" "2021-05-01 UTC"
Since in departureDate month name is in English, you need the locale to be English as well.
Refer How to change the locale of R? if you have non-English locale.

The ideal situation is that the code should be able to deal with every format on its own, without letting it fall to an exception.
Another issue to take into account is that the myd_hms() function returns dates in the POSIXct data type, whereas dmy() returns the Date type, so they wouldn't mix well together.
The code below applies mdy_hms(), then converts it to Date. It then tests for NA's and applies the second function dmy() on the missing values. More rules can be added in the pipeline at will if more formats are to be recognized.
library(dplyr)
dates.converted <-
mdy_hms(departureDate, tz = ) %>%
as.Date() %>%
ifelse(!is.na(.), ., dmy(departureDate)) %>%
structure(class = "Date")
print(dates.converted)
Output
[1] "2020-08-17" "2019-11-19" "2020-12-21" "2020-12-24" "2020-12-24" "2020-04-19" "2019-06-28" "2019-08-16"
[9] "2019-02-04" "2019-04-10" "2019-07-28" "2019-07-26" "2020-06-22" "2020-04-05" "2021-05-01"

One method would be to iterate through a list of candidate formats and apply it only to dates not previously parsed correctly.
fmts <- c("%b %d, %Y %H:%M:%S %p", "%d/%m/%Y")
dates <- rep(Sys.time()[NA], length(departureDate))
for (fmt in fmts) {
isna <- is.na(dates)
if (!any(isna)) break
dates[isna] <- as.POSIXct(departureDate[isna], format = fmt)
}
dates
# [1] "2020-08-17 12:00:00 PDT" "2019-11-19 12:00:00 PST" "2020-12-21 12:00:00 PST"
# [4] "2020-12-24 12:00:00 PST" "2020-12-24 12:00:00 PST" "2020-04-19 12:00:00 PDT"
# [7] "2019-06-28 00:00:00 PDT" "2019-08-16 00:00:00 PDT" "2019-02-04 00:00:00 PST"
# [10] "2019-04-10 00:00:00 PDT" "2019-07-28 00:00:00 PDT" "2019-07-26 00:00:00 PDT"
# [13] "2020-06-22 12:00:00 PDT" "2020-04-05 12:00:00 PDT" "2021-05-01 12:00:00 PDT"
as.Date(dates)
# [1] "2020-08-17" "2019-11-19" "2020-12-21" "2020-12-24" "2020-12-24" "2020-04-19" "2019-06-28"
# [8] "2019-08-16" "2019-02-04" "2019-04-10" "2019-07-28" "2019-07-26" "2020-06-22" "2020-04-05"
# [15] "2021-05-01"
I encourage you to put the most-likely formats first in the fmts vector.
The way this is set up, as soon as all elements are correctly found, no further formats are attempted (i.e., break).
Edit: if there is a difference in LOCALE where AM/PM are not locally recognized, then one method would be to first remove them from the strings:
departureDate <- gsub("\\s[AP]M$", "", departureDate)
departureDate
# [1] "Aug 17, 2020 12:00:00" "Nov 19, 2019 12:00:00" "Dec 21, 2020 12:00:00"
# [4] "Dec 24, 2020 12:00:00" "Dec 24, 2020 12:00:00" "Apr 19, 2020 12:00:00"
# [7] "28/06/2019" "16/08/2019" "04/02/2019"
# [10] "10/04/2019" "28/07/2019" "26/07/2019"
# [13] "Jun 22, 2020 12:00:00" "Apr 5, 2020 12:00:00" "May 1, 2021 12:00:00"
and then use a simpler format:
fmts <- c("%b %d, %Y %H:%M:%S", "%d/%m/%Y")

Related

Is there a way to deal with this date format in R?

I have a data frame that has the date column as a char class. I've tried parsing as.Date but the amount of NAs is worrisome. The dates are are in the following formats: "2003-10-19", and "October 05, 2018"
date <- c("October 05, 2018", "2003-10-19")
as.Date(date) this is what I tried, but most of my results came back with NAs
Here is an option:
date <- c("October 05, 2018", "2003-10-19", "10/9/95", "6 Oct.2010")
lubridate::parse_date_time(date, orders = c("mdy", "ymd", "dmy"))
#> [1] "2018-10-05 UTC" "2003-10-19 UTC" "1995-10-09 UTC" "2010-10-06 UTC"
as.Date has a feature called tryFormats, it's not vectorized, but can be used with e.g. lapply.
date <- c("October 05, 2018", "2003-10-19", "02/04/20", "11/09/2002",
"14.05.2021", "Nov 1, 2022", "March 1, 2004")
lapply(date, as.Date, tryFormats=c("%Y-%m-%d", "%B %d, %Y", "%d/%m/%y",
"%m/%d/%Y", "%d.%m.%Y", "%b %d, %Y"))
[[1]]
[1] "2018-10-05"
[[2]]
[1] "2003-10-19"
[[3]]
[1] "2020-04-02"
[[4]]
[1] "2020-09-11"
[[5]]
[1] "2021-05-14"
[[6]]
[1] "2022-11-01"
[[7]]
[1] "2004-03-01"

Convert any string to date in R

Is there some function that will attempt to guess date from string? I found lubridate:: parse_date_time(), which sounds like it would do the job, but you need to specify the exact format you are expecting. This is fine if all your strings are similar format, but not if it's human-entered data where anything is possible. I am looking for behavior like Excel, where anything that resembles a date is automatically converted to a date.
For example, c("April 11, 2020", "Apr 11", "4/11/20", "04-11", "April 11, 1 p.m.", "04/11/2020, 1:00pm") should all be 2020-04-11. Do I just have to create an elaborate regex or is there some more intelligent method?
Building on #jpmam1's comment, it looks like you can just use lubridate::parse_date_time with an unlimited number of patterns. If you specify enough, it will match anything.
mydates <- c("April 11, 2020", "Apr 11", "4/11/20", "04-11", "April 11, 1 p.m.", "04/11/2020, 1:00pm")
parse_date_time(mydates,c("mdy","mdY","Bdy","bd","md","Bdh","mdYHM"))
#[1] "2020-04-11 00:00:00 UTC" "0000-04-11 00:00:00 UTC" "2020-04-11 00:00:00 UTC" "0000-04-11 00:00:00 UTC" "2020-04-11 01:00:00 UTC"
#[6] "2020-04-11 01:00:00 UTC"
It matches yearless dates with 0000, something you could fix afterwards.

Extract POSIXct information from large vector

I have a large POSIXct vector v2 with 438000 elements, created as follows:
t.start <- as.POSIXct("2016-08-16 15:00:00 CEST")
v1 <- seq(from = t.start, length.out = 2920, by = "3 hours")
v2 <- rep(v1, each = 150)
From v2, I would like to extract the 12 elements that - for the first time they appear - contain the first day of each month. Specifically, I look for:
The numeric position in v2 these 12 elements have
The actual date of these elements in %d %b format, e.g. "01 Sep"
These two things should be extracted separately, i.e. stored in two different vectors afterwards. I think v1 and v2 contain daylight saving POSIXct elements but that should not affect the general operation. Any hint on how I can bypass the daylight savings would be a nice little add-on!
Any idea on how to do that?
We can start by extracting the day number from each element with format(v2, "%d). Then, to determine where the first days of the month are we can equate that to "01". Then we can take the diff() of that logical vector, remembering to concatenate 0L out front to account for the missing first element. Wrap that in which(), and you have the indices of the first element of each first day.
w <- which(c(0L, diff(format(v2, "%d") == "01")) == 1L)
w
# [1] 18451 54451 91651 127801 165001 202201 235801 272851
# [9] 308851 346051 382051 419251
Now w holds the locations of the 12 elements we need. Let's take a look at those elements of v2, just to confirm we've got it right.
v2[w]
# [1] "2016-09-01 00:00:00 PDT" "2016-10-01 00:00:00 PDT"
# [3] "2016-11-01 00:00:00 PDT" "2016-12-01 02:00:00 PST"
# [5] "2017-01-01 02:00:00 PST" "2017-02-01 02:00:00 PST"
# [7] "2017-03-01 02:00:00 PST" "2017-04-01 00:00:00 PDT"
# [9] "2017-05-01 00:00:00 PDT" "2017-06-01 00:00:00 PDT"
# [11] "2017-07-01 00:00:00 PDT" "2017-08-01 00:00:00 PDT"
Looks good. Note that we've got some 2am entries there, which is fine because it's Daylight Savings Time. Now let's get to your desired format ...
format(v2[w], "%d %b")
# [1] "01 Sep" "01 Oct" "01 Nov" "01 Dec" "01 Jan" "01 Feb"
# [7] "01 Mar" "01 Apr" "01 May" "01 Jun" "01 Jul" "01 Aug"

Formatting time with strptime when some times are missing and convert AM/PM to 24 hour format

I have a timestamp vector like
time_stamp <- c("7/1/2013", "7/1/2013 12:00:30 AM", "7/1/2013 12:01:00 AM", "7/1/2013 12:01:30 AM", "8/1/2013","8/1/2013 11:02:30 PM")
I want to format this to date class. I tried
strptime(time_stamp, format = "%d/%m/%Y %H:%M:%S", tz = "GMT")
but since two timestamps have missing times it results in NAs, which should be substituted by default: 12:00:00.
I can run a loop such as:
for (i in 1:length(time_stamp))
{
if(nchar(time_stamp[i])<11)
{
time_stamp[i] <- paste(time_stamp[i], " 12:00:00 AM")
}
}
time_stamp <- format(strptime(time_stamp, format = "%d/%m/%Y %I:%M:%S %p", tz = "GMT"), "%d/%m/%Y %H:%M:%S", tz = "GMT")
Is there a faster and cleaner way to accomplish this? The vector is a part of large dataset so I don't want to loop over it.
lubridate::parse_date_time can take multiple token orders, with or without the %:
lubridate::parse_date_time(time_stamp, orders = c("dmy IMS p", "dmy"))
## [1] "2013-01-07 00:00:00 UTC" "2013-01-07 00:00:30 UTC" "2013-01-07 00:01:00 UTC"
## [4] "2013-01-07 00:01:30 UTC" "2013-01-08 00:00:00 UTC" "2013-01-08 23:02:30 UTC"
Or use its truncated parameter:
lubridate::parse_date_time(time_stamp, orders = 'dmy IMS p', truncated = 4)
which returns the same thing.
Or use a bit of regex replacement and then process as normal:
as.POSIXct(sub("(\\d{4}$)", "\\1 00:00:00", time_stamp),
format = "%d/%m/%Y %H:%M:%S", tz = "GMT")
#[1] "2013-01-07 00:00:00 GMT" "2013-01-07 12:00:30 GMT" "2013-01-07 12:01:00 GMT"
#[4] "2013-01-07 12:01:30 GMT" "2013-01-08 00:00:00 GMT" "2013-01-08 11:02:30 GMT"

How do I convert the below string as date-time in R?

I have the following character vector that contains date/time. I want to convert them to a date format and I tried the below methods : as.Date() and as.POSIXct()
time <- c("Oct 01,2015 15:38:31 ", "Oct 05,2015 11:07:14", "Oct 11,2015 14:15:51 ", "Oct 11,2015 14:19:53 ", "Oct 12,2015 11:23:28", "Oct 19,2015 16:32:51 ")
#as.Date() is skipping the time part
time_1<-as.Date(time,"%b %d,%Y %H:%M:%S")
time_1
[1] "2015-10-01" "2015-10-05" "2015-10-11" "2015-10-11" "2015-10-12" "2015-10-19"
#POSIXct is showing an error
time_2<-as.POSIXlt(time,"%b %d,%Y %H:%M:%S")
The as.Date() function is skipping the time part and POSIX is throwing an error (which is obvious).
How do I convert the above string as a proper date+time format?
We can specify the format in both as.POSIXlt and as.POSIXct
as.POSIXlt(time,format="%b %d,%Y %H:%M:%S")
#[1] "2015-10-01 15:38:31 IST" "2015-10-05 11:07:14 IST"
#[3] "2015-10-11 14:15:51 IST" "2015-10-11 14:19:53 IST"
#[5] "2015-10-12 11:23:28 IST" "2015-10-19 16:32:51 IST"
Without specifying the format, the function guesses the "%b %d,%Y %H:%M:%S" as the timezone (tz). The second argument is tz as per the default formula in the ?as.POSIXct
as.POSIXct(x, tz = "", ...)

Resources