Is there some function that will attempt to guess date from string? I found lubridate:: parse_date_time(), which sounds like it would do the job, but you need to specify the exact format you are expecting. This is fine if all your strings are similar format, but not if it's human-entered data where anything is possible. I am looking for behavior like Excel, where anything that resembles a date is automatically converted to a date.
For example, c("April 11, 2020", "Apr 11", "4/11/20", "04-11", "April 11, 1 p.m.", "04/11/2020, 1:00pm") should all be 2020-04-11. Do I just have to create an elaborate regex or is there some more intelligent method?
Building on #jpmam1's comment, it looks like you can just use lubridate::parse_date_time with an unlimited number of patterns. If you specify enough, it will match anything.
mydates <- c("April 11, 2020", "Apr 11", "4/11/20", "04-11", "April 11, 1 p.m.", "04/11/2020, 1:00pm")
parse_date_time(mydates,c("mdy","mdY","Bdy","bd","md","Bdh","mdYHM"))
#[1] "2020-04-11 00:00:00 UTC" "0000-04-11 00:00:00 UTC" "2020-04-11 00:00:00 UTC" "0000-04-11 00:00:00 UTC" "2020-04-11 01:00:00 UTC"
#[6] "2020-04-11 01:00:00 UTC"
It matches yearless dates with 0000, something you could fix afterwards.
Related
I wanted to know what lubridate function can be used to convert these strings to date format.
using as_date in the above string is giving warning:
Warning message:
All formats failed to parse. No formats found
However, I am able to convert a string like this: "2020 Apr 10 11:22:23" using the as_datetime function.
With lubridate, it is just the order of day, month, year that matters. If we have multiple formats, use parse_date_time
library(lubridate)
parse_date_time(date1, orders = c('dmy', 'mdy'))
[1] "2020-04-21 UTC" "2020-04-21 UTC"
data
date1 <- c("21 Apr 2020", "April 21, 2020")
This is non-lubridate, but: if you don't know the order (d-m-y vs m-d-y vs y-m-d) in advance, or if it could be mixed within a single vector, you could try the anytime package:
anytime::anydate(c("21 Apr 2020","April 21, 2020"))
## [1] "2020-04-21" "2020-04-21"
(Apparently lubridate::parse_date_time() can handle mixed formats as well: it seems to allow slightly more control of which formats are checked for.)
It was this simple. Thank you guys :)
library(lubridate)
a <- "21 Apr 2020"
day1 <- dmy(a)
b <- "April 21, 2020"
day2 <- mdy(b)
I have several date variables in a data.frame.
They look for example like this:
[1] "10/14/18 17:55:28" "10/15/18 19:27:56"
[3] "11/04/18 15:47:46" "Thu Feb 7 14:51:55 2019"
[5] "Thu Feb 7 17:14:15 2019" "Thu Feb 7 15:46:09 2019"
[7] "Thu Feb 7 11:42:27 2019" "Thu Feb 7 13:24:16 2019"
[9] "Thu Feb 7 18:02:29 2019" "Mon Oct 15 08:48:43 2018"
[11] "10/17/18 17:08:38" "12/08/18 08:08:11"
[13] "10/11/18 21:25:30" "10/14/18 19:15:30"
[15] "10/16/18 11:18:01" "10/16/18 18:19:27"
[17] "Tue Oct 16 19:49:24 2018" "Wed Oct 17 21:36:32 2018"
[19] "Sat Oct 13 11:22:35 2018" "Fri Dec 7 17:12:33 2018"
At the moment this is a character variable. I want to change it with as.Date to substract the variables from each other.
I already found this:
as.Date( DATE$Sess1, format = "%m/%d/%y")
I would prefer to keep not only the date but also the time.
The real problem is that they include Apple and Windows format which makes it even more complicated.
I would prefer dplyr solutions ;)
You can use lubridates parse_date_time and include all the formats that it could take.
x <- c("10/14/18 17:55:28" , "10/15/18 19:27:56" ,
"11/04/18 15:47:46" , "Thu Feb 7 14:51:55 2019",
"Thu Feb 7 17:14:15 2019", "Thu Feb 7 15:46:09 2019")
lubridate::parse_date_time(x,c('mdyT', 'amdTY'))
#[1] "2018-10-14 17:55:28 UTC" "2018-10-15 19:27:56 UTC" "2018-11-04 15:47:46 UTC"
#[4] "2019-02-07 14:51:55 UTC" "2019-02-07 17:14:15 UTC" "2019-02-07 15:46:09 UTC"
Read ?parse_date_time to know different format details.
To get the dates, you can wrap as.Date around it.
as.Date(lubridate::parse_date_time(x,c('mdyT', 'amdTY')))
#[1] "2018-10-14" "2018-10-15" "2018-11-04" "2019-02-07" "2019-02-07" "2019-02-07"
For keeping the time, it's best to use a different date format, e.g. POSIXlt or POSIXct. You can also extend the format string to include the time (e.g. format = "%m/%d/%y %H:%M:%S") - see https://astrostatistics.psu.edu/su07/R/html/base/html/strptime.html for more details on these codes.
as.POSIXlt(DATE$Sess1, format = "%m/%d/%y %H:%M:%S")
As for handling different formats, because the ones you have aren't unambiguous on their own, I suggest having a vector of possible formats, then trying each in turn until one works.
If you're using the tidyverse, use {lubridate} to reformat. There are two different date/time formats in your example, so you'll need to format them twice.
lubridate::as_datetime(DATE$Sess1, format = "%a %b %e %H:%M:%S %Y")
and then for all the NA results...
lubridate::as_datetime(DATE$Sess1, format = "%m/%d/%y %H:%M:%S")
I'm trying to parse dates (using lubridate functions) from a vector which has mixed date formats.
departureDate <- c("Aug 17, 2020 12:00:00 AM", "Nov 19, 2019 12:00:00 AM", "Dec 21, 2020 12:00:00 AM",
"Dec 24, 2020 12:00:00 AM", "Dec 24, 2020 12:00:00 AM", "Apr 19, 2020 12:00:00 AM", "28/06/2019",
"16/08/2019", "04/02/2019", "10/04/2019", "28/07/2019", "26/07/2019",
"Jun 22, 2020 12:00:00 AM", "Apr 5, 2020 12:00:00 AM", "May 1, 2021 12:00:00 AM")
As I didn't notice at first, I tried to parse with lubridate::mdy_hms(departureDate) which resulted in NA values for dates whose format differs from that of the parser.
As the format may change on random positions of the vector I tried to use the following sentence:
departureDate <- tryCatch(mdy_hms(departureDate),
warning = function(w){return(dmy(departureDate))})
Which brought even more NA's as it only applied the warning function call. Is there a way to solve this by using my approach?
Thanks in advance
We can use lubridate::parse_date_time which can take multiple formats.
lubridate::parse_date_time(departureDate, c('%b %d, %Y %I:%M:%S %p', '%d/%m/%Y'))
#[1] "2020-08-17 UTC" "2019-11-19 UTC" "2020-12-21 UTC" "2020-12-24 UTC"
#[5] "2020-12-24 UTC" "2020-04-19 UTC" "2019-06-28 UTC" "2019-08-16 UTC"
#[9] "2019-02-04 UTC" "2019-04-10 UTC" "2019-07-28 UTC" "2019-07-26 UTC"
#[13] "2020-06-22 UTC" "2020-04-05 UTC" "2021-05-01 UTC"
Since in departureDate month name is in English, you need the locale to be English as well.
Refer How to change the locale of R? if you have non-English locale.
The ideal situation is that the code should be able to deal with every format on its own, without letting it fall to an exception.
Another issue to take into account is that the myd_hms() function returns dates in the POSIXct data type, whereas dmy() returns the Date type, so they wouldn't mix well together.
The code below applies mdy_hms(), then converts it to Date. It then tests for NA's and applies the second function dmy() on the missing values. More rules can be added in the pipeline at will if more formats are to be recognized.
library(dplyr)
dates.converted <-
mdy_hms(departureDate, tz = ) %>%
as.Date() %>%
ifelse(!is.na(.), ., dmy(departureDate)) %>%
structure(class = "Date")
print(dates.converted)
Output
[1] "2020-08-17" "2019-11-19" "2020-12-21" "2020-12-24" "2020-12-24" "2020-04-19" "2019-06-28" "2019-08-16"
[9] "2019-02-04" "2019-04-10" "2019-07-28" "2019-07-26" "2020-06-22" "2020-04-05" "2021-05-01"
One method would be to iterate through a list of candidate formats and apply it only to dates not previously parsed correctly.
fmts <- c("%b %d, %Y %H:%M:%S %p", "%d/%m/%Y")
dates <- rep(Sys.time()[NA], length(departureDate))
for (fmt in fmts) {
isna <- is.na(dates)
if (!any(isna)) break
dates[isna] <- as.POSIXct(departureDate[isna], format = fmt)
}
dates
# [1] "2020-08-17 12:00:00 PDT" "2019-11-19 12:00:00 PST" "2020-12-21 12:00:00 PST"
# [4] "2020-12-24 12:00:00 PST" "2020-12-24 12:00:00 PST" "2020-04-19 12:00:00 PDT"
# [7] "2019-06-28 00:00:00 PDT" "2019-08-16 00:00:00 PDT" "2019-02-04 00:00:00 PST"
# [10] "2019-04-10 00:00:00 PDT" "2019-07-28 00:00:00 PDT" "2019-07-26 00:00:00 PDT"
# [13] "2020-06-22 12:00:00 PDT" "2020-04-05 12:00:00 PDT" "2021-05-01 12:00:00 PDT"
as.Date(dates)
# [1] "2020-08-17" "2019-11-19" "2020-12-21" "2020-12-24" "2020-12-24" "2020-04-19" "2019-06-28"
# [8] "2019-08-16" "2019-02-04" "2019-04-10" "2019-07-28" "2019-07-26" "2020-06-22" "2020-04-05"
# [15] "2021-05-01"
I encourage you to put the most-likely formats first in the fmts vector.
The way this is set up, as soon as all elements are correctly found, no further formats are attempted (i.e., break).
Edit: if there is a difference in LOCALE where AM/PM are not locally recognized, then one method would be to first remove them from the strings:
departureDate <- gsub("\\s[AP]M$", "", departureDate)
departureDate
# [1] "Aug 17, 2020 12:00:00" "Nov 19, 2019 12:00:00" "Dec 21, 2020 12:00:00"
# [4] "Dec 24, 2020 12:00:00" "Dec 24, 2020 12:00:00" "Apr 19, 2020 12:00:00"
# [7] "28/06/2019" "16/08/2019" "04/02/2019"
# [10] "10/04/2019" "28/07/2019" "26/07/2019"
# [13] "Jun 22, 2020 12:00:00" "Apr 5, 2020 12:00:00" "May 1, 2021 12:00:00"
and then use a simpler format:
fmts <- c("%b %d, %Y %H:%M:%S", "%d/%m/%Y")
I have text (news) data and want to extract dates from the text. Dates can be in any format, such as April 10 2018, 10-04-2018 , 10/04/2018, 2018/04/10, 04.10.2018, etc.
An example string would be:
My Friend is coming on july 10 2018 or 10/07/2018
we extract it using str_extract and then with anydate get the format
library(anytime)
library(stringr)
anydate(str_extract_all(str1, "[[:alnum:]]+[ /]*\\d{2}[ /]*\\d{4}")[[1]])
#[1] "2018-07-10" "2018-10-07"
data
str1 <- "My Friend is coming on july 10 2018 or 10/07/2018"
parsedate works well for these things.
library(parsedate)
dates = c("April 10 2018", "10-04-2018", "10/04/2018", "2018/04/10", "04.10.2018")
parsedate::parse_date(dates)
[1] "2018-04-10 UTC" "2018-10-04 UTC" "2018-10-04 UTC" "2018-04-10 UTC" "2018-10-04 UTC"
The parsedate is a nice package but it fails with the following string
txt = "Live coverage as American payrolls data shows big rise in unemployment, after composite PMI data shows UK business activity sunk to a record low in March following the Covid-19 lockdown"
> parsedate::parse_date(txt) [1] "2020-03-19 UTC"
[1] "2020-03-19 UTC"
I would like to friendly ask a question about converting numeric data into Date format.
I would like to convert the numeric data like:
time1<-c(715, 1212, 0416)
to
July-2015, Dec-2012, Apr-2016
I have tried these code but it is not working.
time2<-as.Date(as.character(time1), format="%m%y")
Does anyone have some ideas to solve this issue?
Part of the issue is that "July 2015", "December 2012", and "April 2016" are not dates since the specific day is missing. Another approach is to convert to zoo::yearmon. Here, the numeric input needs to be converted to a string with leading zero so that the month is from 01 to 12:
library(zoo)
ym <- as.yearmon(sprintf("%04d",time1),format="%m%y")
ym
##[1] "Jul 2015" "Dec 2012" "Apr 2016"
The result is of class yearmon, which can then be coerced to Date:
class(ym)
##[1] "yearmon"
d <- as.Date(ym)
d
##[1] "2015-07-01" "2012-12-01" "2016-04-01"
class(d)
##[1] "Date"
Try lubridate::parse_date_time():
library(lubridate)
time2 <- parse_date_time(time1, orders = "my")
format.Date(time2, "%b-%Y")
[1] "juil.-2015" "déc.-2012" "avril-2016" # my locale lang is French