I have text (news) data and want to extract dates from the text. Dates can be in any format, such as April 10 2018, 10-04-2018 , 10/04/2018, 2018/04/10, 04.10.2018, etc.
An example string would be:
My Friend is coming on july 10 2018 or 10/07/2018
we extract it using str_extract and then with anydate get the format
library(anytime)
library(stringr)
anydate(str_extract_all(str1, "[[:alnum:]]+[ /]*\\d{2}[ /]*\\d{4}")[[1]])
#[1] "2018-07-10" "2018-10-07"
data
str1 <- "My Friend is coming on july 10 2018 or 10/07/2018"
parsedate works well for these things.
library(parsedate)
dates = c("April 10 2018", "10-04-2018", "10/04/2018", "2018/04/10", "04.10.2018")
parsedate::parse_date(dates)
[1] "2018-04-10 UTC" "2018-10-04 UTC" "2018-10-04 UTC" "2018-04-10 UTC" "2018-10-04 UTC"
The parsedate is a nice package but it fails with the following string
txt = "Live coverage as American payrolls data shows big rise in unemployment, after composite PMI data shows UK business activity sunk to a record low in March following the Covid-19 lockdown"
> parsedate::parse_date(txt) [1] "2020-03-19 UTC"
[1] "2020-03-19 UTC"
Related
I am trying to use the package parsedate to parse/convert several different datetime formats into a uniform/homogenous format the issue is that some dates will be in English (my machine language) and some will be in Spanish allow me to illustrate:
I have two vectors:
#English dates
dates<-c("2016 jun 15 8:39 p.m","2016 apr 2 8:39 a.m","2016 dec 2 8:39 a.m")
#Spanish dates
fechas<-c("2016 junio 15 8:39 p.m","2016 abril 2 8:39 a.m","2016 diciembre 2 8:39 a.m")
I noticed that the function parse_date() correctly converts the vector dates into the desired output format, but when trying to parse the vector with Spanish dates it does not work even when changing local time to "Spanish" as is shown below:
#Parsing english dates
parsedate::parse_date(dates)
> parsedate::parse_date(dates)
[1] "2016-06-15 08:39:00 UTC" "2016-04-02 08:39:00 UTC" "2016-12-02 08:39:00 UTC"
#Parsing spanish dates
Sys.setlocale("LC_TIME", "Spanish")
parsedate::parse_date(fechas)
> Sys.setlocale("LC_TIME", "Spanish")
[1] "Spanish_Spain.1252"
> parsedate::parse_date(fechas)
[1] "2016-01-15 08:39:00 UTC" "2016-01-02 08:39:00 UTC" "2016-01-02 08:39:00 UTC"
The Spanish output is wrong because it should return the same output in the English dates, I have tried several ways to properly change the local time of my machine to Spanish with no luck.
I will be very thankful if you can help me.
See here https://github.com/tidyverse/lubridate/issues/781
Sys.setlocale("LC_TIME", "Spanish_Spain.1252")
format <- "%a#%A#%b#%B#%p#"
enc2utf8(unique(format(lubridate:::.date_template, format = format)))
str(lubridate:::.get_locale_regs("Spanish_Spain.1252"))
library(lubridate)
Sys.getlocale("LC_TIME")
[1] "Spanish_Spain.1252"
parse_date_time(fechas, 'ymd HM')
[1] "2016-06-15 08:39:00 UTC" "2016-04-02 08:39:00 UTC" "2016-12-02 08:39:00 UTC"
Hello guys I hope everyone is having a good one, I am trying to work with some AM/PM formats in lubridate on R but I cant seem to come up with a proper solution I hope you guys can correct meand help me out please!
I have a HUGE dataset that has date_time formats in a very rare way the format goes as follow:
First a number that represents the day, second an abbreviation of the month OR even the month fully spelled out a 12H time format and the strings " a. m." OR "p. m." or even a combination of more spaces between or missing "dots" then such as "a. m" to set an example please take a look at this vector:
dates<-c("02 dec 05:47 a. m",
"7 November 09:47 p. m.",
"3 jul 12:28 a.m.",
"23 sept 08:53 a m.",
"7 may 09:05 PM")
These make up for more than 95% of the rare formats of datetime in the data set I have been trying to use lubridate on R I am trying to use the function
ydm_hm(paste(2021,dates))
this is because all dates are form 2021 but I get always:
[1] NA NA NA
[4] NA "2021-05-07 21:05:00 UTC"
Warning message:
4 failed to parse.
the 4 that fail to parse give me NAS and the only one that parses is correct I do notice that this one has PM or AM as uppercase letters without dots but most of the time my formats will be like this:
ydm_hm("7 may 09:05 p.m.")
and this gives me NAS...
So I feel as though the only way to get this dates to workout is to change the structure and using REGEX so convert all "a. m." combinations into "AM" and "PM" only after analyzing the data I realized all "p.m" or "a. m." strings come after ONE or TWO spaces after the 12H time format that always have a length of 5 characters and so what should be considered to come up with the patter of the REGEX is the following
the string will begins with one or two numbers then spaces and then letters (for the month abbreviated or fully spelled out after that will have spaces and then 5 characters (that's the 12H time format) and then will have letters spaces and dots for all possible a.m and p.m formats but I have tried with no luck to convert the structure of the date.. if you guys could help me I will be so freaking thankful I dont know if there is a way or another package in R that will even resolve this issue without using regex so thank you everyone for your help !
my desired output will be:
"2021-12-02 05;47:00 UTC"
"2021-11-07 09:47:00 UTC"
"2021-07-03 12:28:00 UTC"
"2021-09-23 08:53:00 UTC"
"2021-05-07 21:05:00 UTC"
In this case, parse_date from parsedate works
library(parsedate)
parse_date(paste(2021, dates))
-output
[1] "2021-12-02 05:47:00 UTC"
[2] "2021-11-07 09:47:00 UTC"
[3] "2021-07-03 12:28:00 UTC"
[4] "2021-09-23 08:53:00 UTC"
[5] "2021-05-07 21:05:00 UTC"
Or if the second value should be PM, use str_remove to remove the space
library(stringr)
parse_date(paste(2021, str_remove_all(dates,
"(?<=[A-Za-z])[. ]+(?=[A-Za-z])")))
[1] "2021-12-02 05:47:00 UTC"
[2] "2021-11-07 21:47:00 UTC"
[3] "2021-07-03 00:28:00 UTC"
[4] "2021-09-23 08:53:00 UTC"
[5] "2021-05-07 21:05:00 UTC"
With ydm_hm, the issue is that one of the am/pm format showed spaces without the . and this may not get parsed. We could change the format by removing the spaces
library(lubridate)
library(stringr)
ydm_hm(paste(2021, str_remove_all(dates,
"(?<=[A-Za-z])[. ]+(?=[A-Za-z])")))
[1] "2021-12-02 05:47:00 UTC"
[2] "2021-11-07 21:47:00 UTC"
[3] "2021-07-03 00:28:00 UTC"
[4] "2021-09-23 08:53:00 UTC"
[5] "2021-05-07 21:05:00 UTC"
Since you raised the issue of regular expression, I thought I might try one way to do that
library(stringr)
# get boolean for pm dates
pm = str_detect(dates,"(?<=\\d\\d:\\d\\d\\s{1,2})[pP]",)
# convert dates to dates without am/pm
dates = str_extract(dates,"^.*:\\d\\d")
# add pm back to pm dates and am to am dates
dates[pm] <- paste(dates[pm], "PM")
dates[!pm] <- paste(dates[!pm], "AM")
# now your orignal approach works
ydm_hm(paste(2021,dates))
Output
[1] "2021-12-02 05:47:00 UTC" "2021-11-07 21:47:00 UTC" "2021-07-03 00:28:00 UTC" "2021-09-23 08:53:00 UTC"
[5] "2021-05-07 21:05:00 UTC"
i have a time series Data with 10 Minutes difference when i try to convert to date and time type using `df$Time1 <- dmy_hm(df$Time, tz="Asia/Calcutta")
it returns NA at 24 o Clock time interval as you can see i have tried with df$Time1 <- dmy_hm(df$Time, tz="Asia/Calcutta")and df$Time1 = as.POSIXct(df$Time, format="%d-%m-%y %H:%M") Please do guide me on this i am clueless whats happening at 02-07-16 00:00
One option would be using parse_date_time from lubridate which can take multiple formats
library(lubridate)
parse_date_time(df$Time, c('dmy_HM', 'dmy'))
#[1] "2016-07-01 23:30:00 UTC" "2016-07-01 23:40:00 UTC"
#[3] "2016-07-01 23:50:00 UTC" "2016-07-02 00:00:00 UTC"
data
df <- data.frame(Time = c("01-07-16 23:30", "01-07-16 23:40", "01-07-16 23:50",
"02-07-16"))
I am working with messy excel file with multiple date formats
2016-10-17T12:38:41Z
Mon Oct 17 08:03:08 GMT 2016
10-Sep-15
13-Oct-09
18-Oct-2016 05:42:26 UTC
I want to convert all of the above in yyyy-mm-dd format. I am using following code for the conversion but lot of values are coming NA.
as.Date(parse_date_time(df$date,c('mdy', 'ymd_hms','a b d HMS y','d b y HMS')))
How can I do it all of them together. I have read other threads on similar case,but nothing seems to work for my case.
Please help
If I add 'dmy' to the list then at least all of the cases in your example are succesfully parsed:
z <- c("2016-10-17T12:38:41Z", "Mon Oct 17 08:03:08 GMT 2016",
"10-Sep-15", "13-Oct-09", "18-Oct-2016 05:42:26 UTC")
library(lubridate)
parse_date_time(z,c('mdy', 'dmy', 'ymd_HMS','a b d HMS y','d b y HMS'))
## [1] "2016-10-17 12:38:41 UTC" "2016-10-17 08:03:08 UTC"
## [3] "2015-09-10 00:00:00 UTC" "2009-10-13 00:00:00 UTC"
## [5] "2016-10-18 05:42:26 UTC"
Your big problem will be the third and fourth elements: are these actually meant to be 'ymd' and 'dmy' respectively? I'm not sure how any logic will let you auto-detect these differences ... out of context, "15 Sep 2010" and "10 September 2015" both seem perfectly reasonable possibilities ...
For what it's worth I also tried the new anytime package - it only handled the first and last element.
Removing the times first makes it possible to specify only three alternatives in orders to parse the sample data in the question. This interprets 10-Sep-15 and 13-Oct-09 as dmy but if you want them interpreted as ymd then uncomment the commented out line:
orders <- c("dmy", "mdy", "ymd")
# orders <- c("ymd", "dmy", "mdy")
as.Date(parse_date_time(gsub("..:..:..", " ", x), orders = orders))
giving:
[1] "2016-10-17" "2016-10-17" "2015-09-10" "2009-10-13" "2016-10-18"
or if the commented out line is uncommented then:
[1] "2016-10-17" "2016-10-17" "2010-09-15" "2013-10-09" "2016-10-18"
Note: The input is:
x <- c("2016-10-17T12:38:41Z ", "Mon Oct 17 08:03:08 GMT 2016", "10-Sep-15",
"13-Oct-09", "18-Oct-2016 05:42:26 UTC")
I am trying to do some simple operation in R, after loading a table i encountered a date column which has many formats combined.
**Date**
1/28/14 6:43 PM
1/29/14 4:10 PM
1/30/14 12:09 PM
1/30/14 12:12 PM
02-03-14 19:49
02-03-14 20:03
02-05-14 14:33
I need to convert this to format like 28-01-2014 18:43 i.e. %d-%m-%y %h:%m
I tried this
tablename$Date <- as.Date(as.character(tablename$Date), "%d-%m-%y %h:%m")
but doing this its filling NA in the entire column. Please help me to get this right!
The lubridate package makes quick work of this:
library(lubridate)
d <- parse_date_time(dates, names(guess_formats(dates, c("mdy HM", "mdy IMp"))))
d
## [1] "2014-01-28 18:43:00 UTC" "2014-01-29 16:10:00 UTC"
## [3] "2014-01-30 12:09:00 UTC" "2014-01-30 12:12:00 UTC"
## [5] "2014-02-03 19:49:00 UTC" "2014-02-03 20:03:00 UTC"
## [7] "2014-02-05 14:33:00 UTC"
# put in desired format
format(d, "%m-%d-%Y %H:%M:%S")
## [1] "01-28-2014 18:43:00" "01-29-2014 16:10:00" "01-30-2014 12:09:00"
## [4] "01-30-2014 12:12:00" "02-03-2014 19:49:00" "02-03-2014 20:03:00"
## [7] "02-05-2014 14:33:00"
You'll need to adjust the vector in guess_formats if you come across other format variations.