R converting datetime format in foreign language not working

R converting datetime format in foreign language not working - r

I am trying to use the package parsedate to parse/convert several different datetime formats into a uniform/homogenous format the issue is that some dates will be in English (my machine language) and some will be in Spanish allow me to illustrate:
I have two vectors:
#English dates
dates<-c("2016 jun 15 8:39 p.m","2016 apr 2 8:39 a.m","2016 dec 2 8:39 a.m")
#Spanish dates
fechas<-c("2016 junio 15 8:39 p.m","2016 abril 2 8:39 a.m","2016 diciembre 2 8:39 a.m")
I noticed that the function parse_date() correctly converts the vector dates into the desired output format, but when trying to parse the vector with Spanish dates it does not work even when changing local time to "Spanish" as is shown below:
#Parsing english dates
parsedate::parse_date(dates)
> parsedate::parse_date(dates)
[1] "2016-06-15 08:39:00 UTC" "2016-04-02 08:39:00 UTC" "2016-12-02 08:39:00 UTC"
#Parsing spanish dates
Sys.setlocale("LC_TIME", "Spanish")
parsedate::parse_date(fechas)
> Sys.setlocale("LC_TIME", "Spanish")
[1] "Spanish_Spain.1252"
> parsedate::parse_date(fechas)
[1] "2016-01-15 08:39:00 UTC" "2016-01-02 08:39:00 UTC" "2016-01-02 08:39:00 UTC"
The Spanish output is wrong because it should return the same output in the English dates, I have tried several ways to properly change the local time of my machine to Spanish with no luck.
I will be very thankful if you can help me.

See here https://github.com/tidyverse/lubridate/issues/781
Sys.setlocale("LC_TIME", "Spanish_Spain.1252")
format <- "%a#%A#%b#%B#%p#"
enc2utf8(unique(format(lubridate:::.date_template, format = format)))
str(lubridate:::.get_locale_regs("Spanish_Spain.1252"))
library(lubridate)
Sys.getlocale("LC_TIME")
[1] "Spanish_Spain.1252"
parse_date_time(fechas, 'ymd HM')
[1] "2016-06-15 08:39:00 UTC" "2016-04-02 08:39:00 UTC" "2016-12-02 08:39:00 UTC"

Related

Working with difficult AM/PM formats and REGEX with lubridate in R

Hello guys I hope everyone is having a good one, I am trying to work with some AM/PM formats in lubridate on R but I cant seem to come up with a proper solution I hope you guys can correct meand help me out please!
I have a HUGE dataset that has date_time formats in a very rare way the format goes as follow:
First a number that represents the day, second an abbreviation of the month OR even the month fully spelled out a 12H time format and the strings " a. m." OR "p. m." or even a combination of more spaces between or missing "dots" then such as "a. m" to set an example please take a look at this vector:
dates<-c("02 dec 05:47 a. m",
"7 November 09:47 p. m.",
"3 jul 12:28 a.m.",
"23 sept 08:53 a m.",
"7 may 09:05 PM")
These make up for more than 95% of the rare formats of datetime in the data set I have been trying to use lubridate on R I am trying to use the function
ydm_hm(paste(2021,dates))
this is because all dates are form 2021 but I get always:
[1] NA NA NA
[4] NA "2021-05-07 21:05:00 UTC"
Warning message:
4 failed to parse.
the 4 that fail to parse give me NAS and the only one that parses is correct I do notice that this one has PM or AM as uppercase letters without dots but most of the time my formats will be like this:
ydm_hm("7 may 09:05 p.m.")
and this gives me NAS...
So I feel as though the only way to get this dates to workout is to change the structure and using REGEX so convert all "a. m." combinations into "AM" and "PM" only after analyzing the data I realized all "p.m" or "a. m." strings come after ONE or TWO spaces after the 12H time format that always have a length of 5 characters and so what should be considered to come up with the patter of the REGEX is the following
the string will begins with one or two numbers then spaces and then letters (for the month abbreviated or fully spelled out after that will have spaces and then 5 characters (that's the 12H time format) and then will have letters spaces and dots for all possible a.m and p.m formats but I have tried with no luck to convert the structure of the date.. if you guys could help me I will be so freaking thankful I dont know if there is a way or another package in R that will even resolve this issue without using regex so thank you everyone for your help !
my desired output will be:
"2021-12-02 05;47:00 UTC"
"2021-11-07 09:47:00 UTC"
"2021-07-03 12:28:00 UTC"
"2021-09-23 08:53:00 UTC"
"2021-05-07 21:05:00 UTC"

In this case, parse_date from parsedate works
library(parsedate)
parse_date(paste(2021, dates))
-output
[1] "2021-12-02 05:47:00 UTC"
[2] "2021-11-07 09:47:00 UTC"
[3] "2021-07-03 12:28:00 UTC"
[4] "2021-09-23 08:53:00 UTC"
[5] "2021-05-07 21:05:00 UTC"
Or if the second value should be PM, use str_remove to remove the space
library(stringr)
parse_date(paste(2021, str_remove_all(dates,
"(?<=[A-Za-z])[. ]+(?=[A-Za-z])")))
[1] "2021-12-02 05:47:00 UTC"
[2] "2021-11-07 21:47:00 UTC"
[3] "2021-07-03 00:28:00 UTC"
[4] "2021-09-23 08:53:00 UTC"
[5] "2021-05-07 21:05:00 UTC"
With ydm_hm, the issue is that one of the am/pm format showed spaces without the . and this may not get parsed. We could change the format by removing the spaces
library(lubridate)
library(stringr)
ydm_hm(paste(2021, str_remove_all(dates,
"(?<=[A-Za-z])[. ]+(?=[A-Za-z])")))
[1] "2021-12-02 05:47:00 UTC"
[2] "2021-11-07 21:47:00 UTC"
[3] "2021-07-03 00:28:00 UTC"
[4] "2021-09-23 08:53:00 UTC"
[5] "2021-05-07 21:05:00 UTC"

Since you raised the issue of regular expression, I thought I might try one way to do that
library(stringr)
# get boolean for pm dates
pm = str_detect(dates,"(?<=\\d\\d:\\d\\d\\s{1,2})[pP]",)
# convert dates to dates without am/pm
dates = str_extract(dates,"^.*:\\d\\d")
# add pm back to pm dates and am to am dates
dates[pm] <- paste(dates[pm], "PM")
dates[!pm] <- paste(dates[!pm], "AM")
# now your orignal approach works
ydm_hm(paste(2021,dates))
Output
[1] "2021-12-02 05:47:00 UTC" "2021-11-07 21:47:00 UTC" "2021-07-03 00:28:00 UTC" "2021-09-23 08:53:00 UTC"
[5] "2021-05-07 21:05:00 UTC"

Converting date into timestamp in R using strptime() function

I've read a txt file into R as a CSV. As I understand, R will not recognise the strings as timestamps automatically, so I’ll need to convert them from text values using the strptime() function.
Here's an input of my text file:
29/1/12 19:48
30/1/12 21:07
2/2/12 15:53
3/4/12 0:49
5/10/12 2:00
24/10/12 17:11
14/11/12 3:49
11/8/13 16:00
12/7/14 17:00
31/7/14 8:08
31/7/14 10:48
6/8/14 9:24
16/12/14 3:34
24/1/15 19:37
16/6/15 15:55
16/6/15 19:56
18/6/15 1:24
25/6/15 17:20
26/6/15 18:28
1/7/15 15:58
1/7/15 18:05
2/7/15 18:20
2/7/15 18:59
I have tried:
y <- strptime(timestamp, "%d/%m/%y %H:%M")
but keep on getting NA.
Can someone help me to solve this? Thank you.

# Make some sample data
timestamp <- c('29/1/12 19:48','30/1/12 21:07','2/2/12 15:53')
# Check that you are dealing characters not factors
str(timestamp)
#chr [1:3] "29/1/12 19:48" "30/1/12 21:07" "2/2/12 15:53"
# Best to specify the time zone when using striptime
strptime(timestamp, "%d/%m/%y %H:%M", tz = 'UTC')
#[1] "2012-01-29 19:48:00 UTC" "2012-01-30 21:07:00 UTC" "2012-02-02 15:53:00 UTC"
# The lubridate package is very useful for working with dates/times
library(lubridate)
lubridate::dmy_hm(timestamp)
#[1] "2012-01-29 19:48:00 UTC" "2012-01-30 21:07:00 UTC" "2012-02-02 15:53:00 UTC"

Extract Dates in any format from Text in R

I have text (news) data and want to extract dates from the text. Dates can be in any format, such as April 10 2018, 10-04-2018 , 10/04/2018, 2018/04/10, 04.10.2018, etc.
An example string would be:
My Friend is coming on july 10 2018 or 10/07/2018

we extract it using str_extract and then with anydate get the format
library(anytime)
library(stringr)
anydate(str_extract_all(str1, "[[:alnum:]]+[ /]*\\d{2}[ /]*\\d{4}")[[1]])
#[1] "2018-07-10" "2018-10-07"
data
str1 <- "My Friend is coming on july 10 2018 or 10/07/2018"

parsedate works well for these things.
library(parsedate)
dates = c("April 10 2018", "10-04-2018", "10/04/2018", "2018/04/10", "04.10.2018")
parsedate::parse_date(dates)
[1] "2018-04-10 UTC" "2018-10-04 UTC" "2018-10-04 UTC" "2018-04-10 UTC" "2018-10-04 UTC"

The parsedate is a nice package but it fails with the following string
txt = "Live coverage as American payrolls data shows big rise in unemployment, after composite PMI data shows UK business activity sunk to a record low in March following the Covid-19 lockdown"
> parsedate::parse_date(txt) [1] "2020-03-19 UTC"
[1] "2020-03-19 UTC"

Best way to deal with differing date data [duplicate]

I am trying to do some simple operation in R, after loading a table i encountered a date column which has many formats combined.
**Date**
1/28/14 6:43 PM
1/29/14 4:10 PM
1/30/14 12:09 PM
1/30/14 12:12 PM
02-03-14 19:49
02-03-14 20:03
02-05-14 14:33
I need to convert this to format like 28-01-2014 18:43 i.e. %d-%m-%y %h:%m
I tried this
tablename$Date <- as.Date(as.character(tablename$Date), "%d-%m-%y %h:%m")
but doing this its filling NA in the entire column. Please help me to get this right!

The lubridate package makes quick work of this:
library(lubridate)
d <- parse_date_time(dates, names(guess_formats(dates, c("mdy HM", "mdy IMp"))))
d
## [1] "2014-01-28 18:43:00 UTC" "2014-01-29 16:10:00 UTC"
## [3] "2014-01-30 12:09:00 UTC" "2014-01-30 12:12:00 UTC"
## [5] "2014-02-03 19:49:00 UTC" "2014-02-03 20:03:00 UTC"
## [7] "2014-02-05 14:33:00 UTC"
# put in desired format
format(d, "%m-%d-%Y %H:%M:%S")
## [1] "01-28-2014 18:43:00" "01-29-2014 16:10:00" "01-30-2014 12:09:00"
## [4] "01-30-2014 12:12:00" "02-03-2014 19:49:00" "02-03-2014 20:03:00"
## [7] "02-05-2014 14:33:00"
You'll need to adjust the vector in guess_formats if you come across other format variations.

Changing dates in different time zones by adding to POSIXlt

I am running into an error when I try to localize times for "date" (a variable of class=POSIXlt) in my dataset. Example code is as follows:
# All dates are coded by survey software in EST(not local time)
date <- c("2011-07-26 07:23", "2011-07-29 07:34", "2011-07-29 07:40")
region <-c("USA-EST", "UK", "Singapore")
#Change the times based on time-zone differences
start_time<-strptime(date,"%Y-%m-%d %h:%m")
localtime=as.POSIXlt(start_time)
localtime<-ifelse(region=="UK",start_time+6,start_time)
localtime<-ifelse(region=="Singapore",start_time+12,start_time)
#Then, I need to extract the hour and weekday
weekday<-weekdays(localtime)
hour<-factor(localtime)
There must be something wrong with my "ifelse" statement, because I get the error: number of items to replace is not a multiple of replacement length. Please help!

How about using R's native time code? The trick is that you can't have more than one time-zone in a POSIX vector, so use a list instead:
region <- c("EST","Europe/London","Asia/Singapore")
(localtime <- lapply(seq(date),function(x) as.POSIXlt(date[x],tz=region[x])))
[[1]]
[1] "2011-07-26 07:23:00 EST"
[[2]]
[1] "2011-07-29 07:34:00 Europe/London"
[[3]]
[1] "2011-07-29 07:40:00 Asia/Singapore"
And to convert to a vector in a single timezone:
Reduce("c",localtime)
[1] "2011-07-26 13:23:00 BST" "2011-07-29 07:34:00 BST"
[3] "2011-07-29 00:40:00 BST"
Note that my system timezone is BST, but if yours is EST it will convert to that.

You can use the timezone handling built in in POSIXct:
> start_time <- as.POSIXct(date,"%Y-%m-%d %H:%M", tz = "America/New_York")
> start_time
[1] "2011-07-26 07:23:00 EDT" "2011-07-29 07:34:00 EDT" "2011-07-29 07:40:00 EDT"
> format(start_time, tz="Europe/London", usetz=TRUE)
[1] "2011-07-26 12:23:00 BST" "2011-07-29 12:34:00 BST" "2011-07-29 12:40:00 BST"
> format(start_time, tz="Asia/Singapore", usetz=TRUE)
[1] "2011-07-26 19:23:00 SGT" "2011-07-29 19:34:00 SGT" "2011-07-29 19:40:00 SGT"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R converting datetime format in foreign language not working - r

Related

Working with difficult AM/PM formats and REGEX with lubridate in R

Converting date into timestamp in R using strptime() function

Extract Dates in any format from Text in R

Best way to deal with differing date data [duplicate]

Changing dates in different time zones by adding to POSIXlt

Categories

Resources