R: Regex to string detect date-time format in R - r

What would be a solution to detect date-time format
14/07/2009 19:15:29
Is this a fullproof solution?
str_detect(s,regex("([0-9]{2}/[0-9]{2}/[0-9]{4}) [0-9]{2}:[0-9]{2}:[0-9]{2}"))
For example for the format
14.07.2009
I have written the regex to be
str_detect(date,regex("([0-9]{2}\\.[0-9]{2}\\.[0-9]{4})"))
I don't have much idea regarding regex in R or regex in general, just the very basic stuff so would appreciate an easy approach with detailed logic. Thanks in advance.

As a beginner, I sometimes found it helpful to assemble the pattern as follows:
c(
"[0-9]{2}", # day
"/",
"[0-9]{2}", # month
"/",
"[0-9]{4}", # year
" ",
"[0-9]{2}", # Hour
":",
"[0-9]{2}", # minute
":",
"[0-9]{2}" # second
) |> paste(collapse = "")
Returns the pattern:
[1] "[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}"
stringr::str_detect("14/07/2009 19:15:29",
"[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}")
# [1] TRUE
Update (as per comments)
Here is how you could use the lubridate package. dmy_hms() finds datetimes in your format:
lubridate::dmy_hms("14/07/2009 19:15:29")
# [1] "2009-07-14 19:15:29 UTC"
But it will not parse invalid dates:
lubridate::dmy_hms("14/07/2009 19:15:70") # invalid seconds
# [1] NA
So to validate you could do:
(! is.na(lubridate::dmy_hms("14/07/2009 19:15:29")))
# [1] TRUE

Related

Is this an encoding issue?

I downloaded a text file that contains basically two columns—a date column and a contents column.
The date column was initially in the format: mm/dd/yy h:mm:ss am/pm. For example, one such date would be 10/16/2018 8:10:10 PM
I wanted to get the normal date isolated. I split the text column using the strsplit() command and so now I have a vector with dates in the common mm/dd/yy format. I want to convert this using the as.Date(x, format = '%m/%d/%y) coommand.
I notice, however, that I get a good chunk of my character vector coming out as NA. I compared the NA values to the values surrounding it. Here is what I see:
normal_vector[1:3]
[1] "10/12/17" "‎10/12/17" "10/12/17"
**The middle one (normal_vector[2]) is the problem one. **
as.Date(normal_vector[1:3], format = "%m/%d/%y")
[1] "2017-10-12" NA "2017-10-12"
Could this be an encoding issue? I try using the as.Date(iconv(normal_vector[1:3], to = "UTF-8"), format = "%m/%d/%y") but it does not appear to help. Furthermore, if I inspect the encoding of the character vectors as it already is, I get the following:
Encoding(normal_vector[1:3])
[1] "unknown" "UTF-8" "unknown"
Again, I just want to convert all three of these elements into a normal date object in R. They appear identical, and the encoding would have me think that a "UTF-8" character would be easily handled by an as.Date() function. What are some possible reasons that it refuses to be converted to a date?
Thanks!
There are indeed some strange characters (three 'dots') in your second string
look at the hex e280 8e
fread from the data.table-package can read these text just fine...
data.table::fread("./temp.csv", header = FALSE)
# V1 V2 V3
# 1: 10/12/17 ‎10/12/17 10/12/17
after reading, you can cleanse your data using some regex-magic...
dt <- data.table::fread("./temp.csv", header = FALSE)
# V1 V2 V3
# 1: 10/12/17 ‎10/12/17 10/12/17
#strip all NON 0-9, a-z, A-z AND '/' -characters
cleaned <- as.character( gsub( "[^0-9a-zA-Z/]", "", as.matrix( dt ) ) )
as.Date( cleaned, format = "%m/%d/%y" )
# [1] "2017-10-12" "2017-10-12" "2017-10-12"

convert Single digit day in R

I have dates in the format Apr42016, Aug12017, Apr112018. I am trying to convert in Y/m/d using R. I have tried the codes below but when I have a single digit for the day it returned NA. Anyone could help me, please?
strptime(data$date, "%b%e%Y")
as.Date (data$date, format="%b%d%Y")
as.POSIXct(data$date, format="%b%e%Y")
Thank you
You can modify the strings with sub (and add a 0 if necessary) before using as.Date:
myvec <- c("Apr42016", "Aug12017", "Apr112018") # the data
myvec2 <- sub("(?<=[^0])(?=[0-9]{5})", "0", myvec, perl = TRUE)
# [1] "Apr042016" "Aug012017" "Apr112018"
as.Date(myvec2, format = "%b%d%Y")
# [1] "2016-04-04" "2017-08-01" "2018-04-11"
If you can break up the numbers before as.Date, it will make things much easier. (Borrowing Sven's look-behind.)
sub("(?<=\\D)(\\d+)(\\d{4})$", "-\\1-\\2",
c("Apr42016", "Aug12017", "Apr112018"), perl=TRUE)
# [1] "Apr-4-2016" "Aug-1-2017" "Apr-11-2018"
From here, the format should be rather straight-forward:
as.Date(sub("(?<=\\D)(\\d+)(\\d{4})$", "-\\1-\\2", c("Apr42016", "Aug12017", "Apr112018"), perl = TRUE),
format="%b-%d-%Y")
# [1] "2016-04-04" "2017-08-01" "2018-04-11"

How to insert "-" (hyphon symbol) with in a vector character in R?

I have a vector
a <- "20160402"
I want to insert a "-" symbol in positions 5 and 8.
The result should like this
"2016-04-02"
I was trying to use `ins(a, "-", pos = c(5,8))
But this has not worked. Can anyone please help me.
Thank you
We can easily do the conversion with lubridate
library(lubridate)
ymd(a)
#[1] "2016-04-02 UTC"
Or use the correct format with as.Date
as.Date(a, '%Y%m%d')
#[1] "2016-04-02"
If we are looking for a regex solution, capture the characters as a group and use the backreference as the replacement
sub('(.{4})(.{2})(.{2})', '\\1-\\2-\\3', a)
#[1] "2016-04-02"

Convert date format to appropriate one?

I have column filled with date with the following format 09nov1992 and want convert it to 1992-Nov-01.
Any help would be appreciated.
Here's a simple way:
vec <- "09nov1992"
format(as.Date(vec, "%d%b%Y"), "%Y-%b-%d")
# [1] "1992-Nov-09"
An alternative version using regular expressions:
sub("(\\d+)(\\w)(\\w+?)(\\d+)", "\\4-\\U\\2\\L\\3-\\1", vec, perl = TRUE)
# [1] "1992-Nov-09"

How can you insert a colon every two characters?

I have a column of time values, except that they are in character format and do not have the colons to separate H, M, S. The column looks similar to the following:
Time
024201
054722
213024
205022
205024
125440
I want to convert all the values in the column to look like actual time values in the format H:M:S. The values are already in HMS format, so it is simply a matter of inserting colons, but that is proving more difficult than I thought. I found a package that adds commas every three digits from the right to make Strings look like currency values, but nothing for time (without also adding a date value, which I do not want to do). Any help would be appreciated.
Since the data is time related, you should consider storing it in a POSIX format:
> df <- data.frame(Time=c("024201", "054722", "213024", "205022", "205024", "125440")
> df$Time <- as.POSIXct(df$Time, format="%H%M%S")
> df
Time
1 2014-01-05 02:42:01
2 2014-01-05 05:47:22
3 2014-01-05 21:30:24
4 2014-01-05 20:50:22
5 2014-01-05 20:50:24
6 2014-01-05 12:54:40
To output just the times:
> format(df, "%H:%M:%S")
Time
1 02:42:01
2 05:47:22
3 21:30:24
4 20:50:22
5 20:50:24
6 12:54:40
A regular expression with lookaround works for this:
gsub('(..)(?=.)', '\\1:', x$Time, perl=TRUE)
The (?=.) means a character (matched by .) must follow, but is not considered part of the match (and is not captured).
Here is a regex solution:
x <- readLines(n=6)
024201
054722
213024
205022
205024
125440
gsub("(\\d\\d)(\\d\\d)(\\d\\d)", "\\1:\\2:\\3", x)
## [1] "02:42:01" "05:47:22" "21:30:24"
## [4] "20:50:22" "20:50:24" "12:54:40 "
Here the (\\d\\d) says we're looking for 2 digits. The parenthesis breaks the string into 3 parts. Then the \\1: says take chunk 1 and place a colon after it.
Or via date/times classes:
time <- c("024201", "054722", "213024", "205022", "205024", "125440")
time <- as.POSIXct(paste0("1970-01-01", time), format="%Y-%d-%m %H%M%S")
(time <- format(time, "%H:%M:%S"))
# [1] "02:42:01" "05:47:22" "21:30:24" "20:50:22" "20:50:24" "12:54:40"
This gives a chron "times" class vector:
> library(chron)
> times(gsub("(..)(..)(..)", "\\1:\\2:\\3", DF$Time))
[1] 02:42:01 05:47:22 21:30:24 20:50:22 20:50:24 12:54:40
The "times" class can display times without having to display the date and supports various methods on the times.
On the other hand, if only a character string is wanted then only the gsub part is needed.

Resources