Regex pattern questions in r

Regex pattern questions in r - r

I need to match author and time from string in R.
test = "Postedby BeauHDon Friday November 24, 2017 #10:30PM from the cost-effective dept."
I am currently using gsub() to find the desired output.
Expected output would be:
#author
"BeauHDon"
#Month
"November"
#Date
24
#Time
22:30
I got to gsub("Postedby (.*).*", "\\1", test) but the output is
"BeauHDon Friday November 24, 2017 #10:30PM from the cost-effective dept."
Also I understand time requires more more coding after extracting 10:30.
Is it possible to add 12 if next two string is PM?
Thank you.

We can extract using capturing as a group (assuming that the patterns are as shown in the example). Here the pattern is to match one or more non-white spaces (\\S+) followed by spaces (\\s+) from the start (^) of the string, followed by word which we capture in a group (\\w+), followed by capturing word after we skip the next word and space, then get the numbers ((\\d+)) and the time that follows the #
v1 <- scan(text=sub("^\\S+\\s+(\\w+)\\s+\\w+\\s+(\\w+)\\s+(\\d+)[^#]+#(\\S+).*",
"\\1,\\2,\\3,\\4", test), what = "", sep=",", quiet = TRUE)
As the last entry is time, we can convert it to datetime with strptime and change the format, assign it to the last element
v1[4] <- format(strptime(v1[4], "%I:%M %p"), "%H:%M")
If needed, set the names of the element with author, Month etc.
names(v1) <- c("#author", "#Month", "#Date", "#Time")
v1
# #author #Month #Date #Time
#"BeauHDon" "November" "24" "22:30"

Related

Extract timestamp in file name

I have a vector containing file names:
Filenames = c("blabla_bli_20140524_002532_000.wav",
"20201025_231205.wav",
"ble_20190612_220013_012.wav",
"X-20150312_190225_Blablu.wav",
"0000125.wav")
Of course the real vector is longer. I need to extract the timestamp from the name so to obtain the following result in date format:
Result = c("2014-05-24 00:25:32.000",
"2020-10-25 23:12:05.000",
"2019-06-12 22:00:13.012",
"2015-03-12 19:02:25.000",
NA)
Note that sometimes characters are present before or after the timestamp, and that some file names display the milliseconds while some do not (in this case they are assumed to be 000 milliseconds). I also need to obtain something like "NA" if the file name does not have the timestamp in it.
I found what looks like a good solution here but it is in python and I don't know how to translate it in R. I tried this, which is not working:
str_extract(Filenames, regex("_((\d+)_(\d+))"))
Error: '\d' is an unrecognized escape in character string starting ""_((\d"

You can use gsub to extract your datetime part by pattern, then rely (mostly) on the parsing of datetimes by POSIX standards. The only catch is that in POSIX, milliseconds must be expressed as ss.mmm (fractional seconds), so you need to replace the _ with a .
timestamps <- gsub(".*?(([0-9]{8}_[0-9]{6})(_([0-9]{3}))?).*?$", "\\2.\\4", Filenames)
timestamps
[1] "20140524_002532.000" "20201025_231205." "20190612_220013.012"
[4] "20150312_190225." "0000125.wav"
We have captured the datetime section (\\2), added a dot, then the (optional) millisecond section (\\4), without the underscore. Note that the mismatched filename remains unchanged - that's ok.
Now we specify the datetime format using the POSIX specification to first parse the strings, then print them in a different format:
times <- as.POSIXct(timestamps, format="%Y%m%d_%H%M%OS")
format(times, "%Y-%m-%d %H:%M:%OS3")
[1] "2014-05-24 00:25:32.000" "2020-10-25 23:12:05.000" "2019-06-12 22:00:13.012"
[4] "2015-03-12 19:02:25.000" NA
Note that the string that was not a timestamp just got turned into NA, so you can easily get rid of it here. Also, if you want to see the milliseconds printed out, you seed to use the %OS (fractional seconds) format, plus the number of digits (0-6) you want printed - in this case, %OS3.

You can use
library(stringr)
rx <- "(\\d{4})(\\d{2})(\\d{2})_(\\d{2})(\\d{2})(\\d{2})(?:_(\\d+))?"
Filenames = c("blabla_bli_20140524_002532_000.wav", "20201025_231205.wav", "ble_20190612_220013_012.wav", "X-20150312_190225_Blablu.wav", "0000125.wav")
m <- str_match(Filenames, rx)
result <- ifelse(is.na(m[,8]),
str_c(m[,2], "-", m[,3], "-", m[,4], " ", m[,5], ":", m[,6], ":", m[,7], ".000"),
str_c(m[,2], "-", m[,3], "-", m[,4], " ", m[,5], ":", m[,6], ":", m[,7], ".", m[,8]))
result
See the R demo. Output:
> result
[1] "2014-05-24 00:25:32.000"
[2] "2020-10-25 23:12:05.000"
[3] "2019-06-12 22:00:13.012"
[4] "2015-03-12 19:02:25.000"
[5] NA
See the regex demo. It captures parts of the datetime string into separate groups. If the last group ((\d+) in the (?:_(\d+))? optional non-capturing group matching milliseconds) is matched, we add it preceded with a dot, else, we add .000.

Formatting and Replacing Multiple Dates within a Single String in R

I have a question very similar to this one. The difference with mine is that I can have text with multiple dates within one string. All the dates are in the same format, as demonstrated below
rep <- "on the evening of june 11 2022, i was too tired to complete my homework that was due on august 4 2022. on august 25 2022 there will be a test "
All my sentences are lower case and all dates follow the %B %d %Y format. I'm able to extract all the dates using the following code:
> pattern <- paste(month.name, "[:digit:]{1,2}", "[:digit:]{4}", collapse = "|") %>%
regex(ignore_case = TRUE)
> str_extract_all(rep, pattern)
[[1]]
[1] "june 11 2022" "august 4 2022" "august 25 2022"
what I want to do is replace every instance of a date formatted %B %d %Y with the format %Y-%m-%d. I've tried something like this:
str_replace_all(rep, pattern, as.character(as.Date(str_extract_all(rep, pattern),format = "%B %d %Y")))
Which throws the error do not know how to convert 'str_extract_all' to class "Date". This makes sense to me since Im trying to replace multiple different dates and R doesn't know which one to replace it with.
If I change the str_extract_all to just str_extract I get this:
"on the evening of 2022-06-11, i was too tired to complete my homework that was due on 2022-06-11. on 2022-06-11 there will be a test "
Which again, makes sense since the str_extract is taking the first instance of a date, converting the format, and applying that same date across all instances of a date.
I would prefer if the solution used the stringr package just because most of my string tidying thus far has been using that package, BUT I am 100% open to any solution that gets the job done.

We may capture the pattern i.e one or more character (\\w+) followed by a space then one or two digits (\\d{1,2}), followed by space and then four digits (\\d{4}) as a group ((...)) and in the replacement pass a function to convert the captured group to Date class
library(stringr)
str_replace_all(rep, "(\\w+ \\d{1,2} \\d{4})", function(x) as.Date(x, "%b %d %Y"))
-output
[1] "on the evening of 2022-06-11, i was too tired to complete my homework that was due on 2022-08-04. on 2022-08-25 there will be a test "
NOTE: It is better to name objects with different names as rep is a base R function name

You can pass a named vector with multiple replacements to str_replace_all():
library(stringr)
rep <- "on the evening of june 11 2022, i was too tired to complete my homework that was due on august 4 2022. on august 25 2022 there will be a test "
pattern <- paste(month.name, "[:digit:]{1,2}", "[:digit:]{4}", collapse = "|") %>%
regex(ignore_case = TRUE)
extracted <- str_extract_all(rep, pattern)[[1]]
replacements <- setNames(as.character(as.Date(extracted, format = "%B %d %Y")),
extracted)
str_replace_all(rep, replacements)
#> [1] "on the evening of 2022-06-11, i was too tired to complete my homework that was due on 2022-08-04. on 2022-08-25 there will be a test "
Created on 2022-05-26 by the reprex package (v2.0.1)

Find and extract year within sentence for each cell in R

I have a large dataframe of 22641 obs. and 12 variables.
The first column "year" includes extracted values from satellite images in the format below.
1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc
From this format, I only want to keep the date which in this case is 19870517 and format it as date (so two different things). Usually, I use the regex to extract the words that I want, but here the date is different for each cell and I have no idea how to replace the above text with only the date. Maybe the way to do this is to search by position within the sentence but I do not know how.
Any ideas?
Thanks.

It's not clear what the "date is different in each cell" means but if it means that the value of the date is different and it is always the 7th field then either of (1) or (2) will work. If it either means that it consists of 8 consecutive digits anywhere in the text or 8 consecutive digits surrounded by _ anywhere in the text then see (3).
1) Assuming the input DF shown in reproducible form in the Note at the end use read.table to read year, pick out the 7th field and then convert it to Date class. No packages are used.
transform(read.table(text = DF$year, sep = "_")[7],
year = as.Date(as.character(V7), "%Y%m%d"), V7 = NULL)
## year
## 1 1987-05-17
2) Another alternative is separate in tidyr. 0.8.2 or later is needed.
library(dplyr)
library(tidyr)
DF %>%
separate(year, c(rep(NA, 6), "year"), extra = "drop") %>%
mutate(year = as.Date(as.character(year), "%Y%m%d"))
## year
## 1 1987-05-17
3) This assumes that the date is the only sequence of 8 digits in the year field use this or if we know it is surrounded by _ delimiters then the regular expression "_(\\d{8})_" can be used instead.
library(gsubfn)
transform(DF,
year = do.call("c", strapply(DF$year, "\\d{8}", ~ as.Date(x, "%Y%m%d"))))
## year
## 1 1987-05-17
Note
DF <- data.frame(year = "1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc",
stringsAsFactors = FALSE)

Not sure if this will generalize to your whole data but maybe:
gsub(
'(^(?:.*?[^0-9])?)(\\d{8})((?:[^0-9].*)?$)',
'\\2',
'1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc',
perl = TRUE
)
## [1] "19870517"
This uses group capturing and throws away anything but bounded 8 digit strings.

You can use sub to extract the data string and as.Date to convert it into R's date format:
as.Date(sub(".+?([0-9]+)_[^_]+$", "\\1", txt), "%Y%m%d")
# [1] "1987-05-17"
where txt <- "1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc"

R - gsub() for to remove dates from data set

I am using the gsub() function to remove the unwanted text from the data. I just want to have the age in the brackets, not the dates of birth. However, this is in a large data set with differing birth days.
Example of the data:
Test1$Age
Sep 10, 1990(27)
Mar 26, 1987(30
Feb 24, 1997(20)

You can do this using str_extract() from the stringr package:
s <- "Sep 10, 1990(27)"
# get the age in parentheses
stringr::str_extract(s, "\\([0-9]+\\)")
# just the age, with parentheses removed
stringr::str_extract(s, "(?<=\\()[0-9]+")
And the output is:
> s <- "Sep 10, 1990(27)"
>
> # get the age in parentheses
> stringr::str_extract(s, "\\([0-9]+\\)")
[1] "(27)"
>
> # just the age, with parentheses removed
> stringr::str_extract(s, "(?<=\\()[0-9]+")
[1] "27"
The first regular expression matches paired parentheses containing one or more digits. The second regular expression uses positive lookbehind to match one or more digits following an opening parenthesis.
If your data is in a data.frame df with the column named age, then you could do the following:
df$age <- stringr::str_extract(df$age, "\\([0-9]+\\)")
Or, in tidyverse notation:
df <- df %>% mutate(age = stringr::str_extract(age, "\\([0-9]+\\)"))

There seems to be two problems:
the date prior to the left parenthesis is not wanted
the right parenthesis is sometimes missing and it needs to be inserted
1) sub These can be addressed with sub. Match
any number of characters .* followed by
a literal left parenthesis [(] followed by
digits in a capture group (\\d+) followed by
an optional right parenthesis [)]?
and then replace that with a left parenthesis, the match to the capture group \\1 and a right parenthesis.
No packages are used.
pat <- ".*[(](\\d+)[)]?"
transform(test, Age = sub(pat, "(\\1)", Age))
If, instead, you wanted the age as a numeric field then:
transform(test, Age = as.numeric(sub(pat, "\\1", Age)))
2) substring/sub Another possibility is to take the 13th character onwards which gives everything from the left parenthesis to the end of the string and insert a ) if missing. )?$ matches a right parenthesis at the end of the string or just the end of the string if none. That is replaced with a right parenthesis. Again, no packages are used.
transform(test, Age = sub(")?$", ")", substring(Age, 13))
A variation of this if we wanted a numeric Age instead would be to take everything from the 14th character and remove the final ) if present.
transform(test, Age = as.numeric(sub(")", "", substring(Age, 14))))
3) read.table Use read.table to read the Age field with sep = "(" and comment.char = ")" and pick off the second column read. This will give the numeric age and we can use sprintf to surround that with parentheses. If Age were character (as opposed to factor) then as.character(Age) could optionally be written as just Age.
Again, no packages are used. This one does not use regular expressions.
transform(test, Age =
sprintf("(%s)", read.table(text = as.character(Age), sep = "(", comment.char = ")")$V2)
Note: The input in reproducible form is:
test <- data.frame(Age = c("Sep 10, 1990(27)", "Mar 26, 1987(30", "Feb 24, 1997(20)"))

Inserting prefix of 19 into a string date

I have a vector of birth dates as character strings formatted "10-Feb-85".
When I use the as.Date() function in R it assumes the two digit year is after 2000 (none of these birth dates are after the year 2000).
example:
as.Date(x = "10-Feb-52", format = "%d-%b-%y")
returns: 2052-02-10
I'm not proficient in regular expressions but
I think that this is an occasion for a regular expression to insert a "19" after the second "-" or before the last two digits.
I've found a regex that counts forward three characters and inserts a letter:
gsub(pattern = "^(.{3})(.*)$", replacement = "\\1d\\2", x = "abcefg")
But I'm not sure how to count two from the end.
Any help is appreciated.

insert a "19" after the second "-" or before the last two digits.
Before the last two digits:
gsub(pattern = "-(\\d{2})$", replacement = "-19\\1", x = "10-Feb-52")
See the R demo. Here, - is matched first, then 2 digits ((\\d{2})) - that are at the end of string ($) - are matched and captured into Group 1.
After the second -:
gsub(pattern = "^((?:[^-]*-){2})", replacement = "\\119", x = "10-Feb-52")
See another demo. Here, 2 sequences ({2}) of 0+ chars other than - ([^-]*) are matched from the start of the string (^) and captured into group 1. The replacement contains a backreference that restores the captured text in the replacement result.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Regex pattern questions in r - r

Related

Extract timestamp in file name

Formatting and Replacing Multiple Dates within a Single String in R

Find and extract year within sentence for each cell in R

R - gsub() for to remove dates from data set

Inserting prefix of 19 into a string date

Categories

Resources