str_replace: replacement depending on wildcard value [A-Z] - r

I have a number of strings containing the pattern "of" followed by an uppercase letter without spaces (in regex: "of[A-Z]"). I want to add spaces, e.g. "PrinceofWales" should become "Prince of Wales" etc.). However, I couldn't find how to add the value of [A-Z] that was matched into the replacement value:
library(tidyverse)
str_replace("PrinceofWales", "of[A-Z]", " of [A-Z]")
# Gives: Prince of [A-Z]ales
# Expected: Prince of Wales
str_replace("DukeofEdinburgh", "of[A-Z]", " of [A-Z]")
# Gives: Duke of [A-Z]dinburgh
# Expected: Duke of Edinburgh
Can someone enlighten me? :)

It needs to be captured as a group (([A-Z])) and replace with the backreference (\\1) of the captured group i.e. regex interpretation is in the pattern and not in the replacement
stringr::str_replace("PrinceofWales", "of([A-Z])", " of \\1")
[1] "Prince of Wales"
According to ?str_replace
replacement - A character vector of replacements. Should be either length one, or the same length as string or pattern. References of the form \1, \2, etc will be replaced with the contents of the respective matched group (created by ()).
Or another option is a regex lookaround
stringr::str_replace("PrinceofWales", "of(?=[A-Z])", " of ")
[1] "Prince of Wales"

Related

R separate words from numbers in string

I need to clean up some data strings that have words and numbers or just numbers.
below is a toy sample
library(tidyverse)
c("555","Word 123", "two words 123", "three words here 123") %>%
sub("(\\w+) (\\d*)", "\\1|\\2", .)
The result is this:
[1] "555" "Word|123" "two|words 123" "three|words here 123"
but I want to place the '|' before the last set of numbers like shown below
[1] "|555" "Word|123" "two words|123" "three words here|123"
We can use sub to match zero or more spaces (\\s*) followed by a digit we capture as a group ((\\d)) and in the replacement use the | followed by the backreference (\\1) of the captured group
sub("\\s*(\\d)", "|\\1", v1)
#[1] "|555" "Word|123"
#[3] "two words|123" "three words here|123"
data
v1 <- c("555","Word 123", "two words 123", "three words here 123")
You may use
^(.*?)\s*(\d*)$
Replace with \1|\2. See the regex demo.
In R:
sub("^(.*?)\\s*(\\d*)$", "\\1|\\2", .)
Details
^ - start of string
(.*?) - Capturing group 1: any 0+ chars, as few as possible
\s* - zero or more whitespaces
(\d*) - Capturing group 2: zero or more digits
$ - end of string.

How do I extract text between two characters in R

I'd like to extract text between two strings for all occurrences of a pattern. For example, I have this string:
x<- "\nTYPE: School\nCITY: ATLANTA\n\n\nCITY: LAS VEGAS\n\n"
I'd like to extract the words ATLANTA and LAS VEGAS as such:
[1] "ATLANTA" "LAS VEGAS"
I tried using gsub(".*CITY:\\s|\n","",x). The output this yields is:
[1] " LAS VEGAS"
I would like to output both cities (some patterns in the data include more than 2 cities) and to output them without the leading space.
I also tried the qdapRegex package but could not get close. I am not that good with regular expressions so help would be much appreciated.
You may use
> unlist(regmatches(x, gregexpr("CITY:\\s*\\K.*", x, perl=TRUE)))
[1] "ATLANTA" "LAS VEGAS"
Here, CITY:\s*\K.* regex matches
CITY: - a literal substring CITY:
\s* - 0+ whitespaces
\K - match reset operator that discards the text matched so far (zeros the current match memory buffer)
.* - any 0+ chars other than line break chars, as many as possible.
See the regex demo online.
Note that since it is a PCRE regex, perl=TRUE is indispensible.
Another option:
library(stringr)
str_extract_all(x, "(?<=CITY:\\s{3}).+(?=\\n)")
[[1]]
[1] "ATLANTA" "LAS VEGAS"
reads as: extract anything preceded by "City: " (and three spaces) and followed by "\n"
An option can be as:
regmatches(x,gregexpr("(?<=CITY:).*(?=\n\n)",x,perl = TRUE))
# [[1]]
# [1] " ATLANTA" " LAS VEGAS"

Removing parentheses, text proceeding comma, and the comma in a string using string

I have a string that contains a persons name and city. It's formatted like this:
mock <- "Joe Smith (Cleveland, OH)"
I simply want the state abbreviation remaining, so it in this case, the only remaining string would be "OH"
I can get rid of the the parentheses and comma
[(.*?),]
Which gives me:
"Joe Smith Cleveland OH"
But I can't figure out how to combine all of it. For the record, all of the records will look like that, where it ends with ", two letter capital state abbreviation" (ex: ", OH", ", KY", ", MD" etc...)
You may use
mock <- "Joe Smith (Cleveland, OH)"
sub(".+,\\s*([A-Z]{2})\\)$","\\1",mock)
## => [1] "OH"
## With stringr:
str_extract(mock, "[A-Z]{2}(?=\\)$)")
See this R demo
Details
.+,\\s*([A-Z]{2})\\)$ - matches any 1+ chars as many as possible, then ,, 0+ whitespaces, and then captures 2 uppercase ASCII letters into Group 1 (referred to with \1 from the replacement pattern) and then matches ) at the end of string
[A-Z]{2}(?=\)$) - matches 2 uppercase ASCII letters if followed with the ) at the end of the string.
How about this. If they are all formatted the same, then this should work.
mock <- "Joe Smith (Cleveland, OH)"
substr(mock, (nchar(mock) - 2), (nchar(mock) - 1))
If the general case is that the state is in the second and third last characters then match everything, .*, and then a capture group of two characters (..) and then another character . and replace that with the capture group:
sub(".*(..).", "\\1", mock)
## [1] "OH"

regular expression to find exact matching containing a space and a punctuation

I am going through a dataset containing text values (names) that are formatted like this example :
M.Joan (13-2)
A.Alfred (20-13)
F.O'Neil (12-231)
D.Dan Fun (23-3)
T.Collins (51-82) J.Maddon (12-31)
Some strings have two names in it like
M.Joan (13-2) A.Alfred (20-13)
I only want to extract the name from the string.
Some names are easy to extract because they don't have spaces or anything.
However some are hard because they have a space like the last one above.
name_pattern = "[A-Z][.][^ (]{1,}"
base <- str_extract_all(baseball1$Managers, name_pattern)
When I use this code to extract the names, it works well even for names with spaces or punctuations. However, the extracted names have a space at the end. I was wondering if I can find the exact pattern of " (", a space and a parenthesis.
Output:
[[1]]
[1] "Z.Taylor "
[[2]]
[1] "Z.Taylor "
[[3]]
[1] "Z.Taylor "
[[4]]
[1] "Z.Taylor "
[[5]]
[1] "Y.Berra "
[[6]]
[1] "Y.Berra "
You may use
x <- c("M.Joan (13-2) ", "A.Alfred (20-13)", "F.O'Neil (12-231)", "D.Dan Fun (23-3)", "T.Collins (51-82) J.Maddon (12-31)", "T.Hillman (12-34) and N.Yost (23-45)")
regmatches(x, gregexpr("\\p{Lu}.*?(?=\\s*\\()", x, perl=TRUE))
See the regex demo
Or the str_extract_all version:
str_extract_all(baseball1$Managers, "\\p{Lu}.*?(?=\\s*\\()")
See the regex demo.
It matches
\p{Lu} - an uppercase letter
.*? - any char other than line break chars, as few as possible, up to the first occurrence of (but not including into the match, as (?=...) is a non-consuming construct)....
(?=\\s*\\() - positive lookahead that, immediately to the right of the current location, requires the presence of:
\\s* - 0+ whitespace chars
\\( - a literal (.

gsub only part of pattern

I want to use gsub to correct some names that are in my data. I want names such as "R. J." and "A. J." to have no space between the letters.
For example:
x <- "A. J. Burnett"
I want to use gsub to match the pattern of his first name, and then remove the space:
gsub("[A-Z]\\.\\s[A-Z]\\.", "[A-Z]\\.[A-Z]\\.", x)
But I get:
[1] "[A-Z].[A-Z]. Burnett"
Obviously, instead of the [A-Z]'s I want the actual letters in the original name. How can I do this?
Use capture groups by enclosing patterns in (...), and refer to the captured patterns with \\1, \\2, and so on. In this example:
x <- "A. J. Burnett"
gsub("([A-Z])\\.\\s([A-Z])\\.", "\\1.\\2.", x)
[1] "A.J. Burnett"
Also note that in the replacement you don't need to escape the . characters, as they don't have a special meaning there.
You can use a look-ahead ((?=\\w\\.)) and a look-behind ((?<=\\b\\w\\.)) to target such spaces and replace them with "".
x <- c("A. J. Burnett", "Dr. R. J. Regex")
gsub("(?<=\\b\\w\\.) (?=\\w\\.)", "", x, perl = TRUE)
# [1] "A.J. Burnett" "Dr. R.J. Regex"
The look-ahead matches a word character (\\w) followed by a period (\\.), and the look-behind matches a word-boundary (\\b) followed by a word character and a period.

Resources