Regex to capture a substring containing punctuation in R - r

I have a list of strings where each element contains uppercase names with and without punctuation, followed by a sentence.
names_list = list("MICKEY MOUSE is a Disney character",
"DAFFY DUCK is a Warner Bros. character",
"GARFIELD, ODI AND JOHN are characters from a USA cartoon comic strip.",
"BUGS-BUNNY AND FRIENDS Warner Bros. owns these characters.")
I want to extract only the capitalised names at the start of each string. I got as far as:
library('stringr')
str_extract(names_list, '([:upper:]+([:punct:]?[:upper:]?)[:space:])+')
[1] "MICKEY MOUSE " "DAFFY DUCK " "GARFIELD, ODI AND JOHN " "BUNNY AND FRIENDS "
I can't figure out how to specify the mid-word punctuation as in "BUGS-BUNNY" so that I can pull out the whole word. Help much appreciated!

You can try capturing multiple occurrence of upper-case letter along with punctuations and space in them until a space and any upper/lower case letter in encountered.
library(stringr)
str_extract(names_list, '([[:upper:][:punct:][:space:]])+(?=\\s[A-Za-z])')
#[1] "MICKEY MOUSE" "DAFFY DUCK" "GARFIELD, ODI AND JOHN"
# "BUGS-BUNNY AND FRIENDS"

We can use sub from base R
sub("^([A-Z, -]+)\\s+.*", "\\1", unlist(names_list))
#[1] "MICKEY MOUSE" "DAFFY DUCK" "GARFIELD, ODI AND JOHN" "BUGS-BUNNY AND FRIENDS"

Related

Usng R - gsub using code in replacement - Replace comma with full stop after pattern

I would like to manually correct a record by using R. Last name and first name should always be separated by a comma.
names <- c("ADAM, Smith J.", "JOHNSON. Richard", "BROWN, Wilhelm K.", "DAVIS, Daniel")
Sometimes, however, a full stop has crept in as a separator, as in the case of "JOHNSON. Richard". I would like to do this automatically. Since the last name is always at the beginning of the line, I can simply access it via sub:
sub("^[[:upper:]]+\\.","^[[:upper:]]+\\,",names)
However, I cannot use a function for the replacement that specifically replaces the full stop with a comma.
Is there a way to insert a function into the replacement that does this for me?
Your sub is mostly correct, but you'll need a capture group (the brackets and backreference \\1) for the replacement.
Because we are "capturing" the upper case letters, therefore \\1 here represents the original upper case letters in your original strings. The only replacement here is \\. to \\,. In other words, we are replacing upper case letters ^(([[:upper:]]+) AND full stop \\. with it's original content \\1 AND comma \\,.
For more details you can visit this page.
test_names <- c("ADAM, Smith J.", "JOHNSON. Richard", "BROWN, Wilhelm K.", "DAVIS, Daniel")
sub("^([[:upper:]]+)\\.","\\1\\,",test_names)
[1] "ADAM, Smith J." "JOHNSON, Richard" "BROWN, Wilhelm K."
[4] "DAVIS, Daniel"
Can be done by a function like so:
names <- c("ADAM, Smith", "JOHNSON. Richard", "BROWN, Wilhelm", "DAVIS, Daniel")
replacedots <- function(mystring) {
gsub("\\.", ",", names)
}
replacedots(names)
[1] "ADAM, Smith" "JOHNSON, Richard" "BROWN, Wilhelm" "DAVIS, Daniel"

Extracting words between word/space patterns

I have some data where I have names "sandwiched" between two spaces and the phrase "is a (number from 1-99) y.o". For example:
a <- "SomeOtherText John Smith is a 60 y.o. MoreText"
b <- "YetMoreText Will Smth Jr. is a 30 y.o. MoreTextToo"
c <- "JustJunkText Billy Smtih III is 5 y/o MoreTextThree"
I'd like to extract the names "John Smith", "Will Smth Jr." and "Billy Smtih III" (the misspellings are there on purpose). I tried using str_extract or gsub, based on answers to similar questions I found on SO, but with no luck.
You can chain multiple calls to stringr::str_remove.
First regex: remove pattern that start with (^) any letters ([:alpha:]) followed by one or more whitespaces (\\s+).
Seconde regex: remove pattern that ends with ($) a whitespace(\\s) followed by the sequence is, followed by any number of non-newline characters (.)
str_remove(a, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "John Smith"
str_remove(b, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Will Smth Jr."
str_remove(c, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Billy Smtih III"
You can also do it in a single call by using stringr::str_remove_all and joining the two patterns separated by an OR (|) symbol:
str_remove_all(a, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(b, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(c, '^[:alpha:]*\\s+|\\sis.*$')
You can use sub in base R as -
extract_name <- function(x) sub('.*\\s{2,}(.*)\\sis.*\\d+ y[./]o.*', '\\1', x)
extract_name(c(a, b, c))
#[1] "John Smith" "Will Smth Jr." "Billy Smtih III"
\\s{2,} is 2 or more whitespace
(.*) capture group to capture everything until
is followed by a number and y.o and y/o is encountered.

Extract only the sentence portion of a section header

I have a small problem.
I have text that looks like:
B.1 My name is John
I want to only obtain:
My name is John
I'm having difficulty leaving out both the B and the 1, at the same time
You can do this with sub and a regular expression.
TestStrings = c("B.1 My name is John", "A.12 This is another sentence")
sub("\\b[A-Z]\\.\\d+\\s+", "", TestStrings)
[1] "My name is John" "This is another sentence"
The \\b indicates a word boundary (to eliminate multiple letters)
[A-Z] will match a single capital letter.
\\. will match a period
\\d+ will match one or more digits
\\s+ will match any training blank space.
The part that is matched will be replaced with the empty string.
If you are sure that all the strings that you need have the same (or similar) initial part you can do
> a<-"B.1 My name is John"
> substr(a, 5, nchar(a))
[1] "My name is John"

Removing parentheses, text proceeding comma, and the comma in a string using string

I have a string that contains a persons name and city. It's formatted like this:
mock <- "Joe Smith (Cleveland, OH)"
I simply want the state abbreviation remaining, so it in this case, the only remaining string would be "OH"
I can get rid of the the parentheses and comma
[(.*?),]
Which gives me:
"Joe Smith Cleveland OH"
But I can't figure out how to combine all of it. For the record, all of the records will look like that, where it ends with ", two letter capital state abbreviation" (ex: ", OH", ", KY", ", MD" etc...)
You may use
mock <- "Joe Smith (Cleveland, OH)"
sub(".+,\\s*([A-Z]{2})\\)$","\\1",mock)
## => [1] "OH"
## With stringr:
str_extract(mock, "[A-Z]{2}(?=\\)$)")
See this R demo
Details
.+,\\s*([A-Z]{2})\\)$ - matches any 1+ chars as many as possible, then ,, 0+ whitespaces, and then captures 2 uppercase ASCII letters into Group 1 (referred to with \1 from the replacement pattern) and then matches ) at the end of string
[A-Z]{2}(?=\)$) - matches 2 uppercase ASCII letters if followed with the ) at the end of the string.
How about this. If they are all formatted the same, then this should work.
mock <- "Joe Smith (Cleveland, OH)"
substr(mock, (nchar(mock) - 2), (nchar(mock) - 1))
If the general case is that the state is in the second and third last characters then match everything, .*, and then a capture group of two characters (..) and then another character . and replace that with the capture group:
sub(".*(..).", "\\1", mock)
## [1] "OH"

Shifting text around in a sentence via R

I have an R dataframe with movie names like so:
Shawshank Redemption, The
Godfather II, The
Band of Brothers
I would like to display these names as:
The Shawshank Redemption
The Godfather II
Band of Brothers
Can anyone help with how to do a check each row of the dataframe to see if there is a 'The' after a comma (like) above, and if there is, shift it to the front of the sentence?
You can use gsub:
df$movies2 = gsub("^([\\w\\s]+),*\\s*([Tt]he*($|(?=\\s\\(\\d{4}\\))))", "\\2 \\1", df$movies, perl = TRUE)
Result:
> df
movies movies2
1 Shawshank Redemption, The (1994) The Shawshank Redemption (1994)
2 Godfather II, The The Godfather II
3 Band of Brothers Band of Brothers
4 Dora, The Explorer Dora, The Explorer
5 Kill Bill Vol. 2 The Kill Bill Vol. 2 The
6 ,The Highlander ,The Highlander
7 Happening, the the Happening
Data:
df = data.frame(movies = c("Shawshank Redemption, The (1994)",
"Godfather II, The",
"Band of Brothers",
"Dora, The Explorer",
"Kill Bill Vol. 2 The",
",The Highlander",
"Happening, the"), stringsAsFactors = FALSE)
Notes:
The goal of the entire regex is to group the first part (part before ,) and the second part ('The' after , and only when it's at the end or before (year)) into separate capture groups which I can swap with \\2 and \\1
^([\\w\\s]+) matches any word character or spaces one or more times starting from the beginning of the string
,*\\s* matches comma and space both zero or more times
[Tt]he* matches "The" or "the" zero or more times
Notice that it is followed by ($|(?=\\s\\(\\d{4}\\))) which matches the "end of string", $, or a positive lookahead, which checks whether the previous pattern is followed by \\s\\(\\d{4}\\)
\\s\\(\\d{4}\\) matches a space and (4 digits) including the parentheses. Double backslashes are needed to escape a single backslash
So ([Tt]he*($|(?=\\s\\(\\d{4}\\)))) matches "The" or "the" either at the end of string or if it is followed by (4 digits)
Everything in parentheses are capture groups, so \\2 \\1 swaps the first capture group, ([\\w\\s]+), with the second, ([Tt]he*($|(?=\\s\\(\\d{4}\\))))
Now, since "The" is only matched zero or more times by [Tt]he*, if a string doesn't have "The" in it, an empty string gets swapped, with \\1, which returns the original string.
This seems to work for me:
#create a vector of movies
x=c("Shawshank Redemption, The", "Godfather II, The", "Band of Brothers")
#use grep to find those with ", The" at the end
the.end=grep(", The$",x)
#trim movie titles to remove ", The"
trimmed=strtrim(x[the.end],nchar(x[the.end])-5)
#add "The " to the beginning of the trimmed titles
final=paste("The",trimmed)
#replace the trimmed elements of the movie vector
x[the.end]<-final
#take a look
x
Note that this doesn't remove ", The" from anywhere in the name other than the end... I think that's the behaviour that you want. It will also miss any "The" without the comma, or lower case "the". To see what I mean, try this as your initial movie vector:
#create a vector of movies
x=c("Shawshank Redemption, The", "Godfather II, The", "Band of Brothers",
"Dora, The Explorer", "Kill Bill Vol. 2 The", ",The Highlander",
"Happening, the")

Resources