Extracting words between word/space patterns - r

I have some data where I have names "sandwiched" between two spaces and the phrase "is a (number from 1-99) y.o". For example:
a <- "SomeOtherText John Smith is a 60 y.o. MoreText"
b <- "YetMoreText Will Smth Jr. is a 30 y.o. MoreTextToo"
c <- "JustJunkText Billy Smtih III is 5 y/o MoreTextThree"
I'd like to extract the names "John Smith", "Will Smth Jr." and "Billy Smtih III" (the misspellings are there on purpose). I tried using str_extract or gsub, based on answers to similar questions I found on SO, but with no luck.

You can chain multiple calls to stringr::str_remove.
First regex: remove pattern that start with (^) any letters ([:alpha:]) followed by one or more whitespaces (\\s+).
Seconde regex: remove pattern that ends with ($) a whitespace(\\s) followed by the sequence is, followed by any number of non-newline characters (.)
str_remove(a, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "John Smith"
str_remove(b, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Will Smth Jr."
str_remove(c, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Billy Smtih III"
You can also do it in a single call by using stringr::str_remove_all and joining the two patterns separated by an OR (|) symbol:
str_remove_all(a, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(b, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(c, '^[:alpha:]*\\s+|\\sis.*$')

You can use sub in base R as -
extract_name <- function(x) sub('.*\\s{2,}(.*)\\sis.*\\d+ y[./]o.*', '\\1', x)
extract_name(c(a, b, c))
#[1] "John Smith" "Will Smth Jr." "Billy Smtih III"
\\s{2,} is 2 or more whitespace
(.*) capture group to capture everything until
is followed by a number and y.o and y/o is encountered.

Related

Usng R - gsub using code in replacement - Replace comma with full stop after pattern

I would like to manually correct a record by using R. Last name and first name should always be separated by a comma.
names <- c("ADAM, Smith J.", "JOHNSON. Richard", "BROWN, Wilhelm K.", "DAVIS, Daniel")
Sometimes, however, a full stop has crept in as a separator, as in the case of "JOHNSON. Richard". I would like to do this automatically. Since the last name is always at the beginning of the line, I can simply access it via sub:
sub("^[[:upper:]]+\\.","^[[:upper:]]+\\,",names)
However, I cannot use a function for the replacement that specifically replaces the full stop with a comma.
Is there a way to insert a function into the replacement that does this for me?
Your sub is mostly correct, but you'll need a capture group (the brackets and backreference \\1) for the replacement.
Because we are "capturing" the upper case letters, therefore \\1 here represents the original upper case letters in your original strings. The only replacement here is \\. to \\,. In other words, we are replacing upper case letters ^(([[:upper:]]+) AND full stop \\. with it's original content \\1 AND comma \\,.
For more details you can visit this page.
test_names <- c("ADAM, Smith J.", "JOHNSON. Richard", "BROWN, Wilhelm K.", "DAVIS, Daniel")
sub("^([[:upper:]]+)\\.","\\1\\,",test_names)
[1] "ADAM, Smith J." "JOHNSON, Richard" "BROWN, Wilhelm K."
[4] "DAVIS, Daniel"
Can be done by a function like so:
names <- c("ADAM, Smith", "JOHNSON. Richard", "BROWN, Wilhelm", "DAVIS, Daniel")
replacedots <- function(mystring) {
gsub("\\.", ",", names)
}
replacedots(names)
[1] "ADAM, Smith" "JOHNSON, Richard" "BROWN, Wilhelm" "DAVIS, Daniel"

Regex in R - Extracting two letters between spaces

I am trying to extract the two letters between two spaces -
AAPL US Equity
1836 JP Equity
APPLE SOMETHING NOT
C US Equity
Result -
US
JP
US
What I tried was gsub("\\s[A-Z]{2}\\s", "\\1", vec) but that gives me -
AAPLEquity
1836Equity
APPLE SOMETHING NOT
CEquity
which seems the exact opposite of what I want.
We can use sub
out <- rep("", length(vec))
i1 <- grepl("\\b[A-Z]{2}\\b", vec)
out[i1] <- sub(".*\\s+([A-Z]{2})\\s+.*", "\\1", vec[i1])
out
#[1] "US" "JP" "" "US"
Or using str_extract to extract the two upper case characters after a space (specified by the regex lookaround) and follows a word boundary (\\b)
str_extract(vec, "(?<=\\s)([A-Z]{2})\\b")
#[1] "US" "JP" NA "US"
NOTE: Not copied syntax from others' answer
data
vec <- c("AAPL US Equity", "1836 JP Equity", "APPLE SOMETHING NOT", "C US Equity")
The gsub command removes the parts of text matched with the regular expression. \s[A-Z]{2}\s finds streaks of whitespace, 2 uppercase ASCII letters and whitespace, and removes them from character vectors.
You may use
x <- c('AAPL US Equity','1836 JP Equity','APPLE SOMETHING NOT','C US Equity')
sub(".*\\s+([A-Z]{2})\\s.*|.*", "\\1", x)
# => [1] "US" "JP" "" "US"
Here, the .*\\s+([A-Z]{2})\\s.* alternative matches those inputs that have a two-letter "word" between whitespaces and puts the words into Group 1 (\1), while .* alternative matches all other inputs to produce an empty result as the sub operation.
Or, you may use
library(stringr)
str_extract(x, "(?<=\\s)[A-Z]{2}(?=\\s)")
# => [1] "US" "JP" NA "US"
Here, (?<=\\s)[A-Z]{2}(?=\\s) matches and str_extract extracts strings that are first two-letter words in between whitespaces.
If the words can be at the start/end of the string use
str_extract(x, "(?<!\\S)[A-Z]{2}(?!\\S)")

Extract only the sentence portion of a section header

I have a small problem.
I have text that looks like:
B.1 My name is John
I want to only obtain:
My name is John
I'm having difficulty leaving out both the B and the 1, at the same time
You can do this with sub and a regular expression.
TestStrings = c("B.1 My name is John", "A.12 This is another sentence")
sub("\\b[A-Z]\\.\\d+\\s+", "", TestStrings)
[1] "My name is John" "This is another sentence"
The \\b indicates a word boundary (to eliminate multiple letters)
[A-Z] will match a single capital letter.
\\. will match a period
\\d+ will match one or more digits
\\s+ will match any training blank space.
The part that is matched will be replaced with the empty string.
If you are sure that all the strings that you need have the same (or similar) initial part you can do
> a<-"B.1 My name is John"
> substr(a, 5, nchar(a))
[1] "My name is John"

Removing parentheses, text proceeding comma, and the comma in a string using string

I have a string that contains a persons name and city. It's formatted like this:
mock <- "Joe Smith (Cleveland, OH)"
I simply want the state abbreviation remaining, so it in this case, the only remaining string would be "OH"
I can get rid of the the parentheses and comma
[(.*?),]
Which gives me:
"Joe Smith Cleveland OH"
But I can't figure out how to combine all of it. For the record, all of the records will look like that, where it ends with ", two letter capital state abbreviation" (ex: ", OH", ", KY", ", MD" etc...)
You may use
mock <- "Joe Smith (Cleveland, OH)"
sub(".+,\\s*([A-Z]{2})\\)$","\\1",mock)
## => [1] "OH"
## With stringr:
str_extract(mock, "[A-Z]{2}(?=\\)$)")
See this R demo
Details
.+,\\s*([A-Z]{2})\\)$ - matches any 1+ chars as many as possible, then ,, 0+ whitespaces, and then captures 2 uppercase ASCII letters into Group 1 (referred to with \1 from the replacement pattern) and then matches ) at the end of string
[A-Z]{2}(?=\)$) - matches 2 uppercase ASCII letters if followed with the ) at the end of the string.
How about this. If they are all formatted the same, then this should work.
mock <- "Joe Smith (Cleveland, OH)"
substr(mock, (nchar(mock) - 2), (nchar(mock) - 1))
If the general case is that the state is in the second and third last characters then match everything, .*, and then a capture group of two characters (..) and then another character . and replace that with the capture group:
sub(".*(..).", "\\1", mock)
## [1] "OH"

R Regex searching text between delimiter

I have a data file which has text in the following format:
"name: alex age: 27 profession: it"
I want to pull the data between ':' (it should exclude the preceding field name before ":" e.g. name, age, and profession are the only corresponding values that should be retrieved. The token names are not same; they can change.)
I want data to be
alex 27 it
We can use gsub to match word (\\w+), then a :, one or more spaces (\\s+) followed by a word captured as a group ((\\w+)) and replace it with the backreference.
gsub("\\w+:\\s+(\\w+)", "\\1", str1)
#[1] "alex 27 it"
NOTE: Here, we assume the pattern of the string is in key: value pair
Using str_split with a negative lookback Regex can you split the text into a vector of three
st <- "name: alex age: 27 profession: it"
str_split(st,"(?<!:) ")
after that it is easy to remove the text that we dont want with gsub
str_split(st,"(?<!:) ") %>% unlist() %>% gsub("^.*: ","",.)
now using the same technic but extracting the names and using setNames we get a named list wich is very comfortable to work with
dta <- setNames(
str_split(st,"(?<!:) ") %>%
unlist() %>%
gsub("^.*: ","",.) %>%
as.list(),
str_split(st,"(?<!:) ") %>%
unlist() %>%
gsub(":.*$","",.))
dta$profession
[1] "it"
A solution with str_extract_all from stringr. This matches alphanumerics ([[:alnum:]]) that are followed by a : and a space (\\s) and ends at a word boundary (\\b):
library(stringr)
str_extract_all(string, "(?<=:\\s)[[:alnum:]]+\\b")[[1]]
# [1] "alex" "27" "it"
or:
paste(str_extract_all(string, "(?<=:\\s)[[:alnum:]]+\\b")[[1]], collapse = " ")
# [1] "alex 27 it"

Resources