Problems in a regular expression to extract names using stringr - r

I cannot fully understand why my regular expression does not work to extract the info I want. I have an unlisted vector that looks like this:
text <- c("Senator, 1.4balbal", "rule 46.1, declares",
"Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23)
I would like to create a regular expression to extract only the name of the "Town", even if the town has a long name as the one written in the vector ("A Town with a Long Name"). I have tried this to extract the name of the town:
reg.town <- "[[:alpha:]](.+?)+,(.+?)\\d{2}"
towns<- unlist(str_extract_all(example, reg.prov))
but I extract everything around the ",".
Thanks in advance,

It looks like a town name starts with a capital letter ([[:upper:]]), ends with a comma (or continues to the end of text if there is no comma) ([^,]+) and should be at the start of the input text (^). The corresponding regex in this case would be:
^[[:upper:]][^,]+
Demo: https://regex101.com/r/QXYtyv/1

I have solve the problem thanks to #Dmitry Egorov 's demo post in the comment. the regular expression is this one ([[:upper:]].+?, [[:digit:]])
Thanks for your quick replies!!

You may use the following regex:
> library(stringr)
> text <- c("Senator, 1.4balbal", "rule 46.1, declares", "Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23")
> towns <- unlist(str_extract_all(text, "\\b\\p{Lu}[^,]++(?=, \\d)"))
> towns
[1] "Senator" "Town"
[3] "A Town with a Long Name"
The regex matches:
\\b - a leading word boundary
\\p{Lu} - an uppercase letter
[^,]++ - 1+ chars other than a , (possessively, due to ++ quantifier, with no backtracking into this pattern for a more efficient matching)
(?=, \\d) - a positive lookahead that requires a ,, then a space and then any digit to appear immediately after the last non-, symbol matched with [^,]++.
Note you may get the same results with base R using the same regex with a PCRE option enabled:
> towns_baseR <- unlist(regmatches(text, gregexpr("\\b\\p{Lu}[^,]++(?=, \\d)", text, perl=TRUE)))
> towns_baseR
[1] "Senator" "Town"
[3] "A Town with a Long Name"
>

Related

R text mining - remove special characters and quotes

I'm doing a text mining task in R.
Tasks:
1) count sentences
2) identify and save quotes in a vector
Problems :
False full stops like "..." and periods in titles like "Mr." have to be dealt with.
There's definitely quotes in the text body data, and there'll be "..." in them. I was thinking to extract those quotes from the main body and save them in a vector. (there's some manipulation to be done with them too.)
IMPORTANT TO NOTE : My text data is in a Word document. I use readtext("path to .docx file") to load in R. When I view the text, quotes are just " but not \" contrarily to the reproducible text.
path <- "C:/Users/.../"
a <- readtext(paste(path, "Text.docx", sep = ""))
title <- a$doc_id
text <- a$text
reproducible text
text <- "Mr. and Mrs. Keyboard have two children. Keyboard Jr. and Miss. Keyboard. ...
However, Miss. Keyboard likes being called Miss. K [Miss. Keyboard is a bit of a princess ...]
\"Mom how are you o.k. with being called Mrs. Keyboard? I'll never get it...\". "
# splitting by "."
unlist(strsplit(text, "\\."))
The problem is it's splitting by false full-stops
Solution I tried:
# getting rid of . in titles
vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
library(gsubfn)
# replacing . in titles
gsubfn("\\S+", setNames(as.list(vec.rep), vec), text)
The problem with this is that it's not replacing [Miss. by [Miss
To identify quotes :
stri_extract_all_regex(text, '"\\S+"')
but that's not working too. (It's working with \" with the code below)
stri_extract_all_regex("some text \"quote\" some other text", '"\\S+"')
The exact expected vector is :
sentences <- c("Mr and Mrs Keyboard have two children. ", "Keyboard Jr and Miss Keyboard.", "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]", ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
I wanted the sentences separated (so I can count how many sentences in each paragraph).
And quotes also separated.
quotes <- ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
You may match all your current vec values using
gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)
That is, \w+ matches 1 or more word chars and \. matches a dot.
Next, if you just want to extract quotes, use
regmatches(text, gregexpr('"[^"]*"', text))
The " matches a " and [^"]* matches 0 or more chars other than ".
If you plan to match your sentences together with quotes, you might consider
regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
Details
\\s* - 0+ whitespaces
"[^"]*" - a ", 0+ chars other than " and a "
| - or
[^"?!.]+ - 0+ chars other than ?, ", ! and .
[[:space:]?!.]+ - 1 or more whitespace, ?, ! or . chars
[^"[:alnum:]]* - 0+ non-alphanumeric and " chars
R sample code:
> vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
> vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
> library(gsubfn)
> text <- gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)
> regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
[[1]]
[1] "Mr and Mrs Keyboard have two children. "
[2] "Keyboard Jr and Miss Keyboard. ... \n"
[3] "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]\n "
[4] "\"Mom how are you o.k. with being called Mrs Keyboard? I'll never get it...\""

How do I extract text between two characters in R

I'd like to extract text between two strings for all occurrences of a pattern. For example, I have this string:
x<- "\nTYPE: School\nCITY: ATLANTA\n\n\nCITY: LAS VEGAS\n\n"
I'd like to extract the words ATLANTA and LAS VEGAS as such:
[1] "ATLANTA" "LAS VEGAS"
I tried using gsub(".*CITY:\\s|\n","",x). The output this yields is:
[1] " LAS VEGAS"
I would like to output both cities (some patterns in the data include more than 2 cities) and to output them without the leading space.
I also tried the qdapRegex package but could not get close. I am not that good with regular expressions so help would be much appreciated.
You may use
> unlist(regmatches(x, gregexpr("CITY:\\s*\\K.*", x, perl=TRUE)))
[1] "ATLANTA" "LAS VEGAS"
Here, CITY:\s*\K.* regex matches
CITY: - a literal substring CITY:
\s* - 0+ whitespaces
\K - match reset operator that discards the text matched so far (zeros the current match memory buffer)
.* - any 0+ chars other than line break chars, as many as possible.
See the regex demo online.
Note that since it is a PCRE regex, perl=TRUE is indispensible.
Another option:
library(stringr)
str_extract_all(x, "(?<=CITY:\\s{3}).+(?=\\n)")
[[1]]
[1] "ATLANTA" "LAS VEGAS"
reads as: extract anything preceded by "City: " (and three spaces) and followed by "\n"
An option can be as:
regmatches(x,gregexpr("(?<=CITY:).*(?=\n\n)",x,perl = TRUE))
# [[1]]
# [1] " ATLANTA" " LAS VEGAS"

Extract All Strings between a sequence of numbers

I'm dealing with a regular expression in which I has string that has a series of numbers four numbers then name which repeat for multiples.
The text pattern is a series of 4 numbers, then a string. I would like to extract the string after the four numbers. The four numbers must appear before the string. In the example below, I do not want to extract "Not this one", but would like the strings after four numbers.
string_to_inspect <-"Not This One 4586 This one 8888 Another one 8955 PS109 8566 Last One"
My ideal extraction is a character vector that looks like:
"This one" "Another one" "PS109" "Last One"
I have tried
str_extract_all(pattern = "[0-9]{4}(.*?)", string = string_to_inspect)
And it returns a single string that include all the numbers
"4586 This one 8888 Another one 8955 PS109 8566 Last One"
I have tried various combinations but I know I must be missing something critical.
We can split the string by four digits, remove the first one, and then trim the white space.
library(stringr)
str_trim(str_split(string_to_inspect, pattern = "\\s[0-9]{4}\\s")[[1]][-1])
# [1] "This one" "Another one" "PS109" "Last One"
strsplit(string_to_inspect, " [0-9]+ ")
In case you don't want problems with strings mixed with numbers:
string_to_inspect <-"Not This One 4586 This one 8888 Another one 8955 PS109 8566 Last One"
str2insp <- strsplit(string_to_inspect, ' ')[[1]]
str2insp[!gsub('[[:digit:]]', '', str2insp) == '']
outputs:
#[1] "Not" "This" "One" "This" "one" "Another" "one" "PS109" "Last" "One"

Removing parentheses, text proceeding comma, and the comma in a string using string

I have a string that contains a persons name and city. It's formatted like this:
mock <- "Joe Smith (Cleveland, OH)"
I simply want the state abbreviation remaining, so it in this case, the only remaining string would be "OH"
I can get rid of the the parentheses and comma
[(.*?),]
Which gives me:
"Joe Smith Cleveland OH"
But I can't figure out how to combine all of it. For the record, all of the records will look like that, where it ends with ", two letter capital state abbreviation" (ex: ", OH", ", KY", ", MD" etc...)
You may use
mock <- "Joe Smith (Cleveland, OH)"
sub(".+,\\s*([A-Z]{2})\\)$","\\1",mock)
## => [1] "OH"
## With stringr:
str_extract(mock, "[A-Z]{2}(?=\\)$)")
See this R demo
Details
.+,\\s*([A-Z]{2})\\)$ - matches any 1+ chars as many as possible, then ,, 0+ whitespaces, and then captures 2 uppercase ASCII letters into Group 1 (referred to with \1 from the replacement pattern) and then matches ) at the end of string
[A-Z]{2}(?=\)$) - matches 2 uppercase ASCII letters if followed with the ) at the end of the string.
How about this. If they are all formatted the same, then this should work.
mock <- "Joe Smith (Cleveland, OH)"
substr(mock, (nchar(mock) - 2), (nchar(mock) - 1))
If the general case is that the state is in the second and third last characters then match everything, .*, and then a capture group of two characters (..) and then another character . and replace that with the capture group:
sub(".*(..).", "\\1", mock)
## [1] "OH"

separating last sentence from a string in R

I have a vector of strings and i want to separate the last sentence from each string in R.
Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R.
You can use strsplit to get the last sentence from each string as shown:-
## paragraph <- "Your vector here"
result <- strsplit(paragraph, "\\.|\\!|\\?")
last.sentences <- sapply(result, function(x) {
trimws((x[length(x)]))
})
Provided that your input is clean enough (in particular, that there are spaces between the sentences), you can use:
sub(".*(\\.|\\?|\\!) ", "", trimws(yourvector))
It finds the longest substring ending with a punctuation mark and a space and removes it.
I added trimws just in case there are trailing spaces in some of your strings.
Example:
u <- c("This is a sentence. And another sentence!",
"By default R regexes are greedy. So only the last sentence is kept. You see ? ",
"Single sentences are not a problem.",
"What if there are no spaces between sentences?It won't work.",
"You know what? Multiple marks don't break my solution!!",
"But if they are separated by spaces, they do ! ! !")
sub(".*(\\.|\\?|\\!) ", "", trimws(u))
# [1] "And another sentence!"
# [2] "You see ?"
# [3] "Single sentences are not a problem."
# [4] "What if there are no spaces between sentences?It won't work."
# [5] "Multiple marks don't break my solution!!"
# [6] "!"
This regex anchors to the end of the string with $, allows an optional '.' or '!' at the end. At the front it finds the closest ". " or "! " as the end of the prior sentence. The negative lookback ?<= ensures the "." or '!' are not matched. Also provides for a single sentence by using ^ for the beginning.
s <- "Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R."
library (stringr)
str_extract(s, "(?<=(\\.\\s|\\!\\s|^)).+(\\.|\\!)?$")
yields
# [1] "Hence i am confused as to how to separate the last sentence from a string in R."

Resources