How do I extract text between two characters in R - r

I'd like to extract text between two strings for all occurrences of a pattern. For example, I have this string:
x<- "\nTYPE: School\nCITY: ATLANTA\n\n\nCITY: LAS VEGAS\n\n"
I'd like to extract the words ATLANTA and LAS VEGAS as such:
[1] "ATLANTA" "LAS VEGAS"
I tried using gsub(".*CITY:\\s|\n","",x). The output this yields is:
[1] " LAS VEGAS"
I would like to output both cities (some patterns in the data include more than 2 cities) and to output them without the leading space.
I also tried the qdapRegex package but could not get close. I am not that good with regular expressions so help would be much appreciated.

You may use
> unlist(regmatches(x, gregexpr("CITY:\\s*\\K.*", x, perl=TRUE)))
[1] "ATLANTA" "LAS VEGAS"
Here, CITY:\s*\K.* regex matches
CITY: - a literal substring CITY:
\s* - 0+ whitespaces
\K - match reset operator that discards the text matched so far (zeros the current match memory buffer)
.* - any 0+ chars other than line break chars, as many as possible.
See the regex demo online.
Note that since it is a PCRE regex, perl=TRUE is indispensible.

Another option:
library(stringr)
str_extract_all(x, "(?<=CITY:\\s{3}).+(?=\\n)")
[[1]]
[1] "ATLANTA" "LAS VEGAS"
reads as: extract anything preceded by "City: " (and three spaces) and followed by "\n"

An option can be as:
regmatches(x,gregexpr("(?<=CITY:).*(?=\n\n)",x,perl = TRUE))
# [[1]]
# [1] " ATLANTA" " LAS VEGAS"

Related

str_replace: replacement depending on wildcard value [A-Z]

I have a number of strings containing the pattern "of" followed by an uppercase letter without spaces (in regex: "of[A-Z]"). I want to add spaces, e.g. "PrinceofWales" should become "Prince of Wales" etc.). However, I couldn't find how to add the value of [A-Z] that was matched into the replacement value:
library(tidyverse)
str_replace("PrinceofWales", "of[A-Z]", " of [A-Z]")
# Gives: Prince of [A-Z]ales
# Expected: Prince of Wales
str_replace("DukeofEdinburgh", "of[A-Z]", " of [A-Z]")
# Gives: Duke of [A-Z]dinburgh
# Expected: Duke of Edinburgh
Can someone enlighten me? :)
It needs to be captured as a group (([A-Z])) and replace with the backreference (\\1) of the captured group i.e. regex interpretation is in the pattern and not in the replacement
stringr::str_replace("PrinceofWales", "of([A-Z])", " of \\1")
[1] "Prince of Wales"
According to ?str_replace
replacement - A character vector of replacements. Should be either length one, or the same length as string or pattern. References of the form \1, \2, etc will be replaced with the contents of the respective matched group (created by ()).
Or another option is a regex lookaround
stringr::str_replace("PrinceofWales", "of(?=[A-Z])", " of ")
[1] "Prince of Wales"

R / stringr: split string, but keep the delimiters in the output

I tried to search for the solution, but it appears that there is no clear one for R.
I try to split the string by the pattern of, let's say, space and capital letter and I use stringr package for that.
x <- "Foobar foobar, Foobar foobar"
str_split(x, " [:upper:]")
Normally I would get:
[[1]]
[1] "Foobar foobar," "oobar foobar"
The output I would like to get, however, should include the letter from the delimiter:
[[1]]
[1] "Foobar foobar," "Foobar foobar"
Probably there is no out of box solution in stringr like back-referencing, so I would be happy to get any help.
You may split with 1+ whitespaces that are followed with an uppercase letter:
> str_split(x, "\\s+(?=[[:upper:]])")
[[1]]
[1] "Foobar foobar," "Foobar foobar"
Here,
\\s+ - 1 or more whitespaces
(?=[[:upper:]]) - a positive lookahead (a non-consuming pattern) that only checks for an uppercase letter immediately to the right of the current location in string without adding it to the match value, thus, preserving it in the output.
Note that \s matches various whitespace chars, not just plain regular spaces. Also, it is safer to use [[:upper:]] rather than [:upper:] - if you plan to use the patterns with other regex engines (like PCRE, for example).
We could use a regex lookaround to split at the space between a , and upper case character
str_split(x, "(?<=,) (?=[A-Z])")[[1]]
#[1] "Foobar foobar," "Foobar foobar"

regular expression to find exact matching containing a space and a punctuation

I am going through a dataset containing text values (names) that are formatted like this example :
M.Joan (13-2)
A.Alfred (20-13)
F.O'Neil (12-231)
D.Dan Fun (23-3)
T.Collins (51-82) J.Maddon (12-31)
Some strings have two names in it like
M.Joan (13-2) A.Alfred (20-13)
I only want to extract the name from the string.
Some names are easy to extract because they don't have spaces or anything.
However some are hard because they have a space like the last one above.
name_pattern = "[A-Z][.][^ (]{1,}"
base <- str_extract_all(baseball1$Managers, name_pattern)
When I use this code to extract the names, it works well even for names with spaces or punctuations. However, the extracted names have a space at the end. I was wondering if I can find the exact pattern of " (", a space and a parenthesis.
Output:
[[1]]
[1] "Z.Taylor "
[[2]]
[1] "Z.Taylor "
[[3]]
[1] "Z.Taylor "
[[4]]
[1] "Z.Taylor "
[[5]]
[1] "Y.Berra "
[[6]]
[1] "Y.Berra "
You may use
x <- c("M.Joan (13-2) ", "A.Alfred (20-13)", "F.O'Neil (12-231)", "D.Dan Fun (23-3)", "T.Collins (51-82) J.Maddon (12-31)", "T.Hillman (12-34) and N.Yost (23-45)")
regmatches(x, gregexpr("\\p{Lu}.*?(?=\\s*\\()", x, perl=TRUE))
See the regex demo
Or the str_extract_all version:
str_extract_all(baseball1$Managers, "\\p{Lu}.*?(?=\\s*\\()")
See the regex demo.
It matches
\p{Lu} - an uppercase letter
.*? - any char other than line break chars, as few as possible, up to the first occurrence of (but not including into the match, as (?=...) is a non-consuming construct)....
(?=\\s*\\() - positive lookahead that, immediately to the right of the current location, requires the presence of:
\\s* - 0+ whitespace chars
\\( - a literal (.

Regex, R, and Commas

I'm having some trouble with a regex string in R. I'm trying to use regex to extract the tags from a string (scraped from the web) as follows:
str <- "\n\n\n \n\n\n “Don't cry because it's over, smile because it happened.”\n ―\n Dr. Seuss\n\n\n\n\n \n tags:\n attributed-no-source,\n cry,\n crying,\n experience,\n happiness,\n joy,\n life,\n misattributed-dr-seuss,\n optimism,\n sadness,\n smile,\n smiling\n \n \n 176513 likes\n \n\n\n\n\nLike\n\n"
# Why doesn't this work at all?
stringr::str_match(str, "tags:(.+)\\d")
[,1] [,2]
[1,] NA NA
# Why just the first tag? What happens at the comma?
stringr::str_match(str, "tags:\n(.+)")
[,1] [,2]
[1,] "tags:\n attributed-no-source," " attributed-no-source,"
So two questions -- why doesn't my first idea work, and why doesn't the second capture through the end of the string, rather than just the first comma?
Thanks!
Note that stringr regex flavor is that of ICU. Unlike TRE, . does not match line breaks in ICU regex patterns.
So, a possible fix is to use (?s) - a DOTALL modifier that makes . match any char including line break chars - at the start of your patterns:
str_match(str, "(?s)tags:(.+)\\d")
and
str_match(str, "(?s)tags:\n(.+)")
However, I feel as if you need to get all the strings below tags: as separate matches. I suggest using a base R regmatches / gregexpr with a PCRE regex like
(?:\G(?!\A),?|tags:)\R\h*\K[^\s,]+
See the regex demo on your data.
(?:\G(?!\A),?|tags:) - match the end of the previous successful match with 1 or 0 , after it (\G(?!\A),?) or (|) tags: substring
\R - a line break sequence
\h* - 0+ horizontal whitespaces
\K - a match reset operator discarding all the text matched so far
[^\s,]+ - 1 or more chars other than whitespace and ,
See the R demo:
str <- "\n\n\n \n\n\n “Don't cry because it's over, smile because it happened.”\n ―\n Dr. Seuss\n\n\n\n\n \n tags:\n attributed-no-source,\n cry,\n crying,\n experience,\n happiness,\n joy,\n life,\n misattributed-dr-seuss,\n optimism,\n sadness,\n smile,\n smiling\n \n \n 176513 likes\n \n\n\n\n\nLike\n\n"
reg <- "(?:\\G(?!\\A),?|tags:)\\R\\h*\\K[^\\s,]+"
vals <- regmatches(str, gregexpr(reg, str, perl=TRUE))
unlist(vals)
Result:
[1] "attributed-no-source" "cry" "crying"
[4] "experience" "happiness" "joy"
[7] "life" "misattributed-dr-seuss" "optimism"
[10] "sadness" "smile" "smiling"

Problems in a regular expression to extract names using stringr

I cannot fully understand why my regular expression does not work to extract the info I want. I have an unlisted vector that looks like this:
text <- c("Senator, 1.4balbal", "rule 46.1, declares",
"Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23)
I would like to create a regular expression to extract only the name of the "Town", even if the town has a long name as the one written in the vector ("A Town with a Long Name"). I have tried this to extract the name of the town:
reg.town <- "[[:alpha:]](.+?)+,(.+?)\\d{2}"
towns<- unlist(str_extract_all(example, reg.prov))
but I extract everything around the ",".
Thanks in advance,
It looks like a town name starts with a capital letter ([[:upper:]]), ends with a comma (or continues to the end of text if there is no comma) ([^,]+) and should be at the start of the input text (^). The corresponding regex in this case would be:
^[[:upper:]][^,]+
Demo: https://regex101.com/r/QXYtyv/1
I have solve the problem thanks to #Dmitry Egorov 's demo post in the comment. the regular expression is this one ([[:upper:]].+?, [[:digit:]])
Thanks for your quick replies!!
You may use the following regex:
> library(stringr)
> text <- c("Senator, 1.4balbal", "rule 46.1, declares", "Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23")
> towns <- unlist(str_extract_all(text, "\\b\\p{Lu}[^,]++(?=, \\d)"))
> towns
[1] "Senator" "Town"
[3] "A Town with a Long Name"
The regex matches:
\\b - a leading word boundary
\\p{Lu} - an uppercase letter
[^,]++ - 1+ chars other than a , (possessively, due to ++ quantifier, with no backtracking into this pattern for a more efficient matching)
(?=, \\d) - a positive lookahead that requires a ,, then a space and then any digit to appear immediately after the last non-, symbol matched with [^,]++.
Note you may get the same results with base R using the same regex with a PCRE option enabled:
> towns_baseR <- unlist(regmatches(text, gregexpr("\\b\\p{Lu}[^,]++(?=, \\d)", text, perl=TRUE)))
> towns_baseR
[1] "Senator" "Town"
[3] "A Town with a Long Name"
>

Resources