Regex, R, and Commas - r

I'm having some trouble with a regex string in R. I'm trying to use regex to extract the tags from a string (scraped from the web) as follows:
str <- "\n\n\n \n\n\n “Don't cry because it's over, smile because it happened.”\n ―\n Dr. Seuss\n\n\n\n\n \n tags:\n attributed-no-source,\n cry,\n crying,\n experience,\n happiness,\n joy,\n life,\n misattributed-dr-seuss,\n optimism,\n sadness,\n smile,\n smiling\n \n \n 176513 likes\n \n\n\n\n\nLike\n\n"
# Why doesn't this work at all?
stringr::str_match(str, "tags:(.+)\\d")
[,1] [,2]
[1,] NA NA
# Why just the first tag? What happens at the comma?
stringr::str_match(str, "tags:\n(.+)")
[,1] [,2]
[1,] "tags:\n attributed-no-source," " attributed-no-source,"
So two questions -- why doesn't my first idea work, and why doesn't the second capture through the end of the string, rather than just the first comma?
Thanks!

Note that stringr regex flavor is that of ICU. Unlike TRE, . does not match line breaks in ICU regex patterns.
So, a possible fix is to use (?s) - a DOTALL modifier that makes . match any char including line break chars - at the start of your patterns:
str_match(str, "(?s)tags:(.+)\\d")
and
str_match(str, "(?s)tags:\n(.+)")
However, I feel as if you need to get all the strings below tags: as separate matches. I suggest using a base R regmatches / gregexpr with a PCRE regex like
(?:\G(?!\A),?|tags:)\R\h*\K[^\s,]+
See the regex demo on your data.
(?:\G(?!\A),?|tags:) - match the end of the previous successful match with 1 or 0 , after it (\G(?!\A),?) or (|) tags: substring
\R - a line break sequence
\h* - 0+ horizontal whitespaces
\K - a match reset operator discarding all the text matched so far
[^\s,]+ - 1 or more chars other than whitespace and ,
See the R demo:
str <- "\n\n\n \n\n\n “Don't cry because it's over, smile because it happened.”\n ―\n Dr. Seuss\n\n\n\n\n \n tags:\n attributed-no-source,\n cry,\n crying,\n experience,\n happiness,\n joy,\n life,\n misattributed-dr-seuss,\n optimism,\n sadness,\n smile,\n smiling\n \n \n 176513 likes\n \n\n\n\n\nLike\n\n"
reg <- "(?:\\G(?!\\A),?|tags:)\\R\\h*\\K[^\\s,]+"
vals <- regmatches(str, gregexpr(reg, str, perl=TRUE))
unlist(vals)
Result:
[1] "attributed-no-source" "cry" "crying"
[4] "experience" "happiness" "joy"
[7] "life" "misattributed-dr-seuss" "optimism"
[10] "sadness" "smile" "smiling"

Related

R regex - extract words beginning with # symbol

I'm trying to extract twitter handles from tweets using R's stringr package. For example, suppose I want to get all words in a vector that begin with "A". I can do this like so
library(stringr)
# Get all words that begin with "A"
str_extract_all(c("hAi", "hi Ahello Ame"), "(?<=\\b)A[^\\s]+")
[[1]]
character(0)
[[2]]
[1] "Ahello" "Ame"
Great. Now let's try the same thing using "#" instead of "A"
str_extract_all(c("h#i", "hi #hello #me"), "(?<=\\b)\\#[^\\s]+")
[[1]]
[1] "#i"
[[2]]
character(0)
Why does this example give the opposite result that I was expecting and how can I fix it?
It looks like you probably mean
str_extract_all(c("h#i", "hi #hello #me", "#twitter"), "(?<=^|\\s)#[^\\s]+")
# [[1]]
# character(0)
# [[2]]
# [1] "#hello" "#me"
# [[3]]
# [1] "#twitter"
The \b in a regular expression is a boundary and it occurs "Between two characters in the string, where one is a word character and the other is not a word character." see here. Since an space and "#" are both non-word characters, there is no boundary before the "#".
With this revision you match either the start of the string or values that come after spaces.
A couple of things about your regex:
(?<=\b) is the same as \b because a word boundary is already a zero width assertion
\# is the same as #, as # is not a special regex metacharacter and you do not have to escape it
[^\s]+ is the same as \S+, almost all shorthand character classes have their negated counterparts in regex.
So, your regex, \b#\S+, matches #i in h#i because there is a word boundary between h (a letter, a word char) and # (a non-word char, not a letter, digit or underscore). Check this regex debugger.
\b is an ambiguous pattern whose meaning depends on the regex context. In your case, you might want to use \B, a non-word boundary, that is, \B#\S+, and it will match # that are either preceded with a non-word char or at the start of the string.
x <- c("h#i", "hi #hello #me")
regmatches(x, gregexpr("\\B#\\S+", x))
## => [[1]]
## character(0)
##
## [[2]]
## [1] "#hello" "#me"
See the regex demo.
If you want to get rid of this \b/\B ambiguity, use unambiguous word boundaries using lookarounds with stringr methods or base R regex functions with perl=TRUE argument:
regmatches(x, gregexpr("(?<!\\w)#\\S+", x, perl=TRUE))
regmatches(x, gregexpr("(?<!\\S)#\\S+", x, perl=TRUE))
where:
(?<!\w) - an unambiguous starting word boundary - is a negative lookbehind that makes sure there is a non-word char immediately to the left of the current location or start of string
(?<!\S) - a whitespace starting word boundary - is a negative lookbehind that makes sure there is a whitespace char immediately to the left of the current location or start of string.
See this regex demo and another regex demo here.
Note that the corresponding right hand boundaries are (?!\w) and (?!\S).
The answer above should suffice. This will remove the # symbol in case you are trying to get the users' names only.
str_extract_all(c("#tweeter tweet", "h#is", "tweet #tweeter2"), "(?<=\\B\\#)[^\\s]+")
[[1]]
[1] "tweeter"
[[2]]
character(0)
[[3]]
[1] "tweeter2"
While I am no expert with regex, it seems like the issue may be that the # symbol does not correspond to a word character, and thus matching the empty string at the beginning of a word (\\b) does not work because there is no empty string when # is preceding the word.
Here are two great regex resources in case you hadn't seen them:
stat545
Stringr's Regex page, also available as a vignette:
vignette("regular-expressions", package = "stringr")

Can't figure out why regex group is not working in str_match

I have the following code with a regex
CHARACTER <- ^([A-Z0-9 .])+(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$
str_match("WILL (V.O.)",CHARACTER)[1,2]
I thought this should match the value of "WILL " but it is returning blank.
I tried the RegEx in a different language and the group is coming back blank in that instance also.
What do I have to add to this regex to pull back just the value "WILL"?
You formed a repeated capturing group by placing + outside a group. Put it back:
CHARACTER <- "^([A-Z0-9 .]+)(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
^
Note you may trim Will if you use a lazy match with \s* after the group:
CHARACTER <- "^([A-Z0-9\\s.]+?)\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
See this regex demo.
> library(stringr)
> CHARACTER <- "^([A-Z0-9\\s.]+?)\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
> str_match("WILL (V.O.)",CHARACTER)[1,2]
[1] "WILL"
Alternatively, you may just extract Will with
> str_extract("WILL (V.O.)", "^.*?(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$)")
[1] "WILL"
Or the same with base R:
> regmatches(x, regexpr("^.*?(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$)", x, perl=TRUE))
[1] "WILL"
Here,
^ - matches the start of a string
.*? - any 0+ chars other than line break chars as few as possible
(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$) - makes sure that, immediately to the right of the current location, there is
\\s* - 0+ whitespaces
(?:\\(V\\.O\\.\\))? - an optional (V.O.) substring
(?:\\(O\\.S\\.\\))? - an optional (O.S.) substring
(?:\\(CONT'D\\))? - an optional (CONT'D) substring
$ - end of string.

regular expression to find exact matching containing a space and a punctuation

I am going through a dataset containing text values (names) that are formatted like this example :
M.Joan (13-2)
A.Alfred (20-13)
F.O'Neil (12-231)
D.Dan Fun (23-3)
T.Collins (51-82) J.Maddon (12-31)
Some strings have two names in it like
M.Joan (13-2) A.Alfred (20-13)
I only want to extract the name from the string.
Some names are easy to extract because they don't have spaces or anything.
However some are hard because they have a space like the last one above.
name_pattern = "[A-Z][.][^ (]{1,}"
base <- str_extract_all(baseball1$Managers, name_pattern)
When I use this code to extract the names, it works well even for names with spaces or punctuations. However, the extracted names have a space at the end. I was wondering if I can find the exact pattern of " (", a space and a parenthesis.
Output:
[[1]]
[1] "Z.Taylor "
[[2]]
[1] "Z.Taylor "
[[3]]
[1] "Z.Taylor "
[[4]]
[1] "Z.Taylor "
[[5]]
[1] "Y.Berra "
[[6]]
[1] "Y.Berra "
You may use
x <- c("M.Joan (13-2) ", "A.Alfred (20-13)", "F.O'Neil (12-231)", "D.Dan Fun (23-3)", "T.Collins (51-82) J.Maddon (12-31)", "T.Hillman (12-34) and N.Yost (23-45)")
regmatches(x, gregexpr("\\p{Lu}.*?(?=\\s*\\()", x, perl=TRUE))
See the regex demo
Or the str_extract_all version:
str_extract_all(baseball1$Managers, "\\p{Lu}.*?(?=\\s*\\()")
See the regex demo.
It matches
\p{Lu} - an uppercase letter
.*? - any char other than line break chars, as few as possible, up to the first occurrence of (but not including into the match, as (?=...) is a non-consuming construct)....
(?=\\s*\\() - positive lookahead that, immediately to the right of the current location, requires the presence of:
\\s* - 0+ whitespace chars
\\( - a literal (.

R Strsplit keep delimiter in second element

I have been trying to solve this little issue for almost 2 hours, but without success. I simply want to separate a string by the delimiter: one space followed by any character. In the second element I want to keep the delimiter, whereas in the first element it shall not appear. Example:
x <- "123123 123 A123"
strsplit(x," [A-Z]")
results in:
"123123 123" "A123"
However, this does not keep the letter A in the second element.
I have tried using
strsplit(x,"(?<=[A-Z])",perl=T)
but this does not really work for my issue. It would also be okay, if there is a space in the second element, it just need the character in it.
If you want to follow your approach, you need to match 1+ whitespaces followed (i.e. you need a lookahead here) with a letter to consume the whitespaces:
> strsplit(x,"\\s+(?=[A-Z])",perl=T)
[[1]]
[1] "123123 123" "A123"
See the PCRE regex demo.
Details:
\s+ - 1 or more whitespaces (put into the match value and thus will be removed during splitting)
(?=[A-Z]) - the uppercase ASCII letter must appear immediately to the right of the current location, else fail the match (the letter is not part of the match value, and will be kept in the result)
You may also match up to the last non-whitespace char followed with 1+ whitespaces and use \K match reset operator to discard the match before the whitespace:
> strsplit(x,"^.*\\S\\K\\s+",perl=T)
[[1]]
[1] "123123 123" "A123"
If the string contains line breaks, add a DOTALL flag since a dot in a PCRE regex does not match line breaks by default: "(?s)^.*\\S\\K\\s+".
Details:
^ - start of string
.* - any 0+ chars up to the last occurrence of the subsequent subpatterns (that is, \S\s+)
\\S - a non-whitespace
\\K - here, drop all the text matched so far
\\s+ - 1 or more whitespaces.
See another PCRE regex demo.
I would go with stringi package:
library(stringi)
x <- c("123123 123 A123","34512 321 B521")#some modified input data
l1<-stri_split(x,fixed=" ")
[1] "123123" "123" "A123"
Then:
lapply(seq_along(1:length(l1)), function(x) c(paste0(l1[[x]][1]," ",l1[[x]][2]),l1[[x]][3]))
[[1]]
[1] "123123 123" "A123"
[[2]]
[1] "34512 321" "B521"

In R is there a way to extract data based on the beginning and end of a pattern but not the middle data?

In R is there a way to extract data based on the beginning and end of a pattern but not the middle data?
ie. if the following was in a single cell
(1) Number = '1111111111, 0000000000' Text =....
(2) Number = '0000000000' Text =....
it would result in:
(1) 1111111111, 0000000000
(2) 0000000000
I tried:
x1<-str_match(x,"(?<=Number'\\s\\=\\s\\')(\\d|\\s|\\,)\\d\\'")
but that doesn't work.
We can try with str_extract_all
library(stringr)
sapply(str_extract_all(x, "[0-9]+"), toString)
#[1] "1111111111, 0000000000" "0000000000"
You may use a PCRE regex to extract the numbers after Number=' from your input text:
(?:Number\s*=\s*'|\G(?!\A)\s*,\s*)\K\d+
See the regex demo.
Pattern details:
(?:Number\s*=\s*'|\G(?!\A)\s*,\s*) - either of the two alternatives:
Number\s*=\s*' - Number and a = enclosed with 0+ whitespaces
| - or
\G(?!\A)\s*,\s* - end of the previous successful match (\G(?!\A)) and a comma enclosed with 0+ whitespaces (\s*)
\K - omit the text matched so far
\d+ - 1+ digits (returned as a match)
See the R demo:
> x <- c("(1) Number = '1111111111, 0000000000' Text =....", "(2) Number = '0000000000' Text =....")
> regmatches(x, gregexpr("(?:Number\\s*=\\s*'|\\G(?!\\A)\\s*,\\s*)\\K\\d+", x, perl=TRUE))
[[1]]
[1] "1111111111" "0000000000"
[[2]]
[1] "0000000000"

Resources