regular expression to find exact matching containing a space and a punctuation - r

I am going through a dataset containing text values (names) that are formatted like this example :
M.Joan (13-2)
A.Alfred (20-13)
F.O'Neil (12-231)
D.Dan Fun (23-3)
T.Collins (51-82) J.Maddon (12-31)
Some strings have two names in it like
M.Joan (13-2) A.Alfred (20-13)
I only want to extract the name from the string.
Some names are easy to extract because they don't have spaces or anything.
However some are hard because they have a space like the last one above.
name_pattern = "[A-Z][.][^ (]{1,}"
base <- str_extract_all(baseball1$Managers, name_pattern)
When I use this code to extract the names, it works well even for names with spaces or punctuations. However, the extracted names have a space at the end. I was wondering if I can find the exact pattern of " (", a space and a parenthesis.
Output:
[[1]]
[1] "Z.Taylor "
[[2]]
[1] "Z.Taylor "
[[3]]
[1] "Z.Taylor "
[[4]]
[1] "Z.Taylor "
[[5]]
[1] "Y.Berra "
[[6]]
[1] "Y.Berra "

You may use
x <- c("M.Joan (13-2) ", "A.Alfred (20-13)", "F.O'Neil (12-231)", "D.Dan Fun (23-3)", "T.Collins (51-82) J.Maddon (12-31)", "T.Hillman (12-34) and N.Yost (23-45)")
regmatches(x, gregexpr("\\p{Lu}.*?(?=\\s*\\()", x, perl=TRUE))
See the regex demo
Or the str_extract_all version:
str_extract_all(baseball1$Managers, "\\p{Lu}.*?(?=\\s*\\()")
See the regex demo.
It matches
\p{Lu} - an uppercase letter
.*? - any char other than line break chars, as few as possible, up to the first occurrence of (but not including into the match, as (?=...) is a non-consuming construct)....
(?=\\s*\\() - positive lookahead that, immediately to the right of the current location, requires the presence of:
\\s* - 0+ whitespace chars
\\( - a literal (.

Related

R regex - extract words beginning with # symbol

I'm trying to extract twitter handles from tweets using R's stringr package. For example, suppose I want to get all words in a vector that begin with "A". I can do this like so
library(stringr)
# Get all words that begin with "A"
str_extract_all(c("hAi", "hi Ahello Ame"), "(?<=\\b)A[^\\s]+")
[[1]]
character(0)
[[2]]
[1] "Ahello" "Ame"
Great. Now let's try the same thing using "#" instead of "A"
str_extract_all(c("h#i", "hi #hello #me"), "(?<=\\b)\\#[^\\s]+")
[[1]]
[1] "#i"
[[2]]
character(0)
Why does this example give the opposite result that I was expecting and how can I fix it?
It looks like you probably mean
str_extract_all(c("h#i", "hi #hello #me", "#twitter"), "(?<=^|\\s)#[^\\s]+")
# [[1]]
# character(0)
# [[2]]
# [1] "#hello" "#me"
# [[3]]
# [1] "#twitter"
The \b in a regular expression is a boundary and it occurs "Between two characters in the string, where one is a word character and the other is not a word character." see here. Since an space and "#" are both non-word characters, there is no boundary before the "#".
With this revision you match either the start of the string or values that come after spaces.
A couple of things about your regex:
(?<=\b) is the same as \b because a word boundary is already a zero width assertion
\# is the same as #, as # is not a special regex metacharacter and you do not have to escape it
[^\s]+ is the same as \S+, almost all shorthand character classes have their negated counterparts in regex.
So, your regex, \b#\S+, matches #i in h#i because there is a word boundary between h (a letter, a word char) and # (a non-word char, not a letter, digit or underscore). Check this regex debugger.
\b is an ambiguous pattern whose meaning depends on the regex context. In your case, you might want to use \B, a non-word boundary, that is, \B#\S+, and it will match # that are either preceded with a non-word char or at the start of the string.
x <- c("h#i", "hi #hello #me")
regmatches(x, gregexpr("\\B#\\S+", x))
## => [[1]]
## character(0)
##
## [[2]]
## [1] "#hello" "#me"
See the regex demo.
If you want to get rid of this \b/\B ambiguity, use unambiguous word boundaries using lookarounds with stringr methods or base R regex functions with perl=TRUE argument:
regmatches(x, gregexpr("(?<!\\w)#\\S+", x, perl=TRUE))
regmatches(x, gregexpr("(?<!\\S)#\\S+", x, perl=TRUE))
where:
(?<!\w) - an unambiguous starting word boundary - is a negative lookbehind that makes sure there is a non-word char immediately to the left of the current location or start of string
(?<!\S) - a whitespace starting word boundary - is a negative lookbehind that makes sure there is a whitespace char immediately to the left of the current location or start of string.
See this regex demo and another regex demo here.
Note that the corresponding right hand boundaries are (?!\w) and (?!\S).
The answer above should suffice. This will remove the # symbol in case you are trying to get the users' names only.
str_extract_all(c("#tweeter tweet", "h#is", "tweet #tweeter2"), "(?<=\\B\\#)[^\\s]+")
[[1]]
[1] "tweeter"
[[2]]
character(0)
[[3]]
[1] "tweeter2"
While I am no expert with regex, it seems like the issue may be that the # symbol does not correspond to a word character, and thus matching the empty string at the beginning of a word (\\b) does not work because there is no empty string when # is preceding the word.
Here are two great regex resources in case you hadn't seen them:
stat545
Stringr's Regex page, also available as a vignette:
vignette("regular-expressions", package = "stringr")

Extract both occurrences of pattern Regex

I have an input vector as follows:
input <- c("fdsfs iwantthis (1,1,1,1) fdsaaa iwantthisaswell (2,3,4,5)", "fdsfs thistoo (1,1,1,1)")
And I would like to use a regex to extract the following:
> output
[1] "iwantthis iwantthisaswell" "thistoo"
I have managed to extract every word that is before an opening bracket.
I tried this to get only the first word:
> gsub(".*?[[:space:]](.*?)[[:space:]]\\(.*", "\\1", input)
[1] "iwantthis" "thistoo"
But I cannot get it to work for multiple occurrences:
> gsub(".*?[[:space:]](.*?)[[:space:]]\\(.*?[[:space:]](.*?)[[:space:]]\\(.*", "\\1 \\2", input)
[1] "iwantthis iwantthisaswell" "fdsfs thistoo (1,1,1,1)"
The closest I have managed is the following:
library(stringr)
> str_extract_all(input, "(\\S*)\\s\\(")
[[1]]
[1] "iwantthis (" "iwantthisaswell ("
[[2]]
[1] "thistoo ("
I am sure I am missing something in my regex (not that good at it) but what?
You may use
> sapply(str_extract_all(input, "\\S+(?=\\s*\\()"), paste, collapse=" ")
[1] "iwantthis iwantthisaswell" "thistoo"
See the regex demo. The \\S+(?=\\s*\\() will extract all 1+ non-whitespace chunks from a text before a ( char preceded with 0+ whitespaces. sapply with paste will join the found matches with a space (with collapse=" ").
Pattern details
\S+ - 1 or more non-whitespace chars
(?=\s*\() - a positive lookahead ((?=...)) that requires the presence of 0+ whitespace chars (\s*) and then a ( char (\() immediately to the right of the current position.
Here is an option using base R
unlist(regmatches(input, gregexpr("\\w+(?= \\()", input, perl = TRUE)))
#[1] "iwantthis" "iwantthisaswell" "thistoo"
This works in R:
gsub('\\w.+? ([^\\s]+) \\(.+?\\)','\\1', input, perl=TRUE)
Result:
[1] "iwantthis iwantthisaswell" "thistoo"
UPDATED to work for the general case. E.g. now finds "i_wantthisaswell2" by searching on non-spaces between the other matches.
Using other suggested general case inputs:
general_cases <- c("fdsfs iwantthis (1,1,1,1) fdsaaa iwantthisaswell (2,3,4,5)",
"fdsfs thistoo (1,1,1,1) ",
"GaGa iwant_this (1,1,1,1)",
"lal2!##$%^&*()_+a i_wantthisaswell2 (2,3,4,5)")
gsub('\\w.+? ([^\\s]+) \\(.+?\\)','\\1', general_cases, perl=TRUE)
results:
[1] "iwantthis iwantthisaswell" "thistoo "
[3] "iwant_this" "i_wantthisaswell2"

R / stringr: split string, but keep the delimiters in the output

I tried to search for the solution, but it appears that there is no clear one for R.
I try to split the string by the pattern of, let's say, space and capital letter and I use stringr package for that.
x <- "Foobar foobar, Foobar foobar"
str_split(x, " [:upper:]")
Normally I would get:
[[1]]
[1] "Foobar foobar," "oobar foobar"
The output I would like to get, however, should include the letter from the delimiter:
[[1]]
[1] "Foobar foobar," "Foobar foobar"
Probably there is no out of box solution in stringr like back-referencing, so I would be happy to get any help.
You may split with 1+ whitespaces that are followed with an uppercase letter:
> str_split(x, "\\s+(?=[[:upper:]])")
[[1]]
[1] "Foobar foobar," "Foobar foobar"
Here,
\\s+ - 1 or more whitespaces
(?=[[:upper:]]) - a positive lookahead (a non-consuming pattern) that only checks for an uppercase letter immediately to the right of the current location in string without adding it to the match value, thus, preserving it in the output.
Note that \s matches various whitespace chars, not just plain regular spaces. Also, it is safer to use [[:upper:]] rather than [:upper:] - if you plan to use the patterns with other regex engines (like PCRE, for example).
We could use a regex lookaround to split at the space between a , and upper case character
str_split(x, "(?<=,) (?=[A-Z])")[[1]]
#[1] "Foobar foobar," "Foobar foobar"

R Strsplit keep delimiter in second element

I have been trying to solve this little issue for almost 2 hours, but without success. I simply want to separate a string by the delimiter: one space followed by any character. In the second element I want to keep the delimiter, whereas in the first element it shall not appear. Example:
x <- "123123 123 A123"
strsplit(x," [A-Z]")
results in:
"123123 123" "A123"
However, this does not keep the letter A in the second element.
I have tried using
strsplit(x,"(?<=[A-Z])",perl=T)
but this does not really work for my issue. It would also be okay, if there is a space in the second element, it just need the character in it.
If you want to follow your approach, you need to match 1+ whitespaces followed (i.e. you need a lookahead here) with a letter to consume the whitespaces:
> strsplit(x,"\\s+(?=[A-Z])",perl=T)
[[1]]
[1] "123123 123" "A123"
See the PCRE regex demo.
Details:
\s+ - 1 or more whitespaces (put into the match value and thus will be removed during splitting)
(?=[A-Z]) - the uppercase ASCII letter must appear immediately to the right of the current location, else fail the match (the letter is not part of the match value, and will be kept in the result)
You may also match up to the last non-whitespace char followed with 1+ whitespaces and use \K match reset operator to discard the match before the whitespace:
> strsplit(x,"^.*\\S\\K\\s+",perl=T)
[[1]]
[1] "123123 123" "A123"
If the string contains line breaks, add a DOTALL flag since a dot in a PCRE regex does not match line breaks by default: "(?s)^.*\\S\\K\\s+".
Details:
^ - start of string
.* - any 0+ chars up to the last occurrence of the subsequent subpatterns (that is, \S\s+)
\\S - a non-whitespace
\\K - here, drop all the text matched so far
\\s+ - 1 or more whitespaces.
See another PCRE regex demo.
I would go with stringi package:
library(stringi)
x <- c("123123 123 A123","34512 321 B521")#some modified input data
l1<-stri_split(x,fixed=" ")
[1] "123123" "123" "A123"
Then:
lapply(seq_along(1:length(l1)), function(x) c(paste0(l1[[x]][1]," ",l1[[x]][2]),l1[[x]][3]))
[[1]]
[1] "123123 123" "A123"
[[2]]
[1] "34512 321" "B521"

In R is there a way to extract data based on the beginning and end of a pattern but not the middle data?

In R is there a way to extract data based on the beginning and end of a pattern but not the middle data?
ie. if the following was in a single cell
(1) Number = '1111111111, 0000000000' Text =....
(2) Number = '0000000000' Text =....
it would result in:
(1) 1111111111, 0000000000
(2) 0000000000
I tried:
x1<-str_match(x,"(?<=Number'\\s\\=\\s\\')(\\d|\\s|\\,)\\d\\'")
but that doesn't work.
We can try with str_extract_all
library(stringr)
sapply(str_extract_all(x, "[0-9]+"), toString)
#[1] "1111111111, 0000000000" "0000000000"
You may use a PCRE regex to extract the numbers after Number=' from your input text:
(?:Number\s*=\s*'|\G(?!\A)\s*,\s*)\K\d+
See the regex demo.
Pattern details:
(?:Number\s*=\s*'|\G(?!\A)\s*,\s*) - either of the two alternatives:
Number\s*=\s*' - Number and a = enclosed with 0+ whitespaces
| - or
\G(?!\A)\s*,\s* - end of the previous successful match (\G(?!\A)) and a comma enclosed with 0+ whitespaces (\s*)
\K - omit the text matched so far
\d+ - 1+ digits (returned as a match)
See the R demo:
> x <- c("(1) Number = '1111111111, 0000000000' Text =....", "(2) Number = '0000000000' Text =....")
> regmatches(x, gregexpr("(?:Number\\s*=\\s*'|\\G(?!\\A)\\s*,\\s*)\\K\\d+", x, perl=TRUE))
[[1]]
[1] "1111111111" "0000000000"
[[2]]
[1] "0000000000"

Resources