R - look ahead regex [duplicate]

R - look ahead regex [duplicate] - r

Common sense and a sanity-check using gregexpr() indicate that the look-behind and look-ahead assertions below should each match at exactly one location in testString:
testString <- "text XX text"
BB <- "(?<= XX )"
FF <- "(?= XX )"
as.vector(gregexpr(BB, testString, perl=TRUE)[[1]])
# [1] 9
as.vector(gregexpr(FF, testString, perl=TRUE)[[1]][1])
# [1] 5
strsplit(), however, uses those match locations differently, splitting testString at one location when using the lookbehind assertion, but at two locations -- the second of which seems incorrect -- when using the lookahead assertion.
strsplit(testString, BB, perl=TRUE)
# [[1]]
# [1] "text XX " "text"
strsplit(testString, FF, perl=TRUE)
# [[1]]
# [1] "text" " " "XX text"
I have two questions: (Q1) What's going on here? And (Q2) how can one get strsplit() to be better behaved?
Update: Theodore Lytras' excellent answer explains what's going on, and so addresses (Q1). My answer builds on his to identify a remedy, addressing (Q2).

I am not sure whether this qualifies as a bug, because I believe this is expected behaviour based on the R documentation. From ?strsplit:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
Note that this means that if there is a match at the beginning of
a (non-empty) string, the first element of the output is ‘""’, but
if there is a match at the end of the string, the output is the
same as with the match removed.
The problem is that lookahead (and lookbehind) assertions are zero-length. So for example in this case:
FF <- "(?=funky)"
testString <- "take me to funky town"
gregexpr(FF,testString,perl=TRUE)
# [[1]]
# [1] 12
# attr(,"match.length")
# [1] 0
# attr(,"useBytes")
# [1] TRUE
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "take me to " "f" "unky town"
What happens is that the lonely lookahead (?=funky) matches at position 12. So the first split includes the string up to position 11 (left of the match), and it is removed from the string, together with the match, which -however- has zero length.
Now the remaining string is funky town, and the lookahead matches at position 1. However there's nothing to remove, because there's nothing at the left of the match, and the match itself has zero length. So the algorithm is stuck in an infinite loop. Apparently R resolves this by splitting a single character, which incidentally is the documented behaviour when strspliting with an empty regex (when argument split=""). After this the remaining string is unky town, which is returned as the last split since there's no match.
Lookbehinds are no problem, because each match is split and removed from the remaining string, so the algorithm is never stuck.
Admittedly this behaviour looks weird at first glance. Behaving otherwise however would violate the assumption of zero length for lookaheads. Given that the strsplit algorithm is documented, I belive this does not meet the definition of a bug.

Based on Theodore Lytras' careful explication of substr()'s behavior, a reasonably clean workaround is to prefix the to-be-matched lookahead assertion with a positive lookbehind assertion that matches any single character:
testString <- "take me to funky town"
FF2 <- "(?<=.)(?=funky)"
strsplit(testString, FF2, perl=TRUE)
# [[1]]
# [1] "take me to " "funky town"

Looks like a bug to me. This doesn't appear to just be related to spaces, specifically, but rather any lonely lookahead (positive or negative):
FF <- "(?=funky)"
testString <- "take me to funky town"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "take me to " "f" "unky town"
FF <- "(?=funky)"
testString <- "funky take me to funky funky town"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "f" "unky take me to " "f" "unky "
# [5] "f" "unky town"
FF <- "(?!y)"
testString <- "xxxyxxxxxxx"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "xxx" "y" "xxxxxxx"
Seems to work fine if given something to capture along with the zero-width assertion, such as:
FF <- " (?=XX )"
testString <- "text XX text"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "text" "XX text"
FF <- "(?= XX ) "
testString <- "text XX text"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "text" "XX text"
Perhaps something like that might function as a workaround.

Related

How to find word index or position in a given string using r programming

How to find index or position of a word in a given string, below code says the starting position of word and length. After finding the position of the word, I want to extract preceding and succeeding words in my project.
library(stringr)
Output_text <- c("applicable to any future potential contract termination disputes as the tepco dispute was somewhat unique")
word_pos <- regexpr('termination', Output_text)
Output:
[1] 45
attr(,"match.length")
[1] 11
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
45 - It is counting each and every character and displaying starting position of "termination"
11- is length
Here, "termination", is at 7th position, how to find it using r programming
Appreciate your help.

Here it is:
library(stringr)
Output_text <- c("applicable to any future potential contract termination disputes as the tepco dispute was somewhat unique")
words <- unlist(str_split(Output_text, " "))
which(words == "termination")
[1] 7
Edit:
For multiple occurrences of the word in text and generating next and previous keywords:
# Adding a few random "termination" words to the string:
Output_text <- c("applicable to any future potential contract termination disputes as the tepco dispute was termination somewhat unique termination")
words <- unlist(str_split(Output_text, " "))
t1 <- which(words == "termination")
next_keyword <- words[t1+1]
previous_keywords <- words[t1-1]
> next_keyword
[1] "disputes" "somewhat" NA
> previous_keywords
[1] "contract" "was" "unique"

You can do this without worrying about character indices using regular expressions without any external package.
# replace whole string by the words preceding and following 'termination'
(words <- sub("[\\S\\s]+ (\\S+) termination (\\S+) [\\S\\s]+", "\\1 \\2", Output_text, perl = T))
# [1] "contract disputes"
# Split the resulting string into two individual strings
(words <- unlist(strsplit(words, " ")))
# [1] "contract" "disputes"

The easiest way is just the match termination and the surrounding words in str_extract and then str_remove termination.
str_remove(str_extract(Output_text,"\\w+ termination \\w+"),"termination ")
[1] "contract disputes"

How to match everything except for digits followed by a space and ONLY digits followed by a space?

The problem
What the header says, basically. Given a string, I need to extract from it everything that is not a leading number followed by a space. So, given this string
"420 species of grass"
I would like to get
"species of grass"
But, given a string with a number not in the beginning, like so
"The clock says it is 420"
or a string with a number not followed by a space, like so
"It is 420 already"
I would like to get back the same string, with the number preserved
"The clock says it is 420"
"It is 420 already"
What I have tried
Matching a leading number followed by a space works as expected:
library(stringr)
str_extract_all("420 species of grass", "^\\d+(?=\\s)")
[[1]]
[1] "420"
> str_extract_all("The clock says it is 420", "^\\d+(?=\\s)")
[[1]]
character(0)
> str_extract_all("It is 420 already", "^\\d+(?=\\s)")
[[1]]
character(0)
But, when I try to match anything but a leading number followed by a space, it doesn't:
> str_extract_all("420 species of grass", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "species" "of" "grass"
> str_extract_all("The clock says it is 420", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "The" "clock" "says" "it" "is"
> str_extract_all("It is 420 already", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "It" "is" "already"
It seems this regex matches anything but digits AND spaces instead.
How do I fix this?

I think #Douglas's answer is more concise, however, I guess your actual case would be more complicated and you may want to check ?regexpr which can identify the starting position of your specific pattern.
A method using for loop is below:
list <- list("420 species of grass",
"The clock says it is 420",
"It is 420 already")
extract <- function(x) {
y <- vector('list', length(x))
for (i in seq_along(x)) {
if (regexpr("420", x[[i]])[[1]] > 1) {
y[[i]] <- x[[i]]
}
else{
y[[i]] <- substr(x[[i]], (regexpr(" ", x[[i]])[[1]] + 1), nchar(x[[i]]))
}
}
return(y)
}
> extract(list)
[[1]]
[1] "species of grass"
[[2]]
[1] "The clock says it is 420"
[[3]]
[1] "It is 420 already"

I think the easiest way to do this is by removing the numbers instead of extracting the desired pattern:
library(stringr)
strings <- c("420 species of grass", "The clock says it is 420", "It is 420 already")
str_remove(strings, pattern = "^\\d+\\s")
[1] "species of grass" "The clock says it is 420" "It is 420 already"

An easy way out is to replace any digits followed by spaces that occur right from the start of string using this regex,
^\d+\s+
with empty string.
Regex Demo using substitution
Sample R code using sub demo
sub("^\\d+\\s+", "", "420 species of grass")
sub("^\\d+\\s+", "", "The clock says it is 420")
sub("^\\d+\\s+", "", "It is 420 already")
Prints,
[1] "species of grass"
[1] "The clock says it is 420"
[1] "It is 420 already"
Alternative way to achieve same using matching, you can use following regex and capture contents of group1,
^(?:\d+\s+)?(.*)$
Regex Demo using match
Also, anything you place inside a character set looses its special meaning like positive lookahead inside it [^(^\\d+(?=\\s))]+ and simply behaves as a literal, so your regex becomes incorrect.
Edit:
Although solution using sub is better but in case you want match based solution using R codes, you need to use str_match instead of str_extract_all and for accessing group1 contents you need to use [,2]
R Code Demo using match
library(stringr)
print(str_match("420 species of grass", "^(?:\\d+\\s+)?(.*)$")[,2])
print(str_match("The clock says it is 420", "^(?:\\d+\\s+)?(.*)$")[,2])
print(str_match("It is 420 already", "^(?:\\d+\\s+)?(.*)$")[,2])
Prints,
[1] "species of grass"
[1] "The clock says it is 420"
[1] "It is 420 already"

R regex - extract words beginning with # symbol

I'm trying to extract twitter handles from tweets using R's stringr package. For example, suppose I want to get all words in a vector that begin with "A". I can do this like so
library(stringr)
# Get all words that begin with "A"
str_extract_all(c("hAi", "hi Ahello Ame"), "(?<=\\b)A[^\\s]+")
[[1]]
character(0)
[[2]]
[1] "Ahello" "Ame"
Great. Now let's try the same thing using "#" instead of "A"
str_extract_all(c("h#i", "hi #hello #me"), "(?<=\\b)\\#[^\\s]+")
[[1]]
[1] "#i"
[[2]]
character(0)
Why does this example give the opposite result that I was expecting and how can I fix it?

It looks like you probably mean
str_extract_all(c("h#i", "hi #hello #me", "#twitter"), "(?<=^|\\s)#[^\\s]+")
# [[1]]
# character(0)
# [[2]]
# [1] "#hello" "#me"
# [[3]]
# [1] "#twitter"
The \b in a regular expression is a boundary and it occurs "Between two characters in the string, where one is a word character and the other is not a word character." see here. Since an space and "#" are both non-word characters, there is no boundary before the "#".
With this revision you match either the start of the string or values that come after spaces.

A couple of things about your regex:
(?<=\b) is the same as \b because a word boundary is already a zero width assertion
\# is the same as #, as # is not a special regex metacharacter and you do not have to escape it
[^\s]+ is the same as \S+, almost all shorthand character classes have their negated counterparts in regex.
So, your regex, \b#\S+, matches #i in h#i because there is a word boundary between h (a letter, a word char) and # (a non-word char, not a letter, digit or underscore). Check this regex debugger.
\b is an ambiguous pattern whose meaning depends on the regex context. In your case, you might want to use \B, a non-word boundary, that is, \B#\S+, and it will match # that are either preceded with a non-word char or at the start of the string.
x <- c("h#i", "hi #hello #me")
regmatches(x, gregexpr("\\B#\\S+", x))
## => [[1]]
## character(0)
##
## [[2]]
## [1] "#hello" "#me"
See the regex demo.
If you want to get rid of this \b/\B ambiguity, use unambiguous word boundaries using lookarounds with stringr methods or base R regex functions with perl=TRUE argument:
regmatches(x, gregexpr("(?<!\\w)#\\S+", x, perl=TRUE))
regmatches(x, gregexpr("(?<!\\S)#\\S+", x, perl=TRUE))
where:
(?<!\w) - an unambiguous starting word boundary - is a negative lookbehind that makes sure there is a non-word char immediately to the left of the current location or start of string
(?<!\S) - a whitespace starting word boundary - is a negative lookbehind that makes sure there is a whitespace char immediately to the left of the current location or start of string.
See this regex demo and another regex demo here.
Note that the corresponding right hand boundaries are (?!\w) and (?!\S).

The answer above should suffice. This will remove the # symbol in case you are trying to get the users' names only.
str_extract_all(c("#tweeter tweet", "h#is", "tweet #tweeter2"), "(?<=\\B\\#)[^\\s]+")
[[1]]
[1] "tweeter"
[[2]]
character(0)
[[3]]
[1] "tweeter2"
While I am no expert with regex, it seems like the issue may be that the # symbol does not correspond to a word character, and thus matching the empty string at the beginning of a word (\\b) does not work because there is no empty string when # is preceding the word.
Here are two great regex resources in case you hadn't seen them:
stat545
Stringr's Regex page, also available as a vignette:
vignette("regular-expressions", package = "stringr")

removes part of string in r

I'm trying to extract ES at the end of a string
> data <- c("phrases", "phases", "princesses","class","pass")
> data1 <- gsub("(\\w+)(s)+?es\\b", "\\1\\2", data, perl=TRUE)
> gsub("(\\w+)s\\b", "\\1", data1, perl=TRUE)
[1] "phra" "pha" "princes" "clas" "pas"
I get this result
[1] "phra" "pha" "princes" "clas" "pas"
but in reality what I need to obtain is:
[1] "phras" "phas" "princess" "clas" "pas"

You can use a word boundary (\\b) if it is guaranteed that each word is followed by a punctuation or is at the end of the string:
data <- c("phrases, phases, princesses, bases")
gsub('es\\b', '', data)
# [1] "phras, phas, princess, bas"
With your method, just wrap everything till the second + with one set of parentheses:
gsub("(\\w+s+)es\\b", "\\1", data)
# [1] "phras, phas, princess, bas"
There is also no need to make + lazy with ?, since you are trying to match as many consecutive s's as possible.
Edit:
OP changed the data and the desired output. Below is a simple solution that removes either es or s at the end of each string:
data <- c("phrases", "phases", "princesses","class","pass")
gsub('(es|s)\\b', '', data)
# [1] "phras" "phas" "princess" "clas" "pas"

maybe you are looking for a lookbehind assertion (which is a 0 length match)
"(?<=s)es\\b"
or because lookbehind can't have a variable length perl \K construct to keep out of match left of \K
"\\ws\\Kes\\b"

How to delete all strings except some specific letters in R?

after researching for a while, I didn't find exactly what I would like.
What I'd like to do is to keep an exact pattern in a string.
So this is my example:
text=c("hello, please keep THIS","THIS is important","all THIS should be done","not exactly This","not THHIS")
how to get exactly "THIS" in all strings:
res=c("THIS","THIS","THIS","","")
I tried gsubin r, but I don't know how to match characters.
For example I tried:
gsub("(THIS).*", "\\1", text) # This delete all string after "THIS".
gsub(".*(THIS)", "\\1", text) # This delete all string before "THIS".

To extract THIS or THAT as whole words, you may use the following regex:
\b(THIS|THAT)\b
where \b is a word boundary and (...|...) is a capturing group with | alternation operator (that can appear more than once, more alternatives can be added).
Since regmatches with gregexpr return a list of vectors with some empty entries whenever no match is found, you need to convert them into NA first, then unlist, and then turn to "".
Here is some base R code:
> text=c("hello, please keep THIS","THIS is important","all THIS should be done","not exactly This","not THHIS", "THAT is something I need, too")
[1] "THIS" "THIS" "THIS" "" "" ""
> matches <- regmatches(text, gregexpr("\\b(THIS|THAT)\\b", text))
> res <- lapply(matches, function(x) if (length(x) == 0) NA else x)
> res[is.na(res)] <- ""
> unlist(res)
[1] "THIS" "THIS" "THIS" "" "" "THAT"

We can use str_extract
library(stringr)
str_extract(text, "THIS")
#[1] "THIS" "THIS" "THIS" NA
It is better to have NA rather than ""

This will first delete elements which don't match THIS and then follows your original idea while storing intermediate result to a variable. It seems that you want to have empty strings for elements that do not match, and last line does that.
tmp <- text[grepl("THIS", text)]
gsub("(THIS).*", "\\1", tmp) -> tmp
gsub(".*(THIS)", "\\1", tmp) -> tmp
c(tmp, rep("", length(text) - length(tmp)))
gsub("[^THIS]","",text) seems to do the trick? "[^THIS]" matches everything except for THIS, and gsub replaces those matches with the empty string given as the second parameter. see comment, doesn't work as expected.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - look ahead regex [duplicate] - r

Related

How to find word index or position in a given string using r programming

How to match everything except for digits followed by a space and ONLY digits followed by a space?

R regex - extract words beginning with # symbol

removes part of string in r

How to delete all strings except some specific letters in R?

Categories

Resources