Regex to find words from list, when specific words not appear 3 words before

Regex to find words from list, when specific words not appear 3 words before - r

I want to find all matches of specific words from list, but when specific another words not appears in the range of 3 words before.
For example:
Find all the times that the words "good|best|better" appears in the text, but the words "no|not|none" not appears 3 words before.
I tried something like that:
(?<!\sno|\snot(\s|\s\w\s|\s\w\s\w\s))(\bgood\b|\bbest\b|\bbetter\b)
But it's not working.

You may be able to use this PCRE regex in R with perl=TRUE option:
\b(?:not?|none)(?:\s+\S+){0,2}\s+(good|best|better)\b(*SKIP)(*F)|\b(?:good|best|better)\b
RegEx Demo
In your R code use:
gregexpr("\\b(?:not?|none)(?:\\s+\\S+){0,2}\\s+(good|best|better)\\b(*SKIP)(*F)|\\b(?:good|best|better)\\b", mystr, perl=TRUE)
In PCRE, verbs (*SKIP)(*F) are used to fail and skip a match that we don't want to match.

If we would be only looking to fail no and other derivatives of that, we would be starting with a simple expression such as:
^(?!.*no).*times.*$
Then, we would add word boundary if necessary, and we would expand that to:
^(?!.*\bno\b|.*\bnot\b|.*\bnone\b).*times.*$
Demo 1
and finally we would add our desired words using:
^(?!.*\bno\b|.*\bnot\b|.*\bnone\b)(?=.*\bgood\b|.*\bbest\b|.*\bbetter\b).*times.*$
Demo 2
RegEx Circuit
jex.im visualizes regular expressions:

Related

Number after captured group in regex

I want to write a simple RegEx to add leading zeros to my R code. Simplest way is to find (\s)\.(\d) and replace it with \10.\2. But it doesn't work in R as it apparently thinks it's 10th captured group rather than 1st followed by a literal 0. According to this question RStudio uses PCRE but no method for PCRE (or any other engine) from those described here works in RStudio find & replace feature. Is it possible to put a number after a captured group without leaving RStudio?

As a work-around, you can use lookarounds here:
Search for: (?<=\s)\.(?=\d)
Replace with: 0.
See the regex demo.

Extract up to two more digits

This may be a very simple question but I have not much experience with regex expressions. This page is a good source of regex expressions but could not figure out how to include them into my following code:
data %>% filter(grepl("^A01H1", icl))
Question
I would like to extract the values in one column of my data frame starting with this A01H1 up to 2 more digits, for example A01H100, A01H140, A01H110. I could not find a solution despite my few attempts:
Attempts
I looked at this question from which I used ^A01H1[0-9].{2} to select up tot two more digits.
I tried with adding any character ^A01H1[0-9][0-9][x-y] to stop after two digits.
Any help would be much appreciated :)

You can use "^A01H1\\d{1,2}$".
The first part ("^A01H1"), you figured out yourself, so what are we doing in the second part ("\\d{1,2}$")?
\d includes all digits and is equivalent to [0-9], since we are working in R you need to escape \ and thus we use \\d
{1,2} indicates we want to have 1 or 2 matches of \\d
$ specifies the end of the string, so nothing should come afterwards and this prevents to match more than 2 digits

It looks as if you want to match a part of a string that starts with A01H1, then contains 1 or 2 digits and then is not followed with any digit.
You may use
^A01H1\d{1,2}(?!\d)
See the regex demo. If there can be no text after two digits at all, replace (?!\d) with $.
Details
^ - start of strinmg
A01H1 - literal string
\d{1,2} - one to two digits
(?!\d) - no digit allowed immediately to the right
$ - end of string
In R, you could use it like
grepl("^A01H1\\d{1,2}(?!\\d)", icl, perl=TRUE)
Or, with the string end anchor,
grepl("^A01H1\\d{1,2}$", icl)
Note the perl=TRUE is only necessary when using PCRE specific syntax like (?!\d), a negative lookahead.

sub command to extract data and split data frame column [duplicate]

Simple regex question. I have a string on the following format:
this is a [sample] string with [some] special words. [another one]
What is the regular expression to extract the words within the square brackets, ie.
sample
some
another one
Note: In my use case, brackets cannot be nested.

You can use the following regex globally:
\[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.

(?<=\[).+?(?=\])
Will capture content without brackets
(?<=\[) - positive lookbehind for [
.*? - non greedy match for the content
(?=\]) - positive lookahead for ]
EDIT: for nested brackets the below regex should work:
(\[(?:\[??[^\[]*?\]))

This should work out ok:
\[([^]]+)\]

Can brackets be nested?
If not: \[([^]]+)\] matches one item, including square brackets. Backreference \1 will contain the item to be match. If your regex flavor supports lookaround, use
(?<=\[)[^]]+(?=\])
This will only match the item inside brackets.

To match a substring between the first [ and last ], you may use
\[.*\] # Including open/close brackets
\[(.*)\] # Excluding open/close brackets (using a capturing group)
(?<=\[).*(?=\]) # Excluding open/close brackets (using lookarounds)
See a regex demo and a regex demo #2.
Use the following expressions to match strings between the closest square brackets:
Including the brackets:
\[[^][]*] - PCRE, Python re/regex, .NET, Golang, POSIX (grep, sed, bash)
\[[^\][]*] - ECMAScript (JavaScript, C++ std::regex, VBA RegExp)
\[[^\]\[]*] - Java, ICU regex
\[[^\]\[]*\] - Onigmo (Ruby, requires escaping of brackets everywhere)
Excluding the brackets:
(?<=\[)[^][]*(?=]) - PCRE, Python re/regex, .NET (C#, etc.), JGSoft Software
\[([^][]*)] - Bash, Golang - capture the contents between the square brackets with a pair of unescaped parentheses, also see below
\[([^\][]*)] - JavaScript, C++ std::regex, VBA RegExp
(?<=\[)[^\]\[]*(?=]) - Java regex, ICU (R stringr)
(?<=\[)[^\]\[]*(?=\]) - Onigmo (Ruby, requires escaping of brackets everywhere)
NOTE: * matches 0 or more characters, use + to match 1 or more to avoid empty string matches in the resulting list/array.
Whenever both lookaround support is available, the above solutions rely on them to exclude the leading/trailing open/close bracket. Otherwise, rely on capturing groups (links to most common solutions in some languages have been provided).
If you need to match nested parentheses, you may see the solutions in the Regular expression to match balanced parentheses thread and replace the round brackets with the square ones to get the necessary functionality. You should use capturing groups to access the contents with open/close bracket excluded:
\[((?:[^][]++|(?R))*)] - PHP PCRE
\[((?>[^][]+|(?<o>)\[|(?<-o>]))*)] - .NET demo
\[(?:[^\]\[]++|(\g<0>))*\] - Onigmo (Ruby) demo

If you do not want to include the brackets in the match, here's the regex: (?<=\[).*?(?=\])
Let's break it down
The . matches any character except for line terminators. The ?= is a positive lookahead. A positive lookahead finds a string when a certain string comes after it. The ?<= is a positive lookbehind. A positive lookbehind finds a string when a certain string precedes it. To quote this,
Look ahead positive (?=)
Find expression A where expression B follows:
A(?=B)
Look behind positive (?<=)
Find expression A where expression B
precedes:
(?<=B)A
The Alternative
If your regex engine does not support lookaheads and lookbehinds, then you can use the regex \[(.*?)\] to capture the innards of the brackets in a group and then you can manipulate the group as necessary.
How does this regex work?
The parentheses capture the characters in a group. The .*? gets all of the characters between the brackets (except for line terminators, unless you have the s flag enabled) in a way that is not greedy.

Just in case, you might have had unbalanced brackets, you can likely design some expression with recursion similar to,
\[(([^\]\[]+)|(?R))*+\]
which of course, it would relate to the language or RegEx engine that you might be using.
RegEx Demo 1
Other than that,
\[([^\]\[\r\n]*)\]
RegEx Demo 2
or,
(?<=\[)[^\]\[\r\n]*(?=\])
RegEx Demo 3
are good options to explore.
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
Test
const regex = /\[([^\]\[\r\n]*)\]/gm;
const str = `This is a [sample] string with [some] special words. [another one]
This is a [sample string with [some special words. [another one
This is a [sample[sample]] string with [[some][some]] special words. [[another one]]`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Source
Regular expression to match balanced parentheses

(?<=\[).*?(?=\]) works good as per explanation given above. Here's a Python example:
import re
str = "Pagination.go('formPagination_bottom',2,'Page',true,'1',null,'2013')"
re.search('(?<=\[).*?(?=\])', str).group()
"'formPagination_bottom',2,'Page',true,'1',null,'2013'"

The #Tim Pietzcker's answer here
(?<=\[)[^]]+(?=\])
is almost the one I've been looking for. But there is one issue that some legacy browsers can fail on positive lookbehind.
So I had to made my day by myself :). I manged to write this:
/([^[]+(?=]))/g
Maybe it will help someone.
console.log("this is a [sample] string with [some] special words. [another one]".match(/([^[]+(?=]))/g));

if you want fillter only small alphabet letter between square bracket a-z
(\[[a-z]*\])
if you want small and caps letter a-zA-Z
(\[[a-zA-Z]*\])
if you want small caps and number letter a-zA-Z0-9
(\[[a-zA-Z0-9]*\])
if you want everything between square bracket
if you want text , number and symbols
(\[.*\])

This code will extract the content between square brackets and parentheses
(?:(?<=\().+?(?=\))|(?<=\[).+?(?=\]))
(?: non capturing group
(?<=\().+?(?=\)) positive lookbehind and lookahead to extract the text between parentheses
| or
(?<=\[).+?(?=\]) positive lookbehind and lookahead to extract the text between square brackets

In R, try:
x <- 'foo[bar]baz'
str_replace(x, ".*?\\[(.*?)\\].*", "\\1")
[1] "bar"

([[][a-z \s]+[]])
Above should work given the following explaination
characters within square brackets[] defines characte class which means pattern should match atleast one charcater mentioned within square brackets
\s specifies a space
 + means atleast one of the character mentioned previously to +.

I needed including newlines and including the brackets
\[[\s\S]+\]

If someone wants to match and select a string containing one or more dots inside square brackets like "[fu.bar]" use the following:
(?<=\[)(\w+\.\w+.*?)(?=\])
Regex Tester

How to remove characters before matching pattern and after matching pattern in R in one line?

I have this vector Target <- c( "tes_1123_SS1G_340T01", "tes_23_SS2G_340T021". I want to remove anything before SS and anything after T0 (including T0).
Result I want in one line of code:
SS1G_340 SS2G_340
Code I have tried:
gsub("^.*?SS|\\T0", "", Target)

We can use str_extract
library(stringr)
str_extract(Target, "SS[^T]*")
#[1] "SS1G_340" "SS2G_340"

Try this:
gsub(".*(SS.*)T0.*","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Why it works:
With regex, we can choose to keep a pattern and remove everything outside of that pattern with a two-step process. Step 1 is to put the pattern we'd like to keep in parentheses. Step 2 is to reference the number of the parentheses-bound pattern we'd like to keep, as sometimes we might have multiple parentheses-bound elements. See the example below for example:
gsub(".*(SS.*)+(T0.*)","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Note that I've put the T0.* in parentheses this time, but we still get the correct answer because I've told gsub to return the first of the two parentheses-bound patterns. But now see what happens if I use \\2 instead:
gsub(".*(SS.*)+(T0.*)","\\2",Target)
[1] "T01" "T021"
The .* are wild cards by the way. If you'd like to learn more about using regex in R, here's a reference that can get you started.

Regex in R to extract words before a special character

I having a dataframe of part of speech tagged strings
Example:
best_JJS phone_NN only_RB issue_NN camera_NN sensor_NN have_VB mind_NN own_JJ
I want to remove the tags after/and the '_' so that I have the output
best phone only issue camera sensor have mind own
I am using R and I couldn't find an appropriate regex for the gsub function.
I tried this.
sentence= c("best_JJS phone_NN only_RB issue_NN camera_NN sensor_NN have_VB mind_NN own_JJ")
o1=gsub("\\_.*","",sentence, perl = T)
But This removes entire string after the first underscore. Thanks in Advance

You may use _[A-Z]+ TRE pattern with gsub:
sentence <- c("best_JJS phone_NN only_RB issue_NN camera_NN sensor_NN have_VB mind_NN own_JJ")
gsub("_[A-Z]+","",sentence)
[1] "best phone only issue camera sensor have mind own"
See the R demo
The _[A-Z]+ pattern matches an underscore (_, note it does not have to be escaped in a regex pattern) and one or more (+) uppercase ASCII letters ([A-Z]).
You may further precise the pattern, say, to only match the _ if it is preceded with a word char and match uppercase letters only when followed with a word boundary:
"\\B_[A-Z]+\\b
In case you want to create a very specific regex for the POS values, you may use alternation:
"\\B_(JJ|NN|CC|[VR]B)\\b"
And continue adding |<code> to the regex pattern.