Negating a string while matching others [duplicate] - r

This question already has answers here:
Regular expression that both includes and excludes certain strings in R
(3 answers)
Closed 5 years ago.
I would like to match some strings using regex while negating others in R. In the below example, I would like exclude subsections of strings that I would otherwise like to match. Example below using the answer from Regular expression to match a line that doesn't contain a word?.
My confusion is that when I try this, grepl throws an error:
Error in grepl(mypattern, mystring) :
invalid regular expression 'boardgames|(^((?!games).)*$)', reason 'Invalid regexp'
mypattern <- "boardgames|(^((?!games).)*$)"
mystring <- c("boardgames", "boardgames", "games")
grepl(mypattern, mystring)
Note running using str_detect returns desired results (i.e. T, T, F), but I would like to use grepl.

We need perl = TRUE as the default option is perl = FALSE
grepl(mypattern, mystring, perl = TRUE)
#[1] TRUE TRUE FALSE
This is needed when Perl-compatible regexps are used
According to ?regexp
The perl = TRUE argument to grep, regexpr, gregexpr, sub, gsub and
strsplit switches to the PCRE library that implements regular
expression pattern matching using the same syntax and semantics as
Perl 5.x, with just a few differences.

Related

Identifying substring with non-letter characters in string in R data table [duplicate]

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 2 years ago.
I am using data table and I want to mark out observations with a substring ".." in a longer string. After looking at How to select R data.table rows based on substring match (a la SQL like) I tried
like("Hi!..", "..")
which returns TRUE and
like("Hi!..", "Bye")
returns FALSE. However, surprisingly,
like("Hi!". "..")
returns TRUE! If this is a feature, why is that? And what can I use instead if I want to check a substring for non-letter characters?
You have to escape the special character "." with "\":
like("Hi!", "\\.\\.")
The second argument to like() is a regular expression and . has a special meaning in regex; it matches all characters. If you want to look for . literary, then add the argument fixed = TRUE.
like("Hi!", "..", fixed = TRUE)
# [1] FALSE
like("Hi!..", "..", fixed = TRUE)
# [1] TRUE

regular expression in R, reuse matched string in replacement

I want to insert a '0' before the single digit month (e.g. 2020M6 to 2020M06) using regular expressions.
The one below correctly matches the string I need to replace (a single digit at the end of the string following a 'M', excluding 'M'), but the replacement pattern '0$0' is interepreted literally in R; elsewhere (regeprep in matlab) I referenced the matched string, '6' in the example, by '$0'.
sub('(?<=M)([0-9]{1})$','0$0', c('2020M6','2020M10'), perl = T)
[1] "2020M0$0" "2020M10"
I cannot find how to reference and re-use matched strings in the replacement pattern.
PS: There are alternative ways to accomplish the task, but I need to use regular expressions.
Unfortunately, it is not possible to use a backreference to the whole match in base R regex functions.
You can use
sub("(M)([0-9])$", "\\10\\2", x)
With TRE regex like here, you do not have to worry about a digit after a backreference, since only 9 backreferences starting with 1 till 9 are allowed in TRE regex patterns. What is of interest is that you may use perl=TRUE in the above line of code and it will yield the same results.
See the R demo online:
x <- c('2020M6','2020M10')
sub("(M)([0-9])$", "\\10\\2", x)
## => [1] "2020M06" "2020M10"
Also, see the regex demo.
I think you have to capture the digit after 'M' and not 'M' itself, therefore :
sub('(?<=M)([0-9]{1})$','0\\1', c('2020M6','2020M10'), perl = T)
Captured strings can be reused with \\1, \\2 etc, by the way.

Is there an R function to replace a matched RegEx with a string of characters with the same length? [duplicate]

This question already has an answer here:
Replace every single character at the start of string that matches a regex pattern
(1 answer)
Closed 2 years ago.
I have a vector
test <- c("NNNCTCGTNNNGTCGTNN", "NNNNNCGTNNNGTCGTGN")
and I want to replace all N in the head of all elements using same length "-".
When I use function gsub only replace with one "-".
gsub("^N+", "-", test)
# [1] "-CTCGTNNNGTCGTNN" "-CGTNNNGTCGTGN"
But I want the result looks like this
# "---CTCGTNNNGTCGTNN", "-----CGTNNNGTCGTGN"
Is there any R function that can do this? Thanks for your patience and advice.
You can write:
test <- c("NNNCTCGTNNNGTCGTNN", "NNNNNCGTNNNGTCGTGN", "XNNNNNCGTNNNGTCGTGN")
gsub("\\GN", "-", perl=TRUE, test)
which returns:
"---CTCGTNNNGTCGTNN" "-----CGTNNNGTCGTGN" "XNNNNNCGTNNNGTCGTGN"
regex | R code
\G, which is supported by Perl (and by PCRE (PHP), Ruby, Python's PyPI regex engine and others), asserts that the current position is at the beginning of the string for the first match and at the end of the previous match thereafter.
If the string were "NNNCTCGTNNNGTCGTNN" the first three "N"'s would each be matched (and replaced with a hyphen by gsub), then the attempt to match "C" would fail, terminating the match and string replacement.
One approach would be to use the stringr functions, which support regex callbacks:
test <- c("NNNCTCGTNNNGTCGTNN", "NNNNNCGTNNNGTCGTGN")
repl <- function(x) { gsub("N", "-", x) }
str_replace_all(test, "^N+", function(m) repl(m))
[1] "---CTCGTNNNGTCGTNN" "-----CGTNNNGTCGTGN"
The strategy here is to first match ^N+ to capture one or more leading N. Then, we pass that match to a callback function which replaces each N with a dash.

Regex for literal curly brackets in R [duplicate]

This question already has answers here:
Error: '\R' is an unrecognized escape in character string starting "C:\R"
(5 answers)
Closed 2 years ago.
I am not an expert on Regex in R, but I feel I have read the docs first long enough and still come up short, so I am posting here.
I am trying to replace the following string, all LITERALLY as written:
a = "\\begin{tabular}"
a = gsub("\\begin{tabular}", "\\scalebox{0.7}{
\\begin{tabular}", a)
Desired output is : cat('\\scalebox{0.7}{ \\begin{tabular}')
So I know I need to escape the first "\" to "\", but when I escape the brackets I get
Error: '\}' is an unrecognized escape in character string starting...
In your case since you're seeking to replace a fixed string, you can simply set fixed = T option to avoid regular expressions entirely.
a = "\\begin{tabular}"
a = gsub("\\begin{tabular}", "\\scalebox{0.7}{\n\\begin{tabular}", x=a, fixed= T)
and use \n for the newline.
If you did want to use regex, you need to escape curly bracket in pattern using two backslashes rather than one.
e.g.,
a = "\\begin{tabular}"
gsub(pattern = "\\{|\\}", replacement = "_foo_", x=a)
[1] "\\begin_foo_tabular_foo_"
Alternatively, you can enclose the curly brackets in square brackets like so:
e.g.,
a = "\\begin{tabular}"
gsub(pattern = "[{]|[}]", replacement = "_foo_", x=a)
[1] "\\begin_foo_tabular_foo_"

How to perl regex match in R in the grepl function?

I have a function in R which uses the grepl command as follows:
function(x) grepl('\bx\b',res$label, perl=T)
This doesn't seem to work - the 'x' input is a character type string (a sentence), and i'd like to create word boundaries around the 'x' as I match, as I don't want the term to pull out other terms in the table I am searching through which contains some similar terms.
Any suggestions?
You just need to properly escape the slash in your regex
ff<-function(x) grepl('\\bx\\b',x, perl=T)
ff(c("axa","a x a", "xa", "ax","x"))
# [1] FALSE TRUE FALSE FALSE TRUE
If you just want to know whether string is a sentence, not single word, you could use: function(x) grepl('\\s',x)

Resources