R - regex: W metacharacter not working when within square brackets [duplicate] - r

This question already has answers here:
regular expressions in base R: 'perl=TRUE' vs. the default (PCRE vs. TRE)
(3 answers)
Closed 3 years ago.
Let's take the following string:
x <- " hello world"
I would like to extract the first word. To do so, I am using the following regex ^\\W*([a-zA-Z]+).* with a back-reference to the first group.
> gsub("^\\W*([a-zA-Z]+).*", "\\1", x)
[1] "hello"
It works as expected.
Now, let's add a digit and underscore to our string:
x <- " 0_hello world"
I replace \\W by [\\W_0-9] to match the new characters.
> gsub("^[\\W_0-9]*([a-zA-Z]+).*", "\\1", x)
[1] " 0_hello world"
Now, it doesn't work and I do not understand why. It seems that the problem arises when putting \\W within [] but I am not sure why.
The regex works on online regex tester using PCRE though.
What am I doing wrong?

The quick solution is to use Perl-like Regular Expressions by adding an additional argument perl = TRUE.
By default, grep use Extended Regular Expressions (see ?regex) where character classes are defined in the format of [:xxx:]. However, I could not find a character class to match \W exactly.

Related

How to treat operators like characters in R [duplicate]

This question already has answers here:
Match string using regex which includes vertical bar
(3 answers)
Closed 12 months ago.
I often have input dataframes where data in some columns are delimited by a "||". I would like to be able to remove all the data after the "||", but since "||" is an operator weird things happen when I treat it like a normal string, e.g.:
gsub("||.*", "", df$col) and str_replace(df$col, "||", "") do not do what I expect them to do.
Is there a simple way to force R to read operators as if they were any other character?
Thanks!
Any of these statements should work:
sub("[|][|].*", "", x)
sub("[|]{2}.*", "", x)
sub("\\|\\|.*", "", x)
sub("\\|{2}.*", "", x)
The problem is not that || is an operator in R. It is that | is a metacharacter in regular expressions. You can get around the special interpretation of metacharacters by placing them inside of character classes delimited by [] or escaping them with a backslash (and escaping that backslash with a second backslash). See ?regex for details.

Replace latex with r strings using gsub [duplicate]

This question already has an answer here:
"'\w' is an unrecognized escape" in grep
(1 answer)
Closed 1 year ago.
I would like to find and replace tabular instances by tabularx. I tried with gsub but it seems to enter me into a world of escaping pain. Following other questions and answers I find fixed=TRUE which is the best I so far have. The code snippet below almost works, \B is unrecognized. If I escape it twice I get \BEGIN as output!
texText <- '\begin{tabular}{rl}\begin{tabular}{rll}'
texText <- gsub("\begin{tabular}{rl}", "\BEGIN{tabular}{rll}", texText, fixed=TRUE)
I'm using BEGIN as my test to see what is happening. This is before I get to tackling the question of what goes on in the brackets {rl} {ll} {rrl} etc. Ideally I'm looking for a regex that would output:
\begin{tabularx}{rX}\begin{tabularx}{rlX}
That is the final column is replaced by X.
Try using proper escaping:
texText <- "\begin{tabular}{rl}\begin{tabular}{rll}"
output <- gsub("\begin\\{tabular\\}", "\begin{tabularx}", texText)
output
[1] "\begin{tabularx}{rl}\begin{tabularx}{rll}"
A literal backslash requires two backslashes, and also metacharacters such as { and } require two backslashes.

How to use regex to match upto third forward slash in R using gsub? [duplicate]

This question already has answers here:
How to Select everything up to and including the 3rd slash (RegExp)?
(2 answers)
Extract a regular expression match
(12 answers)
Closed 2 years ago.
So this question is relating to specifically how R handles regex - I would like to find some regex in conjunction with gsub to extract out the text all but before the 3rd forward slash.
Here are some string examples:
/google.com/images/video
/msn.com/bing/chat
/bbc.com/video
I would like to obtain the following strings only:
/google.com/images
/msn.com/bing
/bbc.com/video
So it is not keeping the information after the 3rd forward slash.
I cannot seem to get any regex working along with using gsub to solve this!
The closest I have got is:
gsub(pattern = "/[A-Za-z0-9_.-]/[A-Za-z0-9_.-]*$", replacement = "", x = the_data_above )
I think R has some issues regarding forward slashes and escaping them.
From the start of the string match two instances of slash and following non-slash characters followed by anything and replace with the two instances.
paths <- c("/google.com/images/video", "/msn.com/bing/chat", "/bbc.com/video")
sub("^((/[^/]*){2}).*", "\\1", paths)
## [1] "/google.com/images" "/msn.com/bing" "/bbc.com/video"
You can take advantage of lazy (vs greedy) matching by adding the ? after the quantifier (+ in this case) within your capture group:
gsub("(/.+?/.+?)/.*", "\\1", text)
[1] "/google.com/images" "/msn.com/bing" "/bbc.com/video"
Data:
text <- c("/google.com/images/video",
"/msn.com/bing/chat",
"/bbc.com/video")
Try this out:
^\/[A-Za-z0-9_.-]+\/[A-Za-z0-9_.-]+
As seen here: https://regex101.com/r/9ZYppe/1
Your problem arises from the fact that [A-Za-z0-9_.-] matches only one such character. You need to use the + operator to specify that there are multiple of them. Also, the $ at the end is pretty unnecessary because using ^ to assert the start of the sentence solves a great many problems.

R regex using stringr::str_detect and grepl don't seem to be matching "\\+" when it is surrounded by "\\b" [duplicate]

This question already has answers here:
Why does is this end of line (\\b) not recognised as word boundary in stringr/ICU and Perl
(2 answers)
Closed 3 years ago.
I'm pretty new to regex and am trying to detect a word with the "+" symbol when surrounded by "\\b" in long strings of words but both stringr and grepl are giving me the wrong result.
This is the code that I have wrote:
library(stringr)
str_detect("coversyl +", "\\bcoversyl(plus| plus|\\+| \\+)\\b")
The output is FALSE which is wrong.
What would be the right way to do it?
My guess is that your expression is just fine, maybe missing an space,
\\bcoversyl\\b\\s(\\bplus\\b|\\+)
Please see the demo for additional explanation.
If we might want more than one space, we would simply change \\s to \\s+ and it might work:
\\bcoversyl\\b\\s+(\\bplus\\b|\\+)

How to replace "|" in R [duplicate]

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 4 years ago.
I have a dataframe which contains
"HYD_SOA_UNBLOCK~SOA_BLOCK-UK|SOA_BLOCK-DE||SOA_BLOCK-FR||SOA_BLOCK-IT||SOA_BLOCK-ES|"
I want the result to be -
"HYD_SOA_UNBLOCK~SOA_BLOCK-UK|SOA_BLOCK-DE|SOA_BLOCK-FR|SOA_BLOCK-IT|SOA_BLOCK-ES|"
I tried:
leadtemp$collate = gsub("||","|",leadtemp$collate)
but it is not working.
Please help me replace "||" with "|"
As MrFlick suggested, include fixed = TRUE in your gsub statement. The problem occurs because "|" is a Regular Expression operator. Using fixed = TRUE tells gsub to assume the pattern is a string and not a RegEx.
leadtemp$collate = gsub("||","|",leadtemp$collate, fixed=TRUE)
Another (although more complicated) way of doing it would be to escape all the |s:
leadtemp$collate = gsub("\\|\\|","\\|",leadtemp$collate)
Try:
gsub("[|]{2}", "|", leadtemp$collate)
I have defined character class comprising the pipe character and forced gsub to look for exactly two occurrences.Result is:
"HYD_SOA_UNBLOCK~SOA_BLOCK-UK|SOA_BLOCK-DE|SOA_BLOCK-FR|SOA_BLOCK-IT|SOA_BLOCK-ES|"
| is a metacharacter. As you can read here, metacharacters need to be escaped out of with a \. \ is also a metacharacter so it must be escaped out of in the same way. So whenever you want to refer to a | in a string, you have to put \\|. This should make your code work:
leadtemp$collate = gsub("\\|\\|","\\|",leadtemp$collate)

Resources