How to replace "|" in R [duplicate] - r

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 4 years ago.
I have a dataframe which contains
"HYD_SOA_UNBLOCK~SOA_BLOCK-UK|SOA_BLOCK-DE||SOA_BLOCK-FR||SOA_BLOCK-IT||SOA_BLOCK-ES|"
I want the result to be -
"HYD_SOA_UNBLOCK~SOA_BLOCK-UK|SOA_BLOCK-DE|SOA_BLOCK-FR|SOA_BLOCK-IT|SOA_BLOCK-ES|"
I tried:
leadtemp$collate = gsub("||","|",leadtemp$collate)
but it is not working.
Please help me replace "||" with "|"

As MrFlick suggested, include fixed = TRUE in your gsub statement. The problem occurs because "|" is a Regular Expression operator. Using fixed = TRUE tells gsub to assume the pattern is a string and not a RegEx.
leadtemp$collate = gsub("||","|",leadtemp$collate, fixed=TRUE)
Another (although more complicated) way of doing it would be to escape all the |s:
leadtemp$collate = gsub("\\|\\|","\\|",leadtemp$collate)

Try:
gsub("[|]{2}", "|", leadtemp$collate)
I have defined character class comprising the pipe character and forced gsub to look for exactly two occurrences.Result is:
"HYD_SOA_UNBLOCK~SOA_BLOCK-UK|SOA_BLOCK-DE|SOA_BLOCK-FR|SOA_BLOCK-IT|SOA_BLOCK-ES|"

| is a metacharacter. As you can read here, metacharacters need to be escaped out of with a \. \ is also a metacharacter so it must be escaped out of in the same way. So whenever you want to refer to a | in a string, you have to put \\|. This should make your code work:
leadtemp$collate = gsub("\\|\\|","\\|",leadtemp$collate)

Related

How to treat operators like characters in R [duplicate]

This question already has answers here:
Match string using regex which includes vertical bar
(3 answers)
Closed 12 months ago.
I often have input dataframes where data in some columns are delimited by a "||". I would like to be able to remove all the data after the "||", but since "||" is an operator weird things happen when I treat it like a normal string, e.g.:
gsub("||.*", "", df$col) and str_replace(df$col, "||", "") do not do what I expect them to do.
Is there a simple way to force R to read operators as if they were any other character?
Thanks!
Any of these statements should work:
sub("[|][|].*", "", x)
sub("[|]{2}.*", "", x)
sub("\\|\\|.*", "", x)
sub("\\|{2}.*", "", x)
The problem is not that || is an operator in R. It is that | is a metacharacter in regular expressions. You can get around the special interpretation of metacharacters by placing them inside of character classes delimited by [] or escaping them with a backslash (and escaping that backslash with a second backslash). See ?regex for details.

How to replace a caret with str_replace from stringr [duplicate]

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 3 years ago.
I have a code problem need some help(exclude specific string).
I found the str_replace_all to do the job.However,it works on other
characters like"/n", "/t" or "A","B","C",except the "^",I want to exclude
this sign but get a error message
(Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), : Missing closing bracket on a bracket expression.(U_REGEX_MISSING_CLOSE_BRACKET))
Thanks for your help!
code=c("^GSPC","^FTSE","000001.SS","^HSI","^FCHI","^KS11","^TWII","^GDAXI","^STI")
str_replace_all(code, "([^])", "")
An option is to wrap with fixed and should be fine
library(stringr)
str_replace_all(code, fixed("^"), "")
#[1] "GSPC" "FTSE" "000001.SS" "HSI" "FCHI" "KS11" "TWII" "GDAXI" "STI"
Also, as we are replacing with blank (""), an option is str_remove
str_remove(code, fixed("^"))
Regarding why the OP's code didn't, inside the square brackets, if we use ^, it is not reading the literal character, instead the metacharacter in it looks for characters other than and here it is blank ([^])

R regex using stringr::str_detect and grepl don't seem to be matching "\\+" when it is surrounded by "\\b" [duplicate]

This question already has answers here:
Why does is this end of line (\\b) not recognised as word boundary in stringr/ICU and Perl
(2 answers)
Closed 3 years ago.
I'm pretty new to regex and am trying to detect a word with the "+" symbol when surrounded by "\\b" in long strings of words but both stringr and grepl are giving me the wrong result.
This is the code that I have wrote:
library(stringr)
str_detect("coversyl +", "\\bcoversyl(plus| plus|\\+| \\+)\\b")
The output is FALSE which is wrong.
What would be the right way to do it?
My guess is that your expression is just fine, maybe missing an space,
\\bcoversyl\\b\\s(\\bplus\\b|\\+)
Please see the demo for additional explanation.
If we might want more than one space, we would simply change \\s to \\s+ and it might work:
\\bcoversyl\\b\\s+(\\bplus\\b|\\+)

R - regex: W metacharacter not working when within square brackets [duplicate]

This question already has answers here:
regular expressions in base R: 'perl=TRUE' vs. the default (PCRE vs. TRE)
(3 answers)
Closed 3 years ago.
Let's take the following string:
x <- " hello world"
I would like to extract the first word. To do so, I am using the following regex ^\\W*([a-zA-Z]+).* with a back-reference to the first group.
> gsub("^\\W*([a-zA-Z]+).*", "\\1", x)
[1] "hello"
It works as expected.
Now, let's add a digit and underscore to our string:
x <- " 0_hello world"
I replace \\W by [\\W_0-9] to match the new characters.
> gsub("^[\\W_0-9]*([a-zA-Z]+).*", "\\1", x)
[1] " 0_hello world"
Now, it doesn't work and I do not understand why. It seems that the problem arises when putting \\W within [] but I am not sure why.
The regex works on online regex tester using PCRE though.
What am I doing wrong?
The quick solution is to use Perl-like Regular Expressions by adding an additional argument perl = TRUE.
By default, grep use Extended Regular Expressions (see ?regex) where character classes are defined in the format of [:xxx:]. However, I could not find a character class to match \W exactly.

Using Gsub in R to remove a string containing brackets [duplicate]

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 6 years ago.
I'm trying to use gsub to remove certain parts of a string. However, I can't get it to work, and I think it's because the string to be removed contains brackets. Is there any way around this? Thanks for any help.
The command I want to use:
gsub('(4:4aCO)_','', '(5:3)_(4:4)_(5:3)_(4:4)_(4:4aCO)_(6:2)_(4:4a)')
Returns:
#"(5:3)_(4:4)_(5:3)_(4:4)_(4:4aCO)_(6:2)_(4:4a)"
Expected output:
#"(5:3)_(4:4)_(5:3)_(4:4)_(6:2)_(4:4a)"
A quick test to see if brackets were the problem:
gsub('te','', 'test')
#[1] "st"
gsub('(te)','', '(te)st')
#[1] "()st"
We can by placing the brackets inside the square brackets as () is a metacharacter
gsub('[(]4:4aCO[)]','', '(5:3)(4:4)(5:3)(4:4)(4:4aCO)(6:2)_(4:4a)')
Or with fixed = TRUE to evaluate the literal meaning of that character
gsub('(4:4aCO)','', '(5:3)(4:4)(5:3)(4:4)(4:4aCO)(6:2)_(4:4a)', fixed = TRUE)

Resources