This question already has answers here:
Match string using regex which includes vertical bar
(3 answers)
Closed 12 months ago.
I often have input dataframes where data in some columns are delimited by a "||". I would like to be able to remove all the data after the "||", but since "||" is an operator weird things happen when I treat it like a normal string, e.g.:
gsub("||.*", "", df$col) and str_replace(df$col, "||", "") do not do what I expect them to do.
Is there a simple way to force R to read operators as if they were any other character?
Thanks!
Any of these statements should work:
sub("[|][|].*", "", x)
sub("[|]{2}.*", "", x)
sub("\\|\\|.*", "", x)
sub("\\|{2}.*", "", x)
The problem is not that || is an operator in R. It is that | is a metacharacter in regular expressions. You can get around the special interpretation of metacharacters by placing them inside of character classes delimited by [] or escaping them with a backslash (and escaping that backslash with a second backslash). See ?regex for details.
Related
This question already has answers here:
How to remove single quote from a string in R?
(3 answers)
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 11 months ago.
For example for a string like this
NANYANG-GIRLS'-HIGH-SCHOOL
how do I use gsub to replace ' to empty and make it
NANYANG-GIRLS-HIGH-SCHOOL
when I do it in R, it shows error
You can use either of the following two approaches:
sec_name <- gsub('\'', '', sec_name, fixed=TRUE)
sec_name <- gsub("'", "", sec_name, fixed=TRUE)
This first approach is a correct version of what you were doing. Here, we use single quotes for the strings, but we escape the single quote to make it a literal single quote.
This question already has an answer here:
"'\w' is an unrecognized escape" in grep
(1 answer)
Closed 1 year ago.
I would like to find and replace tabular instances by tabularx. I tried with gsub but it seems to enter me into a world of escaping pain. Following other questions and answers I find fixed=TRUE which is the best I so far have. The code snippet below almost works, \B is unrecognized. If I escape it twice I get \BEGIN as output!
texText <- '\begin{tabular}{rl}\begin{tabular}{rll}'
texText <- gsub("\begin{tabular}{rl}", "\BEGIN{tabular}{rll}", texText, fixed=TRUE)
I'm using BEGIN as my test to see what is happening. This is before I get to tackling the question of what goes on in the brackets {rl} {ll} {rrl} etc. Ideally I'm looking for a regex that would output:
\begin{tabularx}{rX}\begin{tabularx}{rlX}
That is the final column is replaced by X.
Try using proper escaping:
texText <- "\begin{tabular}{rl}\begin{tabular}{rll}"
output <- gsub("\begin\\{tabular\\}", "\begin{tabularx}", texText)
output
[1] "\begin{tabularx}{rl}\begin{tabularx}{rll}"
A literal backslash requires two backslashes, and also metacharacters such as { and } require two backslashes.
This question already has answers here:
regular expressions in base R: 'perl=TRUE' vs. the default (PCRE vs. TRE)
(3 answers)
Closed 3 years ago.
Let's take the following string:
x <- " hello world"
I would like to extract the first word. To do so, I am using the following regex ^\\W*([a-zA-Z]+).* with a back-reference to the first group.
> gsub("^\\W*([a-zA-Z]+).*", "\\1", x)
[1] "hello"
It works as expected.
Now, let's add a digit and underscore to our string:
x <- " 0_hello world"
I replace \\W by [\\W_0-9] to match the new characters.
> gsub("^[\\W_0-9]*([a-zA-Z]+).*", "\\1", x)
[1] " 0_hello world"
Now, it doesn't work and I do not understand why. It seems that the problem arises when putting \\W within [] but I am not sure why.
The regex works on online regex tester using PCRE though.
What am I doing wrong?
The quick solution is to use Perl-like Regular Expressions by adding an additional argument perl = TRUE.
By default, grep use Extended Regular Expressions (see ?regex) where character classes are defined in the format of [:xxx:]. However, I could not find a character class to match \W exactly.
This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 4 years ago.
I have a dataframe which contains
"HYD_SOA_UNBLOCK~SOA_BLOCK-UK|SOA_BLOCK-DE||SOA_BLOCK-FR||SOA_BLOCK-IT||SOA_BLOCK-ES|"
I want the result to be -
"HYD_SOA_UNBLOCK~SOA_BLOCK-UK|SOA_BLOCK-DE|SOA_BLOCK-FR|SOA_BLOCK-IT|SOA_BLOCK-ES|"
I tried:
leadtemp$collate = gsub("||","|",leadtemp$collate)
but it is not working.
Please help me replace "||" with "|"
As MrFlick suggested, include fixed = TRUE in your gsub statement. The problem occurs because "|" is a Regular Expression operator. Using fixed = TRUE tells gsub to assume the pattern is a string and not a RegEx.
leadtemp$collate = gsub("||","|",leadtemp$collate, fixed=TRUE)
Another (although more complicated) way of doing it would be to escape all the |s:
leadtemp$collate = gsub("\\|\\|","\\|",leadtemp$collate)
Try:
gsub("[|]{2}", "|", leadtemp$collate)
I have defined character class comprising the pipe character and forced gsub to look for exactly two occurrences.Result is:
"HYD_SOA_UNBLOCK~SOA_BLOCK-UK|SOA_BLOCK-DE|SOA_BLOCK-FR|SOA_BLOCK-IT|SOA_BLOCK-ES|"
| is a metacharacter. As you can read here, metacharacters need to be escaped out of with a \. \ is also a metacharacter so it must be escaped out of in the same way. So whenever you want to refer to a | in a string, you have to put \\|. This should make your code work:
leadtemp$collate = gsub("\\|\\|","\\|",leadtemp$collate)
This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 6 years ago.
I'm trying to use gsub to remove certain parts of a string. However, I can't get it to work, and I think it's because the string to be removed contains brackets. Is there any way around this? Thanks for any help.
The command I want to use:
gsub('(4:4aCO)_','', '(5:3)_(4:4)_(5:3)_(4:4)_(4:4aCO)_(6:2)_(4:4a)')
Returns:
#"(5:3)_(4:4)_(5:3)_(4:4)_(4:4aCO)_(6:2)_(4:4a)"
Expected output:
#"(5:3)_(4:4)_(5:3)_(4:4)_(6:2)_(4:4a)"
A quick test to see if brackets were the problem:
gsub('te','', 'test')
#[1] "st"
gsub('(te)','', '(te)st')
#[1] "()st"
We can by placing the brackets inside the square brackets as () is a metacharacter
gsub('[(]4:4aCO[)]','', '(5:3)(4:4)(5:3)(4:4)(4:4aCO)(6:2)_(4:4a)')
Or with fixed = TRUE to evaluate the literal meaning of that character
gsub('(4:4aCO)','', '(5:3)(4:4)(5:3)(4:4)(4:4aCO)(6:2)_(4:4a)', fixed = TRUE)