Trying to figure out regular expression in R for sub() [duplicate] - r

This question already has answers here:
Replace single backslash in R
(5 answers)
Closed 3 years ago.
I'm trying to use regular expression in a sub() function in order to replace all the "\" in a Vector
I've tried a number of different ways to get R to recognize the "\":
I've tried "\\\" but I keep getting errors.
I've tried "\.*"
I've tried "\\\.*"
data.frame1$vector4 <- sub(pattern = "\\\", replace = ", data.frame1$vector4)
The \ that I am trying to get rid of only appears occasionally in the vector and always in the middle of the string. I want to get rid of it and all the characters that follow it.
The error that I am getting
Error: '\.' is an unrecognized escape in character string starting "\."
Also I'm struggling to get Stack to print the "\" that I am typing above. It keeps deleting them.

1) 4 backslashes To insert backslash into an R literal string use a double backslash; however, a backslash is a metacharacter for a regular expression so it must be escaped by prefacing it with another backslash which also has to be doubled. Thus using 4 backslashes will be needed in the regular expression.
s <- "a\\b\\c"
nchar(s)
## [1] 5
gsub("\\\\", "", s)
## [1] "abc"
2) character class Another way to effectively escape it is to surround it with [...]
gsub("[\\]", "", s)
## [1] "abc"
3) fixed argument Perhaps the simplest way is to use fixed=TRUE in which case special characters will not be regarded as regular expression metacharacters.
gsub("\\", "", s, fixed = TRUE)
## [1] "abc"

Related

strsplit returning nested list with backslashes and quotes added \"

I'm using R to split a messy string of gene names and as a first step am simply attempting to break the string into a list by spaces between characters using strsplit and regex but have been coming across this weird bug:
string <- ' " "KPNA2" "UBE2C" "CENPF" ## [4] "HMGB2"'
ccGenes <- strsplit(string, split = '\\s+')[[1]]
returns a length 1 nested list containing an object of type "character [8]" (not sure what type of object this indicates) that places a backslash in front of double quotes (" -> \") looks like this when printed:
"" "\"" "\"KPNA2\"" "\"UBE2C\"" "\"CENPF\"" "##" "[4]" "\"HMGB2\""
what I want is a list that looks like this:
" "KPNA2" "UBE2C" "KPNA2" "UBE2C" etc...
After I will clean up the quotes and non gene items. I realize this is probably not the most efficient way to go about cleaning up this string, I'm still relatively new to programming and am more curious why the strsplit line I'm using is returning such weird output.
Thanks!
You can use a base R approach with
regmatches(string, gregexpr('(?<=")\\w+(?=")', string, perl=TRUE))[[1]]
# => [1] "KPNA2" "UBE2C" "CENPF" "HMGB2"
See the R demo online and the regex demo. Mind the perl=TRUE argument, it is necessary since this argument enables PCRE regex syntax.
Details:
(?<=") - a positive lookbehind that requires a " char to occur immediately to the left of the current position
\w+ - one or more letters, digits or underscores
(?=") - a positive lookahead that requires a " char to occur immediately to the right of the current position.
If you want to avoid matching underscores and lowercase letters, replace \\w+ with [A-Z0-9]+.
We may use str_extract to extract the alpha numeric characters after the " - match one of more alpha numeric characters ([[:alnum:]]+) that follows the " (within regex lookaround ((?<=")))
library(stringr)
str_extract_all(string, '(?<=")[[:alnum:]]+')[[1]]
[1] "KPNA2" "UBE2C" "CENPF" "HMGB2"
Also, if we want to use strsplit from base R, split not only the space (\\s+), but also on the double quotes and other characters not needed (#)
setdiff(strsplit(string, split = '["# ]+|\\[\\d+\\]')[[1]], "")
[1] "KPNA2" "UBE2C" "CENPF" "HMGB2"

Replace "$" in a string in R [duplicate]

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 1 year ago.
I would like to replace $ in my R strings. I have tried:
mystring <- "file.tree.id$HASHd15962267-44c21f1cee1057d95d6840$HASHe92451fece3b3341962516acfa962b2f$checked"
stringr::str_replace(mystring, pattern="$",
replacement="!")
However, it fails and my replacement character is put as the last character in my original string:
[1] "file.tree.id$HASHd15962267-44c21f1cee1057d95d6840$HASHe92451fece3b3341962516acfa962b2f$checked!"
I tried some variation using "pattern="/$" but it fails as well. Can someone point a strategy to do that?
In base R, You could use:
chartr("$","!", mystring)
[1] "file.tree.id!HASHd15962267-44c21f1cee1057d95d6840!HASHe92451fece3b3341962516acfa962b2f!checked"
Or even
gsub("$","!", mystring, fixed = TRUE)
We need fixed to be wrapped as by default pattern is in regex mode and in regex $ implies the end of string
stringr::str_replace_all(mystring, pattern = fixed("$"),
replacement = "!")
Or could escape (\\$) or place it in square brackets ([$]$), but `fixed would be more faster

Regex for literal curly brackets in R [duplicate]

This question already has answers here:
Error: '\R' is an unrecognized escape in character string starting "C:\R"
(5 answers)
Closed 2 years ago.
I am not an expert on Regex in R, but I feel I have read the docs first long enough and still come up short, so I am posting here.
I am trying to replace the following string, all LITERALLY as written:
a = "\\begin{tabular}"
a = gsub("\\begin{tabular}", "\\scalebox{0.7}{
\\begin{tabular}", a)
Desired output is : cat('\\scalebox{0.7}{ \\begin{tabular}')
So I know I need to escape the first "\" to "\", but when I escape the brackets I get
Error: '\}' is an unrecognized escape in character string starting...
In your case since you're seeking to replace a fixed string, you can simply set fixed = T option to avoid regular expressions entirely.
a = "\\begin{tabular}"
a = gsub("\\begin{tabular}", "\\scalebox{0.7}{\n\\begin{tabular}", x=a, fixed= T)
and use \n for the newline.
If you did want to use regex, you need to escape curly bracket in pattern using two backslashes rather than one.
e.g.,
a = "\\begin{tabular}"
gsub(pattern = "\\{|\\}", replacement = "_foo_", x=a)
[1] "\\begin_foo_tabular_foo_"
Alternatively, you can enclose the curly brackets in square brackets like so:
e.g.,
a = "\\begin{tabular}"
gsub(pattern = "[{]|[}]", replacement = "_foo_", x=a)
[1] "\\begin_foo_tabular_foo_"

Trying to replace a () in a string in R using str_replace [duplicate]

This question already has answers here:
str_replace (package stringr) cannot replace brackets in r?
(3 answers)
Closed 6 years ago.
I am trying to replace a () in a string using the sub_string function in R but it appears that due that the function is overlooking the (). I am pretty new to coding and R so I imagine that it has something to do with the regular expression of ().
I just dont know how to make the code identify that I want it to treat the () as regular characters
example string:
tBodyAcc-mean()-X
Here is the function I am using:
mutate(feature,feature=str_replace(feature$feature,(),""))
Appreciate the help
Sub, gsub
\\ identify special characters
If you want to replace ONLY the parenthesis that are in the middle of the string (that is not at the start or at the end):
text <- "tBodyAcc-mean()-X"
sub("#\\(\\)#", "", text)
[1] "tBodyAcc-mean-X"
text <- "tBodyAcc-mean-X()"
sub("#\\(\\)#", "", text)
[1] "tBodyAcc-mean-X()"
If you want to replace ANY parenthesis (including those at the end and at the start of the string)
text <- "tBodyAcc-mean()-X"
sub("\\(\\)", "", text)
EDIT, as pointed out in several comments using gsub instead of sub will replace all the "()" in a string, while sub only replace the first "()"
text <- "t()BodyAcc-mean()-X"
sub("\\(\\)", "", text)
[1] "tBodyAcc-mean()-X"
> gsub("\\(\\)", "", text)
[1] "tBodyAcc-mean-X"
You can do better using gsub. It will replace all occurrences.
# First argument is the pattern to find. Instead of () you specify \\(\\) because is a regular expression and you want the literal ()
# Second argument is the string to replace
# Third argument is the string in which the replacement takes place
gsub("\\(\\)", "REPLACE", "tBodyAcc-mean()-X")
Output:
[1] "tBodyAcc-meanREPLACE-X"

In R replace punctuation "." within a string [duplicate]

This question already has answers here:
Replacing commas and dots in R
(3 answers)
Closed 7 years ago.
I have look into the web and found this webpage In R, replace text within a string to replace a text within in a string.
I tried the same method to replace the punctuation "." into another punctuation "-" but it did not work.
group <- c("12357.", "12575.", "197.18", ".18947")
gsub(".", "-", group)
gives this output
[1] "------" "------" "------" "------"
instead of
[1] "12357-" "12575-" "197-18" "-18947"
Is there an alternate way to do this ?
"." in regex langage means "any character". To capture the actual point, you need to escape it, so:
gsub("\\.", "-", group)
#[1] "12357-" "12575-" "197-18" "-18947"
As mentioned by #akrun in the comments, if you prefer, you can also enclosed it in between brackets, then you don't need to escape it:
gsub('[.]', '-', group)
[1] "12357-" "12575-" "197-18" "-18947"

Resources