Pasting double quotes " " in R - r

I'm trying to paste a series of string of characters like this:
paste0("//*[#id=",'"set_',1,'_div"]/a')
[1] "//*[#id=\"set_1_div\"]/a"
How can I get rid of the "\"? This is my expected outcome
[1] "//*[#id="set_1_div"]/a"
Thanks a lot

The backslash designates that the next character needs to be 'escaped', i.e., it does not need to be interpreted as being part of an expression, but rather as a character. When using the print statement, character strings are quoted and therefore the escape sign (backslash) is included. However, using the cat statement you can easily see that the backslashes are not actualy part of the character string:
> x <- paste0("//*[#id=",'"set_',1,'_div"]/a')
> x
[1] "//*[#id=\"set_1_div\"]/a"
> cat(x)
//*[#id="set_1_div"]/a

Related

How to create a regex expression to get a substring between 2 pipes

I have a dataset that I'm trying to work with where I need to get the text between two pipe delimiters. The length of the text is variable so I can't use length to get it. This is the string:
ENST00000000233.10|ENSG00000004059.11|OTTHUMG000
I want to get the text between the first and second pipes, that being ENSG00000004059.11. I've tried several different regex expressions, but I can't really figure out the correct syntax. What should the correct regex expression be?
Here is a regex.
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
sub("^[^\\|]*\\|([^\\|]+)\\|.*$", "\\1", x)
#> [1] "ENSG00000004059.11"
Created on 2022-05-03 by the reprex package (v2.0.1)
Explanation:
^ beginning of string;
[^\\|]* not the pipe character zero or more times;
\\| the pipe character needs to be escaped since it's a meta-character;
^[^\\|]*\\| the 3 above combined mean to match anything but the pipe character at the beginning of the string zero or more times until a pipe character is found;
([^\\|]+) group match anything but the pipe character at least once;
\\|.*$ the second pipe plus anything until the end of the string.
Then replace the 1st (and only) group with itself, "\\1", thus removing everything else.
Another option is to get the second item after splitting the string on |.
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
strsplit(x, "\\|")[[1]][[2]]
# strsplit(x, "[|]")[[1]][[2]]
# [1] "ENSG00000004059.11"
Or with tidyverse:
library(tidyverse)
str_split(x, "\\|") %>% map_chr(`[`, 2)
# [1] "ENSG00000004059.11"
Maybe use the regex for look ahead and look behind to extract strings that are surrounded by two "|".
The regex literally means - look one or more characters (.+?) behind "|" ((?<=\\|)) until one character before "|" ((?=\\|)).
library(stringr)
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
str_extract(x, "(?<=\\|).+?(?=\\|)")
[1] "ENSG00000004059.11"
Try this: \|.*\| or in R \\|.*\\| since you need to escape the escape characters. (It's just escaping the first pipe followed by any character (.) repeated any number of times (*) and followed by another escaped pipe).
Then wrap in str_sub(MyString, 2, -2) to get rid of the pipes if you don't want them.

strsplit returning nested list with backslashes and quotes added \"

I'm using R to split a messy string of gene names and as a first step am simply attempting to break the string into a list by spaces between characters using strsplit and regex but have been coming across this weird bug:
string <- ' " "KPNA2" "UBE2C" "CENPF" ## [4] "HMGB2"'
ccGenes <- strsplit(string, split = '\\s+')[[1]]
returns a length 1 nested list containing an object of type "character [8]" (not sure what type of object this indicates) that places a backslash in front of double quotes (" -> \") looks like this when printed:
"" "\"" "\"KPNA2\"" "\"UBE2C\"" "\"CENPF\"" "##" "[4]" "\"HMGB2\""
what I want is a list that looks like this:
" "KPNA2" "UBE2C" "KPNA2" "UBE2C" etc...
After I will clean up the quotes and non gene items. I realize this is probably not the most efficient way to go about cleaning up this string, I'm still relatively new to programming and am more curious why the strsplit line I'm using is returning such weird output.
Thanks!
You can use a base R approach with
regmatches(string, gregexpr('(?<=")\\w+(?=")', string, perl=TRUE))[[1]]
# => [1] "KPNA2" "UBE2C" "CENPF" "HMGB2"
See the R demo online and the regex demo. Mind the perl=TRUE argument, it is necessary since this argument enables PCRE regex syntax.
Details:
(?<=") - a positive lookbehind that requires a " char to occur immediately to the left of the current position
\w+ - one or more letters, digits or underscores
(?=") - a positive lookahead that requires a " char to occur immediately to the right of the current position.
If you want to avoid matching underscores and lowercase letters, replace \\w+ with [A-Z0-9]+.
We may use str_extract to extract the alpha numeric characters after the " - match one of more alpha numeric characters ([[:alnum:]]+) that follows the " (within regex lookaround ((?<=")))
library(stringr)
str_extract_all(string, '(?<=")[[:alnum:]]+')[[1]]
[1] "KPNA2" "UBE2C" "CENPF" "HMGB2"
Also, if we want to use strsplit from base R, split not only the space (\\s+), but also on the double quotes and other characters not needed (#)
setdiff(strsplit(string, split = '["# ]+|\\[\\d+\\]')[[1]], "")
[1] "KPNA2" "UBE2C" "CENPF" "HMGB2"

Trying to figure out regular expression in R for sub() [duplicate]

This question already has answers here:
Replace single backslash in R
(5 answers)
Closed 3 years ago.
I'm trying to use regular expression in a sub() function in order to replace all the "\" in a Vector
I've tried a number of different ways to get R to recognize the "\":
I've tried "\\\" but I keep getting errors.
I've tried "\.*"
I've tried "\\\.*"
data.frame1$vector4 <- sub(pattern = "\\\", replace = ", data.frame1$vector4)
The \ that I am trying to get rid of only appears occasionally in the vector and always in the middle of the string. I want to get rid of it and all the characters that follow it.
The error that I am getting
Error: '\.' is an unrecognized escape in character string starting "\."
Also I'm struggling to get Stack to print the "\" that I am typing above. It keeps deleting them.
1) 4 backslashes To insert backslash into an R literal string use a double backslash; however, a backslash is a metacharacter for a regular expression so it must be escaped by prefacing it with another backslash which also has to be doubled. Thus using 4 backslashes will be needed in the regular expression.
s <- "a\\b\\c"
nchar(s)
## [1] 5
gsub("\\\\", "", s)
## [1] "abc"
2) character class Another way to effectively escape it is to surround it with [...]
gsub("[\\]", "", s)
## [1] "abc"
3) fixed argument Perhaps the simplest way is to use fixed=TRUE in which case special characters will not be regarded as regular expression metacharacters.
gsub("\\", "", s, fixed = TRUE)
## [1] "abc"

R gsub/str_replace to return a backslash

I need to insert a data frame into a SQL database. I've build the script (using loops, str_c, RODBC) to transform my data frame into a SQL Insert command, but I've run into the problem with a single "'" breaking the SQL.
Here is an example of the problem:
The Data Frame looks like this:
pk b
1 o'keefe
The desired SQL output is: INSERT INTO table (pk, b) (1, 'o\'keefe')
gsub("'", "\'", str_replace_na(df$b[1], ""))
[1] "o'keefe"
gsub("'", "\\\\'", str_replace_na(df$b[1], ""))
[1] "o\\'keefe"
I've tried str_replace, str_replace_all, gsub w/ fixed = TRUE and perl = TRUE and I get the same result.
I am aware of the comment on How to give Backslash as replacement in R string replace, which states cat() shows the slash. But this doesn't carry over to my data frame or SQL query.
Any help on this problem would be greatly appreciated!
Additional note, I am aware the R prints a double backslash as referenced http://r.789695.n4.nabble.com/gsub-replacing-double-backslashes-with-single-backslash-td4453328.html and R: How to replace space (' ') in string with a *single* backslash and space ('\ ') even though only one slash really exists. However, my SQL statement still won't work when zero or two backslashes are present.
"o\\'keefe" is in fact what you want: the double blackslash is in fact a representation of a single backslash.
For instance:
\U005C is the unicode character for the backslash. Yet:
"\U005C"
[1] "\\"
While \U002F is the unicode character for the forward slash and:
"\U002F"
[1] "/"
So your second solution already gave you what you want. Removing the unnecessary str_replace_na():
gsub("'", "\\\\'", df$b[1])
[1] "o\\'keefe"
Note: credit in fact goes to #Rui Barradas who showed that the double backslash represents a single backslash with:
nchar("\\")
[1] 1
Try putting the single quote inside ['].
x <- "o'keefe"
y <- gsub("[']", "\\\\'", s)
y
#[1] "o\\'keefe"
This seems to have added two characters to the string but no, there is just one \.
nchar(x)
#[1] 7
nchar(y)
#[1] 8

print backslash in R strings

GNU R 3.02
> bib <- "\cite"
Error: '\c' is an unrecognized escape in character string starting ""\c"
> bib <- "\\cite"
> print(bib)
[1] "\\cite"
> sprintf(bib)
[1] "\\cite"
>
how can I print out the string variable bib with just one "\"?
(I've tried everything conceivable, and discover that R treats the "\\" as one character.)
I see that in many cases this is not a problem, since this is usually handled internally by R, say, if the string were to be used as text for a plot.
But I need to send it to LaTeX. So I really have to remove it.
I see cat does the trick. If cat could only be made to send its result to a string.
You should use cat.
bib <- "\\cite"
cat(bib)
# \cite
You can remove the ## and [1] by setting a few options in knitr. Here is an example chunk:
<<newChunk,echo=FALSE,comment=NA,background=NA>>=
bib <- "\\cite"
cat(bib)
#
which gets you \cite. Note as well that you can set these options globally.
There is no backslash in the character element "\cite". The backslash is being interpreted as an escape and the two character "\c" is being interpreted as a cntrl-c. Except that is not a recognized character. See ?Quotes. The second version has only one backslash followed by 4 alpha characters. Count the characters to see this:
nchar("\\cite")
[1] 5
OK,
<<echo=FALSE,result='asis'>>
result <- cat(rbib)
#
does the trick (without the result <- bit, [1] is added). It just feels kludgy.

Resources