Searching a backslash in a string received from external source - r

I have a string I received from my DB, so in R it looks like:
a <- c("www", "x", "yes", "\303\243")
> a
[1] "www" "x" "yes" "ã"
What I want to do is to find which of the elements has backslash in it.
I tried:
grepl('\\',a[4])
But I keep getting the error
invalid regular expression '\', reason 'Trailing backslash'
no matter whether I use cat or fixed=T.
How do I find that backslash in the list?

You need to escape the backslash twice, once for the String literal in R and once for the regular Expression. grepl("\\", a[4]) applies the regexp \, while grepl("\\\\", a[4]) applies the regexp \\. To view the escaped string literal you can use cat("\\").
But i think your string does not contain any backslash at all, because in the definition the backslash occurs in an escape sequence, not as a character itself.

Related

Applying a regular expression to a string in R

I'm just getting to know the language R, previously worked with python. The challenge is to replace the last character of each word in the string with *.
How it should look: example text in string, and result work: exampl* tex* i* strin*
My code:
library(tidyverse)
library(stringr)
string_example = readline("Enter our text:")
string_example = unlist(strsplit(string_example, ' '))
string_example
result = str_replace(string_example, pattern = "*\b", replacement = "*")
result
I get an error:
> result = str_replace(string_example, pattern = "*\b", replacement = "*")
Error in stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=``)
Help solve the task
Oh, I noticed an error, the pattern should be .\b. this is how the code is executed, but there is no replacement in the string
If you mean words consisting of letters only, you can use
string_example <- "example text in string"
library(stringr)
str_replace_all(string_example, "\\p{L}\\b", "*")
## => [1] "exampl* tex* i* strin*"
See the R demo and the regex demo.
Details:
\p{L} - a Unicode category (propery) class matching any Unicode letter
\b - a word boundary, in this case, it makes sure there is no other word character immediately on the right. It will fails the match if the letter matched with \p{L} is immediately followed with a letter, digit or _ (these are all word chars). If you want to limit this to a letter check, replace \b with (?!\p{L}).
Note the backslashes are doubled because in regular string literals backslashes are used to form string escape sequences, and thus need escaping themselves to introduce literal backslashes in string literals.
Some more things to consider
If you do not want to change one-letter words, add a non-word boundary at the start, "\\B\\p{L}\\b"
If you want to avoid matching letters that are followed with - + another letter (i.e. some compound words), you can add a lookahead check: "\\p{L}\\b(?!-)".
You may combine the lookarounds and (non-)word boundaries as you need.

How to replace "&" with "\&"

I find it difficult to replace "&" with "\&" using R's base gsub() function -
gsub("&", "\&", "A&B")
Gives below error -
Error: '\&' is an unrecognized escape in character string starting ""\&"
Is there any way to achieve this substitution?
You may use
gsub("&", "\\&", "A&B",fixed=TRUE) # Fixed string replacement
gsub("(&)", "\\\\\\1", "A&B") # Regex replacement
The fixed string replacement is clear: every & is replaced with a \&. The double \ is used in the string literal to denote a literal \.
In the regex replacement, the & is matched and captured into Group 1. Since a backslash is a special character in the regex replacement pattern, it must be doubled, and - keeping in mind a literal backslash is defined with \\ inside a string literal - we need to use \\\\ in the replacement. The \1 is the backreference to Group 1 value, but again, the \ must be doubled in the string, literal, hence, we use \\1 in there. That is why there are 6 backslashes in a row. You may find more about backslashes problem here.
The result only contains a single backslash, you can easily check that using cat or saving the contents to a text file:
cat(gsub("&", "\\&", "A&B",fixed=TRUE), collapse="\n")
cat(gsub("(&)", "\\\\\\1", "A&B"))
See the R demo online

Split a character string in R on a single backslash [duplicate]

I am trying to extract the part of the string before the first backslash but I can't seem to get it tot work properly.
I have tried multiple ways of getting it to work, based on the manual page for strsplit and after searching online.
In my actual situation the strings are in a dataframe which I get from a database connection but I can simplify the situation with the following:
> strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\",fixed=TRUE)
[[1]]
[1] "BLAAT1\022E:" "BLAAT2" "BLAAT3"
> strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\",fixed=FALSE)
Error in strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3", "\\", fixed = FALSE) :
invalid regular expression '\', reason 'Trailing backslash'
> strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\\\",fixed=TRUE)
[[1]]
[1] "BLAAT1\022E:\\BLAAT2\\BLAAT3"
> strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\\\",fixed=FALSE)
[[1]]
[1] "BLAAT1\022E:" "BLAAT2" "BLAAT3"
The expected output would also split on the \ between BLAAT1 and 022E:
Thanks in advance
If you use a regex with strsplit function, a literal backslash can be coded as two literal backslashes (as a literal \ is a special regex metacharacter that is used to form regex escapes, like \d, \w, etc.), but since R string literals support string escape sequences (like "\r" for carriage return, "\n" for a newline char) a literal backslash needs to be defined with a double backslash.
So, "\\" is a literal \, and a regex pattern to match a literal backslash char, being \\, should be coded with 4 backslashes, "\\\\".
Here is a regex that you can use: it splits at \ and a non-printable character:
strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\\\|[^[:print:]]",fixed=FALSE)
# [1] "BLAAT1" "E:" "BLAAT2" "BLAAT3"
See IDEONE demo

Warning on regex string in Python

So, I am doing a small function to strip all the weird chars from a string, eg. #$& will be replaced just for a " "
The chars I am trying to remove are the following, defined into a string:
xChars = r"#$%()'^*\;:/|+_.–°ªº"
However I kepp getting the warning:
Anomalous backslash in string: '\;'. String constant might be missing an r prefix
However, when i used the r prefix eg. r"\" python rules out some of the special chars i want to replace. It doesnt produce an error it just thinks that those chars are ok or something and it rules them out.
Any ideas on how to fix this ?
Normally backslashes escape characters, therefore the compiler isn´t sure if the backslash has to be escaped. Maybe try using a double backslash to escape the backslash itself like: xChars = r"#$%()'^*\\;:/|+_.–°ªº"

Escaping backslash (\) in string or paths in R

Windows copies path with backslash \, which R does not accept. So, I wanted to write a function which would convert \ to /. For example:
chartr0 <- function(foo) chartr('\','\\/',foo)
Then use chartr0 as...
source(chartr0('E:\RStuff\test.r'))
But chartr0 is not working. I guess, I am unable to escape /. I guess escaping / may be important in many other occasions.
Also, is it possible to avoid the use chartr0 every time, but convert all path automatically by creating an environment in R which calls chartr0 or use some kind of temporary use like using options
From R 4.0.0 you can use r"(...)" to write a path as raw string constant, which avoids the need for escaping:
r"(E:\RStuff\test.r)"
# [1] "E:\\RStuff\\test.r"
There is a new syntax for specifying raw character constants similar to the one used in C++: r"(...)" with ... any character sequence not containing the sequence )". This makes it easier to write strings that contain backslashes or both single and double quotes. For more details see ?Quotes.
Your fundamental problem is that R will signal an error condition as soon as it sees a single back-slash before any character other than a few lower-case letters, backslashes themselves, quotes or some conventions for entering octal, hex or Unicode sequences. That is because the interpreter sees the back-slash as a message to "escape" the usual translation of characters and do something else. If you want a single back-slash in your character element you need to type 2 backslashes. That will create one backslash:
nchar("\\")
#[1] 1
The "Character vectors" section of _Intro_to_R_ says:
"Character strings are entered using either matching double (") or single (') quotes, but are printed using double quotes (or sometimes without quotes). They use C-style escape sequences, using \ as the escape character, so \ is entered and printed as \, and inside double quotes " is entered as \". Other useful escape sequences are \n, newline, \t, tab and \b, backspace—see ?Quotes for a full list."
?Quotes
chartr0 <- function(foo) chartr('\\','/',foo)
chartr0('E:\\RStuff\\test.r')
You cannot write E:\Rxxxx, because R believes R is escaped.
The problem is that every single forward slash and backslash in your code is escaped incorrectly, resulting in either an invalid string or the wrong string being used. You need to read up on which characters need to be escaped and how. Take a look at the list of escape sequences in the link below. Anything not listed there (such as the forward slash) is treated literally and does not require any escaping.
http://cran.r-project.org/doc/manuals/R-lang.html#Literal-constants

Resources