Split a character string in R on a single backslash [duplicate] - r

I am trying to extract the part of the string before the first backslash but I can't seem to get it tot work properly.
I have tried multiple ways of getting it to work, based on the manual page for strsplit and after searching online.
In my actual situation the strings are in a dataframe which I get from a database connection but I can simplify the situation with the following:
> strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\",fixed=TRUE)
[[1]]
[1] "BLAAT1\022E:" "BLAAT2" "BLAAT3"
> strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\",fixed=FALSE)
Error in strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3", "\\", fixed = FALSE) :
invalid regular expression '\', reason 'Trailing backslash'
> strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\\\",fixed=TRUE)
[[1]]
[1] "BLAAT1\022E:\\BLAAT2\\BLAAT3"
> strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\\\",fixed=FALSE)
[[1]]
[1] "BLAAT1\022E:" "BLAAT2" "BLAAT3"
The expected output would also split on the \ between BLAAT1 and 022E:
Thanks in advance

If you use a regex with strsplit function, a literal backslash can be coded as two literal backslashes (as a literal \ is a special regex metacharacter that is used to form regex escapes, like \d, \w, etc.), but since R string literals support string escape sequences (like "\r" for carriage return, "\n" for a newline char) a literal backslash needs to be defined with a double backslash.
So, "\\" is a literal \, and a regex pattern to match a literal backslash char, being \\, should be coded with 4 backslashes, "\\\\".
Here is a regex that you can use: it splits at \ and a non-printable character:
strsplit("BLAAT1\022E:\\BLAAT2\\BLAAT3","\\\\|[^[:print:]]",fixed=FALSE)
# [1] "BLAAT1" "E:" "BLAAT2" "BLAAT3"
See IDEONE demo

Related

Applying a regular expression to a string in R

I'm just getting to know the language R, previously worked with python. The challenge is to replace the last character of each word in the string with *.
How it should look: example text in string, and result work: exampl* tex* i* strin*
My code:
library(tidyverse)
library(stringr)
string_example = readline("Enter our text:")
string_example = unlist(strsplit(string_example, ' '))
string_example
result = str_replace(string_example, pattern = "*\b", replacement = "*")
result
I get an error:
> result = str_replace(string_example, pattern = "*\b", replacement = "*")
Error in stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=``)
Help solve the task
Oh, I noticed an error, the pattern should be .\b. this is how the code is executed, but there is no replacement in the string
If you mean words consisting of letters only, you can use
string_example <- "example text in string"
library(stringr)
str_replace_all(string_example, "\\p{L}\\b", "*")
## => [1] "exampl* tex* i* strin*"
See the R demo and the regex demo.
Details:
\p{L} - a Unicode category (propery) class matching any Unicode letter
\b - a word boundary, in this case, it makes sure there is no other word character immediately on the right. It will fails the match if the letter matched with \p{L} is immediately followed with a letter, digit or _ (these are all word chars). If you want to limit this to a letter check, replace \b with (?!\p{L}).
Note the backslashes are doubled because in regular string literals backslashes are used to form string escape sequences, and thus need escaping themselves to introduce literal backslashes in string literals.
Some more things to consider
If you do not want to change one-letter words, add a non-word boundary at the start, "\\B\\p{L}\\b"
If you want to avoid matching letters that are followed with - + another letter (i.e. some compound words), you can add a lookahead check: "\\p{L}\\b(?!-)".
You may combine the lookarounds and (non-)word boundaries as you need.

How to replace "&" with "\&"

I find it difficult to replace "&" with "\&" using R's base gsub() function -
gsub("&", "\&", "A&B")
Gives below error -
Error: '\&' is an unrecognized escape in character string starting ""\&"
Is there any way to achieve this substitution?
You may use
gsub("&", "\\&", "A&B",fixed=TRUE) # Fixed string replacement
gsub("(&)", "\\\\\\1", "A&B") # Regex replacement
The fixed string replacement is clear: every & is replaced with a \&. The double \ is used in the string literal to denote a literal \.
In the regex replacement, the & is matched and captured into Group 1. Since a backslash is a special character in the regex replacement pattern, it must be doubled, and - keeping in mind a literal backslash is defined with \\ inside a string literal - we need to use \\\\ in the replacement. The \1 is the backreference to Group 1 value, but again, the \ must be doubled in the string, literal, hence, we use \\1 in there. That is why there are 6 backslashes in a row. You may find more about backslashes problem here.
The result only contains a single backslash, you can easily check that using cat or saving the contents to a text file:
cat(gsub("&", "\\&", "A&B",fixed=TRUE), collapse="\n")
cat(gsub("(&)", "\\\\\\1", "A&B"))
See the R demo online

Replace punctuation in string

I want to replace the punctuation in a string by adding '\\' before the punctuation. The reason is I will be using regex on the string afterwards and it fails if there is a question mark without '\\' in front of it.
So basically, I would like to do something like this:
gsub("\\?","\\\\?", x)
Which converts a string "How are you?" to "How are you\\?" But I would like to do this for all punctuation. Is this possible?
You can use gsub with the [[:punct:]] regular expression alias as follows:
> x <- "Hi! How are you today?"
> gsub('([[:punct:]])', '\\\\\\1', x)
[1] "Hi\\! How are you today\\?"
Note the replacement starts with '\\\\' to produce the double backslash you requested while the '\\1' portion preserves the punctuation mark.

using gsub function in r to remove slash

suppose I have a string that has the following characters
"\"------------080209060700030309080805\""
And now I want to use gsub function in r to remove the "\ and \" part, and only keep the following characters:
"------------080209060700030309080805\"
Could anyone help me to figure out how should I do it properly ?
Edit 1: Fixed bug (two backslashes required to create a backslash in a string):
s <- '\\"------------080209060700030309080805\\"'
s
gsub('\\"', "", s, fixed = TRUE)
results in
> s <- '\\"------------080209060700030309080805\\"'
> s
[1] "\\\"------------080209060700030309080805\\\""
> gsub('\\"', "", s, fixed = TRUE)
[1] "------------080209060700030309080805"
Please note that a single backslash in R is the escape code which is NOT part of the string:
> charToRaw('\\"')
[1] 5c 22
> charToRaw('\"')
[1] 22
Therefor you have to use two backslashes in the quoted string to create one backslash internally. If you print this string the backslash is escaped again which looks confusing:
> print('\\"')
[1] "\\\""
If you want to print the unescaped content of the string use cat instead of print:
> cat('\\"')
\"
For more see help in R: ?"'":
Character constants
Single and double quotes delimit character constants. They can be used
interchangeably but double quotes are preferred (and character
constants are printed using double quotes), so single quotes are
normally only used to delimit character constants containing double
quotes.
Backslash is used to start an escape sequence inside character
constants. Escaping a character not in the following table is an
error.
Single quotes need to be escaped by backslash in single-quoted
strings, and double quotes in double-quoted strings.
\n newline \r carriage return \t tab \b backspace \a alert (bell)
\f form feed \v vertical tab \ backslash \ \' ASCII apostrophe '
\" ASCII quotation mark " ` ASCII grave accent (backtick) ` \nnn
character with given octal code (1, 2 or 3 digits) \xnn character
with given hex code (1 or 2 hex digits) \unnnn Unicode character with
given code (1--4 hex digits) \Unnnnnnnn Unicode character with given
code (1--8 hex digits)
string <- "\\------------080209060700030309080805\\"
string <- gsub("^\\\\(.*)\\\\$", "\\1", string)
Notes: The pattern I used was ^\(.*)\$, which will match everything in between a beginning and ending backslash. This would only match strings therefore which both begin and end with backslash. Also, we use four backslashes (\\\\) to represent a literal backslash for the pattern in gsub(). We need to escape twice, once for R, and a second time for the regex engine.

Searching a backslash in a string received from external source

I have a string I received from my DB, so in R it looks like:
a <- c("www", "x", "yes", "\303\243")
> a
[1] "www" "x" "yes" "ã"
What I want to do is to find which of the elements has backslash in it.
I tried:
grepl('\\',a[4])
But I keep getting the error
invalid regular expression '\', reason 'Trailing backslash'
no matter whether I use cat or fixed=T.
How do I find that backslash in the list?
You need to escape the backslash twice, once for the String literal in R and once for the regular Expression. grepl("\\", a[4]) applies the regexp \, while grepl("\\\\", a[4]) applies the regexp \\. To view the escaped string literal you can use cat("\\").
But i think your string does not contain any backslash at all, because in the definition the backslash occurs in an escape sequence, not as a character itself.

Resources