I have a string that looks like:
str<-"a\f\r"
I'm trying to remove the backslashes but nothing works:
gsub("\","",str, fixed=TRUE)
gsub("\\","",str)
gsub("(\)","",str)
gsub("([\])","",str)
...basically all the variations you can imagine. I have even tried the string_replace_all function. ANY HELP??
I'm using R version 3.1.1; Mac OSX 10.7; the dput for a single string in my vector of strings gives:
dput(line)
"ud83d\ude21\ud83d\udd2b"
I imported the file using
readLines from a standard
.txt file. The content of the file looks something like:
got an engineer booked for this afternoon \ud83d\udc4d all now hopefully sorted\ud83d\ude0a I m going to go insane ud83d\ude21\ud83d\udd2b in utf8towcs …
Thanks.
One quite universal solution is
gsub("\\\\", "", str)
Thanks to the comment above.
When inputting backslashes from the keyboard, always escape them.
str <-"this\\is\\my\\string" # note doubled backslashes -> 'this\is\my\string'
gsub("\\", "", str, fixed=TRUE) # ditto
str2 <- "a\\f\\r" # ditto -> 'a\f\r'
gsub("\\", "", str2, fixed=TRUE)# ditto
Note that if you do
str <- "a\f\r"
then str contains no backslashes. It consists of the 3 characters a, \f (which is not normally printable, except as \f, and \r (same).
And just to head off a possible question. If your data was read from a file, the file doesn't have to have doubled backslashes. For example, if you have a file test.txt containing
a\b\c\d\e\f
and you do
str <- readLines("test.txt")
then str will contain the string a\b\c\d\e\f as you'd expect: 6 letters separated by 5 single backslashes. But you still have to type doubled backslashes if you want to work with it.
str <- gsub("\\", "", str, fixed=TRUE) # now contains abcdef
From the dput, it looks like what you've got there is UTF-16 encoded text, which probably came from a Windows machine. According to
https://en.wikipedia.org/wiki/Unicode#Character_General_Category
https://en.wikipedia.org/wiki/UTF-16
it encodes glyphs in the Supplementary Multilingual Plane, which is pretty obscure. I'll guess that you need to supply the argument encoding="UTF-16" to readLines when you read in the file.
Since there isn't any direct ways to dealing with single backslashes, here's the closest solution to the problem as provided by David Arenburg in the comments section
gsub("[^A-Za-z0-9]", "", str) #remove all besides the alphabets & numbers
This might be helpful :)
require(stringi)
stri_escape_unicode("ala\\ma\\kota")
## [1] "ala\\\\ma\\\\kota"
stri_unescape_unicode("ala\\ ma\\ kota")
## [1] "ala ma kota"
As of R 4.0.0, you can now use raw strings to avoid confusion with backlashes, just use the following syntax: r"(your_raw_expression)" (parentheses included):
str<-r"(ud83d\ude21\ud83d\udd2b)" #Equivalent of "ud83d\\ude21\\ud83d\\udd2b"
gsub(r"(\\)", "", str)
# [1] "ud83dude21ud83dudd2b"
Related
I have an R tibble with UTF-8 character column. When I print the contents of this column for a certain problematic record, everything looks fine: one two three. There are, however, problems when I try to use this string in a RDBMS query which I construct in R and send to the database.
If I copy this string to Notepad++ and convert the encoding to ANSI, I can see that the string actually contains some additional characters that cause the problem: one â€two‬ three.
A partial solution that works would be conversion to ASCII:
iconv(my_string, "UTF-8", "ASCII", sub = "")
, but all non-ASCII characters are lost here.
Conversion from UTF-8 to UTF-8 doesn't solve my problem:
iconv(my_string, "UTF-8", "UTF-8", sub = "").
Is it possible to remove all invisible characters like the ones above without losing the UTF-8 encoding?
That is:
how can I convert my string to the form that I see when I print it out in R (without hidden parts)?
Not sure I completely understand what you are trying to do, but you can use stringi or stringr to explicitly specify what characters you want to retain. For your example, it could look something like this. You may have to expand the characters you want to retain, but this approach is one option:
library(stringr)
my_string <- "one â€two‬ three"
# Specifying that you only want upper and lowercase letters,
# numbers, punctuation, and whitespace.
str_remove_all(my_string, "[^A-z|0-9|[:punct:]|\\s]")
[1] "one two three"
# Just checking
stringi::stri_enc_isutf8(str_remove_all(my_string, "[^A-z|0-9|[:punct:]|\\s]"))
[1] TRUE
EDIT: I do want to note that you should check and see how robust this approach is. I have not dealt with invisible characters often so this may not be the best way to go about removing them.
You haven't given us a way to construct your bad string so I can't test this on your data, but it works on this example.
badString <- "one \u200Btwo\u200B three"
chars <- strsplit(badString, "")[[1]] # Assume badString has one entry; if not, add a loop
chars <- chars[nchar(chars, type = "width") > 0]
goodString <- paste(chars, collapse = "")
Both badString and goodString look the same when printed:
> badString
[1] "one two three"
> goodString
[1] "one two three"
but they have different numbers of characters:
> nchar(badString)
[1] 15
> nchar(goodString)
[1] 13
I need to insert a data frame into a SQL database. I've build the script (using loops, str_c, RODBC) to transform my data frame into a SQL Insert command, but I've run into the problem with a single "'" breaking the SQL.
Here is an example of the problem:
The Data Frame looks like this:
pk b
1 o'keefe
The desired SQL output is: INSERT INTO table (pk, b) (1, 'o\'keefe')
gsub("'", "\'", str_replace_na(df$b[1], ""))
[1] "o'keefe"
gsub("'", "\\\\'", str_replace_na(df$b[1], ""))
[1] "o\\'keefe"
I've tried str_replace, str_replace_all, gsub w/ fixed = TRUE and perl = TRUE and I get the same result.
I am aware of the comment on How to give Backslash as replacement in R string replace, which states cat() shows the slash. But this doesn't carry over to my data frame or SQL query.
Any help on this problem would be greatly appreciated!
Additional note, I am aware the R prints a double backslash as referenced http://r.789695.n4.nabble.com/gsub-replacing-double-backslashes-with-single-backslash-td4453328.html and R: How to replace space (' ') in string with a *single* backslash and space ('\ ') even though only one slash really exists. However, my SQL statement still won't work when zero or two backslashes are present.
"o\\'keefe" is in fact what you want: the double blackslash is in fact a representation of a single backslash.
For instance:
\U005C is the unicode character for the backslash. Yet:
"\U005C"
[1] "\\"
While \U002F is the unicode character for the forward slash and:
"\U002F"
[1] "/"
So your second solution already gave you what you want. Removing the unnecessary str_replace_na():
gsub("'", "\\\\'", df$b[1])
[1] "o\\'keefe"
Note: credit in fact goes to #Rui Barradas who showed that the double backslash represents a single backslash with:
nchar("\\")
[1] 1
Try putting the single quote inside ['].
x <- "o'keefe"
y <- gsub("[']", "\\\\'", s)
y
#[1] "o\\'keefe"
This seems to have added two characters to the string but no, there is just one \.
nchar(x)
#[1] 7
nchar(y)
#[1] 8
I want to escape single quotation marks in a vector (for example to execute a SQL query), but I'm unable to do so.
Example code:
strings <- c("ef'gh")
strings2 <- gsub("'", "\'", strings)
cat(strings2)
print(strings2)
Expected output of either cat or print:
ef\'gh
"ef\'gh"
But none of the above. I tried multiple other combinations already, with different ways of escaping quotation, without success.
You may use the following code:
strings <- c("ef'gh")
strings2 <- sub("\'", "\\\\'", strings)
cat(strings2)
[1] ef\'gh
print(strings2)
[1] "ef\\'gh"
I realize this is a rather simple question and I have searched throughout this site, but just can't seem to get my syntax right for the following regex challenges. I'm looking to do two things. First have the regex to pick up the first three characters and stop at a semicolon. For example, my string might look as follows:
Apt;House;Condo;Apts;
I'd like to go here
Apartment;House;Condo;Apartment
I'd also like to create a regex to substitute a word in between delimiters, while keep others unchanged. For example, I'd like to go from this:
feline;labrador;bird;labrador retriever;labrador dog; lab dog;
To this:
feline;dog;bird;dog;dog;dog;
Below is the regex I'm working with. I know ^ denotes the beginning of the string and $ the end. I've tried many variations, and am making substitutions, but am not achieving my desired out put. I'm also guessing one regex could work for both? Thanks for your help everyone.
df$variable <- gsub("^apt$;", "Apartment;", df$variable, ignore.case = TRUE)
Here is an approach that uses look behind (so you need perl=TRUE):
> tmp <- c("feline;labrador;bird;labrador retriever;labrador dog; lab dog;",
+ "lab;feline;labrador;bird;labrador retriever;labrador dog; lab dog")
> gsub( "(?<=;|^) *lab[^;]*", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The (?<=;|^) is the look behind, it says that any match must be preceded by either a semi-colon or the beginning of the string, but what is matched is not included in the part to be replaced. The * will match 0 or more spaces (since your example string had one case where there was space between the semi-colon and the lab. It then matches a literal lab followed by 0 or more characters other than a semi-colon. Since * is by default greedy, this will match everything up to, but not including' the next semi-colon or the end of the string. You could also include a positive look ahead (?=;|$) to make sure it goes all the way to the next semi-colon or end of string, but in this case the greediness of * will take care of that.
You could also use the non-greedy modifier, then force to match to end of string or semi-colon:
> gsub( "(?<=;|^) *lab.*?(?=;|$)", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The .*? will match 0 or more characters, but as few as it can get away with, stretching just until the next semi-colon or end of line.
You can skip the look behind (and perl=TRUE) if you match the delimiter, then include it in the replacement:
> gsub("(;|^) *lab[^;]*", "\\1dog", tmp)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
With this method you need to be careful that you only match the delimiter on one side (the first in my example) since the match consumes the delimiter (not with the look-ahead or look-behind), if you consume both delimiters, then the next will be skipped and only every other field will be considered for replacement.
I'd recommend doing this in two steps:
Split the string by the delimiters
Do the replacements
(optional, if that's what you gotta do) Smash the strings back together.
To split the string, I'd use the stringr library. But you can use base R too:
myString <- "Apt;House;Condo;Apts;"
# base R
splitString <- unlist(strsplit(myString, ";", fixed = T))
# with stringr
library(stringr)
splitString <- as.vector(str_split(myString, ";", simplify = T))
Once you've done that, THEN you can do the text substitution:
# base R
fixedApts <- gsub("^Apt$|^Apts$", "Apartment", splitString)
# with stringr
fixedApts <- str_replace(splitString, "^Apt$|^Apts$", "Apartment")
# then do the rest of your replacements
There's probabably a better way to do the replacements than regular expressions (using switch(), maybe?)
Use paste0(fixedApts, collapse = "") to collapse the vector into a single string at the end if that's what you need to do.
GNU R 3.02
> bib <- "\cite"
Error: '\c' is an unrecognized escape in character string starting ""\c"
> bib <- "\\cite"
> print(bib)
[1] "\\cite"
> sprintf(bib)
[1] "\\cite"
>
how can I print out the string variable bib with just one "\"?
(I've tried everything conceivable, and discover that R treats the "\\" as one character.)
I see that in many cases this is not a problem, since this is usually handled internally by R, say, if the string were to be used as text for a plot.
But I need to send it to LaTeX. So I really have to remove it.
I see cat does the trick. If cat could only be made to send its result to a string.
You should use cat.
bib <- "\\cite"
cat(bib)
# \cite
You can remove the ## and [1] by setting a few options in knitr. Here is an example chunk:
<<newChunk,echo=FALSE,comment=NA,background=NA>>=
bib <- "\\cite"
cat(bib)
#
which gets you \cite. Note as well that you can set these options globally.
There is no backslash in the character element "\cite". The backslash is being interpreted as an escape and the two character "\c" is being interpreted as a cntrl-c. Except that is not a recognized character. See ?Quotes. The second version has only one backslash followed by 4 alpha characters. Count the characters to see this:
nchar("\\cite")
[1] 5
OK,
<<echo=FALSE,result='asis'>>
result <- cat(rbib)
#
does the trick (without the result <- bit, [1] is added). It just feels kludgy.