GNU R 3.02
> bib <- "\cite"
Error: '\c' is an unrecognized escape in character string starting ""\c"
> bib <- "\\cite"
> print(bib)
[1] "\\cite"
> sprintf(bib)
[1] "\\cite"
>
how can I print out the string variable bib with just one "\"?
(I've tried everything conceivable, and discover that R treats the "\\" as one character.)
I see that in many cases this is not a problem, since this is usually handled internally by R, say, if the string were to be used as text for a plot.
But I need to send it to LaTeX. So I really have to remove it.
I see cat does the trick. If cat could only be made to send its result to a string.
You should use cat.
bib <- "\\cite"
cat(bib)
# \cite
You can remove the ## and [1] by setting a few options in knitr. Here is an example chunk:
<<newChunk,echo=FALSE,comment=NA,background=NA>>=
bib <- "\\cite"
cat(bib)
#
which gets you \cite. Note as well that you can set these options globally.
There is no backslash in the character element "\cite". The backslash is being interpreted as an escape and the two character "\c" is being interpreted as a cntrl-c. Except that is not a recognized character. See ?Quotes. The second version has only one backslash followed by 4 alpha characters. Count the characters to see this:
nchar("\\cite")
[1] 5
OK,
<<echo=FALSE,result='asis'>>
result <- cat(rbib)
#
does the trick (without the result <- bit, [1] is added). It just feels kludgy.
Related
I have a string that looks like:
str<-"a\f\r"
I'm trying to remove the backslashes but nothing works:
gsub("\","",str, fixed=TRUE)
gsub("\\","",str)
gsub("(\)","",str)
gsub("([\])","",str)
...basically all the variations you can imagine. I have even tried the string_replace_all function. ANY HELP??
I'm using R version 3.1.1; Mac OSX 10.7; the dput for a single string in my vector of strings gives:
dput(line)
"ud83d\ude21\ud83d\udd2b"
I imported the file using
readLines from a standard
.txt file. The content of the file looks something like:
got an engineer booked for this afternoon \ud83d\udc4d all now hopefully sorted\ud83d\ude0a I m going to go insane ud83d\ude21\ud83d\udd2b in utf8towcs …
Thanks.
One quite universal solution is
gsub("\\\\", "", str)
Thanks to the comment above.
When inputting backslashes from the keyboard, always escape them.
str <-"this\\is\\my\\string" # note doubled backslashes -> 'this\is\my\string'
gsub("\\", "", str, fixed=TRUE) # ditto
str2 <- "a\\f\\r" # ditto -> 'a\f\r'
gsub("\\", "", str2, fixed=TRUE)# ditto
Note that if you do
str <- "a\f\r"
then str contains no backslashes. It consists of the 3 characters a, \f (which is not normally printable, except as \f, and \r (same).
And just to head off a possible question. If your data was read from a file, the file doesn't have to have doubled backslashes. For example, if you have a file test.txt containing
a\b\c\d\e\f
and you do
str <- readLines("test.txt")
then str will contain the string a\b\c\d\e\f as you'd expect: 6 letters separated by 5 single backslashes. But you still have to type doubled backslashes if you want to work with it.
str <- gsub("\\", "", str, fixed=TRUE) # now contains abcdef
From the dput, it looks like what you've got there is UTF-16 encoded text, which probably came from a Windows machine. According to
https://en.wikipedia.org/wiki/Unicode#Character_General_Category
https://en.wikipedia.org/wiki/UTF-16
it encodes glyphs in the Supplementary Multilingual Plane, which is pretty obscure. I'll guess that you need to supply the argument encoding="UTF-16" to readLines when you read in the file.
Since there isn't any direct ways to dealing with single backslashes, here's the closest solution to the problem as provided by David Arenburg in the comments section
gsub("[^A-Za-z0-9]", "", str) #remove all besides the alphabets & numbers
This might be helpful :)
require(stringi)
stri_escape_unicode("ala\\ma\\kota")
## [1] "ala\\\\ma\\\\kota"
stri_unescape_unicode("ala\\ ma\\ kota")
## [1] "ala ma kota"
As of R 4.0.0, you can now use raw strings to avoid confusion with backlashes, just use the following syntax: r"(your_raw_expression)" (parentheses included):
str<-r"(ud83d\ude21\ud83d\udd2b)" #Equivalent of "ud83d\\ude21\\ud83d\\udd2b"
gsub(r"(\\)", "", str)
# [1] "ud83dude21ud83dudd2b"
I have a data set called event_table that has a column titled "c.Comments" which contains strings mostly in english, but has some arabic in some of the comment entries. I want to filter out rows in which the comments entry contains arabic characters.
I read the data into R from an xlsx file and the arabic characters show as UTF-8 "< U+4903 >< U+483d >" (with no spaces) etc.
I've tried using regular expressions to achieve what I want, but the strings I'm trying to match refuse to be filtered out. I've tried all kinds of different regular expressions, but none of them seem to do the trick. I'll try filtering out literally "
event_table <- event_table %>%
filter(!grepl("<U+", c.Comments, fixed = TRUE))
event_table <- event_table %>%
filter(!grepl("<U\\+", c.Comments)
"\x", "\d\d\d\d", and all sorts of other combinations have done nothing for me
I'm starting to suspect that my method of filtering may be the issue rather than the regular expression, so any suggestions would be greatly appreciated.
Arabic chars can be detected with grep/grepl using a PCRE regex like \p{Arabic}:
> df <- data.frame(x=c("123", "abc", "ﺏ"))
> df
x
1 123
2 abc
3 <U+FE8F>
> grepl("\\p{Arabic}", df$x, perl=TRUE)
[1] FALSE FALSE TRUE
In your case, the code will look like
event_table <- event_table %>%
filter(!grepl("\\p{Arabic}", c.Comments, perl=TRUE))
Look at the ?Syntax help page. The Unicode character associated with may vary with the assumed codepage. On my machine the R character would be created with the string: "\u4903" but it prints as a Chinese glyph. The regex engine in R (as documented in the ?regex help page which you should refer to now) is PCRE.
The pattern in this grepl expression will filter out the printing non-ASCII characters:
grepl("[[:alnum:]]|[[:punct:]]", "\u4903")
[1] FALSE
And I don't think you should be negating that grepl result:
dplyr::filter(data.frame("\u4903"), grepl("[[:alnum:]]|[[:punct:]]", "\u4903"))
[1] X.䤃.
<0 rows> (or 0-length row.names)
dplyr::filter(data.frame("\u4903"), !grepl("[[:alnum:]]|[[:punct:]]", "\u4903"))
X.䤃.
1 䤃
I have an R tibble with UTF-8 character column. When I print the contents of this column for a certain problematic record, everything looks fine: one two three. There are, however, problems when I try to use this string in a RDBMS query which I construct in R and send to the database.
If I copy this string to Notepad++ and convert the encoding to ANSI, I can see that the string actually contains some additional characters that cause the problem: one â€two‬ three.
A partial solution that works would be conversion to ASCII:
iconv(my_string, "UTF-8", "ASCII", sub = "")
, but all non-ASCII characters are lost here.
Conversion from UTF-8 to UTF-8 doesn't solve my problem:
iconv(my_string, "UTF-8", "UTF-8", sub = "").
Is it possible to remove all invisible characters like the ones above without losing the UTF-8 encoding?
That is:
how can I convert my string to the form that I see when I print it out in R (without hidden parts)?
Not sure I completely understand what you are trying to do, but you can use stringi or stringr to explicitly specify what characters you want to retain. For your example, it could look something like this. You may have to expand the characters you want to retain, but this approach is one option:
library(stringr)
my_string <- "one â€two‬ three"
# Specifying that you only want upper and lowercase letters,
# numbers, punctuation, and whitespace.
str_remove_all(my_string, "[^A-z|0-9|[:punct:]|\\s]")
[1] "one two three"
# Just checking
stringi::stri_enc_isutf8(str_remove_all(my_string, "[^A-z|0-9|[:punct:]|\\s]"))
[1] TRUE
EDIT: I do want to note that you should check and see how robust this approach is. I have not dealt with invisible characters often so this may not be the best way to go about removing them.
You haven't given us a way to construct your bad string so I can't test this on your data, but it works on this example.
badString <- "one \u200Btwo\u200B three"
chars <- strsplit(badString, "")[[1]] # Assume badString has one entry; if not, add a loop
chars <- chars[nchar(chars, type = "width") > 0]
goodString <- paste(chars, collapse = "")
Both badString and goodString look the same when printed:
> badString
[1] "one two three"
> goodString
[1] "one two three"
but they have different numbers of characters:
> nchar(badString)
[1] 15
> nchar(goodString)
[1] 13
I scraped data and received some character variables containing a narrow no break space (unicode U+202F). The resulting character variable shows up fine in R if it is in a vector. For example, the return of test shows up with a narrow space in the console:
test <- "variable1 variable2"
<br>
test
(html code here because the code environment does not show the narrow space)
However, if I add the vector to a list/data frame/tibble, it shows up as variable1<U+202F>variable2 . If I save this data frame as a csv file with fileEncoding = "UTF-8" and open it with the corresponding encoding, still shows up in the observations. My workaround right now is to use gsub but I am wondering what I am doing wrong.
The offender is format.default:
test <- "variable1\u202Fvariable2"
print(test)
[1] "variable1 variable2"
format(test)
#[1] "variable1<U+202F>variable2"
format gets called by format.data.frame which in turn is called by print.data.frame.
A solution might be to define a character method:
format.character <- function(x, ...) x
DF <- data.frame(x = 1:5) #beware of stringsAsFactors
DF$test <- test
DF #spaces are actually thin spaces in R console
# x test
#1 1 variable1 variable2
#2 2 variable1 variable2
#3 3 variable1 variable2
#4 4 variable1 variable2
#5 5 variable1 variable2
Obviously, such a simple method will break functions relying on other format arguments.
OTOH, why do you care how thin spaces are printed?
Anbody having the same problem: There is a package called textclean which replaces or removes non-ascii characters by replace_non_ascii().
One method is to convert all unicode characters to blank using gsub:
text <- "variable1\u202Fvariable2"
new_text <- gsub("[^\x20-\x7E]", " ", text)
Here I match the negation of all commonly used ASCII characters, ranging from hex code 20 (SPACE) to 7E (~). The disadvantage of this method is that you might unintentionally remove more than what you wish, but you can always add exclusions to the character class.
Output:
> format(text)
[1] "variable1<U+202F>variable2"
> format(new_text)
[1] "variable1 variable2"
I'm trying to paste a series of string of characters like this:
paste0("//*[#id=",'"set_',1,'_div"]/a')
[1] "//*[#id=\"set_1_div\"]/a"
How can I get rid of the "\"? This is my expected outcome
[1] "//*[#id="set_1_div"]/a"
Thanks a lot
The backslash designates that the next character needs to be 'escaped', i.e., it does not need to be interpreted as being part of an expression, but rather as a character. When using the print statement, character strings are quoted and therefore the escape sign (backslash) is included. However, using the cat statement you can easily see that the backslashes are not actualy part of the character string:
> x <- paste0("//*[#id=",'"set_',1,'_div"]/a')
> x
[1] "//*[#id=\"set_1_div\"]/a"
> cat(x)
//*[#id="set_1_div"]/a