I work with R and I would like to use the Unicode symbol "dot above" as thousands separator for data contained in a dataframe (not for plotting), for example: 1˙000˙000 instead of 1,000,000
The character code of "dot above" is 02D9 (taken from Microsoft Word); when I type the command:
"\u02D9"
the result is the symbol "dot above":
"˙"
I changed the option "scipen":
options(scipen = 10)
and then I tried three different solutions:
1.
format(1000000, big.mark = "\u02D9")
2.
format(1000000, big.mark = intToUtf8("0x02D9"))
3.
library(Unicode)
format(1000000, big.mark = intToUtf8(as.u_char("02D9")))
but the result is always:
"1™Ë000™Ë000"
Maybe it's about the encoding system (I live in Italy and I use Microsoft Windows 7 on my computer) or maybe the solution is simpler than the ones I tried but I don't know how to deal with it.
Does anybody know how to do it?
Thanks in advance for any suggestion.
As #Roland mentions gsub and sub used by prettyNum mangle the encoding. Unfortunately you cannot simply reset it afterwards to recover your data since the mangling converts it to two characters, and prettyNum reverses their order:
`Encoding<-`(format(1000, big.mark = "\u02D9"),"UTF-8")
[1] "1\u0099\xcb000"
The best way to work around is to format with a safe character and swap and change encoding afterwards:
`Encoding<-`(gsub(",","\u02d9",format(1e6,big.mark=",")),"UTF-8")
[1] "1˙000˙000"
Related
I'm working with the following code:
Y_Columns <- c("Y.1.1")
paste('{"ImportId":"', Y_Columns, '"}', sep = "")
The paste function produces the following output:
"{\"ImportId\":\"Y.1.1\"}"
How do I get the paste function to omit the \? Such that, the output is:
"{"ImportId":"Y.1.1"}"
Thank you for your help.
Note: I did do a search on SO to see if there were any Q's that asked "what is an escape character in R". But I didn't review all the 160 answers, only the first 20.
This is one way of demonstrating what I wrote in my comment:
out <- paste('{"ImportId":"', Y_Columns, '"}', sep = "")
out
#[1] "{\"ImportId\":\"Y.1.1\"}"
?print
print(out,quote=FALSE)
#[1] {"ImportId":"Y.1.1"}
Both R and regex patterns use escape characters to allow special characters to be displayed in print output or input. (And sometimes regex patterns need to have doubled escapes.) R has a few characters that need to be "escaped" in certain situation. You illustrated one such situation: including double-quote character inside a result that will be printed with surrounding double-quotes. If you were intending to include any single quotes inside a character value that was delimited by single quotes at the time of creation, they would have needed to be escaped as well.
out2 <- '\'quoted\''
nchar(out2)
#[1] 8 ... note that neither the surround single-quotes nor the backslashes get counted
> out2
[1] "'quoted'" ... and the default output quote-char is a double-quote.
Here's a good Q&A to review:How to replace '+' using gsub() function in R
It has two answers, both useful: one shows how to double escape a special character and the other shows how to use teh fixed argument to get around that requirement.
And another potentially useful Q&A on the topic of handling Windows paths:
File path issues in R using Windows ("Hex digits in character string" error)
And some further useful reading suggestions: Look at the series of help pages that start with capital letters. (Since I can never remember which one has which nugget of essential information, I tried ?Syntax first and it has a "See Also" list of essential reading: Arithmetic, Comparison, Control, Extract, Logic, NumericConstants, Paren, Quotes, Reserved. and I then realized what I wanted to refer you to was most likely ?Quotes where all the R-specific escape sequence letters should be listed.
I have to request an API with an URL encoding according to RFC 3986, knowing that I have accented characters in my query.
For instance, this argument :
quel écrivain ?
should be encoded like this:
quel%20%C3%A9crivain%20%3F%0D%0A
Unfortunately, when I use URLencode, encoding, url_encode, or curlEscape, I have the resulting encoding:
URLencode("quel écrivain ?")
[1] "quel%20%E9crivain%20?"
The problem is on accented letters: for instance "é" is converted into "%E9" instead of "%C3%A9"...
I struggle with this URL encoding without finding any issue... As I don't have the hand on the API, I don't know how it handles the encoding.
A weird thing is that using POST instead of GET leads to a response in which word with accent are cutted into 2 different lines :
"1\tquel\tquel\tDET\tDET\tGender=Masc|Number=Sing\t5\tdet\t0\t_\n4\t<U+FFFD>\t<U+FFFD>\tSYM\tSYM\t_\t5\tcompound\t0\t_\n5\tcrivain\tcrivain\
As you can see, "écrivain" is splitted into "<U+FFFD>" (which is an ASCII encoding of "é") and "crivain".
I become mad with this encoding problem, if a brilliant mind could help me I would be very gratefull!
Set reserved = TRUE
i.e.
your_string <- "quel écrivain ?"
URLencode(your_string, reserved = TRUE)
# [1] "quel%20%C3%A9crivain%20%3F"
I do not think I am a brilliant mind, but I still have a possible solution for you. After using URLencode() it seems that your accented characters are converted into the trailing part of their unicode representation preceeded by a %. To convert your characters into readable characters you might turn them into "real unicode" and use the package stringi to make them readable. For your single string the solution worked on my machine, at least. I hope it also works for you.
Please note that I have introduced a % character at the end of your string to demonstrate that below gsub command should work in any case.
You might have to adapt the replacement pattern \\u00 to also cover unicode patterns that have more than the last two positions filled with something but 0, if this is relevant in your case.
library(stringi)
str <- "quel écrivain ?"
str <- URLencode(str)
#"quel%20%E9crivain%20?"
#replacing % by a single \ backslash to directly get correct unicode representation
#does not work since it is an escape character, therefore "\\"
str <- gsub("%", paste0("\\", "u00"), str , fixed = T)
#[1] "quel\\u0020\\u00E9crivain\\u0020?"
#since we have double escapes, we need the unescape function from stringi
#which recognizes double backslash as single backslash for the conversion
str <- stri_unescape_unicode(str)
#[1] "quel écrivain ?"
When doing some textual data cleaning in R, I can found some special characters. In order to get rid of them, I have to know their unicodes, for example € is \u20AC. I would like to know if it is possible "see" the unicodes with a function that take into account the string within the special character as an input?
Refering to Cath comment, iconv can do the job :
iconv("é", toRaw = TRUE)
Then, you may want to unlist and paste with \u00.
special_char <- "%"
Unicode::as.u_char(utf8ToInt(special_char))
I've got a strange text file with a bunch of NUL characters in it (actually about 10 such files), and I'd like to programmatically replace them from within R. Here is a link to one of the files.
With the aid of this question I've finally figured out a better-than-ad-hoc way of going into each file and find-and-replacing the nuisance characters. It turns out that each pair of them should correspond to one space ([NUL][NUL]->) to maintain the intended line width of the file (which is crucial for reading these as fixed-width further down the road).
However, for robustness' sake, I prefer a more automable approach to the solution, ideally (for organization's sake) something I could add at the beginning of an R script I'm writing to clean up the files. This question looked promising but the accepted answer is insufficient - readLines throws an error whenever I try to use it on these files (unless I activate skipNul).
Is there any way to get the lines of this file into R so I could use gsub or whatever else to fix this issue without resorting to external programs?
You want to read the file as binary then you can substitute the NULs, e.g. to replace them by spaces:
r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space>
writeBin(r, "00staff.txt")
str(readLines("00staff.txt"))
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__ ...
You could also substitute the NULs with a really rare character (such as "\01") and work on the string in place, e.g., let's say if you want to replace two NULs ("\00\00") with one space:
r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(1)
a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE)
s = strsplit(a, "\n", TRUE)[[1]]
str(s)
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__
I want a character variable in R taking the value from, lets say "a", and adding " \%", to create a %-sign later in LaTeX.
Usually I'd do something like:
a <- 5
paste(a,"\%")
but this fails.
Error: '\%' is an unrecognized escape in character string starting "\%"
Any ideas? A workaround would be to define another command giving the %-sign in LaTeX, but I'd prefer a solution within R.
As many other languages, certain characters in strings have a different meaning when they're escaped. One example for that is \n, which means newline instead of n. When you write \%, R tries to interpret % as a special character and fails doing so. You might want to try to escape the backslash, so that it is just a backslash:
paste(a, "\\%")
You can read on escape sequences here.
You can also look at the latexTranslate function from the Hmisc package, which will escape special characters from strings to make them LaTeX-compatible :
R> latexTranslate("You want to give me 100$ ? I agree 100% !")
[1] "You want to give me 100\\$ ? I agree 100\\% !"