I want to show a block ASCII character █ (it's ASCII code is 219),
How can I show it in terminal?
I am using RGui on WinXP
You can use backslash to escape otherwise unprintable characters:
print("\245")
displays the Yen character (¥) on my gui. The 245 is in octal format, so the above expression is printing out ASCII (or whatever encoding the GUI is using) character 165.
219 is 333 in octal, but
print("\333")
prints out the Û character on my gui.
A few (but by no means all) unicode characters are also supported on the R gui:
cyrillic_d <- "\u0414"
print(cyrillic_d)
outputs Д.
Following mobrule, the following works on R running in a UTF-8 locale on Linux:
> "\u258A"
[1] "▊"
This works on Windows
> "\u2588"
[1] "█"
Related
Running R CMD check --as-cran gives
Portable packages must use only ASCII characters in their R code,
except perhaps in comments.
Use \uxxxx escapes for other characters.
What are \uxxxx, and more importantly, how can I convert non ASCII characters into them?
What I know so far
?iconv is very informative, and looks powerful, but I see nothing of the form \u
this python documentation indicates \uxxxx are
Character with 16-bit hex value xxxx (Unicode only)
Question
How can I convert non-ASCII characters into character representations of the form \uxxxx
Some examples c("¤", "£", "€", "¢", "¥", "₧", "ƒ")
You have stri_escape_unicode from stringi to escape unicode:
stringi::stri_escape_unicode(c("¤", "£", "€", "¢", "¥", "₧", "ƒ"))
## [1] "\\u00a4" "\\u00a3" "\\u20ac" "\\u00a2" "\\u00a5" "P" "\\u0192"
I have an addin based on that to remove non ascii characters between quotes in function here : https://github.com/dreamRs/prefixer
My object in R contains the following unicode which are extracted from twitter.
\xe0\xae\xa8\xe0\xae\x9f\xe0\xae\xbf\xe0\xae\x95\xe0\xae\xb0\xe0\xaf\x8d
\xe0\xae\x9a\xe0\xaf\x82\xe0\xae\xb0\xe0\xaf\x8d\xe0\xae\xaf\xe0\xae\xbe
\xe0\xae\x9a\xe0\xaf\x86\xe0\xae\xaf\xe0\xaf\x8d\xe0\xae\xa4
\xe0\xae\x89\xe0\xae\xa4\xe0\xae\xb5\xe0\xae\xbf
\xe0\xae\xae\xe0\xae\xbf\xe0\xae\x95
\xe0\xae\xae\xe0\xaf\x81\xe0\xae\x95\xe0\xaf\x8d\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xaf\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xae\xa4\xe0\xaf\x81!'
- \xe0\xae\x9f\xe0\xaf\x86\xe0\xae\xb2\xe0\xaf\x8d\xe0\xae\x9f\xe0\xae\xbe\xe0\xae\xb5\xe0\xae\xbf\xe0\xae\xb2\xe0\xaf\x8d
\xe0\xae\xa8\xe0\xaf\x86\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xb4\xe0\xaf\x8d\xe0\xae\xa8\xe0\xaf\x8d\xe0\xae\xa4
\xe0\xae\x9a\xe0\xaf\x80\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xaf\x8d
I need to convert them to human readable strings. If I just put this in a string, e.g.
x <- "\xe0\xae\xa8\xe0\xae\x9f\xe0\xae\xbf\xe0\xae\x95\xe0\xae\xb0\xe0\xaf\x8d \xe0\xae\x9a\xe0\xaf\x82\xe0\xae\xb0\xe0\xaf\x8d\xe0\xae\xaf\xe0\xae\xbe \xe0\xae\x9a\xe0\xaf\x86\xe0\xae\xaf\xe0\xaf\x8d\xe0\xae\xa4 \xe0\xae\x89\xe0\xae\xa4\xe0\xae\xb5\xe0\xae\xbf \xe0\xae\xae\xe0\xae\xbf\xe0\xae\x95 \xe0\xae\xae\xe0\xaf\x81\xe0\xae\x95\xe0\xaf\x8d\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xaf\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xae\xa4\xe0\xaf\x81!' - \xe0\xae\x9f\xe0\xaf\x86\xe0\xae\xb2\xe0\xaf\x8d\xe0\xae\x9f\xe0\xae\xbe\xe0\xae\xb5\xe0\xae\xbf\xe0\xae\xb2\xe0\xaf\x8d \xe0\xae\xa8\xe0\xaf\x86\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xb4\xe0\xaf\x8d\xe0\xae\xa8\xe0\xaf\x8d\xe0\xae\xa4 \xe0\xae\x9a\xe0\xaf\x80\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xaf\x8d"
it displays as an unreadable mess. How can I get it to display using the actual characters?
When you assign the hex codes like \xe0\xae\xa8\xe0... to a string, R doesn't know how they are intended to be interpreted, so it assumes the encoding for the current locale on your computer. On most modern Unix-based systems these days, that would be UTF-8, so for example on a Mac your string displays as
> x
[1] "நடிகர் சூர்யா செய்த உதவி மிக முக்கியமானது!' - டெல்டாவில் நெகிழ்ந்த சீமான்"
which I assume is the correct display. Google Translate recognizes it as being written in Tamil.
However, on Windows it displays unreadably. On my Windows 10 system, I see
> x
[1] "நடிகர௠சூரà¯à®¯à®¾ செயà¯à®¤ உதவி மிக à®®à¯à®•à¯à®•à®¿à®¯à®®à®¾à®©à®¤à¯!' - டெலà¯à®Ÿ
because it uses the code page corresponding to the Latin1 encoding, which is wrong for that string. To get it to display properly on Windows, you need to tell R that it is encoded in UTF-8 by declaring its encoding:
Encoding(x) <- "UTF-8"
Then it will display properly in Windows as well, which solves your problem.
For others trying to do this, it's important to know that there are only a few values that work this way. You can declare the encoding to be "UTF-8", "latin1", "bytes" or "unknown". "unknown" means the local encoding on the machine, "bytes" means it shouldn't be interpreted as characters at all. If your string has a different encoding, you need to use a different approach: convert to one of the encodings that R knows about.
For example, the string
x <- "\xb4\xde\xd1\xe0\xde\xd5 \xe3\xe2\xe0\xde"
is Russian encoded in ISO 8859-5. On a system where that was the local encoding it would display properly, but on mine it displays using the hex codes. To get it to display properly I need to convert it to UTF-8 using
y <- iconv(x, from="ISO8859-5", to="UTF-8")
Then it will display properly as [1] "Доброе утро". You can see the full list of encodings that iconv() knows about using iconvlist().
From ?Quotes:
\xnn character with given hex code (1 or 2 hex digits)
\unnnn Unicode character with given code (1--4 hex digits)
In the case where the Unicode character has only one or two digits, I would expect these characters to be the same. In fact, one of the examples on the ?Quotes help page shows:
"\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21"
## [1] "Hello World!"
"\u48\u65\u6c\u6c\u6f\u20\u57\u6f\u72\u6c\u64\u21"
## [1] "Hello World!"
However, under Linux, when trying to print a pound sign, I see
cat("\ua3")
## £
cat("\xa3")
## �
That is, the \x hex code fails to display correctly. (This behaviour persisted with any locale that I tried.) Under Windows 7 both versions show a pound sign.
If I convert to integer and back then the pound sign displays correctly under Linux.
cat(intToUtf8(utf8ToInt("\xa3")))
## £
Incidentally, this doesn't work under Windows, since utf8ToInt("\xa3") returns NA.
Some \x characters return NA under Windows but throw an error under Linux. For example:
utf8ToInt("\xf0")
## Error in utf8ToInt("\xf0") : invalid UTF-8 string
("\uf0" is a valid character.)
These examples show that there are some differences between \x and \u forms of characters, which seem to be OS-specific, but I can't see any logic in how they are defined.
What are the difference between these two character forms?
The escape sequence \xNN inserts the raw byte NN into a string, whereas \uNN inserts the UTF-8 bytes for the Unicode code point NN into a UTF-8 string:
> charToRaw('\xA3')
[1] a3
> charToRaw('\uA3')
[1] c2 a3
These two types of escape sequence cannot be mixed in the same string:
> '\ua3\xa3'
Error: mixing Unicode and octal/hex escapes in a string is not allowed
This is because the escape sequences also define the encoding of the string. A \uNN sequence explicitly sets the encoding of the entire string to "UTF-8", whereas \xNN leaves it in the default "unknown" (aka. native) encoding:
> Encoding('\xa3')
[1] "unknown"
> Encoding('\ua3')
[1] "UTF-8"
This becomes important when printing strings, as they need to be converted into the appropriate output encoding (e.g., that of your console). Strings with a defined encoding can be converted appropriately (see enc2native), but those with an "unknown" encoding are simply output as-is:
On Linux, your console is probably expecting UTF-8 text, and as 0xA3 is not a valid UTF-8 sequence, it gives you "�".
On Windows, your console is probably expecting Windows-1252 text, and as 0xA3 is the correct encoding for "£", that's what you see. (When the string is \uA3, a conversion from UTF-8 to Windows-1252 takes place.)
If the encoding is set explicitly, the appropriate conversion will take place on Linux:
> s <- '\xa3'
> Encoding(s) <- 'latin1'
> cat(s)
£
I want to write to a file with UTF-8 encoding containing the character
10001100 which is Œ the Latin capital ligature OE in extended ASCII table,
zz <- file("c:/testbin", "wb")
writeBin("10001100",zz)
close(zz)
When I open the file with office(encoding=utf-8), I can see Œ what I can not read is with readBin?
zz <- file("c:/testbin", "rb")
readBin(zz,raw())->x
x
[1] c5
readBin(zz,character())->x
Warning message:
In readBin(zz, character()) :
incomplete string at end of file has been discarded
x
character(0)
There are multiple difficulties here.
Firstly, there are actually several "Extended ASCII" tables. Since you are on Windows you are probably using CP1252 which is one of them, also called Windows-1252 or ANSI, and the Win default "latin" encoding. However the code for Œ varies within this family of tables. In CP1252, "Œ" is represented by 10001100 or "\x8c", as you wrote. However it does not exist in ISO-8859-1. And in UTF-8 it corresponds to "\xc5\x92" or "\u0152", as rlegendi indicated.
So, to write UTF-8 from CP1252-as-binary-as-string, you have to convert your string into it a "raw" number (the R class for bytes) and then a character, change its "encoding" from CP1252 to UTF-8 (in fact convert its byte value to the corresponding one for the same character in UTF-8), after that you can re-convert it to raw, and finally write to the file:
char_bin_str <- '10001100'
char_u <- iconv(rawToChar(as.raw(strtoi(char_bin_str, base=2))),
# "\x8c" 8c 140 '10001100'
from="CP1252",
to="UTF-8")
test.file <- "~/test-unicode-bytes.txt"
zz <- file(test.file, 'wb')
writeBin(charToRaw(char_u), zz)
close(zz)
Secondly, when you readBin(), do not forget to give a number of bytes to read which is big enough (n=file.info(test.file)$size here), otherwise it reads only the first byte (see below):
zz <- file(test.file, 'rb')
x <- readBin(zz, 'raw', n=file.info(test.file)$size)
close(zz)
x
[1] c5 92
Thirdly, if in the end you want to turn it back into a character, correctly understood and displayed by R, you have first to convert it into a string with rawToChar(). Now, the way it will be displayed depends on your default encoding, see Sys.getlocale() to see what it is (probably something ending with 1252 on Windows). The best is probably to specify that your character should be read as UTF-8 – otherwise it will be understood with your default encoding.
xx <- rawToChar(x)
Encoding(xx) <- "UTF-8"
xx
[1] "Œ"
This should keep things under control, write the correct bytes in UTF-8, and be the same on every OS. Hope it helps.
PS: I am not exactly sure why in your code x returned c5, and I guess it would have returned c5 92 if you had set n=2 (or more) as a parameter to readBin(). On my machine (Mac OS X 10.7, R 3.0.2 and Win XP, R 2.15) it returns 31, the hex ASCII representation of '1' (the first char in '10001100', which makes sense), with your code. Maybe you opened your file in Office as CP1252 and saved it as UTF-8 there, before coming back to R?
Try this instead (I replaced the binary value with the UTF encoding because I think it is better when you want such an output):
writeBin(charToRaw("\u0152"), zz)
I'm working on being able to read transcripts of dialogue into R. However I run into a bump with special characters like curly quotes en and em dashes etc. Typically I replace these special characters in a microsoft product first with replace. Typically I replace special characters with plain text but on some occasions desire to replace them with other characters (ie I replace “ ” with { }). This is tedious and not always thorough. If I could read the transcripts into R as is and then use Encoding to switch their encoding to a recognizable unicode format, I could gsub them out and replace them with plain text versions. However the file is read in in some way I don't understand.
Here's an xlsx of what my data may look like:
http://dl.dropbox.com/u/61803503/test.xlsx
This is what is in the .xlsx file
text num
“ ” curly quotes 1
en dash (–) and the em dash (—) 2
‘ ’ curly apostrophe-ugg 3
… ellipsis are uck in R 4
This can be read into R with:
URL <- "http://dl.dropbox.com/u/61803503/test.xlsx"
library(gdata)
z <- read.xls(URL, stringsAsFactors = FALSE)
The result is:
text num
1 “ †curly quotes 1
2 en dash (–) and the em dash (—) 2
3 ‘ ’ curly apostrophe-ugg 3
4 … ellipsis are uck in R 4
So I tried to use Encoding to convert to Unicode:
iconv(z[, 1], "latin1", "UTF-8")
This gives:
[1] "â\u0080\u009c â\u0080\u009d curly quotes" "en dash (â\u0080\u0093) and the em dash (â\u0080\u0094)"
[3] "â\u0080\u0098 â\u0080\u0099 curly apostrophe-ugg" "â\u0080¦ ellipsis are uck in R"
Which makes gsubing less useful.
What can I do to convert these special characters to distinguishable unicode so I can gsub them out appropriately? To be more explicit I was hoping to have z[1, 1] read:
\u201C 2\u01D curly quotes
To make it even more clear my desired outcome I will webscrape the tables from a page like wikipedia's: http://en.wikipedia.org/wiki/Quotation_mark_glyphs and use the unicode reference chart to replace characters appropriately. So I need the characters to be in unicode or some standard format that I can systematically go through and replace the characters. Maybe it already is and I'm missing it.
PS I don't save the files as .csv or plain text because the special characters are replaced with ? hence the use of read.xls I'm not attached to any particular method of reading in the file (ie read.xls) if you've got a better alternative.
Maybe this will help (I'll have access to a Windows machine tomorrow and can probably play with it more at that point if SO doesn't get you the answer first).
On my Linux system, when I do the following:
iconv(z$text, "", "cp1252")
I get:
[1] "\x93 \x94 curly quotes" "en dash (\x96) and the em dash (\x97)"
[3] "\x91 \x92 curly apostrophe-ugg" "\x85 ellipsis are uck in R"
This is not UTF, but (I believe) ISO hex entities. Still, if you are able to get to this point also, then you should be able to use gsub the way you intend to.
See this page (reserved section in particular) for conversions.
Update
You can also try converting to an encoding that doesn't have those characters, like ASCII and set sub to "byte". On my machine, that gives me:
iconv(z$text, "", "ASCII", "byte")
# [1] "<e2><80><9c> <e2><80><9d> curly quotes"
# [2] "en dash (<e2><80><93>) and the em dash (<e2><80><94>)"
# [3] "<e2><80><98> <e2><80><99> curly apostrophe-ugg"
# [4] "<e2><80><a6> ellipsis are uck in R"
It's ugly, but UTF-8(e2, 80, 9c) is a right curly quote (each character, I believe, is a set of three values in angled brackets). You can find conversions at this site where you can search by punctuation mark name.
Try
> iconv(z, "UTF-8", "UTF-8")
[1] "c(\"“—” curly quotes\", \"en dash (–) and the em dash (—)\", \"‘—’ curly apostrophe-ugg\", \"… ellipsis are uck in R\")"
[2] "c(1, 2, 3, 4)"
windows is very problematic with encodings. Maybe you can look at http://www.vmware.com/products/player/ and run linux.
This works on my windows box. Initial input was as you had. You may have a different experience.