R-package text encoding - special characters encoded incorrectly - r

I've got an R-function along these lines:
swedish.weekday <- function(date = Sys.Date()) {
require(lubridate)
c("Sön", "Mån", "Tis", "Ons", "Tor", "Fre", "Lör")[wday(date)]
}
This returns the three letter equivalent of Sun, Mon, Tue etc.
Works absolutely fine until I include this function in a package where during the build the function transforms into:
swedish.weekday <- function(date = Sys.Date()) {
require(lubridate)
c("Sön", "Mån", "Tis", "Ons", "Tor", "Fre", "Lör")[wday(date)]
}
I've tried setting the encoding options in the project settings to either ISO8859-1 or WINDOWS-1252 but neither works. Using 64 bit R 3.1.2 under Windows 7.
Suspect I'd need to change something in the build config but I'm lost as to what - any help/direction much appreciated!

As per the link posted in the comments above I solved the issue by merely using Unicode escapes as such:
day <- c("S\u00F6n", "M\u00E5n", "Tis", "Ons", "Tor", "Fre", "L\u00F6r")[wday(date)]
Edit: While passing these results to an external system (OLAP) I discovered it is also necessary to force the encoding of these results to ISO ("latin-9") to ensure it does not only look correct on the screen but also as far as the system is concerned as such day <- inconv(day, "UTF-8", "latin-9")
For ref...
There is a portable way to have arbitrary text in character strings (only) in your R code, which is to supply them in Unicode as \uxxxx escapes. If there are any characters not in the current encoding the parser will encode the character string as UTF-8 and mark it as such. This applies also to character strings in datasets: they can be prepared using \uxxxx escapes or encoded in UTF-8 in a UTF-8 locale, or even converted to UTF-8 via ‘iconv()’. If you do this, make sure you have ‘R (>= 2.10)’ (or later) in the ‘Depends’ field of the DESCRIPTION file.

Related

Bold math characters into moodle with mathml converter

When I have $$\mathbf{x}$$ in my .Rmd file, and use exams2moodle with the pandoc-mathml converter, the xml file contains an "𝐱" character, which needs to be replaced with an "x" character before moodle will import the quiz question (because moodle will give an error saying the file is not UTF-8 without BOM.)
I wonder what are the most practical workarounds? Is this a bug? Thanks!
Minimal example: Here is minimal_example.Rmd
Question
========
Stare hard at the variable.
$$\mathbf{x}$$
What is its value?
Solution
========
If you think hard enough, you will know it is 12.
Meta-information
================
extype: num
exsolution: 12
exname: minimal_example
extol: 0
Here is the minimal_example.r
library("exams")
exams2moodle("minimal_example.Rmd", converter="pandoc-mathml")
And... here is a snippet of the resulting .xml file.
...
<questiontext format="html">
<text><![CDATA[<p>
<p>Stare hard at the variable. <math display="block" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mstyle mathvariant="bold"><mi>𝐱</mi></mstyle><annotation encoding="application/x-tex">\mathbf{x}</annotation></semantics></math> What is its value?</p>
</p>]]></text>
</questiontext>
...
If I try importing the XML to my school's moodle, I get a dmlwriteexeption error. If I replace the "𝐱" with "x" the XML imports fine.
I am fairly certain my moodlequiz.xml file does not contain a BOM.
$ file moodlequiz.xml
moodlequiz.xml: XML 1.0 document, UTF-8 Unicode text, with very long lines
$ hexdump -n 3 -C moodlequiz.xml
00000000 3c 3f 78 |<?x|
00000003
I consider this question resolved. Hopefully nobody else has this issue, and I will use one of the proposed workarounds for my own files. Thanks!
TL;DR
exams2moodle(..., converter = "pandoc-mathml") seems to work correctly and produces an UTF-8 encoded XML file moodlequiz.xml. The problem on your end appears to be caused by a BOM (byte order mark) in your XML file. It is unclear to me whether this is introduced through exams2moodle() or through an editor on your end.
Either you can remove the BOM manually or you can avoid the UTF-8 encoding altogether by using exams2moodle(..., converter = "pandoc-mathml-ascii"). The latter requires at least version 2.4-0 of the package.
Replication
Thanks for providing a reproducible example. I ran your example code - both on a Linux machine running in an UTF-8 locale and a Windows 10 machine - and can confirm that I get exactly the same XML code containing the UTF-8 encoded bold x: 𝐱. However, I have no problem importing that into my Moodle system.
Possible sources of the problem
So I looked up what the Moodle error message is about. Moodle does not accept UTF-8-encoded files with a BOM (byte order mark) at the beginning. Some systems use a BOM at the beginning of a file to declare how the file is encoded. See:
Moodle documentation: https://docs.moodle.org/39/en/UTF-8_and_BOM
Wikipedia with general information: https://en.wikipedia.org/wiki/Byte_order_mark
The moodlequiz.xml I produced on the two systems I mentioned above have no BOM. So I suspect that either your R setup produces a file with a BOM or the BOM is inserted later, e.g., after opening the XML file with an editor. The Moodle documentation above has some information on what you can do to detect the BOM and get rid of it. Hopefully, this lets you debug the problem on your end. If the BOM was produced by exams2moodle() (as opposed to your editor for example) and you find out how to avoid that, please let me know.
Alternative solution
In principle it is possible to replace the UTF-8 encocded characters by the corresponding HTML entities. For example, in this particular case we have a "MATHEMATICAL BOLD SMALL X" with Unicode U+1D431 (see https://www.w3.org/Math/characters/bold.html). Thus, we can also represent it as 𝐱 (hexadecimal) or 𝐱 (decimal). Then the XML file can be in ASCII while still leading to the same output in HTML.
While pandoc is generally designed to work with UTF-8 throughout it also has support for (hexa)decimal escapes in certain conversions, see https://pandoc.org/MANUAL.html#option--ascii. And luckily it is possible to combine the --mathml with the --ascii option. There was only a small bug in how R/exams passed on the option to the rmarkdown::pandoc_convert() function which I just fixed. So you need at least version 2.4-0 of exams and can then do:
exams2moodle(..., converter = "pandoc-mathml-ascii")
which yields a moodlequiz.xml in ASCII instead of UTF-8.

Convert unicode to a readable string

My object in R contains the following unicode which are extracted from twitter.
\xe0\xae\xa8\xe0\xae\x9f\xe0\xae\xbf\xe0\xae\x95\xe0\xae\xb0\xe0\xaf\x8d
\xe0\xae\x9a\xe0\xaf\x82\xe0\xae\xb0\xe0\xaf\x8d\xe0\xae\xaf\xe0\xae\xbe
\xe0\xae\x9a\xe0\xaf\x86\xe0\xae\xaf\xe0\xaf\x8d\xe0\xae\xa4
\xe0\xae\x89\xe0\xae\xa4\xe0\xae\xb5\xe0\xae\xbf
\xe0\xae\xae\xe0\xae\xbf\xe0\xae\x95
\xe0\xae\xae\xe0\xaf\x81\xe0\xae\x95\xe0\xaf\x8d\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xaf\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xae\xa4\xe0\xaf\x81!'
- \xe0\xae\x9f\xe0\xaf\x86\xe0\xae\xb2\xe0\xaf\x8d\xe0\xae\x9f\xe0\xae\xbe\xe0\xae\xb5\xe0\xae\xbf\xe0\xae\xb2\xe0\xaf\x8d
\xe0\xae\xa8\xe0\xaf\x86\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xb4\xe0\xaf\x8d\xe0\xae\xa8\xe0\xaf\x8d\xe0\xae\xa4
\xe0\xae\x9a\xe0\xaf\x80\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xaf\x8d
I need to convert them to human readable strings. If I just put this in a string, e.g.
x <- "\xe0\xae\xa8\xe0\xae\x9f\xe0\xae\xbf\xe0\xae\x95\xe0\xae\xb0\xe0\xaf\x8d \xe0\xae\x9a\xe0\xaf\x82\xe0\xae\xb0\xe0\xaf\x8d\xe0\xae\xaf\xe0\xae\xbe \xe0\xae\x9a\xe0\xaf\x86\xe0\xae\xaf\xe0\xaf\x8d\xe0\xae\xa4 \xe0\xae\x89\xe0\xae\xa4\xe0\xae\xb5\xe0\xae\xbf \xe0\xae\xae\xe0\xae\xbf\xe0\xae\x95 \xe0\xae\xae\xe0\xaf\x81\xe0\xae\x95\xe0\xaf\x8d\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xaf\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xae\xa4\xe0\xaf\x81!' - \xe0\xae\x9f\xe0\xaf\x86\xe0\xae\xb2\xe0\xaf\x8d\xe0\xae\x9f\xe0\xae\xbe\xe0\xae\xb5\xe0\xae\xbf\xe0\xae\xb2\xe0\xaf\x8d \xe0\xae\xa8\xe0\xaf\x86\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xb4\xe0\xaf\x8d\xe0\xae\xa8\xe0\xaf\x8d\xe0\xae\xa4 \xe0\xae\x9a\xe0\xaf\x80\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xaf\x8d"
it displays as an unreadable mess. How can I get it to display using the actual characters?
When you assign the hex codes like \xe0\xae\xa8\xe0... to a string, R doesn't know how they are intended to be interpreted, so it assumes the encoding for the current locale on your computer. On most modern Unix-based systems these days, that would be UTF-8, so for example on a Mac your string displays as
> x
[1] "நடிகர் சூர்யா செய்த உதவி மிக முக்கியமானது!' - டெல்டாவில் நெகிழ்ந்த சீமான்"
which I assume is the correct display. Google Translate recognizes it as being written in Tamil.
However, on Windows it displays unreadably. On my Windows 10 system, I see
> x
[1] "நடிகர௠சூரà¯à®¯à®¾ செயà¯à®¤ உதவி மிக à®®à¯à®•à¯à®•à®¿à®¯à®®à®¾à®©à®¤à¯!' - டெலà¯à®Ÿ
because it uses the code page corresponding to the Latin1 encoding, which is wrong for that string. To get it to display properly on Windows, you need to tell R that it is encoded in UTF-8 by declaring its encoding:
Encoding(x) <- "UTF-8"
Then it will display properly in Windows as well, which solves your problem.
For others trying to do this, it's important to know that there are only a few values that work this way. You can declare the encoding to be "UTF-8", "latin1", "bytes" or "unknown". "unknown" means the local encoding on the machine, "bytes" means it shouldn't be interpreted as characters at all. If your string has a different encoding, you need to use a different approach: convert to one of the encodings that R knows about.
For example, the string
x <- "\xb4\xde\xd1\xe0\xde\xd5 \xe3\xe2\xe0\xde"
is Russian encoded in ISO 8859-5. On a system where that was the local encoding it would display properly, but on mine it displays using the hex codes. To get it to display properly I need to convert it to UTF-8 using
y <- iconv(x, from="ISO8859-5", to="UTF-8")
Then it will display properly as [1] "Доброе утро". You can see the full list of encodings that iconv() knows about using iconvlist().

problems replacing €-symbol in strings

I want to replace every "€" in a string with "[euro]". Now this works perfectly fine with
file.col.name <- gsub("€","[euro]", file.col.name, fixed = TRUE)
Now I am looping over column names from a csv-file and suddenly I have trouble with the string "total€".
It works for other special character (#,?) but the € sign doesn't get recognized.
grep("€",file.column.name)
also returns 0 and if I extract the last letter it prints "€" but
print(lastletter(file.column.name) == "€")
returns FALSE. (lastletter is just a function to extract the last letter of a string.)
Does anyone have an idea why that happens and maybe an idea to solve it? I checked the class of "file.column.name" and it returns "character", also tried to convert it into a character again and stuff like that but didn't help.
Thank you!
Your encodings are probably mixed. Check the encodings of the files, then add the appropriate encoding to, e.g., read.csv using fileEncoding="…" as an argument.
If you are working under Unix/Linux, the file utility will tell you the encoding of text files. Otherwise, any editor should show you the encoding of the files.
Common encodings are UTF-8, ISO-8859-15 and windows-1252. Try "UTF-8", "windows-1252" and "latin-9" as values for fileEncoding (the latter being a portable name for ISO-8859-15 according to R's documentation).

How to specify encoding while creating file?

I am using an R script to create and append a file. But I need the file to be saved in ANSI encoding,even though some characters are in Unicode format. How to ensure ANSI encoding?
newfile='\home\user\abc.ttl'
file.create(newfile)
text3 <- readLines('\home\user\init.ttl')
sprintf('readlines %d',length(text3))
for(k in 1:length(text3))
{
cat(text3[[k]],file=newfile,sep="\n",append=TRUE)
}
Encoding can be tricky, since you need to detect your encoding upon input, and then you need to convert it before writing. Here it sounds like your input file input.ttl is encoded as UTF-8, and you need it converted to ASCII. This means you are probably going to lose some non-translatable characters, since there may be no mapping from the UTF-8 characters to ASCII outside of the 128-bit lower range. (Within this range the mappings of UTF-8 to ASCII are the same.)
So here is how to do it. You will have to modify your code accordingly to test since you did not supply the elements needed for a reproducible example.
Make sure that your input file is actually UTF-8 and that you are reading it as UTF-8. You can do this by adding encoding = "UTF-8" to the third line of your code, as an argument to readLines(). Note that you may not be able to set the system locale to UTF-8 on a Windows platform, but the file will still be read as UTF-8, even though extended characters may not display properly.
Use iconv() to convert the text from UTF-8 to ASCII. iconv() is vectorised so it works on the whole set of text. You can do this using
text3 <- iconv(text3, "UTF-8", "ASCII", sub = "")
Note here that the sub = "" argument prevents the default behaviour of converting the entire character element to NA if it encounters any untranslatable characters. (These include the seemingly innocent but actually subtly evil things such as "smart quotes".)
Now when you write the file using cat() the output should be ASCII.

Convert a file encoding using R? (ANSI to UTF-8)

I wish to convert an HTML file encoded in ANSI to UTF-8, using R.
Is there a tool, or a combination of tools, that can make this work?
Thanks.
Edit: o.k, I've narrowed my problem to another one. It is re-posted here: Using "cat" to write non-English characters into a .html file (in R)
you can use iconv:
writeLines(iconv(readLines("tmp.html"), from = "ANSI_X3.4-1986", to = "UTF8"), "tmp2.html")
tmp2.html should be utf-8.
Edit by Henrik in June 2015:
A working solution for Windows distilled from the comments is as follows:
writeLines(iconv(readLines("tmp.html"), from = "ANSI_X3.4-1986", to = "UTF8"),
file("tmp2.html", encoding="UTF-8"))
Update 2021: And if ANSI is the current locale, the following works as well (i.e., uses the local encoding as from source):
writeLines(iconv(readLines("tmp.html"), from = "", to = "UTF8"),
file("tmp2.html", encoding="UTF-8"))
I had some problems with the solutions proposed above, especially with the TAB character. This alternative never disappointed me. Unfortunately it only works on UNIX-like systems.
system('iconv -f CP1252 -t UTF-8 < tmp.html > tmp2.html')

Resources