How to specify encoding while creating file? - r

I am using an R script to create and append a file. But I need the file to be saved in ANSI encoding,even though some characters are in Unicode format. How to ensure ANSI encoding?
newfile='\home\user\abc.ttl'
file.create(newfile)
text3 <- readLines('\home\user\init.ttl')
sprintf('readlines %d',length(text3))
for(k in 1:length(text3))
{
cat(text3[[k]],file=newfile,sep="\n",append=TRUE)
}

Encoding can be tricky, since you need to detect your encoding upon input, and then you need to convert it before writing. Here it sounds like your input file input.ttl is encoded as UTF-8, and you need it converted to ASCII. This means you are probably going to lose some non-translatable characters, since there may be no mapping from the UTF-8 characters to ASCII outside of the 128-bit lower range. (Within this range the mappings of UTF-8 to ASCII are the same.)
So here is how to do it. You will have to modify your code accordingly to test since you did not supply the elements needed for a reproducible example.
Make sure that your input file is actually UTF-8 and that you are reading it as UTF-8. You can do this by adding encoding = "UTF-8" to the third line of your code, as an argument to readLines(). Note that you may not be able to set the system locale to UTF-8 on a Windows platform, but the file will still be read as UTF-8, even though extended characters may not display properly.
Use iconv() to convert the text from UTF-8 to ASCII. iconv() is vectorised so it works on the whole set of text. You can do this using
text3 <- iconv(text3, "UTF-8", "ASCII", sub = "")
Note here that the sub = "" argument prevents the default behaviour of converting the entire character element to NA if it encounters any untranslatable characters. (These include the seemingly innocent but actually subtly evil things such as "smart quotes".)
Now when you write the file using cat() the output should be ASCII.

Related

Convert unicode to a readable string

My object in R contains the following unicode which are extracted from twitter.
\xe0\xae\xa8\xe0\xae\x9f\xe0\xae\xbf\xe0\xae\x95\xe0\xae\xb0\xe0\xaf\x8d
\xe0\xae\x9a\xe0\xaf\x82\xe0\xae\xb0\xe0\xaf\x8d\xe0\xae\xaf\xe0\xae\xbe
\xe0\xae\x9a\xe0\xaf\x86\xe0\xae\xaf\xe0\xaf\x8d\xe0\xae\xa4
\xe0\xae\x89\xe0\xae\xa4\xe0\xae\xb5\xe0\xae\xbf
\xe0\xae\xae\xe0\xae\xbf\xe0\xae\x95
\xe0\xae\xae\xe0\xaf\x81\xe0\xae\x95\xe0\xaf\x8d\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xaf\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xae\xa4\xe0\xaf\x81!'
- \xe0\xae\x9f\xe0\xaf\x86\xe0\xae\xb2\xe0\xaf\x8d\xe0\xae\x9f\xe0\xae\xbe\xe0\xae\xb5\xe0\xae\xbf\xe0\xae\xb2\xe0\xaf\x8d
\xe0\xae\xa8\xe0\xaf\x86\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xb4\xe0\xaf\x8d\xe0\xae\xa8\xe0\xaf\x8d\xe0\xae\xa4
\xe0\xae\x9a\xe0\xaf\x80\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xaf\x8d
I need to convert them to human readable strings. If I just put this in a string, e.g.
x <- "\xe0\xae\xa8\xe0\xae\x9f\xe0\xae\xbf\xe0\xae\x95\xe0\xae\xb0\xe0\xaf\x8d \xe0\xae\x9a\xe0\xaf\x82\xe0\xae\xb0\xe0\xaf\x8d\xe0\xae\xaf\xe0\xae\xbe \xe0\xae\x9a\xe0\xaf\x86\xe0\xae\xaf\xe0\xaf\x8d\xe0\xae\xa4 \xe0\xae\x89\xe0\xae\xa4\xe0\xae\xb5\xe0\xae\xbf \xe0\xae\xae\xe0\xae\xbf\xe0\xae\x95 \xe0\xae\xae\xe0\xaf\x81\xe0\xae\x95\xe0\xaf\x8d\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xaf\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xae\xa4\xe0\xaf\x81!' - \xe0\xae\x9f\xe0\xaf\x86\xe0\xae\xb2\xe0\xaf\x8d\xe0\xae\x9f\xe0\xae\xbe\xe0\xae\xb5\xe0\xae\xbf\xe0\xae\xb2\xe0\xaf\x8d \xe0\xae\xa8\xe0\xaf\x86\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xb4\xe0\xaf\x8d\xe0\xae\xa8\xe0\xaf\x8d\xe0\xae\xa4 \xe0\xae\x9a\xe0\xaf\x80\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xaf\x8d"
it displays as an unreadable mess. How can I get it to display using the actual characters?
When you assign the hex codes like \xe0\xae\xa8\xe0... to a string, R doesn't know how they are intended to be interpreted, so it assumes the encoding for the current locale on your computer. On most modern Unix-based systems these days, that would be UTF-8, so for example on a Mac your string displays as
> x
[1] "நடிகர் சூர்யா செய்த உதவி மிக முக்கியமானது!' - டெல்டாவில் நெகிழ்ந்த சீமான்"
which I assume is the correct display. Google Translate recognizes it as being written in Tamil.
However, on Windows it displays unreadably. On my Windows 10 system, I see
> x
[1] "நடிகர௠சூரà¯à®¯à®¾ செயà¯à®¤ உதவி மிக à®®à¯à®•à¯à®•à®¿à®¯à®®à®¾à®©à®¤à¯!' - டெலà¯à®Ÿ
because it uses the code page corresponding to the Latin1 encoding, which is wrong for that string. To get it to display properly on Windows, you need to tell R that it is encoded in UTF-8 by declaring its encoding:
Encoding(x) <- "UTF-8"
Then it will display properly in Windows as well, which solves your problem.
For others trying to do this, it's important to know that there are only a few values that work this way. You can declare the encoding to be "UTF-8", "latin1", "bytes" or "unknown". "unknown" means the local encoding on the machine, "bytes" means it shouldn't be interpreted as characters at all. If your string has a different encoding, you need to use a different approach: convert to one of the encodings that R knows about.
For example, the string
x <- "\xb4\xde\xd1\xe0\xde\xd5 \xe3\xe2\xe0\xde"
is Russian encoded in ISO 8859-5. On a system where that was the local encoding it would display properly, but on mine it displays using the hex codes. To get it to display properly I need to convert it to UTF-8 using
y <- iconv(x, from="ISO8859-5", to="UTF-8")
Then it will display properly as [1] "Доброе утро". You can see the full list of encodings that iconv() knows about using iconvlist().

problems replacing €-symbol in strings

I want to replace every "€" in a string with "[euro]". Now this works perfectly fine with
file.col.name <- gsub("€","[euro]", file.col.name, fixed = TRUE)
Now I am looping over column names from a csv-file and suddenly I have trouble with the string "total€".
It works for other special character (#,?) but the € sign doesn't get recognized.
grep("€",file.column.name)
also returns 0 and if I extract the last letter it prints "€" but
print(lastletter(file.column.name) == "€")
returns FALSE. (lastletter is just a function to extract the last letter of a string.)
Does anyone have an idea why that happens and maybe an idea to solve it? I checked the class of "file.column.name" and it returns "character", also tried to convert it into a character again and stuff like that but didn't help.
Thank you!
Your encodings are probably mixed. Check the encodings of the files, then add the appropriate encoding to, e.g., read.csv using fileEncoding="…" as an argument.
If you are working under Unix/Linux, the file utility will tell you the encoding of text files. Otherwise, any editor should show you the encoding of the files.
Common encodings are UTF-8, ISO-8859-15 and windows-1252. Try "UTF-8", "windows-1252" and "latin-9" as values for fileEncoding (the latter being a portable name for ISO-8859-15 according to R's documentation).

R-package text encoding - special characters encoded incorrectly

I've got an R-function along these lines:
swedish.weekday <- function(date = Sys.Date()) {
require(lubridate)
c("Sön", "Mån", "Tis", "Ons", "Tor", "Fre", "Lör")[wday(date)]
}
This returns the three letter equivalent of Sun, Mon, Tue etc.
Works absolutely fine until I include this function in a package where during the build the function transforms into:
swedish.weekday <- function(date = Sys.Date()) {
require(lubridate)
c("Sön", "Mån", "Tis", "Ons", "Tor", "Fre", "Lör")[wday(date)]
}
I've tried setting the encoding options in the project settings to either ISO8859-1 or WINDOWS-1252 but neither works. Using 64 bit R 3.1.2 under Windows 7.
Suspect I'd need to change something in the build config but I'm lost as to what - any help/direction much appreciated!
As per the link posted in the comments above I solved the issue by merely using Unicode escapes as such:
day <- c("S\u00F6n", "M\u00E5n", "Tis", "Ons", "Tor", "Fre", "L\u00F6r")[wday(date)]
Edit: While passing these results to an external system (OLAP) I discovered it is also necessary to force the encoding of these results to ISO ("latin-9") to ensure it does not only look correct on the screen but also as far as the system is concerned as such day <- inconv(day, "UTF-8", "latin-9")
For ref...
There is a portable way to have arbitrary text in character strings (only) in your R code, which is to supply them in Unicode as \uxxxx escapes. If there are any characters not in the current encoding the parser will encode the character string as UTF-8 and mark it as such. This applies also to character strings in datasets: they can be prepared using \uxxxx escapes or encoded in UTF-8 in a UTF-8 locale, or even converted to UTF-8 via ‘iconv()’. If you do this, make sure you have ‘R (>= 2.10)’ (or later) in the ‘Depends’ field of the DESCRIPTION file.

Accentuation in R

I was wonderiing if there exist a short way to get accents of a character string in R. For instance when I the "accentued" string is "Université d'Aix-Marseille" with my script I get "Universit%C3A9 d%27Aix-Marseille". Is there any function or algorithm to get the former one directly ?
I precise tht the file from where a get all my character string is encoded is UTF-8.
Sincerely yours.
You can get and set the encoding of a character vector like this:
s <- "Université d'Aix-Marseille"
Encoding(s)
# set encoding to utf-8
Encoding(s) <- "UTF-8"
s
If that fixes it, you could change your default encoding to UTF-8.

CSS character encoding

According to W3C, CSS can set its character encoding by #charset in the first line, is it valid to to say that I should put #charset "UTF-8" in every CSS i made, even it only contains ASCII characters?
Will there be any performance penalty after I declare it using UTF-8 ?
p.s. I can't think of a way to test it out.
No, it is not valid to say so, as an unqualified statement. If your file contains only ASCII characters, it is very likely that its character encoding is ASCII compatible (EBCDIC is not much used there days), so the rule would be harmless, but also pointless as long as the file keeps being ASCII-only.
What matters is what happens when a non-ASCII character gets inserted into the CSS file, for whatever reason. It could be, for example, a innocent-look smart quote (”) inserted when editing the file with a program that produces smart quotes. It is more likely that the smart quote gets inserted in windows-1252 encoding than in UTF-8 encoding. So if the file has the #charset "UTF-8" rule, it probably becomes a bit more difficult to analyze the problem.
If, on the other hand, you know that your CSS file will be edited using software that uses UTF-8 encoding by default, then it is OK to declare it as UTF-8 encoded even if it only contains ASCII characters. For example, if you some day edit the file and add a declaration like content: "“foo”", you might forget to add the #charset rule.
There is no overhead in declaring the encoding as UTF-8. If the data contains ASCII characters only, any decent routine that reads UTF-8 will process the characters as fast as simple reading of ASCII. A routine that reads a UTF-8 bytestream will have to first check whether the byte is in the ASCII range and take it as standing for an ASCII character if it is.

Resources