Prevent R from removing diacritics (UTF-8)

Prevent R from removing diacritics (UTF-8) - r

R somehow ignores diacritics when I enter a character string. I am trying to setup utf-8 everywhere possible, but it keeps doing it. How do I make it keep the diacritics?
Example:
string <- "Marie Skłodowska-Curie"
string
Returns:
"Marie Sklodowska-Curie actions"
What I need (note the "ł"):
"Marie Skłodowska-Curie actions"
The same happens when I load a csv in utf-8 encoding and then try to use strings containing the "ł" character. R replaces it with "l".
For context, I am using RStudio.

Related

RQDA does not read Imported UTF-8 characters in UTF-8 encoded .txt

Trying to import a database of texts for analysis with RQDA. The database consists of word to text converted files with UTF-8 encoding. RQDA is supposed to read UTF-8, however UTF-8 characters like (ą, č, ę, ė, į, š, ų, ū) are not recognized after import to RQDA.
I'm using the "write.FileList" function for import. Its details state, that
"The file content will be converted to UTF-8 character before being written to *.rqda. The original content can be in any suitable encoding, so you can inspect the content correctly; In other words, the better practices is to used the corresponding encoding (you can get a hint by localeToCharset function) to save the imported files."
write.FileList(FileList, encoding = .rqda$encoding, con = .rqda$qdacon)
addFilesFromDir("C:\\output", pattern = "*.txt$")
write.FileList imports the database of text to RQDA, but UTF-8 characters are not recognized.
Shows this warning:
"In rsqlite_fetch(res#ptr, n = n) :
Don't need to call dbFetch() for statements, only for queries"

Convert unicode to a readable string

My object in R contains the following unicode which are extracted from twitter.
\xe0\xae\xa8\xe0\xae\x9f\xe0\xae\xbf\xe0\xae\x95\xe0\xae\xb0\xe0\xaf\x8d
\xe0\xae\x9a\xe0\xaf\x82\xe0\xae\xb0\xe0\xaf\x8d\xe0\xae\xaf\xe0\xae\xbe
\xe0\xae\x9a\xe0\xaf\x86\xe0\xae\xaf\xe0\xaf\x8d\xe0\xae\xa4
\xe0\xae\x89\xe0\xae\xa4\xe0\xae\xb5\xe0\xae\xbf
\xe0\xae\xae\xe0\xae\xbf\xe0\xae\x95
\xe0\xae\xae\xe0\xaf\x81\xe0\xae\x95\xe0\xaf\x8d\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xaf\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xae\xa4\xe0\xaf\x81!'
- \xe0\xae\x9f\xe0\xaf\x86\xe0\xae\xb2\xe0\xaf\x8d\xe0\xae\x9f\xe0\xae\xbe\xe0\xae\xb5\xe0\xae\xbf\xe0\xae\xb2\xe0\xaf\x8d
\xe0\xae\xa8\xe0\xaf\x86\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xb4\xe0\xaf\x8d\xe0\xae\xa8\xe0\xaf\x8d\xe0\xae\xa4
\xe0\xae\x9a\xe0\xaf\x80\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xaf\x8d
I need to convert them to human readable strings. If I just put this in a string, e.g.
x <- "\xe0\xae\xa8\xe0\xae\x9f\xe0\xae\xbf\xe0\xae\x95\xe0\xae\xb0\xe0\xaf\x8d \xe0\xae\x9a\xe0\xaf\x82\xe0\xae\xb0\xe0\xaf\x8d\xe0\xae\xaf\xe0\xae\xbe \xe0\xae\x9a\xe0\xaf\x86\xe0\xae\xaf\xe0\xaf\x8d\xe0\xae\xa4 \xe0\xae\x89\xe0\xae\xa4\xe0\xae\xb5\xe0\xae\xbf \xe0\xae\xae\xe0\xae\xbf\xe0\xae\x95 \xe0\xae\xae\xe0\xaf\x81\xe0\xae\x95\xe0\xaf\x8d\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xaf\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xae\xa4\xe0\xaf\x81!' - \xe0\xae\x9f\xe0\xaf\x86\xe0\xae\xb2\xe0\xaf\x8d\xe0\xae\x9f\xe0\xae\xbe\xe0\xae\xb5\xe0\xae\xbf\xe0\xae\xb2\xe0\xaf\x8d \xe0\xae\xa8\xe0\xaf\x86\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xb4\xe0\xaf\x8d\xe0\xae\xa8\xe0\xaf\x8d\xe0\xae\xa4 \xe0\xae\x9a\xe0\xaf\x80\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xaf\x8d"
it displays as an unreadable mess. How can I get it to display using the actual characters?

When you assign the hex codes like \xe0\xae\xa8\xe0... to a string, R doesn't know how they are intended to be interpreted, so it assumes the encoding for the current locale on your computer. On most modern Unix-based systems these days, that would be UTF-8, so for example on a Mac your string displays as
> x
[1] "நடிகர் சூர்யா செய்த உதவி மிக முக்கியமானது!' - டெல்டாவில் நெகிழ்ந்த சீமான்"
which I assume is the correct display. Google Translate recognizes it as being written in Tamil.
However, on Windows it displays unreadably. On my Windows 10 system, I see
> x
[1] "à®¨à®Ÿà®¿à®•à®°à¯ à®šà¯‚à®°à¯à®¯à®¾ à®šà¯†à®¯à¯à®¤ à®‰à®¤à®µà®¿ à®®à®¿à®• à®®à¯à®•à¯à®•à®¿à®¯à®®à®¾à®©à®¤à¯!' - à®Ÿà¯†à®²à¯à®Ÿ
because it uses the code page corresponding to the Latin1 encoding, which is wrong for that string. To get it to display properly on Windows, you need to tell R that it is encoded in UTF-8 by declaring its encoding:
Encoding(x) <- "UTF-8"
Then it will display properly in Windows as well, which solves your problem.
For others trying to do this, it's important to know that there are only a few values that work this way. You can declare the encoding to be "UTF-8", "latin1", "bytes" or "unknown". "unknown" means the local encoding on the machine, "bytes" means it shouldn't be interpreted as characters at all. If your string has a different encoding, you need to use a different approach: convert to one of the encodings that R knows about.
For example, the string
x <- "\xb4\xde\xd1\xe0\xde\xd5 \xe3\xe2\xe0\xde"
is Russian encoded in ISO 8859-5. On a system where that was the local encoding it would display properly, but on mine it displays using the hex codes. To get it to display properly I need to convert it to UTF-8 using
y <- iconv(x, from="ISO8859-5", to="UTF-8")
Then it will display properly as [1] "Доброе утро". You can see the full list of encodings that iconv() knows about using iconvlist().

problems replacing €-symbol in strings

I want to replace every "€" in a string with "[euro]". Now this works perfectly fine with
file.col.name <- gsub("€","[euro]", file.col.name, fixed = TRUE)
Now I am looping over column names from a csv-file and suddenly I have trouble with the string "total€".
It works for other special character (#,?) but the € sign doesn't get recognized.
grep("€",file.column.name)
also returns 0 and if I extract the last letter it prints "€" but
print(lastletter(file.column.name) == "€")
returns FALSE. (lastletter is just a function to extract the last letter of a string.)
Does anyone have an idea why that happens and maybe an idea to solve it? I checked the class of "file.column.name" and it returns "character", also tried to convert it into a character again and stuff like that but didn't help.
Thank you!

Your encodings are probably mixed. Check the encodings of the files, then add the appropriate encoding to, e.g., read.csv using fileEncoding="…" as an argument.
If you are working under Unix/Linux, the file utility will tell you the encoding of text files. Otherwise, any editor should show you the encoding of the files.
Common encodings are UTF-8, ISO-8859-15 and windows-1252. Try "UTF-8", "windows-1252" and "latin-9" as values for fileEncoding (the latter being a portable name for ISO-8859-15 according to R's documentation).

How to specify encoding while creating file?

I am using an R script to create and append a file. But I need the file to be saved in ANSI encoding,even though some characters are in Unicode format. How to ensure ANSI encoding?
newfile='\home\user\abc.ttl'
file.create(newfile)
text3 <- readLines('\home\user\init.ttl')
sprintf('readlines %d',length(text3))
for(k in 1:length(text3))
{
cat(text3[[k]],file=newfile,sep="\n",append=TRUE)
}

Encoding can be tricky, since you need to detect your encoding upon input, and then you need to convert it before writing. Here it sounds like your input file input.ttl is encoded as UTF-8, and you need it converted to ASCII. This means you are probably going to lose some non-translatable characters, since there may be no mapping from the UTF-8 characters to ASCII outside of the 128-bit lower range. (Within this range the mappings of UTF-8 to ASCII are the same.)
So here is how to do it. You will have to modify your code accordingly to test since you did not supply the elements needed for a reproducible example.
Make sure that your input file is actually UTF-8 and that you are reading it as UTF-8. You can do this by adding encoding = "UTF-8" to the third line of your code, as an argument to readLines(). Note that you may not be able to set the system locale to UTF-8 on a Windows platform, but the file will still be read as UTF-8, even though extended characters may not display properly.
Use iconv() to convert the text from UTF-8 to ASCII. iconv() is vectorised so it works on the whole set of text. You can do this using
text3 <- iconv(text3, "UTF-8", "ASCII", sub = "")
Note here that the sub = "" argument prevents the default behaviour of converting the entire character element to NA if it encounters any untranslatable characters. (These include the seemingly innocent but actually subtly evil things such as "smart quotes".)
Now when you write the file using cat() the output should be ASCII.

Accentuation in R

I was wonderiing if there exist a short way to get accents of a character string in R. For instance when I the "accentued" string is "Université d'Aix-Marseille" with my script I get "Universit%C3A9 d%27Aix-Marseille". Is there any function or algorithm to get the former one directly ?
I precise tht the file from where a get all my character string is encoded is UTF-8.
Sincerely yours.

You can get and set the encoding of a character vector like this:
s <- "Université d'Aix-Marseille"
Encoding(s)
# set encoding to utf-8
Encoding(s) <- "UTF-8"
s
If that fixes it, you could change your default encoding to UTF-8.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex