Cyrillic characters in R - r

I am importing some data from a SQL database using the RODBC package. All Cyrillic characters unfortunately get converted to ? marks upon importing the data into R. I tried setting DBMSencoding='UTF-8' while making the connection, but that did not work. I have also tried Sys.setlocale("LC_CTYPE", "ukrainian") but no joy so far. How can I get rid of the ? mark and see the actual Cyrillic character?

Related

Storing special characters with R in csv

I need a solution for storing special characters like emojis, arabic or chinese characters in a csv. I tried the base write.csv, write.csv2 with the fileEncoding="UTF-8" parameter and the readr-function write_csv but nothing worked properly. The special characters are shown in R, so I guess there is a solution for storing them.
Example-Code:
df <- data.frame("x" = c("ö", "ä"),
"y" = c("مضر السامرائي", "🐇"))
write.csv(df, "~/TubeWork/data/test2.csv", fileEncoding = "UTF-8")
To check the results I use Excel and it looks as follows:
Maybe it's a problem of Excel, which can display the results correctly? If yes, how I should check if the characters are displayed correctly?
Is there maybe a solution to convert the characters to unicode and save it as unicode? This would be fine for me as well. But the best solution would be a csv with the special characters displayed.
Thank you in advance!
Windows 10 64-bit; R 4.2.1; RStudio 2022.12.0+353
Update!
If I read the exported csv back in R, all the Emojis are displayed correctly. So as you all wrote, Excel can't diplay the emojis and special character correctly. If I want the special characters character displayed in Excel, you should use readr:write_excel_csv() (Big Thanks to Ritchie Scramenta for the useful comment).
Once again: Problem solved!
Thank you!

R version 4.2.0 and Swedish letters (ä ö å) not working in newest R release. Anyone found a solution?

I have updated to the latest R release (R version 4.2.0), but I am now facing the problem that all the Swedish special letters cannot be read anymore. I am working with a database that has many Swedish letters in its factor lables and even if I am reading them in as strings R doesn't recognise them, with the consequence that all summary tables that are based on these factors as groups are not calculated correctly anymore. The code has been working fine under the previous release (but I had issues with knitting Rmarkdown files, therefore the need for updating).
I have set the encoding to iso-5889-4 (which is nothern languages) after UTF-8 has not worked. Is there anything else I could try? Or has anyone come to a solution on how to fix this, other than to rename all lables before reading in the .csv files? (I would really like to avoid this fix, since I am often working with similar data)
I have used read.csv() and it produces cryptic outputs replacing the special letters with for example <d6> instead of ö and <c4> instead of ä.
I hope that someone has an idea for a fix. Thanks.
edit: I use windows.
Sys.getlocale("LC_CTYPE")
[1] "Swedish_Sweden.utf8"
Use the encoding parameter
I have been able to detect failed loads by attempting to apply toupper to strings, which gives me errors such as
Error in toupper(dataset$column) :
invalid multibyte string 999751
This is resolved and expected outcomes obtained by using
read.csv(..., encoding = 'latin1')
or
data.table::fread(..., encoding = 'Latin-1')
I believe this solution should apply to Swedish characters as they are also covered by the Latin-1 encoding.
I have the same problem, what worked for me was like the answer above said but I used encoding ISO-8859-1 instead. It works for both reading from file and saving to file for Swedish characters å,ä,ö,Å,Ä,Ä, i.e:
read.csv("~/test.csv", fileEncoding = "ISO-8859-1")
and
write.csv2(x, file="test.csv", row.names = FALSE, na = "", fileEncoding = "ISO-8859-1")
It's tedious but it works right now. Another tip is if you use Rstudio is to go to Global options -> Code -> Saving and set your default text encoding to ISO-8859-1 and restart Rstudio. It will save and read your scripts in that encoding as default if I understand correctly. I had the problem when I opened my scripts with Swedish characters, they would display wrong characters. This solution fixed that.

R: write_csv without messing up the language

I am working with a dataset that contains data in multiple languages.
Is there a way to export my work as a CSV file and have R maintain the use of characters in a foreign language instead of replacing them with gibberish English symbols?
Update for anyone who reaches this by Google:
It looks like R only pretends to screw up foreign languages. When you use write_csv, it actually does create a .csv that uses the correct foreign characters.
However, you'll only see them if you open them in Notepad. If you open it in Excel, Excel will screw it up, and if you open it with read_csv, R will screw it up (but will still export it correctly when you use write_csv again).
write_excel_csv() from the readr package seems to work without messing up the foreign characters when opening with Excel.

Getting Japanese characters to display in R Shiny

Our users have RStudio installed on their local machines and are using Shiny to filter data and exporting dataframes to an .xlsx file.
This works really well for most characters, but not when it comes to the Japanese and Mandarin ones. For those, they get to see ??????? instead of the actual text.
Data is residing in a SQL DB and we're using RODBC to connect to DB.
RODBC doesn't seem to like reading these Japanese and Mandarin characters. Is there a way to get around this?
Any help is much appreciated!
Thanks
I had a similar problem with french language the other day. Maybe these options can help you :
In RStudio, try going in Tool > Global Options > Code > Saving > and then choose the right encoding for Japanese and Mandarin. The UTF-8 enconding might work for you.
The blog post Escaping from character encoding hell in R on Windows explains how to set the encoding to import external documents. It should work with data imported with RODBC as well. The autor uses Japanese characters in his examples.
In the odbcDriverConnect() function of the RODBC package, the argument DBMSencoding="UTF-8" might work for you.

Displaying UTF-8 encoded Chinese characters in R

I try to open a UTF-8 encoded .csv file that contains (traditional) Chinese characters in R. For some reason, R displays the information sometimes as Chinese characters, sometimes as unicode characters.
For instance:
data <-read.csv("mydata.csv", encoding="UTF-8")
data
will produce unicode characters, while:
data <-read.csv("mydata.csv", encoding="UTF-8")
data[,1]
will actually display Chinese characters.
If I turn it into a matrix, it will also display Chinese characters, but if I try to look at the data (command View(data) or fix(data)) it is in unicode again.
I've asked for advice from people who use a Mac (I'm using a PC, Windows 7), and some of them got Chinese characters throughout, others didn't. I tried to save the original data as a table instead and read it into R this way - same result. I tried running the script in RStudio, Revolution R, and RGui. I tried to adjust the locale (e.g. to chinese), but either R didn't let me change it or else the result was gibberish instead of unicode characters.
My current locale is:
"LC_COLLATE=French_Switzerland.1252;LC_CTYPE=French_Switzerland.1252;LC_MONETARY=French_Switzerland.1252;LC_NUMERIC=C;LC_TIME=French_Switzerland.1252"
Any help to get R to consistently display Chinese characters would be greatly appreciated...
Not a bug, more a misunderstanding of the underlying type system conversions (the character type and the factor type) when constructing a data.frame.
You could start first with data <-read.csv("mydata.csv", encoding="UTF-8", stringsAsFactors=FALSE) which will make your Chinese characters to be of the character type and so by printing them out you should see waht you are expecting.
#nograpes: similarly x=c('中華民族');x; y <- data.frame(x, stringsAsFactors=FALSE) and everything should be ok.
In my case, the utf-8 encoding does not work in my r. But the Gb* encoding works.The utf8 wroks in ubuntu. First you need to figure out the default encoding in your OS. And encode it as it is. Excel can not encode it as utf8 properly even it claims that it save as utf8.
(1) Download 'Open Sheet' software.
(2) Open it properly. You can scroll the encoding method until you
see the Chinese character displayed in the preview windows.
(3) Save it as utf-8(if you want utf-8). (UTF-8 is not solution to every problem, you HAVE TO know the default encoding in your system first)

Resources