How to read Unicode file in R - r

I have one unicode (UTF-8) file with column delimiter as 'þ', I'm trying to read it using R csv reader as follows
data <- read.csv(file_name,,sep="þ",encoding="UTF-8")
in my data frame I'm getting everything in a single column, can someone correct me what I'm doing wrong here?

I think your script needs to be encoded as utf-8 too if you're using non-ascii characters.
Save your code in for example myfile.r and then
Try this:
source("myfile.r", encoding="utf-8")
hopefully your error will go away

Related

Storing special characters with R in csv

I need a solution for storing special characters like emojis, arabic or chinese characters in a csv. I tried the base write.csv, write.csv2 with the fileEncoding="UTF-8" parameter and the readr-function write_csv but nothing worked properly. The special characters are shown in R, so I guess there is a solution for storing them.
Example-Code:
df <- data.frame("x" = c("ö", "ä"),
"y" = c("مضر السامرائي", "🐇"))
write.csv(df, "~/TubeWork/data/test2.csv", fileEncoding = "UTF-8")
To check the results I use Excel and it looks as follows:
Maybe it's a problem of Excel, which can display the results correctly? If yes, how I should check if the characters are displayed correctly?
Is there maybe a solution to convert the characters to unicode and save it as unicode? This would be fine for me as well. But the best solution would be a csv with the special characters displayed.
Thank you in advance!
Windows 10 64-bit; R 4.2.1; RStudio 2022.12.0+353
Update!
If I read the exported csv back in R, all the Emojis are displayed correctly. So as you all wrote, Excel can't diplay the emojis and special character correctly. If I want the special characters character displayed in Excel, you should use readr:write_excel_csv() (Big Thanks to Ritchie Scramenta for the useful comment).
Once again: Problem solved!
Thank you!

Weird black paint-like symbol appears when reading a csv file in R

I am trying to read a csv file in R but when I run read_csv(), I get this weird paint-like symbol for some rows, even though it is displayed correctly in the raw csv file. I have tried reading it through read.csv() and also converting the file to excel and reading it through read_xlsx() but I get the same weird symbol. I am guessing it has something to do with the encoding but I am not sure what to do. Any suggestions?

How to read csv file with unknown formatting and unknown encoding in R Program? (example file provided)

I have tried my best to read a CSV file in r but failed. I have provided a sample of the file in the following Gdrive link.
Data
I found that it is a tab-delimited file by opening in a text editor. The file is read in Excel without issues. But when I try to read it in R using "readr" package or the base r packages, it fails. Not sure why. I have tried different encoding like UTF-8. UTF-16, UTF16LE. Could you please help me to write the correct script to read this file. Currently, I am converting this file to excel as a comma-delimited to read in R. But I am sure there must be something that I am doing wrong. Any help would be appreciated.
Thanks
Amal
PS: What I don't understand is how excel is reading the file without any parameters provided? Can we build the same logic in R to read any file?
This is a Windows-related encoding problem.
When I open your file in Notepad++ it tells me it is encoded as UCS-2 LE BOM. There is a trick to reading in files with unusual encodings into R. In your case this seems to do the trick:
read.delim(con <- file("temp.csv", encoding = "UCS-2LE"))
(adapted from R: can't read unicode text files even when specifying the encoding).
BTW "CSV" stands for "comma separated values". This file has tab-separated values, so you should give it either a .tsv or .txt suffix, not .csv, to avoid confusion.
In terms of your second question, could we build the same logic in R to guess encoding, delimiters and read in many types of file without us explicitly saying what the encoding and delimiter is - yes, this would certainly be possible. Whether it is desirable I'm not sure.

How to output a file with Chinese Character into .csv file that's compatible with excel?

I want to export the data.frame with character vector in Chinese.
I have tried to output it into text file, it works perfectly with the following code
Sys.setlocale(category = "LC_ALL", locale = "zh_cn.utf-8")
data<-data.frame(ID=c('小李','小王','小宗'),number=c(1:3))
write.table(data,'test.txt',quote=F,row.names=F,sep='\t')
But when I tried to use the write.csv, if I use excel to open the data file, the Chinese part of the data is not correct for the output test.csv, see the figure below for details.
write.csv(data,'test.csv',row.names=F)
I have found a similar post on stackoverflow, but failed to figure out how to cope with my case. How to export a csv in utf-8 format?.
Is there any solution that can output the data file that is compatible with excel?

Displaying UTF-8 encoded Chinese characters in R

I try to open a UTF-8 encoded .csv file that contains (traditional) Chinese characters in R. For some reason, R displays the information sometimes as Chinese characters, sometimes as unicode characters.
For instance:
data <-read.csv("mydata.csv", encoding="UTF-8")
data
will produce unicode characters, while:
data <-read.csv("mydata.csv", encoding="UTF-8")
data[,1]
will actually display Chinese characters.
If I turn it into a matrix, it will also display Chinese characters, but if I try to look at the data (command View(data) or fix(data)) it is in unicode again.
I've asked for advice from people who use a Mac (I'm using a PC, Windows 7), and some of them got Chinese characters throughout, others didn't. I tried to save the original data as a table instead and read it into R this way - same result. I tried running the script in RStudio, Revolution R, and RGui. I tried to adjust the locale (e.g. to chinese), but either R didn't let me change it or else the result was gibberish instead of unicode characters.
My current locale is:
"LC_COLLATE=French_Switzerland.1252;LC_CTYPE=French_Switzerland.1252;LC_MONETARY=French_Switzerland.1252;LC_NUMERIC=C;LC_TIME=French_Switzerland.1252"
Any help to get R to consistently display Chinese characters would be greatly appreciated...
Not a bug, more a misunderstanding of the underlying type system conversions (the character type and the factor type) when constructing a data.frame.
You could start first with data <-read.csv("mydata.csv", encoding="UTF-8", stringsAsFactors=FALSE) which will make your Chinese characters to be of the character type and so by printing them out you should see waht you are expecting.
#nograpes: similarly x=c('中華民族');x; y <- data.frame(x, stringsAsFactors=FALSE) and everything should be ok.
In my case, the utf-8 encoding does not work in my r. But the Gb* encoding works.The utf8 wroks in ubuntu. First you need to figure out the default encoding in your OS. And encode it as it is. Excel can not encode it as utf8 properly even it claims that it save as utf8.
(1) Download 'Open Sheet' software.
(2) Open it properly. You can scroll the encoding method until you
see the Chinese character displayed in the preview windows.
(3) Save it as utf-8(if you want utf-8). (UTF-8 is not solution to every problem, you HAVE TO know the default encoding in your system first)

Resources