I notice that it is easy to load CSV files containing Chinese characters in R with read.csv("mydata.csv", encoding="UTF-8").
I have text files (.txt) which I want to read using readChar(). However, readChar() does not allow me to specify encoding.
What can I do?
Related
I have text that contains Unicode.
Export this text to CSV.
I don't want to see Unicode in CSV files created in R.
This is the Unicode in question.
<U+00A0>
This unicode will appear blank when exported to an xlsx file.
However, when exporting as csv, it comes out as Unicode. It looks like <U+00A0> in the csv file.
How can solve this problem. and I want to know it is possible.
I tried changing the encoding option of the write.table function.
I tried using the iconv function.
But it was not resolved.
I have tried my best to read a CSV file in r but failed. I have provided a sample of the file in the following Gdrive link.
Data
I found that it is a tab-delimited file by opening in a text editor. The file is read in Excel without issues. But when I try to read it in R using "readr" package or the base r packages, it fails. Not sure why. I have tried different encoding like UTF-8. UTF-16, UTF16LE. Could you please help me to write the correct script to read this file. Currently, I am converting this file to excel as a comma-delimited to read in R. But I am sure there must be something that I am doing wrong. Any help would be appreciated.
Thanks
Amal
PS: What I don't understand is how excel is reading the file without any parameters provided? Can we build the same logic in R to read any file?
This is a Windows-related encoding problem.
When I open your file in Notepad++ it tells me it is encoded as UCS-2 LE BOM. There is a trick to reading in files with unusual encodings into R. In your case this seems to do the trick:
read.delim(con <- file("temp.csv", encoding = "UCS-2LE"))
(adapted from R: can't read unicode text files even when specifying the encoding).
BTW "CSV" stands for "comma separated values". This file has tab-separated values, so you should give it either a .tsv or .txt suffix, not .csv, to avoid confusion.
In terms of your second question, could we build the same logic in R to guess encoding, delimiters and read in many types of file without us explicitly saying what the encoding and delimiter is - yes, this would certainly be possible. Whether it is desirable I'm not sure.
I have a file in Excel which has a column with Chinese simplified characters. When I open it in R from the corresponding CSV file I only get ?'s.
I'm afraid the problem is when exporting from Excel to CSV because when I open the CSV file on a text editor I also get ?'s.
How can I get around this?
The best way to secure your Chinese/Unicode characters is to read file from .xlsx:
library(readxl)
read_xlsx("yourfilepath.xlsx", col_types = "text")
If your file is too big to read from .xlsx, then the best way is to open Excel and split manually into multiple files.
(My experience with a laptop with 8GB RAM is to split files into 250,000 rows x 106 columns.)
If you need to read from .csv, your all windows settings/localization needs to be the same as your file, but even that does not guarantee the integrity of all your Unicode characters (eg. emojis).
(If you also need .csv for something else, then you can use the R function write.csv after you read data from .xlsx into R.)
I scraped a site that includes the names of many different cities from around the world using R's rvest package. Some of these names have German umlauts and characters from almost every other major language in them which are not showing properly in the .csv file I used to output the text. Is there a way to make Excel display these names properly? I'm using Excel 2011 on Mac. Here is some examples of what the names appear as in my csv file.
"MÔÈhldorf am Inn" instead of "Mühldorf am Inn"
"PÔ_rnu" instead of "Pärnu"
I did not use any kind of encoding when outputting the text as a cvs and don't have access to the original scraped object in R.
write.csv(data, "master_me_data.csv")
Any help would be appreciated.
I've scraped Japanese contents from online to conduct content analysis. Now I am preparing the text data, starting with creating term-document matrix. The package I am using to clean and parse things out is "RMeCab". I've been told that this package requires text data to be in ANSI encoding. But my data is in UTF-8 encoding, as is the setting of RMeCab and the global setting within R itself.
Is it necessary that I change the encoding of my text files in order to run RMeCab? In that case, how do I convert the encoding of tens of thousands of separate text files quickly?
I tried encoding conversion websites, which give me some gibberish as an ANSI output. I do not understand the mechanism behind inputting something that looks like a bunch of question marks into RMeCab. If I successfully converted encoding to ANSI and my text data look like a bunch of symbols, would RMeCab still be able to read it as Japanese text?