I have a file in Excel which has a column with Chinese simplified characters. When I open it in R from the corresponding CSV file I only get ?'s.
I'm afraid the problem is when exporting from Excel to CSV because when I open the CSV file on a text editor I also get ?'s.
How can I get around this?
The best way to secure your Chinese/Unicode characters is to read file from .xlsx:
library(readxl)
read_xlsx("yourfilepath.xlsx", col_types = "text")
If your file is too big to read from .xlsx, then the best way is to open Excel and split manually into multiple files.
(My experience with a laptop with 8GB RAM is to split files into 250,000 rows x 106 columns.)
If you need to read from .csv, your all windows settings/localization needs to be the same as your file, but even that does not guarantee the integrity of all your Unicode characters (eg. emojis).
(If you also need .csv for something else, then you can use the R function write.csv after you read data from .xlsx into R.)
Related
I wish to open and read the following text file in Scilab (version 6.0.2).
The original file is an .xlsx that I have converted to both .txt and .csv through Excel to facilitate opening & working with it in Scilab.
Using both fscanfMat and csvRead, scilab only reads the first column as Nan. I understand why the first column is considered as Nan, but I do not see why the rest of the document isn't read. Columns 2 and 3 are in particular of interest to me.
For csvRead, I used :
M=csvRead(chemin+filename," ",",",[],[],[],[],7);
to skip the 7-row header.
Could it be something to do with the way in which the file has been formatted?
For anyone able to help, I will try to upload an example of a .txt file and also the original .xlsx file
Files available for download, here: Excel and Text files
If you convert your xlsx file into a xls one with Excel you can read it withthe readxls function.
Your separator is a tabulation character (ascii code 9). Use the following command:
M=csvRead("Probe1_350N_2S.txt",ascii(9),",",[],[],[],[],7);
I have tried my best to read a CSV file in r but failed. I have provided a sample of the file in the following Gdrive link.
Data
I found that it is a tab-delimited file by opening in a text editor. The file is read in Excel without issues. But when I try to read it in R using "readr" package or the base r packages, it fails. Not sure why. I have tried different encoding like UTF-8. UTF-16, UTF16LE. Could you please help me to write the correct script to read this file. Currently, I am converting this file to excel as a comma-delimited to read in R. But I am sure there must be something that I am doing wrong. Any help would be appreciated.
Thanks
Amal
PS: What I don't understand is how excel is reading the file without any parameters provided? Can we build the same logic in R to read any file?
This is a Windows-related encoding problem.
When I open your file in Notepad++ it tells me it is encoded as UCS-2 LE BOM. There is a trick to reading in files with unusual encodings into R. In your case this seems to do the trick:
read.delim(con <- file("temp.csv", encoding = "UCS-2LE"))
(adapted from R: can't read unicode text files even when specifying the encoding).
BTW "CSV" stands for "comma separated values". This file has tab-separated values, so you should give it either a .tsv or .txt suffix, not .csv, to avoid confusion.
In terms of your second question, could we build the same logic in R to guess encoding, delimiters and read in many types of file without us explicitly saying what the encoding and delimiter is - yes, this would certainly be possible. Whether it is desirable I'm not sure.
I'm exporting CSV files from R, using write.csv
I've also tried various encoding options.
The CSV files open fine and read manually.
However, I need to process these files in another program (METAL https://genome.sph.umich.edu/wiki/METAL) and the program is pretty much unable to recognize the text in the files.
But, when I open the CSV files in excel, and save them again manually with excel-- same name, same place, same encoding, without changing anything except clicking save as CSV, then METAL is able to recognize the text in the CSV files.
I was wondering if anyone has any suggestions how to fix this? It's very cumbersome to go in each file and have to re-save manually.
I'm trying to read a .csv downloaded on Ishares' website containing all holdings of an ETF (and additional info).
Here is the csv.
My issue is that I can't get a neat dataframe containing Tickers, company name, asset class, etc. using the read.csv function.
When I opened the csv file in notepad I saw that all rows are on the same line and no special character is used to separate rows from each others.
Do you have any clues on how should I deal with this file?
That csv file uses the "LF" line break convention found on Linux and other Unixes - Windows usually expects text files to use the "CRLF" convention, which is why the line breaks are not showing in notepad. Another issue with that file is that before the csv content there are 10 header lines. Does this work in R?
read.csv(<fname>, skip=10)
If the line breaks are still an issue, there must be a Windows tool to convert from LF linebreaks to CRLF. On Linux it's "unix2dos" - I'm not sure what it would be on Windows.
I always convert excel file into CSV file to import to R as following.
myDataFrame <- read.csv("mydatafile.csv", stringsAsFactors=F)
But, I got a serious problem when I convert xlsx file which is written in Chinese. Most of characters(not all of them) shows '??' because of encoding.
So, I decided to use xlsx package to import directly. But the problem is that size of excel file exceeds 10MB.
It gave me an error message because of JVMs memory limit. (I assume that xlsx uses Java internally.)
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: GC overhead limit exceeded
How can I import chinese excel file to R? I tried 'Save as..' CSV file, and opened it notepad, and save it with option 'UTF-8'. but the result was the same(shows '??').
FYI, I can see full chinese character in the original excel file.
Your question is a mixed one. Let's assume that you have converted the xlsx file into csv. If you haven't, please refer to other threads like this one. I think this step is best carried out in some externel tool rather than in R.
Now we've got a csv, there remain two problems, size and encoding. For encoding, as you have mentioned in the comment, you can use the encoding= option of several R functions like read.csv. For Chinese files coming out of Excel, the encoding is most probably "GB18030". If cannot decide, the open file dialog of Libreoffice Calc may give you some clue.
If the file size is large, you may first convert the encoding using the Linux command iconv, and then further process it in R.
Now for the size part. A 50mb or even 500mb csv can easily handled by read.csv, although not necessarily fast, provided that you have enough memory. If the file is larger than 1G, there are two options:
Use the sqldf package, which reads the csv into a temporary database, and then into a data.frame.
Process the csv line by line. First use file() to create a connection, then use readLines() to process it line by line. Finally manually combine the result into a data.frame or other appropriate structure.
The first one is simpler, the second one can handle really large file.
Hope it helps.