Fast method to read csv with UTF-16LE encoding - r

I'm dealing with .csv files with UTF-16LE encoding, this method works to read the files, but read.csv is very slow compared to read_csv.
read.csv2(path,dec=",",skip=1,header=T,fileEncoding="UTF-16LE",sep="/t")
Unfortunately I can't make read_csv work, I only get empty rows and I don't find a way to even specify encoding in the function.
I can't share my data, but if anyone dealt with this encoding any help would be appreciated.

You can specify file encodings with readr functions like read_csv with the locale option: locale=locale(encoding="UTF-16LE"). However, I haven't successfully read in a utf-16le file with read_csv. I get an "Incomplete multibyte sequence" error. There's a related issue filed, but I still have issues with my file -- hopefully others will have more success.

Related

Read hyperterminal file(ht file) in R

Is it possible for R to read/import hyperterminal file(.ht) ?
I have tried to use read.csv, read.table, read_tsv and read.delim, but no one works. I think because the file contain some special character(maybe due to the encoding ANSI but I am not having a very deep understanding on encoding) so that R unable to read it.
Is there any way to remove the rows of special character first before reading/importing the file?
Or can convert the file to txt/convert the encoding to general form(for example UTF-8) ?
This is when I open ht file by notepad
htfile_notepad
This is when I open ht file by excel
htfile_excel
Please help. Thanks a lot!

R write.csv not handling characters like é correctly

When I look at data in R, it has characters like "é" displayed correctly.
I export it to excel using write.csv. When I open the csv file, "é" is displayed as "√©". Is the problem with write.csv or with excel? What can I do to fix it?
Thanks
Try the write_excel_csv() function from the readr package
readr::write_excel_csv(your_dataframe, "file_path")
It's a problem with Excel. Try Importing data instead of Opening the file.
Go to: 'Data' --> 'From Text/CSV' and then select '65001:Unicode (UTF-8)'. That will match the encoding from R.
Try experimenting with the parameter fileEncoding of write.csv:
write.csv(..., fileEncoding="UTF-16LE")
From write.csv documentation:
fileEncoding character string: if non-empty declares the encoding to
be used on a file (not a connection) so the character data can be
re-encoded as they are written. See file.
CSV files do not record an encoding, and this causes problems if they
are not ASCII for many other applications. Windows Excel 2007/10 will
open files (e.g., by the file association mechanism) correctly if they
are ASCII or UTF-16 (use fileEncoding = "UTF-16LE") or perhaps in the
current Windows codepage (e.g., "CP1252"), but the ‘Text Import
Wizard’ (from the ‘Data’ tab) allows far more choice of encodings.
Excel:mac 2004/8 can import only ‘Macintosh’ (which seems to mean Mac
Roman), ‘Windows’ (perhaps Latin-1) and ‘PC-8’ files. OpenOffice 3.x
asks for the character set when opening the file.

How to read csv file with unknown formatting and unknown encoding in R Program? (example file provided)

I have tried my best to read a CSV file in r but failed. I have provided a sample of the file in the following Gdrive link.
Data
I found that it is a tab-delimited file by opening in a text editor. The file is read in Excel without issues. But when I try to read it in R using "readr" package or the base r packages, it fails. Not sure why. I have tried different encoding like UTF-8. UTF-16, UTF16LE. Could you please help me to write the correct script to read this file. Currently, I am converting this file to excel as a comma-delimited to read in R. But I am sure there must be something that I am doing wrong. Any help would be appreciated.
Thanks
Amal
PS: What I don't understand is how excel is reading the file without any parameters provided? Can we build the same logic in R to read any file?
This is a Windows-related encoding problem.
When I open your file in Notepad++ it tells me it is encoded as UCS-2 LE BOM. There is a trick to reading in files with unusual encodings into R. In your case this seems to do the trick:
read.delim(con <- file("temp.csv", encoding = "UCS-2LE"))
(adapted from R: can't read unicode text files even when specifying the encoding).
BTW "CSV" stands for "comma separated values". This file has tab-separated values, so you should give it either a .tsv or .txt suffix, not .csv, to avoid confusion.
In terms of your second question, could we build the same logic in R to guess encoding, delimiters and read in many types of file without us explicitly saying what the encoding and delimiter is - yes, this would certainly be possible. Whether it is desirable I'm not sure.

Is there any way to use Å letter in R?

Is there any way to deal with this letter in R -Å?
In some configuration I'm able to read this letter from SQL by RODBC, but I didn't found any solution to save this letter to csv or txt. It's always getting converted to normal A or Ĺ.
Also, how to read this letter correctly from Excel file?
I understand from you question that the letter displays properly inside R but you have problems writing it to files.
R's writing functions usually have an encoding parameter (for example, for write.csv and write.table it's called fileEncoding).
When you don't set it explicitly, the function will encode the file using your OS's (or R-installations) native encoding, which can sometimes cause problems with special characters. What exactly goes wrong and how to fix it depends heavily on your system setup - especially if you're also interacting with databases, as you describe.
But very often, an easy fix is writing files in UTF-8 encoding, i.e.
write.csv(your_df, your_path, fileEncoding='UTF-8')
as most external programs (such as Excel) are able to automatically detect and properly read UTF-8 encoded files.
Set the fileEncoding argument on write.table to fit your needs (e.g., if your text is encoded as UTF-8, try write.table(my_tab, file = "my_tab.txt", fileEncoding = "UTF8")).

How to read excel file in Chinese character [R]?

I always convert excel file into CSV file to import to R as following.
myDataFrame <- read.csv("mydatafile.csv", stringsAsFactors=F)
But, I got a serious problem when I convert xlsx file which is written in Chinese. Most of characters(not all of them) shows '??' because of encoding.
So, I decided to use xlsx package to import directly. But the problem is that size of excel file exceeds 10MB.
It gave me an error message because of JVMs memory limit. (I assume that xlsx uses Java internally.)
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: GC overhead limit exceeded
How can I import chinese excel file to R? I tried 'Save as..' CSV file, and opened it notepad, and save it with option 'UTF-8'. but the result was the same(shows '??').
FYI, I can see full chinese character in the original excel file.
Your question is a mixed one. Let's assume that you have converted the xlsx file into csv. If you haven't, please refer to other threads like this one. I think this step is best carried out in some externel tool rather than in R.
Now we've got a csv, there remain two problems, size and encoding. For encoding, as you have mentioned in the comment, you can use the encoding= option of several R functions like read.csv. For Chinese files coming out of Excel, the encoding is most probably "GB18030". If cannot decide, the open file dialog of Libreoffice Calc may give you some clue.
If the file size is large, you may first convert the encoding using the Linux command iconv, and then further process it in R.
Now for the size part. A 50mb or even 500mb csv can easily handled by read.csv, although not necessarily fast, provided that you have enough memory. If the file is larger than 1G, there are two options:
Use the sqldf package, which reads the csv into a temporary database, and then into a data.frame.
Process the csv line by line. First use file() to create a connection, then use readLines() to process it line by line. Finally manually combine the result into a data.frame or other appropriate structure.
The first one is simpler, the second one can handle really large file.
Hope it helps.

Resources