Read hyperterminal file(ht file) in R - r

Is it possible for R to read/import hyperterminal file(.ht) ?
I have tried to use read.csv, read.table, read_tsv and read.delim, but no one works. I think because the file contain some special character(maybe due to the encoding ANSI but I am not having a very deep understanding on encoding) so that R unable to read it.
Is there any way to remove the rows of special character first before reading/importing the file?
Or can convert the file to txt/convert the encoding to general form(for example UTF-8) ?
This is when I open ht file by notepad
htfile_notepad
This is when I open ht file by excel
htfile_excel
Please help. Thanks a lot!

Related

R write.csv not handling characters like é correctly

When I look at data in R, it has characters like "é" displayed correctly.
I export it to excel using write.csv. When I open the csv file, "é" is displayed as "√©". Is the problem with write.csv or with excel? What can I do to fix it?
Thanks
Try the write_excel_csv() function from the readr package
readr::write_excel_csv(your_dataframe, "file_path")
It's a problem with Excel. Try Importing data instead of Opening the file.
Go to: 'Data' --> 'From Text/CSV' and then select '65001:Unicode (UTF-8)'. That will match the encoding from R.
Try experimenting with the parameter fileEncoding of write.csv:
write.csv(..., fileEncoding="UTF-16LE")
From write.csv documentation:
fileEncoding character string: if non-empty declares the encoding to
be used on a file (not a connection) so the character data can be
re-encoded as they are written. See file.
CSV files do not record an encoding, and this causes problems if they
are not ASCII for many other applications. Windows Excel 2007/10 will
open files (e.g., by the file association mechanism) correctly if they
are ASCII or UTF-16 (use fileEncoding = "UTF-16LE") or perhaps in the
current Windows codepage (e.g., "CP1252"), but the ‘Text Import
Wizard’ (from the ‘Data’ tab) allows far more choice of encodings.
Excel:mac 2004/8 can import only ‘Macintosh’ (which seems to mean Mac
Roman), ‘Windows’ (perhaps Latin-1) and ‘PC-8’ files. OpenOffice 3.x
asks for the character set when opening the file.

How to read csv file with unknown formatting and unknown encoding in R Program? (example file provided)

I have tried my best to read a CSV file in r but failed. I have provided a sample of the file in the following Gdrive link.
Data
I found that it is a tab-delimited file by opening in a text editor. The file is read in Excel without issues. But when I try to read it in R using "readr" package or the base r packages, it fails. Not sure why. I have tried different encoding like UTF-8. UTF-16, UTF16LE. Could you please help me to write the correct script to read this file. Currently, I am converting this file to excel as a comma-delimited to read in R. But I am sure there must be something that I am doing wrong. Any help would be appreciated.
Thanks
Amal
PS: What I don't understand is how excel is reading the file without any parameters provided? Can we build the same logic in R to read any file?
This is a Windows-related encoding problem.
When I open your file in Notepad++ it tells me it is encoded as UCS-2 LE BOM. There is a trick to reading in files with unusual encodings into R. In your case this seems to do the trick:
read.delim(con <- file("temp.csv", encoding = "UCS-2LE"))
(adapted from R: can't read unicode text files even when specifying the encoding).
BTW "CSV" stands for "comma separated values". This file has tab-separated values, so you should give it either a .tsv or .txt suffix, not .csv, to avoid confusion.
In terms of your second question, could we build the same logic in R to guess encoding, delimiters and read in many types of file without us explicitly saying what the encoding and delimiter is - yes, this would certainly be possible. Whether it is desirable I'm not sure.

How to save a dataframe as a .csv file with UTF-8 encoding and LF line ending in R using Rstudio?

I came across this weird situation:
I need to save a dataframe to a .csv file UTF-8 and with a LF ending. I'm using the latest version of R and Rstudio on a Windows 10 machine.
My first attempt was to do naively:
write.csv(df, fileEncoding="UTF-8", eol="\n")
checking with Notepad++, it appears the encoding is UTF-8, however the line ending is CRLF and not LF. Ok, let's double check with Notepad: surprise, surprise, the encoding, according to Notepad, is ANSI. At this point I'm confused.
After looking at the docs for the function write.csv I read that:
CSV files do not record an encoding
I'm not an expert on the topic, so I decide to revert back and save the file as a simple .txt using write.table as follows:
write.table(df, fileEncoding="UTF-8", eol="\n")
again, the same result as above. No changes whatsoever. I tried the combinations
write.csv(df)
write.table(df)
without specified encodings but no change. Then I set the default encoding in Rstudio to be UTF-8 and LF line ending (as in the picture below)
and ran the tests again. No change. What am I missing??
This is an odd one, at least for me. Nonetheless, by reading the docs of write.table I found the solution. Apparently on Windows, to save files Unix-style you have to open a binary connection to a file and then save the file using the desired eol:
f <- file("filename.csv", "wb")
write.csv(df, file=f, eol="\n")
close(f)
As far as the UTF-8 format is concerned, global settings should work fine.
Check that the eol is LF using Notepad++. UTF-8 is harder to check since on Linux isutf8 (from moreutils) says files are indeed UTF-8 but Windows' Notepad disagrees when saving and says they are ANSI.

Fast method to read csv with UTF-16LE encoding

I'm dealing with .csv files with UTF-16LE encoding, this method works to read the files, but read.csv is very slow compared to read_csv.
read.csv2(path,dec=",",skip=1,header=T,fileEncoding="UTF-16LE",sep="/t")
Unfortunately I can't make read_csv work, I only get empty rows and I don't find a way to even specify encoding in the function.
I can't share my data, but if anyone dealt with this encoding any help would be appreciated.
You can specify file encodings with readr functions like read_csv with the locale option: locale=locale(encoding="UTF-16LE"). However, I haven't successfully read in a utf-16le file with read_csv. I get an "Incomplete multibyte sequence" error. There's a related issue filed, but I still have issues with my file -- hopefully others will have more success.

'Embedded nul in string' when importing large CSV (8 GB) with fread()

I have a large CSV file (8.1 GB) that I'm trying to wrangle into R. I created the CSV using Python's csvkit in2csv, converted from a .txt file, but somehow the conversion led to null characters showing up in the file. I'm now getting this error when importing:
Error in fread("file.csv", nrows = 100) :
embedded nul in string: 'ÿþr\0e\0c\0d\0_\0z\0i\0p\0c\0'
I am able to import small chunks just fine with read.csv though, but that's because it allows for UTF-16 encoding via the fileEncoding argument.
test <- read.csv("file.csv", nrows=100, fileEncoding="UTF-16LE")
I don't dare try to import an 8 GB file with read.csv, though.
So I then tried the solution offered here, in which you use sed s/\\0//g file.csv > file2.csv to pull the nulls out. The command performed just fine and populated a new 8GB CSV file, but I received a nearly-identical error:
Error in fread("file2.csv", nrows = 100) :
embedded nul in string: 'ÿþr\0e\0c\0d\0_\0z\0i\0p\0c\0,\0p\0o\0s\0t\0_\0z\0i
So, that didn't work. I'm stumped at this point. Considering the size of the file, I can't use read.csv on the whole thing, and I'm not sure how to get rid of the nulls in the original CSV. I'm not even sure how the file got encoded as UTF-16. Any suggestions or advice would be greatly appreciated at this point.
Edit: I'm on a Windows machine.
If you're on linux/mac, try this
file <- "file.csv"
tt <- tempfile() # or tempfile(tmpdir="/dev/shm")
system(paste0("tr < ", file, " -d '\\000' >", tt))
fread(tt)
A possible option would be to install bash emulator on your machine from http://win-bash.sourceforge.net/ , and remove null terminated strings using Linux tools, as described, for example, here: Identifying and removing null characters in UNIX or here 'Embedded nul in string' error when importing csv with fread
I think the nonsensical characters happen because the file is compressed. This is what I found when trying to read vcf.gz files. fread does not seem to support reading compressed files. See e.g. https://github.com/Rdatatable/data.table/issues/717
readLines() and read.table() support compressed files, but they are slower.

Resources