Fread function change name of first column in large csv files - r

I have 2.3 GB csv file. When I read it using fread function present in data.table library of R , it adds a '' symbol to the first column .
So my data's first column was 'HistoryID' , after reading it through fread , it becomes 'HistoryID'. Other columns remain unaffected.
Is there a specific encoding which should be used to solve this problem ?
When I read the data in read.csv function, this problem gets solved if we use ' UTF-8-BOM' encoding , but the same doesn't seem to work for fread.

According to the documentation on cran -- R-data.html#Variations-on-read_002etable
Byte Order Marks still cause problem with encoding, and can be dealt with like this:
it can be read on Windows by
read.table("intro.dat", fileEncoding = "UTF-8")
but on a Unix-alike might need
read.table("intro.dat", fileEncoding = "UTF-8-BOM")
Check the section 2.1 Variations on read.table
It also seems to suggest that read.csv uses this trick.

Related

openxlsx package, read.xlsx check.names=false still placing a . in column names

Usuaully I use Tidyverse to read in excel files with the read_excel command, however I encountered the dreaded "Unknown or uninitialised column" bug that refers to a non existent column and then warns about said not existent column from then on through the workflow.
So I decided to use openxlsx instead to read in the excel files. All was going well until I realised that openxlsx sees column names with white space as not syntactically correct and it adds a . to replace the whitespace. So 'Customer Name' becomes 'Customer.Name'.
I tried using the check.names=FALSE command to leave the headers in tact, but the package seems to ignore this command.
Many of the headers might have more than a single space between the words and the format has to stay the same. I cannot use an excel package that relies on Java as our company has blocked it.
How can I force openxlsx to leave the header alone?
Example of the code I am using is here: IMACS <- read.xlsx("//zfsstdscun001a.rz.ch.com/UKGI_Pricing/Bus_Insights/R_Scripts/IMACS.xlsx",check.names=FALSE, sheet = "IMACS")
All credit to #Matt on this.
Using readxl and read_excel together worked a treat.
IMACS <- readxl::read_excel("//zfsstdscun001a.rz.com/UKGI_Pricing/Bus_Insights/R_Scripts/CAT Risks/IMACSV2.xlsx",
sheet = "IMACS")

Unable to read in csv with read.csv function

Ive used the read.csv function for years and never seen this error.
zError in make.names(col.names, unique = TRUE) :
invalid multibyte string 10
I have a fairly standard .csv file I am trying to read in (download a copy here). Any ideas on what is going on?
It means that something is strange in your column names. Try to use the argument check.names = FALSE in your call. Also be sure you are giving the right sep argument.
Some on your columns might have special characters.
read_csv from readr package should be able to deal with that well and it is very fast too.

missing lines when read csv file into R

I have trouble with reading csv file into R. The file contains more than 10000 lines, but only 4977 lines are read into R. And there is no missing value in the file. My code below:
mydata = read.csv("12260101.csv", quote = "\"", skipNul = TRUE)
write.csv(mydata, "check.csv")
It's hard to say without seeing the CSV file. You might want to compare rows that aren't being imported with the imported ones.
I would try using the function read_csv() from the package readr or fread() from data.table.
As other posters pointed out, hard to reproduce without an example. I had a similar issue with read.csv but fread worked without any problems. Might be worth a shot to give it a try.

Fast method to read csv with UTF-16LE encoding

I'm dealing with .csv files with UTF-16LE encoding, this method works to read the files, but read.csv is very slow compared to read_csv.
read.csv2(path,dec=",",skip=1,header=T,fileEncoding="UTF-16LE",sep="/t")
Unfortunately I can't make read_csv work, I only get empty rows and I don't find a way to even specify encoding in the function.
I can't share my data, but if anyone dealt with this encoding any help would be appreciated.
You can specify file encodings with readr functions like read_csv with the locale option: locale=locale(encoding="UTF-16LE"). However, I haven't successfully read in a utf-16le file with read_csv. I get an "Incomplete multibyte sequence" error. There's a related issue filed, but I still have issues with my file -- hopefully others will have more success.

how to have fread perform like read.delim

I've got a large tab-delimited data table that I am trying to read into R using the data.table package fread function. However, fread encounters an error. If I use read.delim, the table is read in properly, but I can't figure out how to configure fread such that it handles the data properly.
In an attempt to find a solution, I've installed the development version of data.table, so I am currently running data.table v1.9.7, under R v3.2.2, running on Ubuntu 15.10.
I've isolated the problem to a few lines from my large table, and you can download it here.
When I used fread:
> fread('problemRows.txt')
Error in fread("problemRows.txt") :
Expecting 8 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
I tried using the parameters used by read.delim:
fread('problemRows.txt', sep="\t", quote="\"")
but I get the same error.
Any thoughts on how to get this to read in properly? I'm not sure what exactly the problem is.
Thanks!
With this recent commit c1b7cda, fread's quote logic got a bit cleverer in handling such tricky cases. With this:
require(data.table) # v1.9.7+
fread("my_file.txt")
should just work. The error message is now more informative as well if it is unable to handle. See #1462.
As explained in the comments, specifying the quotes argument did the trick.
fread("my_file.txt", quote="")

Resources