Unable to read in csv with read.csv function - r

Ive used the read.csv function for years and never seen this error.
zError in make.names(col.names, unique = TRUE) :
invalid multibyte string 10
I have a fairly standard .csv file I am trying to read in (download a copy here). Any ideas on what is going on?

It means that something is strange in your column names. Try to use the argument check.names = FALSE in your call. Also be sure you are giving the right sep argument.

Some on your columns might have special characters.
read_csv from readr package should be able to deal with that well and it is very fast too.

Related

Fread unusual line ending causing error

I am attempting to download a large database of NYC taxi data, publicly available at the NYC TLC website.
library(data.table)
feb14 <- fread('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv', header = T)
Executing the above code successfully downloads the data (which takes a few minutes), but then fails to parse due to an internal error. I have tried removing header = T as well.
Is there a workaround in order to deal with the "unusual line endings" in fread ?
Error in fread("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv", :
Internal error. No eol2 immediately before line 3 after sep detection.
In addition: Warning message:
In fread("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv", :
Detected eol as \n\r, a highly unusual line ending. According to Wikipedia the Acorn BBC used this. If it is intended that the first column on the next row is a character column where the first character of the field value is \r (why?) then the first column should start with a quote (i.e. 'protected'). Proceeding with attempt to read the file.
It seems that the issues might be caused due the presence of a blank line between the header and data in the original .csv file. Deleting the line from the .csv using notepad++ seemed to fix it for me.
Sometimes other options like read.csv/read.table can behave differently... so you can always try that. (Maybe the source code tells why, havent looked into that).
Another option is to use readLines() to read in such a file. As far as I know, no parsing/formatting is done here. So this is, as far as I know, the most basic way to read a file
At last, a quick fix: use the option 'skip = ...' in fread, or control the end by saying 'nrows = ...'.
There is something fishy with fread. data.table is the faster, more performance oriented for reading large files, however in this case the behavior is not optimal. You may want to raise this issue on github
I am able to reproduce the issue on downloaded file even with nrows = 5 or even with nrows = 1 but only if stick to the original file. If I copy paste the first few rows and then try, the issue is gone. The issue also goes away if I read directly from the web with small nrows. This is not even an encoding issue, hence my recommendation to raise an issue.
I tried reading the file using read.csv and 100,000 rows without an issue and under 6 seconds.
feb14_2 <- read.csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv", header = T, nrows = 100000)
header = T is a redundant argument so would not make a difference for fread but is needed for read.csv.

missing lines when read csv file into R

I have trouble with reading csv file into R. The file contains more than 10000 lines, but only 4977 lines are read into R. And there is no missing value in the file. My code below:
mydata = read.csv("12260101.csv", quote = "\"", skipNul = TRUE)
write.csv(mydata, "check.csv")
It's hard to say without seeing the CSV file. You might want to compare rows that aren't being imported with the imported ones.
I would try using the function read_csv() from the package readr or fread() from data.table.
As other posters pointed out, hard to reproduce without an example. I had a similar issue with read.csv but fread worked without any problems. Might be worth a shot to give it a try.

Fread function change name of first column in large csv files

I have 2.3 GB csv file. When I read it using fread function present in data.table library of R , it adds a '' symbol to the first column .
So my data's first column was 'HistoryID' , after reading it through fread , it becomes 'HistoryID'. Other columns remain unaffected.
Is there a specific encoding which should be used to solve this problem ?
When I read the data in read.csv function, this problem gets solved if we use ' UTF-8-BOM' encoding , but the same doesn't seem to work for fread.
According to the documentation on cran -- R-data.html#Variations-on-read_002etable
Byte Order Marks still cause problem with encoding, and can be dealt with like this:
it can be read on Windows by
read.table("intro.dat", fileEncoding = "UTF-8")
but on a Unix-alike might need
read.table("intro.dat", fileEncoding = "UTF-8-BOM")
Check the section 2.1 Variations on read.table
It also seems to suggest that read.csv uses this trick.

What does skipNul = TRUE in read.table() and read.csv() do (beyond skipping/ignoring embedded nulls)?

I realize setting skipNul = TRUE in read.csv() and read.table() skips over/ignores "embedded nulls" (see ?read.csv and Get "embedded nul(s) found in input" when reading a csv using read.csv()).
What does skipping/ignoring embedded nulls mean for the resulting data in R? I expect R's "skipping" or "ignoring" them means they're kept as text strings, when they would ideally show up as NA values, except the na.strings argument wasn't sufficient to catch them.
This is very late but as someone who ran into R's character encoding hell and found documentation sorely lacking: Embedded nuls aren't NAs, they're invisible characters embedded in strings. You don't lose anything from the data by getting rid of them.

alternatives to read.csv(textConnection())

I am downloading a 120mb csv file from webserver using read.csv(textConnection(binarydata1)) and this is painfully slow. I tried pipe(), like this read.csv(pipe(binarydata1)) I am getting an error Error in pipe(binarydata1) : invalid 'description' argument. Any help regarding this issue is much appricated.
#jeremycg, #hrbrmstr
Suggestion
fread from the data.table package.
local storage via download.file or functions in curl or httr and use data.table::fread like #jeremycg suggested or readr::read_csv
Response
The csv file i am dealing with is in binary format, so I am converting this to standard format using these functions
t1 = getURLContent(url,userpwd,httpauth = 1L, binary=TRUE)
t2 = readBin(t1, what='character', n=length(t1)/4)
when I try fread(t2) after converting binary to standard format i get an error
Error in fread(t61) :
'input' must be a single character string containing a
file name, a command, full path to a file, a URL starting
'http://' or 'file://', or the input data itself
If i try fread directly without converting binary to standard format then no problem it works, if I try converting binary to standard format it does not work
Even though the question is 4 years old it helped me with my current problem, where I also have a 300MB connection where read.csv took ages.
I found the vroom function from package vroom helpful here. It stored my data like a charme. It took one minute for my data where I don't even know if the read.csv(textConnection...) would get me a result (I usually terminated R after 30min. with no result).

Resources