Reading large csv file in R - r

I have a number of csv-files of different size, but all somewhat big. Using read.csv to read them into R takes longer than I've been patient to wait so far (several hours). I managed to read the biggest file (2.6 gb) very fast (less than a minute) with data.table's fread.
My problem occurs when I try to read a file of half the size. I get the following error message:
Error in fread("C:/Users/Jesper/OneDrive/UdbudsVagten/BBR/CO11700T.csv",:
Expecting 21 cols, but line 2557 contains text after processing all
cols. It is very likely that this is due to one or more fields having
embedded sep=';' and/or (unescaped) '\n' characters within unbalanced
unescaped quotes.
fread cannot handle such ambiguous cases and those
lines may not have been read in as expected. Please read the section
on quotes in ?fread.
Through research I've found suggestions to add quote = "" to the code, but it doesn't help me. I've tried using the bigmemory package, but R crashes when I try. I'm on a 64 bit system with 8 gb of ram.
I know there are quite a few threads on this subject, but I haven't been able to solve the problem with any of the solutions. I would really like to use fread (given my good experience with the bigger file), and it seems like there should be some way to make it work - just can't figure it out.

Solved this by installing SlickEdit and using it to edit the lines that caused the trouble. A few characters like ampersand, quotation marks, and apostrophes were consistently encoded to include semicolon - e.g. & instead of just &. As semicolon was the seperator in the text document, this caused the problem in reading with fread.

Related

How can I load a large (3.96 gb) .tsv file in R studio

I want to load a 3.96 gigabyte tab separated value file to R and I have 8 ram in my system. How can I load this file to R to do some manipulation on it.
I tried library(data.table) to load my data
but I´ve got this error message (Error: cannot allocate vector of size 965.7 Mb)
I also tried fread with this code but it was not working either: it took a lot of time and at last it showed an error.
as.data.frame(fread(file name))
If I were you, I probably would
1) try your fread code once more without the typo (closing parenthesis was initially missing):
as.data.frame(fread(file name))
2) try to read the file in parts by specifying number of rows to read. This can be done in read.csv and fread with nrow arguments. By reading a small number of rows one could check and confirm that the file is actually readable before doing anything else. Sometimes files are malformed, there could be some special characters, wrong end-of-line characters, escaping or something else which needs to be addressed first.
3) have a look at bigmemory package which have read.big.matrix function. Also ff package has the desired functionalities.
Alternatively, I probably would also try to think "outside the box": do I need all of the data in the file? If not, I could preprocess the file for example with cut or awk to remove unnecessary columns. Do I absolutely need to read it as one file and have all data simultaneously in memory? If not, I could split the file or maybe use readLines..
ps. This topic is covered quite nicely in this post.
pps. Thanks to #Yuriy Barvinchenko for comment on fread
You are reading the data (which puts it in memory) and then storing it as a data.frame (which makes another copy). Instead, read it directly into a data.frame with
fread(file name, data.table=FALSE)
Also, it wouldn't hurt to run garbage collection.
gc()
From my experience and in addition to #Oka answer:
fread() have nrows= argument, so you can read first 10 lines.
If you found out that you don't need all lines and/or all columns, so you can set condition and list of fields just after fread()[]
You can use data.table as dataframe in many cases, so you can try to read without as.data.frame()
This way I worked with 5GB csv file.

data.table v.1.11.0+ no longer freads data file that was fread by v.1.10.4-3

I've encountered a possible bug in the new version of data.table. I have a 2GB .csv file with c. 3 million rows and 67 columns. I can use fread() to read it all fine from data.table v.1.10.4-3, but v.1.11.0+ terminates at a row somewhere down the middle. The base read.csv() also hits the same problem. I really like data.table and want to create a bug report on Github, but obviously I can't upload the 2GB data file anywhere.
I need a way of splicing maybe ~10 rows around the problematic point (the row number is known) in order to create a portable reproducible example. Any ideas how I can do that without reading in the .csv file?
Also, is there a program I can use to open the raw file to look at the problematic point and see what causes the issue? Notepad/Excel won't open a file this big.
EDIT: the verbose output.
EDIT2: this is the problematic line. It shows that what is supposed to be one line is somehow split into 3 lines. I can only assume it is due to an export bug in an ancient software (SAP Business Objects) that was used to create the CSV. It is unsurprising that it causes an issue. However, it surprising that data.table v.1.10.4-3 was able to handle it in a smart way and read it correctly, whereas v.1.11.0+ could not. Could it do something with encoding or technical hidden characters?
EDIT3: proof that this is what really happens.
Thanks for including the output. It shows that fread is issuing a warning. Did you miss this warning before?
Warning message:
In fread("Data/FP17s with TCD in March 2018.csv", na.strings = c("#EMPTY", :
Stopped early on line 138986. Expected 67 fields but found 22. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<916439/0001,Q69,GDS Contract,MR A SYED,916439,Mr,SYED A Mr,A,SYED,58955,3718.00,Nine Mile Ride Dental Practice,Dental Surgery,193 Nine Mile Ride,Finchampstead,WOKINGHAM,RG40 4JD,2181233168.00,TORIN,FASTNEDGE,1 ANCHORITE CLOSE,>>
This is very helpful, surely. It tells you the line number: 138986. It says that this line is 22 fields but it expects 67. Could the warning be better by stating why it is expecting 67 fields at that point (e.g. by saying there are 67 column names and it has seen 67 columns up to that point?) It gives you a hint of what to try (fill=TRUE) which would fill that too-short line with NA in columns 23:67. Then it includes the data from the line, too.
Does it work with fill=TRUE, as the warning message suggests?
You say it worked in 1.10.4-3 but I suspect it's more likely it stopped early there too, but without warning. If so, that was a bug not to warn, now fixed.
Using Powershell on Windows:
Get-Content YourFile.csv | Select -Index (0,19,20,21,22) > OutputFileName.csv
would dump the header and lines 20-23 into a new file.
Use a combination of skip and nrow:
You mentioned that you have no problem reading the file with v.1.10.4-3, right?. So use that to skip most of the .csv and set nrow to the number of rows you want. Once you have that data.table, you can write that portion of the file and you have a portable reproducible example.
For example:
DT <- fread(my_file.csv, skip=138981, nrow=10)

Fread unusual line ending causing error

I am attempting to download a large database of NYC taxi data, publicly available at the NYC TLC website.
library(data.table)
feb14 <- fread('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv', header = T)
Executing the above code successfully downloads the data (which takes a few minutes), but then fails to parse due to an internal error. I have tried removing header = T as well.
Is there a workaround in order to deal with the "unusual line endings" in fread ?
Error in fread("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv", :
Internal error. No eol2 immediately before line 3 after sep detection.
In addition: Warning message:
In fread("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv", :
Detected eol as \n\r, a highly unusual line ending. According to Wikipedia the Acorn BBC used this. If it is intended that the first column on the next row is a character column where the first character of the field value is \r (why?) then the first column should start with a quote (i.e. 'protected'). Proceeding with attempt to read the file.
It seems that the issues might be caused due the presence of a blank line between the header and data in the original .csv file. Deleting the line from the .csv using notepad++ seemed to fix it for me.
Sometimes other options like read.csv/read.table can behave differently... so you can always try that. (Maybe the source code tells why, havent looked into that).
Another option is to use readLines() to read in such a file. As far as I know, no parsing/formatting is done here. So this is, as far as I know, the most basic way to read a file
At last, a quick fix: use the option 'skip = ...' in fread, or control the end by saying 'nrows = ...'.
There is something fishy with fread. data.table is the faster, more performance oriented for reading large files, however in this case the behavior is not optimal. You may want to raise this issue on github
I am able to reproduce the issue on downloaded file even with nrows = 5 or even with nrows = 1 but only if stick to the original file. If I copy paste the first few rows and then try, the issue is gone. The issue also goes away if I read directly from the web with small nrows. This is not even an encoding issue, hence my recommendation to raise an issue.
I tried reading the file using read.csv and 100,000 rows without an issue and under 6 seconds.
feb14_2 <- read.csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv", header = T, nrows = 100000)
header = T is a redundant argument so would not make a difference for fread but is needed for read.csv.

Deal with escaped commas in CSV file?

I'm reading in a file in R using fread as such
test.set = fread("file.csv", header=FALSE, fill=TRUE, blank.lines.skip=TRUE)
Where my csv consists of 6 columns. An example of a row in this file is
"2014-07-03 11:25:56","61073a09d113d3d3a2af6474c92e7d1e2f7e2855","Securenet Systems Radio Playlist Update","Your Love","Fred Hammond & Radical for Christ","50fcfb08424fe1e2c653a87a64ee92d7"
However, certain rows are formatted in a particular way when there is a comma inside one of the cells. For instance,
"2014-07-03 11:25:59","37780f2e40f3af8752e0d66d50c9363279c55be6","Spotify","\"Hello\", He Lied","Red Box","b226ff30a0b83006e5e06582fbb0afd3"
produces an error of the sort
Expecting 6 cols, but line 5395818 contains text after processing all
cols. Try again with fill=TRUE. Another reason could be that fread's
logic in distinguishing one or more fields having embedded sep=','
and/or (unescaped) '\n' characters within unbalanced unescaped quotes
has failed. If quote='' doesn't help, please file an issue to figure
out if the logic could be improved.
As you can see, the value that is causing the error is "\"Hello\", He Lied", which I want to be read by fread as "Hello, He Lied". I'm not sure how to account for this, though - I've tried using fill=TRUE and quote="" as suggested, but the error still keeps coming up. It's probably just a matter of finding the right parameter(s) for fread; anyone know what those might be?
In read.table() from base R this issue is solvable.
Using Import data into R with an unknown number of columns?
In fread from data.table this is not possible.
Issue logged for this : https://github.com/Rdatatable/data.table/issues/2669

how to have fread perform like read.delim

I've got a large tab-delimited data table that I am trying to read into R using the data.table package fread function. However, fread encounters an error. If I use read.delim, the table is read in properly, but I can't figure out how to configure fread such that it handles the data properly.
In an attempt to find a solution, I've installed the development version of data.table, so I am currently running data.table v1.9.7, under R v3.2.2, running on Ubuntu 15.10.
I've isolated the problem to a few lines from my large table, and you can download it here.
When I used fread:
> fread('problemRows.txt')
Error in fread("problemRows.txt") :
Expecting 8 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
I tried using the parameters used by read.delim:
fread('problemRows.txt', sep="\t", quote="\"")
but I get the same error.
Any thoughts on how to get this to read in properly? I'm not sure what exactly the problem is.
Thanks!
With this recent commit c1b7cda, fread's quote logic got a bit cleverer in handling such tricky cases. With this:
require(data.table) # v1.9.7+
fread("my_file.txt")
should just work. The error message is now more informative as well if it is unable to handle. See #1462.
As explained in the comments, specifying the quotes argument did the trick.
fread("my_file.txt", quote="")

Resources