Comparing speed of fread vs. read.table for reading the first 1M rows out of 100M - r

I have a 14GB data.txt file. I was comparing the speed of fread and read.table by reading the first 1M rows. It looks like fread is much slower although it is not supposed to be. It takes some time until the percentage counts show up.
What could be the reason? I thought it was supposed to be super fast... I am using a Windows OS computer.

fread mmaps the file. This takes some time, and will map the whole file. This means subsequent "read-ins" will be faster.
read.table does not mmap the whole file. It can read in the file line by line [and stop at line 1000000].
You can see some background on mmap at mmap() vs. reading blocks
The examples in the help from fread highlight this behaiviour

Related

Reading large numeric TSV file into memory in R

I am trying to read a file representing a numeric matrix with 4.5e5 rows and 2e3 columns. First line is the header with ncol+1 words, while each row begins with a row name. In txt format it is around 17G in size.
I tried using:
read.table(fname, header=TRUE)
but the operation ate all 64G of RAM available. I assume it loaded it in a wrong structure.
Usually people discuss speed, is there a way to import it so it fits properly? Performance is not a primary issue.
EDIT: I managed to read it with read.table:
colclasses = c("character",rep("numeric",2000))
betas = read.table(beta_fname, header=TRUE, colClasses=colclasses, row.names=1)
But documentation still recommends "scan" for memory usage. What would be the "scan" alternative?
There are several things you might try. Google about reading large files and they might point you to using 'fread' in data.table. You can also try 'read_delim_chunked' that might help. Also break the file into smaller pieces, read each one in, write out an RDS file. When complete you might be able to read in the RDS files and combine using less space.

How can I load a large (3.96 gb) .tsv file in R studio

I want to load a 3.96 gigabyte tab separated value file to R and I have 8 ram in my system. How can I load this file to R to do some manipulation on it.
I tried library(data.table) to load my data
but I´ve got this error message (Error: cannot allocate vector of size 965.7 Mb)
I also tried fread with this code but it was not working either: it took a lot of time and at last it showed an error.
as.data.frame(fread(file name))
If I were you, I probably would
1) try your fread code once more without the typo (closing parenthesis was initially missing):
as.data.frame(fread(file name))
2) try to read the file in parts by specifying number of rows to read. This can be done in read.csv and fread with nrow arguments. By reading a small number of rows one could check and confirm that the file is actually readable before doing anything else. Sometimes files are malformed, there could be some special characters, wrong end-of-line characters, escaping or something else which needs to be addressed first.
3) have a look at bigmemory package which have read.big.matrix function. Also ff package has the desired functionalities.
Alternatively, I probably would also try to think "outside the box": do I need all of the data in the file? If not, I could preprocess the file for example with cut or awk to remove unnecessary columns. Do I absolutely need to read it as one file and have all data simultaneously in memory? If not, I could split the file or maybe use readLines..
ps. This topic is covered quite nicely in this post.
pps. Thanks to #Yuriy Barvinchenko for comment on fread
You are reading the data (which puts it in memory) and then storing it as a data.frame (which makes another copy). Instead, read it directly into a data.frame with
fread(file name, data.table=FALSE)
Also, it wouldn't hurt to run garbage collection.
gc()
From my experience and in addition to #Oka answer:
fread() have nrows= argument, so you can read first 10 lines.
If you found out that you don't need all lines and/or all columns, so you can set condition and list of fields just after fread()[]
You can use data.table as dataframe in many cases, so you can try to read without as.data.frame()
This way I worked with 5GB csv file.

How to get data into h2o fast

What my question isnt:
Efficient way to maintain a h2o data frame
H2O running slower than data.table R
Loading data bigger than the memory size in h2o
Hardware/Space:
32 Xeon threads w/ ~256 GB Ram
~65 GB of data to upload. (about 5.6 billion cells)
Problem:
It is taking hours to upload my data into h2o. This isn't any special processing, only "as.h2o(...)".
It takes less than a minute using "fread" to get the text into the space and then I make a few row/col transformations (diff's, lags) and try to import.
The total R memory is ~56GB before trying any sort of "as.h2o" so the 128 allocated shouldn't be too crazy, should it?
Question:
What can I do to make this take less than an hour to load into h2o? It should take from a minute to a few minutes, no longer.
What I have tried:
bumping ram up to 128 GB in 'h2o.init'
using slam, data.table, and options( ...
convert to "as.data.frame" before "as.h2o"
write to csv file (r write.csv chokes and takes forever. It is writing a lot of GB though, so I understand).
write to sqlite3, too many columns for a table, which is weird.
Checked drive cache/swap to make sure there are enough GB there. Perhaps java is using cache. (still working)
Update:
So it looks like my only option is to make a giant text file and then use "h2o.importFile(...)" for it. I'm up to 15GB written.
Update2:
It is a hideous csv file, at ~22GB (~2.4Mrows, ~2300 cols). For what it was worth, it took from 12:53pm until 2:44PM to write the csv file. Importing it was substantially faster, after it was written.
Think of as.h2o() as a convenience function, that does these steps:
converts your R data to a data.frame, if not already one.
saves that data.frame to a temp file on local disk (it will use data.table::fwrite() if available (*), otherwise write.csv())
call h2o.uploadFile() on that temp file
delete the temp file
As your updates say, writing huge data files to disk can take a while. But the other pain point here is using h2o.uploadFile() instead of the quicker h2o.importFile(). The decision of which to use is visibility:
With h2o.uploadFile() your client has to be able to see the file.
With h2o.importFile() your cluster has to be able to see the file.
When your client is running on the same machine as one of your cluster nodes, your data file is visible to both client and cluster, so always prefer h2o.importFile(). (It does a multi-threaded import.)
Another couple of tips: only bring data into the R session that you actually need there. And remember both R and H2O are column-oriented, so cbind can be quick. If you just need to process 100 of your 2300 columns in R, have them in one csv file, and keep the other 2200 columns in another csv file. Then h2o.cbind() them after loading each into H2O.
*: Use h2o:::as.h2o.data.frame (without parentheses) to see the actual code. For data.table writing you need to first do options(h2o.use.data.table = TRUE); you can also optionally switch it on/off with the h2o.fwrite option.

fread memory usage is much larger than the file

I am on a 512gb ram server. I have a 84gig CSV (hefty, I know). I am reading only 31 columns of 79, where the excluded are all floats/decimals.
After comparing many methods, it seems the highest performance way to do what I want would be to fread the file. The file size is 84gb, but watching "top" the process uses 160 gigs of memory (RES), even though the size of the eventual data.table is about 20gigs.
I know fread preallocates memory which is why it's so fast. Just wondering - is this normal and is there a way to curb the memory consumption?
Edit: it seems like, if I just ask fread to read 10000 rows (of 300MM), fread will still preallocate 84 gigs of memory.
See R FAQ 7.42. If you want to minimize the resources you use on the server, read the csv using fread once, then save the resulting object using save or saveRDS. Then read that binary file when you need the data.
Or you can use a command line tool like cut, awk, sed, etc to only select the columns you want and write the output to another file. Then you can use fread on that smaller file.
Try to see http://www.r-bloggers.com/efficiency-of-importing-large-csv-files-in-r/ or Reading 40 GB csv file into R using bigmemory.
May be bigmemory library helps you.

Reading large files into R

I am a newbie to R, but I am aware that it chokes on "big" files. I am trying to read a 200MB data file. I have tried it in csv format and also converting it to tab delimited txt but in both cases I use up my 4GB of RAM before the file loads.
Is it normal that R would use 4GB or memory to load a 200MB file, or could there be something wrong with the file and it is causing R to keep reading a bunch of nothingness in addition to the data?
From ?read.table
Less memory will be used if colClasses is specified as one of the six atomic vector classes.
...
Using nrows, even as a mild over-estimate, will help memory usage.
Use both of these arguments.
Ensure that you properly specify numeric for your numeric data. See here: Specifying colClasses in the read.csv
And do not under-estimate nrows.
If you're running 64-bit R, you might try the 32-bit version. It will use less memory to hold the same data.
See here also: Extend memory size limit in R

Resources