fread memory usage is much larger than the file - r

I am on a 512gb ram server. I have a 84gig CSV (hefty, I know). I am reading only 31 columns of 79, where the excluded are all floats/decimals.
After comparing many methods, it seems the highest performance way to do what I want would be to fread the file. The file size is 84gb, but watching "top" the process uses 160 gigs of memory (RES), even though the size of the eventual data.table is about 20gigs.
I know fread preallocates memory which is why it's so fast. Just wondering - is this normal and is there a way to curb the memory consumption?
Edit: it seems like, if I just ask fread to read 10000 rows (of 300MM), fread will still preallocate 84 gigs of memory.

See R FAQ 7.42. If you want to minimize the resources you use on the server, read the csv using fread once, then save the resulting object using save or saveRDS. Then read that binary file when you need the data.
Or you can use a command line tool like cut, awk, sed, etc to only select the columns you want and write the output to another file. Then you can use fread on that smaller file.

Try to see http://www.r-bloggers.com/efficiency-of-importing-large-csv-files-in-r/ or Reading 40 GB csv file into R using bigmemory.
May be bigmemory library helps you.

Related

Partially read really large csv.gz in R using vroom

I have a csv.gz file that (from what I've been told) before compression was 70GB in size. My machine has 50GB of RAM, so anyway I will never be able to open it as a whole in R.
I can load for example the first 10m rows as follows:
library(vroom)
df <- vroom("HUGE.csv.gz", delim= ",", n_max = 10^7)
For what I have to do, it is fine to load 10m rows at the time, do my operations, and continue with the next 10m rows. I could do this in a loop.
I was therefore trying the skip argument.
df <- vroom("HUGE.csv.gz", delim= ",", n_max = 10^7, skip = 10^7)
This results in an error:
Error: The size of the connection buffer (131072) was not large enough
to fit a complete line:
* Increase it by setting `Sys.setenv("VROOM_CONNECTION_SIZE")`
I increased this with Sys.setenv("VROOM_CONNECTION_SIZE" = 131072*1000), however, the error persists.
Is there a solution to this?
Edit: I found out that random access to a gzip compressed csv (csv.gz) is not possible. We have to start from top. Probably the easiest is to decompress and save, then skip should work.
I haven't been able to figure out vroom solution for very large more-than-RAM (gzipped) csv files. However, the following approach has worked well for me and I'd be grateful to know about approaches with better querying speed while also saving disk space.
Use split sub-command inxsv from https://github.com/BurntSushi/xsv to split the large csv file into comfortably-within-RAM chunks of say, 10^5, lines and save them in a folder.
Read all chunks using data.table::fread one-by-one (to avoid low-memory error) using a for loop and save all of them into a folder as compressed parquet files using arrow package which saves space and prepares the large table for fast querying. For even faster operations, it is advisable to re-save the parquet files partitioned by the fields by which you need to frequently filter.
Now you can use arrow::open_dataset and query that multi-file parquet folder using dplyr commands. It takes minimum disk space and gives the fastest results in my experience.
I use data.table::fread with explicit definition of column classes of each field for fastest and most reliable parsing of csv files. readr::read_csv has also been accurate but slower. However, auto-assignment of column classes by read_csv as well as the ways in which you can custom-define column classes by read_csv is actually the best - so less human-time but more machine-time - which means that it may be faster overall depending on scenario. Other csv parsers have thrown errors for the kind of csv files that I work with and waste time.
You may now delete the folder containing chunked csv files to save space, unless you want to experiment loop-reading them with other csv parsers.
Other previously successfully approaches: Loop read all csv chunks as mentioned above and save them into:
a folder using disk.frame package. Then that folder may be queried using dplyr or data.table commands explained in the documentation. It has facility to save in compressed fst files which saves space, though not as much as parquet files.
a table in DuckDB database which allows querying with SQL or dplyr commands. Using database-tables approach won't save you disk space. But DuckDB also allows querying partitioned/un-partitioned parquet files (which saves disk space) with SQL commands.
EDIT: - Improved Method Below
I experimented a little and found a much better way to do the above operations. Using the code below, the large (compressed) csv file will be chunked automatically within R environment (no need to use any external tool like xsv) and all chunks will be written in parquet format in a folder ready for querying.
library(readr)
library(arrow)
fyl <- "...path_to_big_data_file.csv.gz"
pqFolder <- "...path_to_folder_where_chunked_parquet_files_are_to_be_saved"
f <- function(x, pos){
write_parquet(x,
file.path(pqFolder, paste0(pos, ".parquet")),
compression = "gzip",
compression_level = 9)
}
read_csv_chunked(
fyl,
col_types = list(Column1="f", Column2="c", Column3="T", ...), # all column specifications
callback = SideEffectChunkCallback$new(f),
chunk_size = 10^6)
If, instead of parquet, you want to use -
disk.frame, the callback function may be used to create chunked compressed fst files for dplyr or data.table style querying.
DuckDB, the callback function may be used to append the chunks into a database table for SQL or dplyr style querying.
By judiciously choosing the chunk_size parameter of readr::read_csv_chunked command, the computer should never run out of RAM while running queries.
PS: I use gzip compression for parquet files since they can then be previewed with ParquetViewer from https://github.com/mukunku/ParquetViewer. Otherwise, zstd (not currently supported by ParquetViewer) decompresses faster and hence improves reading speed.
EDIT 2:
I got a csv file which was really big for my machine: 20 GB gzipped and expands to about 83 GB, whereas my home laptop has only 16 GB. Turns out that the read_csv_chunked method I mentioned in earlier EDIT fails to complete. It always stops working after some time and does not create all parquet chunks. Using my previous method of splitting the csv file with xsv and then looping over them creating parquet chunks worked. To be fair, I must mention it took multiple attempts this way too and I had programmed a check to create only additional parquet chunks when running the program on successive attempts.
EDIT 3:
VROOM does have difficulty when dealing with huge files since it needs to store the index in memory as well as any data you read from the file. See development thread https://github.com/r-lib/vroom/issues/203
EDIT 4:
Additional tip: The chunked parquet files created by the above mentioned method may be very conveniently queried using SQL with DuckDB method mentioned at
https://duckdb.org/docs/data/parquet
and
https://duckdb.org/2021/06/25/querying-parquet.html
DuckDB method is significant because R Arrow method currently suffers from a very serious limitation which is mentioned in the official documentation page https://arrow.apache.org/docs/r/articles/dataset.html.
Specifically, and I quote: "In the current release, arrow supports the dplyr verbs mutate(), transmute(), select(), rename(), relocate(), filter(), and arrange(). Aggregation is not yet supported, so before you call summarise() or other verbs with aggregate functions, use collect() to pull the selected subset of the data into an in-memory R data frame."
The problem is that if you use collect() on a very big dataset, the RAM usage spikes and the system crashes. Whereas, using SQL statements to do the same aggregation job on the same big-dataset with DuckDB does not cause RAM usage spikes and does not cause system crash. So until Arrow fixes itself for aggregation queries for big-data, SQL from DuckDB provides a nice solution to querying big datasets in chunked parquet format.

How can I load a large (3.96 gb) .tsv file in R studio

I want to load a 3.96 gigabyte tab separated value file to R and I have 8 ram in my system. How can I load this file to R to do some manipulation on it.
I tried library(data.table) to load my data
but I´ve got this error message (Error: cannot allocate vector of size 965.7 Mb)
I also tried fread with this code but it was not working either: it took a lot of time and at last it showed an error.
as.data.frame(fread(file name))
If I were you, I probably would
1) try your fread code once more without the typo (closing parenthesis was initially missing):
as.data.frame(fread(file name))
2) try to read the file in parts by specifying number of rows to read. This can be done in read.csv and fread with nrow arguments. By reading a small number of rows one could check and confirm that the file is actually readable before doing anything else. Sometimes files are malformed, there could be some special characters, wrong end-of-line characters, escaping or something else which needs to be addressed first.
3) have a look at bigmemory package which have read.big.matrix function. Also ff package has the desired functionalities.
Alternatively, I probably would also try to think "outside the box": do I need all of the data in the file? If not, I could preprocess the file for example with cut or awk to remove unnecessary columns. Do I absolutely need to read it as one file and have all data simultaneously in memory? If not, I could split the file or maybe use readLines..
ps. This topic is covered quite nicely in this post.
pps. Thanks to #Yuriy Barvinchenko for comment on fread
You are reading the data (which puts it in memory) and then storing it as a data.frame (which makes another copy). Instead, read it directly into a data.frame with
fread(file name, data.table=FALSE)
Also, it wouldn't hurt to run garbage collection.
gc()
From my experience and in addition to #Oka answer:
fread() have nrows= argument, so you can read first 10 lines.
If you found out that you don't need all lines and/or all columns, so you can set condition and list of fields just after fread()[]
You can use data.table as dataframe in many cases, so you can try to read without as.data.frame()
This way I worked with 5GB csv file.

How to get data into h2o fast

What my question isnt:
Efficient way to maintain a h2o data frame
H2O running slower than data.table R
Loading data bigger than the memory size in h2o
Hardware/Space:
32 Xeon threads w/ ~256 GB Ram
~65 GB of data to upload. (about 5.6 billion cells)
Problem:
It is taking hours to upload my data into h2o. This isn't any special processing, only "as.h2o(...)".
It takes less than a minute using "fread" to get the text into the space and then I make a few row/col transformations (diff's, lags) and try to import.
The total R memory is ~56GB before trying any sort of "as.h2o" so the 128 allocated shouldn't be too crazy, should it?
Question:
What can I do to make this take less than an hour to load into h2o? It should take from a minute to a few minutes, no longer.
What I have tried:
bumping ram up to 128 GB in 'h2o.init'
using slam, data.table, and options( ...
convert to "as.data.frame" before "as.h2o"
write to csv file (r write.csv chokes and takes forever. It is writing a lot of GB though, so I understand).
write to sqlite3, too many columns for a table, which is weird.
Checked drive cache/swap to make sure there are enough GB there. Perhaps java is using cache. (still working)
Update:
So it looks like my only option is to make a giant text file and then use "h2o.importFile(...)" for it. I'm up to 15GB written.
Update2:
It is a hideous csv file, at ~22GB (~2.4Mrows, ~2300 cols). For what it was worth, it took from 12:53pm until 2:44PM to write the csv file. Importing it was substantially faster, after it was written.
Think of as.h2o() as a convenience function, that does these steps:
converts your R data to a data.frame, if not already one.
saves that data.frame to a temp file on local disk (it will use data.table::fwrite() if available (*), otherwise write.csv())
call h2o.uploadFile() on that temp file
delete the temp file
As your updates say, writing huge data files to disk can take a while. But the other pain point here is using h2o.uploadFile() instead of the quicker h2o.importFile(). The decision of which to use is visibility:
With h2o.uploadFile() your client has to be able to see the file.
With h2o.importFile() your cluster has to be able to see the file.
When your client is running on the same machine as one of your cluster nodes, your data file is visible to both client and cluster, so always prefer h2o.importFile(). (It does a multi-threaded import.)
Another couple of tips: only bring data into the R session that you actually need there. And remember both R and H2O are column-oriented, so cbind can be quick. If you just need to process 100 of your 2300 columns in R, have them in one csv file, and keep the other 2200 columns in another csv file. Then h2o.cbind() them after loading each into H2O.
*: Use h2o:::as.h2o.data.frame (without parentheses) to see the actual code. For data.table writing you need to first do options(h2o.use.data.table = TRUE); you can also optionally switch it on/off with the h2o.fwrite option.

Comparing speed of fread vs. read.table for reading the first 1M rows out of 100M

I have a 14GB data.txt file. I was comparing the speed of fread and read.table by reading the first 1M rows. It looks like fread is much slower although it is not supposed to be. It takes some time until the percentage counts show up.
What could be the reason? I thought it was supposed to be super fast... I am using a Windows OS computer.
fread mmaps the file. This takes some time, and will map the whole file. This means subsequent "read-ins" will be faster.
read.table does not mmap the whole file. It can read in the file line by line [and stop at line 1000000].
You can see some background on mmap at mmap() vs. reading blocks
The examples in the help from fread highlight this behaiviour

Reading large files into R

I am a newbie to R, but I am aware that it chokes on "big" files. I am trying to read a 200MB data file. I have tried it in csv format and also converting it to tab delimited txt but in both cases I use up my 4GB of RAM before the file loads.
Is it normal that R would use 4GB or memory to load a 200MB file, or could there be something wrong with the file and it is causing R to keep reading a bunch of nothingness in addition to the data?
From ?read.table
Less memory will be used if colClasses is specified as one of the six atomic vector classes.
...
Using nrows, even as a mild over-estimate, will help memory usage.
Use both of these arguments.
Ensure that you properly specify numeric for your numeric data. See here: Specifying colClasses in the read.csv
And do not under-estimate nrows.
If you're running 64-bit R, you might try the 32-bit version. It will use less memory to hold the same data.
See here also: Extend memory size limit in R

Resources