Fastest way to read in 100,000 .dat.gz files

Fastest way to read in 100,000 .dat.gz files - r

I have a few hundred thousand very small .dat.gz files that I want to read into R in the most efficient way possible. I read in the file and then immediately aggregate and discard the data, so I am not worried about managing memory as I get near the end of the process. I just really want to speed up the bottleneck, which happens to be unzipping and reading in the data.
Each dataset consists of 366 rows and 17 columns. Here is a reproducible example of what I am doing so far:
Building reproducible data:
require(data.table)
# Make dir
system("mkdir practice")
# Function to create data
create_write_data <- function(file.nm) {
dt <- data.table(Day=0:365)
dt[, (paste0("V", 1:17)) := lapply(1:17, function(x) rnorm(n=366))]
write.table(dt, paste0("./practice/",file.nm), row.names=FALSE, sep="\t", quote=FALSE)
system(paste0("gzip ./practice/", file.nm))
}
And here is code applying:
# Apply function to create 10 fake zipped data.frames (550 kb on disk)
tmp <- lapply(paste0("dt", 1:10,".dat"), function(x) create_write_data(x))
And here is my most efficient code so far to read in the data:
# Function to read in files as fast as possible
read_Fast <- function(path.gz) {
system(paste0("gzip -d ", path.gz)) # Unzip file
path.dat <- gsub(".gz", "", path.gz)
dat_run <- fread(path.dat)
}
# Apply above function
dat.files <- list.files(path="./practice", full.names = TRUE)
system.time(dat.list <- rbindlist(lapply(dat.files, read_Fast), fill=TRUE))
dat.list
I have bottled this up in a function and applied it in parallel, but it is still much much too slow for what I need this for.
I have already tried the h2o.importFolder from the wonderful h2o package, but it is actually much much slower compared to using plain R with data.table. Maybe there is a way to speed up the unzipping of files, but I am unsure. From the few times that I have run this, I have noticed that the unzipping of the files usually takes about 2/3rd of the function time.

I'm sort of surprised that this actually worked. Hopefully it works for your case. I'm quite curious to know how speed compares to reading in compressed data from disk directly from R (albeit with a penalty for non-vectorization) instead.
tblNames = fread('cat *dat.gz | gunzip | head -n 1')[, colnames(.SD)]
tbl = fread('cat *dat.gz | gunzip | grep -v "^Day"')
setnames(tbl, tblNames)
tbl

R has the ability to read gzipped files natively, using the gzfile function. See if this works.
rbindlist(lapply(dat.files, function(f) {
read.delim(gzfile(f))
}))

The bottleneck might be caused by the use of the system() call to an external application.
You should try using the builting functions to extract the archive.
This answer explains how: Decompress gz file using R

Related

R - How to combine many .csv files into one without overloading memory?

I have a total of 757 files about 80 MB each, totaling 57.4 GB worth of .csv data.
I am trying to combine all of these files into one file. The code I used worked on a small sample of the data (52 files), but it is running into an error when I try to do it on all of the data.
This is the code I used for the sample:
# Reading in multiple files of genetic data at once
setwd()
files = list.files(pattern = "*.csv")
gene1 <- do.call(rbind, lapply(files, fread))
rm(files)
gene1 <- as.data.frame(unclass(gene1))
The problem I'm having is that RStudio crashes after a few minutes of combining the data. The sample took about a minute to do.
Anyway, I'm guessing that it's a memory issue or something. Is there a way I can keep the memory from getting full? I know Python has a chunking option of some sort.
If you think the issue is something else, please let me know what you think.
Thank you!

For a reframe the question perspective, I propose the command line CSV Tool Kit. In particular the CSVstack tool
https://csvkit.readthedocs.io/en/latest/scripts/csvstack.html
Joining a set of homogeneous files
$ csvstack folder/*.csv > all_csvs.csv

Here is a brute force "chunk by 50" solution
files = list.files(pattern = "*.csv")
chunks=data.frame(x=c(seq(from=1,to=751, by=50)),
y=c(seq(from=50,to=750, by=50), 757))
all_genes<-data.table()
for (i in 1:NROW(chunks)) {
gene1 <- do.call(rbind, lapply(files[chunks[i,"x"]:chunks[i,"y"]], fread))
all_genes <- rbind(gene1, all_genes)
}
all_genes_unclass <- as.data.frame(unclass(all_genes))

Importing very large dataset into h2o from sqlite

I have a database of about 500G. It comprises of 16 tables, each containing 2 or 3 column (first column can be discarded) and 1,375,328,760 rows. I need all the tables to be joined as one dataframe in h2o as they are needed for running a prediction in an XGB model. I have tried to convert the individual sql tables into the h2o environment using as.h2o, and h2o.cbind them 2 or 3 tables at a time, until they are one dataset. However, I get this "GC overhead limit exceeded: java.lang.OutOfMemoryError", after converting 4 tables.
Is there a way around this?
My machine specs are 124G RAM, OS (Rhel 7.8), Root(1tb), Home(600G) and 2TB external HDD.
The model is run on this local machine and the max_mem_size is set at 100G. The details of the code are below.
library(data.table)
library(h2o)
h2o.init(
nthreads=14,
max_mem_size = "100G")
h2o.removeAll()
setwd("/home/stan/Documents/LUR/era_aq")
l1.hex <- as.h2o(d2)
l2.hex <- as.h2o(lai)
test_l1.hex <-h2o.cbind(l1.hex,l2.hex[,-1])
h2o.rm (l1.hex,l2.hex)
l3.hex <- as.h2o(lu100)
l4.hex <- as.h2o(lu1000)
test_l2.hex <-h2o.cbind(l3.hex,l4.hex[,-1])
h2o.rm(l3.hex,l4.hex)
l5.hex <- as.h2o(lu1250)
l6.hex <- as.h2o(lu250)
test_l3.hex <-h2o.cbind(l5.hex,l6.hex[,-1])
h2o.rm(l5.hex,l6.hex)
l7.hex <- as.h2o(pbl)
l8.hex <- as.h2o(msl)
test_l4.hex <-h2o.cbind(l7.hex,l8.hex[,-1])
h2o.rm(ll7.hex,l8.hex)
test.hex <-h2o.cbind(test_l1.hex,test_l2.hex[,-1],test_l3.hex[,-1],test_l4.hex[,-1])
test <- test.hex[,-1]
test[1:3,]```

First, as Tom says in the comments, you're gonna need a bigger boat. H2O holds all data in memory, and generally you need 3 to 4x the data size to be able to do anything useful with it. A dataset of 500GB means you need the total memory of your cluster to be 1.5-2TB.
(H2O stores the data compressed, and I don't think sqlite does, in which case you might get away with only needing 1TB.)
Second, as.h2o() is an inefficient way to load big datasets. What will happen is your dataset is loaded into R's memory space, then it is saved to a csv file, then that csv file is streamed over TCP/IP to the H2O process.
So, the better way is to export directly from sqlite to a csv file. And then use h2o.importFile() to load that csv file into H2O.
h2o.cbind() is also going to involve a lot of copying. If you can find a tool or script to column-bind the csv files in advance of import, it might be more efficient. A quick search found csvkit, but I'm not sure if it needs to load the files into memory, or can do work with the files completely on disk.

Since memory is a premium and all R runs in RAM, avoid storing large helper data.table andh20 objects in your global environment. Consider setting up a function to build a list for compilation that temporary objects are removed when function is out of scope. Ideally, you build your h2o objects directly from file source:
# BUILD LIST OF H20 OBJECTS WITHOUT HELPER COPIES
h2o_list <- lapply(list_of_files, function(f) as.h2o(data.table::fread(f))[-1])
# h2o_list <- lapply(list_of_files, function(f) h2o.importFile(f)[-1])
# CBIND ALL H20 OBJECTS
test.h2o <- do.call(h2o.cbind, h2o_list)
Or even combine both lines with named function as opposed to anonymous function. Then, only final object remains after processing.
build_h2o <- function(f) as.h2o(data.table::fread(f))[-1])
# build_h2o <- function(f) h2o.importFile(f)[-1]
test.h2o <- do.call(h2o.cbind, lapply(list_of_files, build_h2o))
Extend function with if for some datasets that need to retain first column or not.
build_h2o <- function(f) {
if (grepl("lai|lu1000|lu250|msl", f)) { tmp <- fread(f)[-1] }
else { tmp <- fread(f) }
return(as.h2o(tmp))
}
Finally, if possible, leverage data.table methods like cbindlist:
final_dt <- cbindlist(lapply(list_of_files, function(f) fread(f)[-1]))
test.h2o <- as.h2o(final_dt)
rm(final_dt)
gc()

Using a for loop to write in multiple .grd files

I am working with very large data layers for a SDM class and because of this I ended up breaking some of my layers into a bunch of blocks to avoid memory restraint. These blocks were written out as .grd files, and now I need to get them read back into R and merged together. I am extremely new to R an programming in general so any help would be appreciated. What I have been trying so far looks like this:
merge.coarse=raster("coarseBlock1.grd")
for ("" in 2:nBlocks){
merge.coarse=merge(merge.coarse,raster(paste("coarseBlock", ".grd", sep="")))
}
where my files are in coarseBlock.grd and are sequentially numbered from 1 to nBlocks (259)
Any feed back would be greatly appreciated.

Using for loops is generally slow in R. Also, using functions like merge and rbind in a for loop eat up a lot of memory because of the way R passes values to these functions.
A more efficient way to do this task would be to call lapply (see this tutorial on apply functions for details) to load the files into R. This will result in a list which can then be collapsed using the rbind function:
rasters <- lapply(list.files(GRDFolder), FUN = raster)
merge.coarse <- do.call(rbind, rasters)

I'm not too familiar with .grd files, but this overall process should at least get you going in the right direction. Assuming all your .grd files (1 through 259) are stored in the same folder (which I will refer to as GRDFolder), then you can try this:
merge.coarse <- raster("coarseBlock1.grd")
for(filename in list.files(GRDFolder))
{
temp <- raster(filename)
merge.coarse <- rbind(merge.coarse, temp)
}

freeing up memory in R

In R, I am trying to combine and convert several sets of timeseries data as an xts from http://www.truefx.com/?page=downloads however, the files are large and there many files so this is causing me issues on my laptop. They are stored as a csv file which have been compressed as a zip file.
Downloading them and unzipping them is easy enough (although takes up a lot of space on a hard drive).
Loading the 350MB+ files for one month's worth of data into the R is reasonably straight forward with the new fread() function in the data.table package.
Some datatable transformations are done (inside a function) so that the timestamps can be read easily and a mid column is produced. Then the datatable is saved as an RData file on the hard drive, and all references are to the datatable object are removed from the workspace, and a gc() is run after removal...however when looking at the R session in my Activity Monitor (run from a Mac)...it still looks like it is taking up almost 1GB of RAM...and things seem a bit laggy...I was intending to load several years worth of the csv files at the same time, convert them to useable datatables, combine them and then create a single xts object, which seems infeasible if just one month uses 1GB of RAM.
I know I can sequentially download each file, convert it, save it shut down R and repeat until i have a bunch of RData files that i can just load and bind, but was hopeing there might be a more efficient manner to do this so that after removing all references to a datatable you get back not "normal" or at startup levels of RAM usage. Are there better ways of clearing memory than gc()? Any suggestions would be greatly appreciated.

In my project I had to deal with many large files. I organized the routine on the following principles:
Isolate memory-hungry operations in separate R scripts.
Run each script in new process which is destroyed after execution. Thus system gives used memory back.
Pass parameters to the scripts via text file.
Consider the toy example below.
Data generation:
setwd("/path/to")
write.table(matrix(1:5e7, ncol=10), "temp.csv") # 465.2 Mb file
slave.R - memory consuming part
setwd("/path/to")
library(data.table)
# simple processing
f <- function(dt){
dt <- dt[1:nrow(dt),]
dt[,new.row:=1]
return (dt)
}
# reads parameters from file
csv <- read.table("io.csv")
infile <- as.character(csv[1,1])
outfile <- as.character(csv[2,1])
# memory-hungry operations
dt <- as.data.table(read.csv(infile))
dt <- f(dt)
write.table(dt, outfile)
master.R - executes slaves in separate processes
setwd("/path/to")
# 3 files processing
for(i in 1:3){
# sets iteration-specific parameters
csv <- c("temp.csv", paste("temp", i, ".csv", sep=""))
write.table(csv, "io.csv")
# executes slave process
system("R -f slave.R")
}

read.csv is extremely slow in reading csv files with large numbers of columns

I have a .csv file: example.csv with 8000 columns x 40000 rows. The csv file have a string header for each column. All fields contains integer values between 0 and 10. When I try to load this file with read.csv it turns out to be extremely slow. It is also very slow when I add a parameter nrow=100. I wonder if there is a way to accelerate the read.csv, or use some other function instead of read.csv to load the file into memory as a matrix or data.frame?
Thanks in advance.

If your CSV only contains integers, you should use scan instead of read.csv, since ?read.csv says:
‘read.table’ is not the right tool for reading large matrices,
especially those with many columns: it is designed to read _data
frames_ which may have columns of very different classes. Use
‘scan’ instead for matrices.
Since your file has a header, you will need skip=1, and it will probably be faster if you set what=integer(). If you must use read.csv and speed / memory consumption are a concern, setting the colClasses argument is a huge help.

Try using data.table::fread(). This is by far on of the fastest ways to read .csv files into R. There is a good benchmark here.
library(data.table)
data <- fread("c:/data.csv")
If you want to make it even faster, you can also read only the subset of columns you want to use:
data <- fread("c:/data.csv", select = c("col1", "col2", "col3"))

Also try Hadley Wickham's readr package:
library(readr)
data <- read_csv("file.csv")

If you'll read the file often, it might well be worth saving it from R in a binary format using the save function. Specifying compress=FALSE often results in faster load times.
...You can then load it in with the (surprise!) load function.
d <- as.data.frame(matrix(1:1e6,ncol=1000))
write.csv(d, "c:/foo.csv", row.names=FALSE)
# Load file with read.csv
system.time( a <- read.csv("c:/foo.csv") ) # 3.18 sec
# Load file using scan
system.time( b <- matrix(scan("c:/foo.csv", 0L, skip=1, sep=','),
ncol=1000, byrow=TRUE) ) # 0.55 sec
# Load (binary) file using load
save(d, file="c:/foo.bin", compress=FALSE)
system.time( load("c:/foo.bin") ) # 0.09 sec

Might be worth it to try the new vroom package
vroom is a new approach to reading delimited and fixed width data into R.
It stems from the observation that when parsing files reading data from disk and finding the delimiters is generally not the main bottle neck. Instead (re)-allocating memory and parsing the values into R data types (particularly for characters) takes the bulk of the time.
Therefore you can obtain very rapid input by first performing a fast indexing step and then using the ALTREP (ALTernative REPresentations) framework available in R versions 3.5+ to access the values in a lazy / delayed fashion.
This approach potentially also allows you to work with data that is larger than memory. As long as you are careful to avoid materializing the entire dataset at once it can be efficiently queried and subset.
#install.packages("vroom",
# dependencies = TRUE, repos = "https://cran.rstudio.com")
library(vroom)
df <- vroom('example.csv')
Benchmark: readr vs data.table vs vroom for a 1.57GB file

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex