Importing very large dataset into h2o from sqlite - r

I have a database of about 500G. It comprises of 16 tables, each containing 2 or 3 column (first column can be discarded) and 1,375,328,760 rows. I need all the tables to be joined as one dataframe in h2o as they are needed for running a prediction in an XGB model. I have tried to convert the individual sql tables into the h2o environment using as.h2o, and h2o.cbind them 2 or 3 tables at a time, until they are one dataset. However, I get this "GC overhead limit exceeded: java.lang.OutOfMemoryError", after converting 4 tables.
Is there a way around this?
My machine specs are 124G RAM, OS (Rhel 7.8), Root(1tb), Home(600G) and 2TB external HDD.
The model is run on this local machine and the max_mem_size is set at 100G. The details of the code are below.
library(data.table)
library(h2o)
h2o.init(
nthreads=14,
max_mem_size = "100G")
h2o.removeAll()
setwd("/home/stan/Documents/LUR/era_aq")
l1.hex <- as.h2o(d2)
l2.hex <- as.h2o(lai)
test_l1.hex <-h2o.cbind(l1.hex,l2.hex[,-1])
h2o.rm (l1.hex,l2.hex)
l3.hex <- as.h2o(lu100)
l4.hex <- as.h2o(lu1000)
test_l2.hex <-h2o.cbind(l3.hex,l4.hex[,-1])
h2o.rm(l3.hex,l4.hex)
l5.hex <- as.h2o(lu1250)
l6.hex <- as.h2o(lu250)
test_l3.hex <-h2o.cbind(l5.hex,l6.hex[,-1])
h2o.rm(l5.hex,l6.hex)
l7.hex <- as.h2o(pbl)
l8.hex <- as.h2o(msl)
test_l4.hex <-h2o.cbind(l7.hex,l8.hex[,-1])
h2o.rm(ll7.hex,l8.hex)
test.hex <-h2o.cbind(test_l1.hex,test_l2.hex[,-1],test_l3.hex[,-1],test_l4.hex[,-1])
test <- test.hex[,-1]
test[1:3,]```

First, as Tom says in the comments, you're gonna need a bigger boat. H2O holds all data in memory, and generally you need 3 to 4x the data size to be able to do anything useful with it. A dataset of 500GB means you need the total memory of your cluster to be 1.5-2TB.
(H2O stores the data compressed, and I don't think sqlite does, in which case you might get away with only needing 1TB.)
Second, as.h2o() is an inefficient way to load big datasets. What will happen is your dataset is loaded into R's memory space, then it is saved to a csv file, then that csv file is streamed over TCP/IP to the H2O process.
So, the better way is to export directly from sqlite to a csv file. And then use h2o.importFile() to load that csv file into H2O.
h2o.cbind() is also going to involve a lot of copying. If you can find a tool or script to column-bind the csv files in advance of import, it might be more efficient. A quick search found csvkit, but I'm not sure if it needs to load the files into memory, or can do work with the files completely on disk.

Since memory is a premium and all R runs in RAM, avoid storing large helper data.table andh20 objects in your global environment. Consider setting up a function to build a list for compilation that temporary objects are removed when function is out of scope. Ideally, you build your h2o objects directly from file source:
# BUILD LIST OF H20 OBJECTS WITHOUT HELPER COPIES
h2o_list <- lapply(list_of_files, function(f) as.h2o(data.table::fread(f))[-1])
# h2o_list <- lapply(list_of_files, function(f) h2o.importFile(f)[-1])
# CBIND ALL H20 OBJECTS
test.h2o <- do.call(h2o.cbind, h2o_list)
Or even combine both lines with named function as opposed to anonymous function. Then, only final object remains after processing.
build_h2o <- function(f) as.h2o(data.table::fread(f))[-1])
# build_h2o <- function(f) h2o.importFile(f)[-1]
test.h2o <- do.call(h2o.cbind, lapply(list_of_files, build_h2o))
Extend function with if for some datasets that need to retain first column or not.
build_h2o <- function(f) {
if (grepl("lai|lu1000|lu250|msl", f)) { tmp <- fread(f)[-1] }
else { tmp <- fread(f) }
return(as.h2o(tmp))
}
Finally, if possible, leverage data.table methods like cbindlist:
final_dt <- cbindlist(lapply(list_of_files, function(f) fread(f)[-1]))
test.h2o <- as.h2o(final_dt)
rm(final_dt)
gc()

Related

How to delete temporary files in parallel task in R

Is it possible to delete temporary files from within a parallelized R task?
I rely on parallelization with doParallel and foreach in R to perform various calculations on small subsets of a huge raster file. This involves cropping a subset of the large raster many times. My basic syntax looks similar to this:
grid <- raster::raster("grid.tif")
data <- raster::raster("data.tif")
cl <- parallel::makeCluster(32)
doParallel::registerDoParallel(cl)
m <- foreach(col=ncol(grid)) %:% foreach(row=nrow(grid)) %dopar% {
# get extent of subset
cell <- raster::cellFromRowCol(grid, row, col)
ext <- raster::extentFromCells(grid, cell)
# crop main raster to subset extent
subset <- raster::crop(data, ext)
# ...
# perform some processing steps on the raster subset
# ...
# save results to a separate file
saveRDS(subset, paste0("output_folder/", row, "_", col)
}
The algorithm works perfectly fine and achieves what I want it to. However, raster::crop(data, ext) creates a small temporary file everytime it is called. This seems to be standard behavior of the raster package, but it becomes a problem, because these temp files are only deleted after the whole code has been executed, and take up way too much disk space in the meantime (hundreds of GB).
In a serial execution of the task I can simply delete the temporary file with file.remove(subset#file#name). However, this does not work anymore when running the task in parallel. Instead, the command is simply ignored and the temp file stays where it is until the whole task is done.
Any ideas as to why this is the case and how I could solve this problem?
There is a function for this removeTmpFiles.
You should be able to use f <- filename(subset), avoid reading from slots (#). I do not see why you would not be able to remove it. But perhaps it needs some fiddling with the path?
temp files are only created when the raster package deems it necessary, based on RAM available and required. See canProcessInMemory( , verbose=TRUE). The default settings are somewhat conservative, and you can change them with rasterOptions() (memfrac and maxmemory)
Another approach is to provide a filename argument to crop. Then you know what the filename is, and you can delete it. Of course you need to take care of not overwriting data from different tasks, so you may need to use some unique id associated with it.
saveRDS( ) won't work if the raster is backed up by a tempfile (as it will disappear).

How to batch read 2.8 GB gzipped (40 GB TSVs) files into R?

I have a directory with 31 gzipped TSVs (2.8 GB compressed / 40 GB uncompressed). I would like to conditionally import all matching rows based on the value of 1 column, and combine into one data frame.
I've read through several answers here, but none seem to work—I suspect that they are not meant to handle that much data.
In short, how can I:
Read 3 GB of gzipped files
Import only rows whose column matches a certain value
Combine matching rows into one data frame.
The data is tidy, with only 4 columns of interest: date, ip, type (str), category (str).
The first thing I tried using read_tsv_chunked():
library(purrr)
library(IPtoCountry)
library(lubridate)
library(scales)
library(plotly)
library(tidyquant)
library(tidyverse)
library(R.utils)
library(data.table)
#Generate the path to all the files.
import_path <- "import/"
files <- import_path %>%
str_c(dir(import_path))
#Define a function to filter data as it comes in.
call_back <- function(x, pos){
unique(dplyr::filter(x, .data[["type"]] == "purchase"))
}
raw_data <- files %>%
map(~ read_tsv_chunked(., DataFrameCallback$new(call_back),
chunk_size = 5000)) %>%
reduce(rbind) %>%
as_tibble() # %>%
This first approach worked with 9 GB of uncompressed data, but not with 40 GB.
The second approach using fread() (same loaded packages):
#Generate the path to all the files.
import_path <- "import/"
files <- import_path %>%
str_c(dir(import_path))
bind_rows(map(str_c("gunzip - c", files), fread))
That looked like it started working, but then locked up. I couldn't figure out how to pass the select = c(colnames) argument to fread() inside the map()/str_c() calls, let alone the filter criteria for the one column.
This is more of a strategy answer.
R loads all data into memory for processing, so you'll run into issues with the amount of data you're looking at.
What I suggest you do, which is what I do, is to use Apache Spark for the data processing, and use the R package sparklyr to interface to it. You can then load your data into Spark, process it there, then retrieve the summarised set of data back into R for further visualisation and analysis.
You can install Spark locally in your R Studio instance and do a lot there. If you need further computing capacity have a look at a hosted option such as AWS.
Have a read of this https://spark.rstudio.com/
One technical point, there is a sparklyr function spark_read_text which will read delimited text files directly into the Spark instance. It's very useful.
From there you can use dplyr to manipulate your data. Good luck!
First, if the base read.table is used, you don't need to gunzip anything, as it uses Zlib to read these directly. read.table also works much faster if the colClasses parameter is specified.
Y'might need to write some custom R code to produce a melted data frame directly from each of the 31 TSVs, and then accumulate them by rbinding.
Still it will help to have a machine with lots of fast virtual memory. I often work with datasets on this order, and I sometimes find an Ubuntu system wanting on memory, even if it has 32 cores. I have an alternative system where I have convinced the OS that an SSD is more of its memory, giving me an effective 64 GB RAM. I find this very useful for some of these problems. It's Windows, so I need to set memory.limit(size=...) appropriately.
Note that once a TSV is read using read.table, it's pretty compressed, approaching what gzip delivers. You may not need a big system if you do it this way.
If it turns out to take a long time (I doubt it), be sure to checkpoint and save.image at spots in between.

Iteratively resave a directory tree of Excel files

I am regularly receiving data from a source that is producing a non-standard Excel format which can't be read by readxl::read_excel. Here is the github issue thread. Consequently I have a whole directory tree containing hundreds of (almost) Excel files that I would like to read into R and combine with plyr::ldply The files can, however, be opened just fine by XLConnect::loadWorkbook. But unfortunately, even with allocating huge amounts of memory for the Java virtual machine, it always crashes after reading a few files. I tried adding these three lines to my import function:
options(java.parameters = "-Xmx16g")
detach("package:XLConnect", unload = TRUE)
library(XLConnect)
xlcFreeMemory()
However, I still get:
Error: OutOfMemoryError (Java): Java heap space
All I need to do is resave them in Excel and then they read in just fine from readxl::read_excel. I'm hoping I could also resave them in batch using XLConnect and then read them in using readxl::read_excel. Unfortunately, using Linux, I can't just script Excel to resave them. Does anyone have another workaround?
Since you're on Linux, running an Excel macro to re-save the spreadsheets looks to be difficult.
You could start a separate R process to read each spreadsheet with XLConnect. This can be done in at least two ways:
Run Rscript with a script file, passing it the name of the spreadsheet. Save the data to a .RData file, and read it back in your master R process.
Use parLapply from the parallel package, passing it a vector of spreadsheet names and a function to read the file. In this case, you don't have to save the data to disk as an intermediate step. However, you might have to do this in chunks, as the slave processes will slowly run out of memory unless you restart them.
Example of the latter:
files <- list.files(pattern="xlsx$")
filesPerChunk <- 5
clustSize <- 4 # or how ever many slave nodes you want
runSize <- clustSize * filesPerChunk
runs <- length(files)%/%runSize + (length(files)%%runSize != 0)
library(parallel)
sheets <- lapply(seq(runs), function(i) {
runStart <- (i - 1) * runSize + 1
runEnd <- min(length(files), runStart + runSize - 1)
runFiles <- files[runStart:runEnd]
# periodically restart and stop the cluster to deal with memory leaks
cl <- makeCluster(clustSize)
on.exit(stopCluster(cl))
parLapply(cl, runFiles, function(f) {
require(XLConnect)
loadWorkbook(f, ...)
})
})
sheets <- unlist(sheets, recursive=FALSE) # convert list of lists to a simple list

Package a large data set

Column-wise storage in the inst/extdata directory of a package, as suggested by Jan, is now implemented in the dfunbind package.
I'm using the data-raw idiom to make entire analyses from the raw data to the results reproducible. For this, datasets are first wrapped in R packages which can then be loaded with library().
One of the datasets I'm using is largish, around 8 million observations with about 80 attributes. For my current analysis I only need a small fraction of the attributes, but I'd like to package the entire dataset anyway.
Now, if it is simply packaged as a data frame (e.g., with devtools::use_data()), it will be loaded in its entirety when first accessing it. What would be the best approach to package this kind of data so that I can lazy-load at the column level? (Only those columns which I'm actually accessing are loaded, the others happily stay on disk and don't occupy RAM.) Would the ff package help? Can anyone point me to a working example?
I think, I would store the data in inst/extdata. Then create a couple of functions in your package that can read and return parts of that data. In your functions you can get the path to your data using: system.file("extdata", "yourfile", package = "yourpackage"). (As on the page you linked to).
The question then is in what format you store your data and how do you obtain selections from it without reading the data in memory. For that, there are a large number of options. To name some:
sqlite: Store your data in a sqlite database. You can then perform queries on this data using the rsqlite package.
ff: store your data in ff objects (e.g. save using the save.ffdf function from ffbase; use load.ffdf to load again). ff doesn't handle character fields well (they are always converted to factors). And in theory the files are not cross platform although as long as you stay on intel platforms you should be ok.
CSV: store your data in a plain old csv file. You can then make selections from this file using the LaF package. The performance will probably be less than with ff but might be good enough.
RDS: store each of your columns in a seperate RDS file (using saveRDS) and load them using readRDS the advantage is that you do not depend on any R-packages. This is fast. The disadvantage is that you cannot do row selections (but that does not seem to be the case).
If you only want to select columns, I would go with RDS.
A rough example using RDS
The following code creates an example package containing the iris data set:
load_data <- function(dataset, columns) {
result <- vector("list", length(columns));
for (i in seq_along(columns)) {
col <- columns[i]
fn <- system.file("extdata", dataset, paste0(col, ".RDS"), package = "lazydata")
result[[i]] <- readRDS(fn)
}
names(result) <- columns
as.data.frame(result)
}
store_data <- function(package, name, data) {
dir <- file.path(package, "inst", "exdata", name)
dir.create(dir, recursive = TRUE)
for (col in names(data)) {
saveRDS(data[[col]], file.path(dir, paste0(col, ".RDS")))
}
}
packagename <- "lazyload"
package.skeleton(packagename, "load_data")
store_data(packagename, "iris", iris)
After building and installing the package (you'll need to fix the documentation, e.g. delete it) you can do:
library(lazyload)
data <- load_data("iris", "Sepal.Width")
To load the Sepal.Width column of the iris data set.
Of course this is a very simple implementation of load_data: no error handling, it assumes all column exist, it does not know which columns exist, it does not know which data sets exist.

freeing up memory in R

In R, I am trying to combine and convert several sets of timeseries data as an xts from http://www.truefx.com/?page=downloads however, the files are large and there many files so this is causing me issues on my laptop. They are stored as a csv file which have been compressed as a zip file.
Downloading them and unzipping them is easy enough (although takes up a lot of space on a hard drive).
Loading the 350MB+ files for one month's worth of data into the R is reasonably straight forward with the new fread() function in the data.table package.
Some datatable transformations are done (inside a function) so that the timestamps can be read easily and a mid column is produced. Then the datatable is saved as an RData file on the hard drive, and all references are to the datatable object are removed from the workspace, and a gc() is run after removal...however when looking at the R session in my Activity Monitor (run from a Mac)...it still looks like it is taking up almost 1GB of RAM...and things seem a bit laggy...I was intending to load several years worth of the csv files at the same time, convert them to useable datatables, combine them and then create a single xts object, which seems infeasible if just one month uses 1GB of RAM.
I know I can sequentially download each file, convert it, save it shut down R and repeat until i have a bunch of RData files that i can just load and bind, but was hopeing there might be a more efficient manner to do this so that after removing all references to a datatable you get back not "normal" or at startup levels of RAM usage. Are there better ways of clearing memory than gc()? Any suggestions would be greatly appreciated.
In my project I had to deal with many large files. I organized the routine on the following principles:
Isolate memory-hungry operations in separate R scripts.
Run each script in new process which is destroyed after execution. Thus system gives used memory back.
Pass parameters to the scripts via text file.
Consider the toy example below.
Data generation:
setwd("/path/to")
write.table(matrix(1:5e7, ncol=10), "temp.csv") # 465.2 Mb file
slave.R - memory consuming part
setwd("/path/to")
library(data.table)
# simple processing
f <- function(dt){
dt <- dt[1:nrow(dt),]
dt[,new.row:=1]
return (dt)
}
# reads parameters from file
csv <- read.table("io.csv")
infile <- as.character(csv[1,1])
outfile <- as.character(csv[2,1])
# memory-hungry operations
dt <- as.data.table(read.csv(infile))
dt <- f(dt)
write.table(dt, outfile)
master.R - executes slaves in separate processes
setwd("/path/to")
# 3 files processing
for(i in 1:3){
# sets iteration-specific parameters
csv <- c("temp.csv", paste("temp", i, ".csv", sep=""))
write.table(csv, "io.csv")
# executes slave process
system("R -f slave.R")
}

Resources