Working with Large Number of csv files in R - r

I have a directory containing ~40000 csv files, each ranging in size from ~400 bytes to ~11 MB. I have written a function that reads in a csv file and calculates some basic numbers for each csv file (eq. how many values of "female" are in each csv file). This code successfully ran for the same number of csv files, but when the csv files were smaller.
I'm using the packages parallel and doParallel to run this on my machine and receive the following error:
Error in unserialize(node$con) : error reading from connection
I suspect I'm running out of memory but I am not sure how best to handle the increased size of the files I'm working with. My code is as follows:
Say that 'fpath' is the path to my directory where all these csvs live:
fpath<-"../Desktop/bigdirectory"
filelist <- as.vector(list.files(fpath,pattern="site.csv"))
(f <- file.path(fpath,filelist))
# demographics csv in specified directory
dlist <- as.vector(list.files(fpath,pattern="demographics.csv"))
(d <- file.path(fpath,dlist))
demos <- fread(d,header=T,sep=",")
cl <- makeCluster(4)
registerDoParallel(cl)
setDefaultCluster(cl)
clusterExport(NULL,c('transit.grab'))
clusterEvalQ(NULL,library(data.table))
sdemos <- demo.grab(dl[[1]],demos)
stmp <- parLapply(cl,f,FUN=transit.grab,sdemos)
And the function 'transit.grab' is the following:
transit.grab <- function(sitefile,selected.demos){
require(sqldf)
demos <- selected.demos
# Renaming sitefile columns
sf <- fread(sitefile,header=T,sep=",")
names(sf)[1] <- c("id")
# Selecting transits from site file using list of selected users
sdat <- sqldf('select sf.* from sf inner join demos on sf.id=demos.id')
return(sdat)
}
I'm not looking for someone to debug code, as I know it runs properly for a smaller amount of data, but rather I desperately need suggestions on how to implement this code for ~6.7 GB of data. Any and all feedback is welcome, thanks!
UPDATE:
As suggested, I replaced sqldf() with merge(), which reduced my computation time by half when tested on a smaller directory. When observing my memory usage via Activity Monitor, my trend is pretty flat. BUT now when I try running my code on the large directory, my R session crashes.

Related

How to delete temporary files in parallel task in R

Is it possible to delete temporary files from within a parallelized R task?
I rely on parallelization with doParallel and foreach in R to perform various calculations on small subsets of a huge raster file. This involves cropping a subset of the large raster many times. My basic syntax looks similar to this:
grid <- raster::raster("grid.tif")
data <- raster::raster("data.tif")
cl <- parallel::makeCluster(32)
doParallel::registerDoParallel(cl)
m <- foreach(col=ncol(grid)) %:% foreach(row=nrow(grid)) %dopar% {
# get extent of subset
cell <- raster::cellFromRowCol(grid, row, col)
ext <- raster::extentFromCells(grid, cell)
# crop main raster to subset extent
subset <- raster::crop(data, ext)
# ...
# perform some processing steps on the raster subset
# ...
# save results to a separate file
saveRDS(subset, paste0("output_folder/", row, "_", col)
}
The algorithm works perfectly fine and achieves what I want it to. However, raster::crop(data, ext) creates a small temporary file everytime it is called. This seems to be standard behavior of the raster package, but it becomes a problem, because these temp files are only deleted after the whole code has been executed, and take up way too much disk space in the meantime (hundreds of GB).
In a serial execution of the task I can simply delete the temporary file with file.remove(subset#file#name). However, this does not work anymore when running the task in parallel. Instead, the command is simply ignored and the temp file stays where it is until the whole task is done.
Any ideas as to why this is the case and how I could solve this problem?
There is a function for this removeTmpFiles.
You should be able to use f <- filename(subset), avoid reading from slots (#). I do not see why you would not be able to remove it. But perhaps it needs some fiddling with the path?
temp files are only created when the raster package deems it necessary, based on RAM available and required. See canProcessInMemory( , verbose=TRUE). The default settings are somewhat conservative, and you can change them with rasterOptions() (memfrac and maxmemory)
Another approach is to provide a filename argument to crop. Then you know what the filename is, and you can delete it. Of course you need to take care of not overwriting data from different tasks, so you may need to use some unique id associated with it.
saveRDS( ) won't work if the raster is backed up by a tempfile (as it will disappear).

Importing very large dataset into h2o from sqlite

I have a database of about 500G. It comprises of 16 tables, each containing 2 or 3 column (first column can be discarded) and 1,375,328,760 rows. I need all the tables to be joined as one dataframe in h2o as they are needed for running a prediction in an XGB model. I have tried to convert the individual sql tables into the h2o environment using as.h2o, and h2o.cbind them 2 or 3 tables at a time, until they are one dataset. However, I get this "GC overhead limit exceeded: java.lang.OutOfMemoryError", after converting 4 tables.
Is there a way around this?
My machine specs are 124G RAM, OS (Rhel 7.8), Root(1tb), Home(600G) and 2TB external HDD.
The model is run on this local machine and the max_mem_size is set at 100G. The details of the code are below.
library(data.table)
library(h2o)
h2o.init(
nthreads=14,
max_mem_size = "100G")
h2o.removeAll()
setwd("/home/stan/Documents/LUR/era_aq")
l1.hex <- as.h2o(d2)
l2.hex <- as.h2o(lai)
test_l1.hex <-h2o.cbind(l1.hex,l2.hex[,-1])
h2o.rm (l1.hex,l2.hex)
l3.hex <- as.h2o(lu100)
l4.hex <- as.h2o(lu1000)
test_l2.hex <-h2o.cbind(l3.hex,l4.hex[,-1])
h2o.rm(l3.hex,l4.hex)
l5.hex <- as.h2o(lu1250)
l6.hex <- as.h2o(lu250)
test_l3.hex <-h2o.cbind(l5.hex,l6.hex[,-1])
h2o.rm(l5.hex,l6.hex)
l7.hex <- as.h2o(pbl)
l8.hex <- as.h2o(msl)
test_l4.hex <-h2o.cbind(l7.hex,l8.hex[,-1])
h2o.rm(ll7.hex,l8.hex)
test.hex <-h2o.cbind(test_l1.hex,test_l2.hex[,-1],test_l3.hex[,-1],test_l4.hex[,-1])
test <- test.hex[,-1]
test[1:3,]```
First, as Tom says in the comments, you're gonna need a bigger boat. H2O holds all data in memory, and generally you need 3 to 4x the data size to be able to do anything useful with it. A dataset of 500GB means you need the total memory of your cluster to be 1.5-2TB.
(H2O stores the data compressed, and I don't think sqlite does, in which case you might get away with only needing 1TB.)
Second, as.h2o() is an inefficient way to load big datasets. What will happen is your dataset is loaded into R's memory space, then it is saved to a csv file, then that csv file is streamed over TCP/IP to the H2O process.
So, the better way is to export directly from sqlite to a csv file. And then use h2o.importFile() to load that csv file into H2O.
h2o.cbind() is also going to involve a lot of copying. If you can find a tool or script to column-bind the csv files in advance of import, it might be more efficient. A quick search found csvkit, but I'm not sure if it needs to load the files into memory, or can do work with the files completely on disk.
Since memory is a premium and all R runs in RAM, avoid storing large helper data.table andh20 objects in your global environment. Consider setting up a function to build a list for compilation that temporary objects are removed when function is out of scope. Ideally, you build your h2o objects directly from file source:
# BUILD LIST OF H20 OBJECTS WITHOUT HELPER COPIES
h2o_list <- lapply(list_of_files, function(f) as.h2o(data.table::fread(f))[-1])
# h2o_list <- lapply(list_of_files, function(f) h2o.importFile(f)[-1])
# CBIND ALL H20 OBJECTS
test.h2o <- do.call(h2o.cbind, h2o_list)
Or even combine both lines with named function as opposed to anonymous function. Then, only final object remains after processing.
build_h2o <- function(f) as.h2o(data.table::fread(f))[-1])
# build_h2o <- function(f) h2o.importFile(f)[-1]
test.h2o <- do.call(h2o.cbind, lapply(list_of_files, build_h2o))
Extend function with if for some datasets that need to retain first column or not.
build_h2o <- function(f) {
if (grepl("lai|lu1000|lu250|msl", f)) { tmp <- fread(f)[-1] }
else { tmp <- fread(f) }
return(as.h2o(tmp))
}
Finally, if possible, leverage data.table methods like cbindlist:
final_dt <- cbindlist(lapply(list_of_files, function(f) fread(f)[-1]))
test.h2o <- as.h2o(final_dt)
rm(final_dt)
gc()

Running system() inside foreach()

I am running R version 3.4.1 on Windows 7.
I need to run a certain program using N=500 bootstrapped data files that I have stored within a working directory. The executable will produce 3 more datafiles for each of the N runs, all of which are stored in one working directory. I'd like to use parallel computing to increase the speed of this process.
The loop requires one starting file ('starter') that gets modified at every iteration of the loop. This is a really important part that I can't really get around doing. Also, at every iteration any pre-existing data files are removed and then the executable is called using system(). The executable produces the three data files called Report, CompReport, and covar.
In a normal loop the process looks something like this:
starter_strt <- SS_readstarter(file="starter.ss") # read starter file
for(iboot in 1:N){
starter$datfile <- paste("BootData", iboot, ".ss", sep="")
#code to replace the starter file with the new one
SS_writestarter(starter, overwrite=TRUE)
#any old files are removed
file.remove("Report.sso", paste("Report_",iboot,".sso",sep=""))
file.remove("CompReport.sso", paste("CompReport_",iboot,".sso",sep=""))
file.remove("covar.sso", paste("covar_",iboot,".sso",sep=""))
#then the program is called
system("ss3")
#the files produced get pushed to the working directory
file.copy("Report.sso",paste("Report_",iboot,".sso",sep=""))
file.copy("CompReport.sso",paste("CompReport_",iboot,".sso",sep=""))
file.copy("covar.sso",paste("covar_",iboot,".sso",sep=""))
}
This above process takes 5 days to complete so I've been exploring options for parallelization. What I've worked with so far doesn't work so any help or suggestions are appreciated. Here's what I was thinking. Using a foreach loop and making sure I'm not reading and overwriting the starter file. This is what I have:
# read starter file
starter_strt <- SS_readstarter(file="starter.ss") # read starter file
iter <- 2 #just for an example here even though I said N=500 above
cores <- parallel::detectCores()-1
cl <- makeCluster(cores)
registerDoParallel(cl)
clusterCall(cl, function(x) .libPaths(x), .libPaths())
foreach(iboot=1:iter, .packages='r4ss', .export="starter_strt", system ("ss3")) %dopar% {
#code to create the starter file so I don't overwrite it
datfile<- data.frame(rep(paste("BootData",iboot,".ss",sep=""),2))
colnames(datfile) <- "datfile"
starter_list <- cbind(starter_strt, datfile)
# replace starter file with modified version
SS_writestarter(starter_list, overwrite=TRUE)
#delete files as above
#run program
system("ss3")
#the files produced get pushed to the working directory
file.copy("Report.sso",paste("Report_",iboot,".sso",sep=""))
file.copy("CompReport.sso",paste("CompReport_",iboot,".sso",sep=""))
file.copy("covar.sso",paste("covar_",iboot,".sso",sep=""))
}
The program runs through once but then I get a warning (see below) and the expected files are missing from the directory.
Warning message:
In e$fun(obj, substitute(ex), parent.frame(), e$data) :
already exporting variable(s): starter_strt
I'm pretty stumped at this point. I saw this example Running multiple R scripts using the system() command where the system command is called in a parLapply function. Maybe this is what I need? Help and/or suggestions are greatly appreciated.

Iteratively resave a directory tree of Excel files

I am regularly receiving data from a source that is producing a non-standard Excel format which can't be read by readxl::read_excel. Here is the github issue thread. Consequently I have a whole directory tree containing hundreds of (almost) Excel files that I would like to read into R and combine with plyr::ldply The files can, however, be opened just fine by XLConnect::loadWorkbook. But unfortunately, even with allocating huge amounts of memory for the Java virtual machine, it always crashes after reading a few files. I tried adding these three lines to my import function:
options(java.parameters = "-Xmx16g")
detach("package:XLConnect", unload = TRUE)
library(XLConnect)
xlcFreeMemory()
However, I still get:
Error: OutOfMemoryError (Java): Java heap space
All I need to do is resave them in Excel and then they read in just fine from readxl::read_excel. I'm hoping I could also resave them in batch using XLConnect and then read them in using readxl::read_excel. Unfortunately, using Linux, I can't just script Excel to resave them. Does anyone have another workaround?
Since you're on Linux, running an Excel macro to re-save the spreadsheets looks to be difficult.
You could start a separate R process to read each spreadsheet with XLConnect. This can be done in at least two ways:
Run Rscript with a script file, passing it the name of the spreadsheet. Save the data to a .RData file, and read it back in your master R process.
Use parLapply from the parallel package, passing it a vector of spreadsheet names and a function to read the file. In this case, you don't have to save the data to disk as an intermediate step. However, you might have to do this in chunks, as the slave processes will slowly run out of memory unless you restart them.
Example of the latter:
files <- list.files(pattern="xlsx$")
filesPerChunk <- 5
clustSize <- 4 # or how ever many slave nodes you want
runSize <- clustSize * filesPerChunk
runs <- length(files)%/%runSize + (length(files)%%runSize != 0)
library(parallel)
sheets <- lapply(seq(runs), function(i) {
runStart <- (i - 1) * runSize + 1
runEnd <- min(length(files), runStart + runSize - 1)
runFiles <- files[runStart:runEnd]
# periodically restart and stop the cluster to deal with memory leaks
cl <- makeCluster(clustSize)
on.exit(stopCluster(cl))
parLapply(cl, runFiles, function(f) {
require(XLConnect)
loadWorkbook(f, ...)
})
})
sheets <- unlist(sheets, recursive=FALSE) # convert list of lists to a simple list

freeing up memory in R

In R, I am trying to combine and convert several sets of timeseries data as an xts from http://www.truefx.com/?page=downloads however, the files are large and there many files so this is causing me issues on my laptop. They are stored as a csv file which have been compressed as a zip file.
Downloading them and unzipping them is easy enough (although takes up a lot of space on a hard drive).
Loading the 350MB+ files for one month's worth of data into the R is reasonably straight forward with the new fread() function in the data.table package.
Some datatable transformations are done (inside a function) so that the timestamps can be read easily and a mid column is produced. Then the datatable is saved as an RData file on the hard drive, and all references are to the datatable object are removed from the workspace, and a gc() is run after removal...however when looking at the R session in my Activity Monitor (run from a Mac)...it still looks like it is taking up almost 1GB of RAM...and things seem a bit laggy...I was intending to load several years worth of the csv files at the same time, convert them to useable datatables, combine them and then create a single xts object, which seems infeasible if just one month uses 1GB of RAM.
I know I can sequentially download each file, convert it, save it shut down R and repeat until i have a bunch of RData files that i can just load and bind, but was hopeing there might be a more efficient manner to do this so that after removing all references to a datatable you get back not "normal" or at startup levels of RAM usage. Are there better ways of clearing memory than gc()? Any suggestions would be greatly appreciated.
In my project I had to deal with many large files. I organized the routine on the following principles:
Isolate memory-hungry operations in separate R scripts.
Run each script in new process which is destroyed after execution. Thus system gives used memory back.
Pass parameters to the scripts via text file.
Consider the toy example below.
Data generation:
setwd("/path/to")
write.table(matrix(1:5e7, ncol=10), "temp.csv") # 465.2 Mb file
slave.R - memory consuming part
setwd("/path/to")
library(data.table)
# simple processing
f <- function(dt){
dt <- dt[1:nrow(dt),]
dt[,new.row:=1]
return (dt)
}
# reads parameters from file
csv <- read.table("io.csv")
infile <- as.character(csv[1,1])
outfile <- as.character(csv[2,1])
# memory-hungry operations
dt <- as.data.table(read.csv(infile))
dt <- f(dt)
write.table(dt, outfile)
master.R - executes slaves in separate processes
setwd("/path/to")
# 3 files processing
for(i in 1:3){
# sets iteration-specific parameters
csv <- c("temp.csv", paste("temp", i, ".csv", sep=""))
write.table(csv, "io.csv")
# executes slave process
system("R -f slave.R")
}

Resources