freeing up memory in R - r

In R, I am trying to combine and convert several sets of timeseries data as an xts from http://www.truefx.com/?page=downloads however, the files are large and there many files so this is causing me issues on my laptop. They are stored as a csv file which have been compressed as a zip file.
Downloading them and unzipping them is easy enough (although takes up a lot of space on a hard drive).
Loading the 350MB+ files for one month's worth of data into the R is reasonably straight forward with the new fread() function in the data.table package.
Some datatable transformations are done (inside a function) so that the timestamps can be read easily and a mid column is produced. Then the datatable is saved as an RData file on the hard drive, and all references are to the datatable object are removed from the workspace, and a gc() is run after removal...however when looking at the R session in my Activity Monitor (run from a Mac)...it still looks like it is taking up almost 1GB of RAM...and things seem a bit laggy...I was intending to load several years worth of the csv files at the same time, convert them to useable datatables, combine them and then create a single xts object, which seems infeasible if just one month uses 1GB of RAM.
I know I can sequentially download each file, convert it, save it shut down R and repeat until i have a bunch of RData files that i can just load and bind, but was hopeing there might be a more efficient manner to do this so that after removing all references to a datatable you get back not "normal" or at startup levels of RAM usage. Are there better ways of clearing memory than gc()? Any suggestions would be greatly appreciated.

In my project I had to deal with many large files. I organized the routine on the following principles:
Isolate memory-hungry operations in separate R scripts.
Run each script in new process which is destroyed after execution. Thus system gives used memory back.
Pass parameters to the scripts via text file.
Consider the toy example below.
Data generation:
setwd("/path/to")
write.table(matrix(1:5e7, ncol=10), "temp.csv") # 465.2 Mb file
slave.R - memory consuming part
setwd("/path/to")
library(data.table)
# simple processing
f <- function(dt){
dt <- dt[1:nrow(dt),]
dt[,new.row:=1]
return (dt)
}
# reads parameters from file
csv <- read.table("io.csv")
infile <- as.character(csv[1,1])
outfile <- as.character(csv[2,1])
# memory-hungry operations
dt <- as.data.table(read.csv(infile))
dt <- f(dt)
write.table(dt, outfile)
master.R - executes slaves in separate processes
setwd("/path/to")
# 3 files processing
for(i in 1:3){
# sets iteration-specific parameters
csv <- c("temp.csv", paste("temp", i, ".csv", sep=""))
write.table(csv, "io.csv")
# executes slave process
system("R -f slave.R")
}

Related

Inefficient memory usage in R when turning xml into dataframes

I'm trying to figure out, how to stop R from consuming unreasonably much RAM during my script run. My concern is to convert each of 100MB-ish xml files (that are parts of a large 10GB xml) to data frames and then save them as csv files to merge them later. I tried to convert the large xml, had no success though, as it not only crashed R, but the whole system, so it needed to reboot.
library(XML)
library(plyr)
library(data.table)
filelist<-list.files(pattern="origfile\\.\\d*\\.xml")
#vector with xml names in my working directory
for (i in 1:length(filelist)) {
data <- xmlParse(filelist[i])
xml_data <- xmlToList(data)
df<-ldply(xml_data, rbind)
fwrite(df, file = paste("newfile",i,".csv",sep=""),row.names = FALSE)
rm(data,xml_data,df)
gc()
}
The problem is, every time the for loop starts again, it adds new data without discarding variables that have no use already, even if I delete them explicitly. Sometimes RAM usage drops to 60%, but then grows up again. After it reaches some limit, exception "cannot allocate vector of size x MB" is thrown. PC has 16GB of RAM, but R doesn't seem to need more than 1.5-2GB for one cycle step.
What should be changed in code or what needs to be done in general to prevent reaching RAM limit?

Importing very large dataset into h2o from sqlite

I have a database of about 500G. It comprises of 16 tables, each containing 2 or 3 column (first column can be discarded) and 1,375,328,760 rows. I need all the tables to be joined as one dataframe in h2o as they are needed for running a prediction in an XGB model. I have tried to convert the individual sql tables into the h2o environment using as.h2o, and h2o.cbind them 2 or 3 tables at a time, until they are one dataset. However, I get this "GC overhead limit exceeded: java.lang.OutOfMemoryError", after converting 4 tables.
Is there a way around this?
My machine specs are 124G RAM, OS (Rhel 7.8), Root(1tb), Home(600G) and 2TB external HDD.
The model is run on this local machine and the max_mem_size is set at 100G. The details of the code are below.
library(data.table)
library(h2o)
h2o.init(
nthreads=14,
max_mem_size = "100G")
h2o.removeAll()
setwd("/home/stan/Documents/LUR/era_aq")
l1.hex <- as.h2o(d2)
l2.hex <- as.h2o(lai)
test_l1.hex <-h2o.cbind(l1.hex,l2.hex[,-1])
h2o.rm (l1.hex,l2.hex)
l3.hex <- as.h2o(lu100)
l4.hex <- as.h2o(lu1000)
test_l2.hex <-h2o.cbind(l3.hex,l4.hex[,-1])
h2o.rm(l3.hex,l4.hex)
l5.hex <- as.h2o(lu1250)
l6.hex <- as.h2o(lu250)
test_l3.hex <-h2o.cbind(l5.hex,l6.hex[,-1])
h2o.rm(l5.hex,l6.hex)
l7.hex <- as.h2o(pbl)
l8.hex <- as.h2o(msl)
test_l4.hex <-h2o.cbind(l7.hex,l8.hex[,-1])
h2o.rm(ll7.hex,l8.hex)
test.hex <-h2o.cbind(test_l1.hex,test_l2.hex[,-1],test_l3.hex[,-1],test_l4.hex[,-1])
test <- test.hex[,-1]
test[1:3,]```
First, as Tom says in the comments, you're gonna need a bigger boat. H2O holds all data in memory, and generally you need 3 to 4x the data size to be able to do anything useful with it. A dataset of 500GB means you need the total memory of your cluster to be 1.5-2TB.
(H2O stores the data compressed, and I don't think sqlite does, in which case you might get away with only needing 1TB.)
Second, as.h2o() is an inefficient way to load big datasets. What will happen is your dataset is loaded into R's memory space, then it is saved to a csv file, then that csv file is streamed over TCP/IP to the H2O process.
So, the better way is to export directly from sqlite to a csv file. And then use h2o.importFile() to load that csv file into H2O.
h2o.cbind() is also going to involve a lot of copying. If you can find a tool or script to column-bind the csv files in advance of import, it might be more efficient. A quick search found csvkit, but I'm not sure if it needs to load the files into memory, or can do work with the files completely on disk.
Since memory is a premium and all R runs in RAM, avoid storing large helper data.table andh20 objects in your global environment. Consider setting up a function to build a list for compilation that temporary objects are removed when function is out of scope. Ideally, you build your h2o objects directly from file source:
# BUILD LIST OF H20 OBJECTS WITHOUT HELPER COPIES
h2o_list <- lapply(list_of_files, function(f) as.h2o(data.table::fread(f))[-1])
# h2o_list <- lapply(list_of_files, function(f) h2o.importFile(f)[-1])
# CBIND ALL H20 OBJECTS
test.h2o <- do.call(h2o.cbind, h2o_list)
Or even combine both lines with named function as opposed to anonymous function. Then, only final object remains after processing.
build_h2o <- function(f) as.h2o(data.table::fread(f))[-1])
# build_h2o <- function(f) h2o.importFile(f)[-1]
test.h2o <- do.call(h2o.cbind, lapply(list_of_files, build_h2o))
Extend function with if for some datasets that need to retain first column or not.
build_h2o <- function(f) {
if (grepl("lai|lu1000|lu250|msl", f)) { tmp <- fread(f)[-1] }
else { tmp <- fread(f) }
return(as.h2o(tmp))
}
Finally, if possible, leverage data.table methods like cbindlist:
final_dt <- cbindlist(lapply(list_of_files, function(f) fread(f)[-1]))
test.h2o <- as.h2o(final_dt)
rm(final_dt)
gc()

Iteratively resave a directory tree of Excel files

I am regularly receiving data from a source that is producing a non-standard Excel format which can't be read by readxl::read_excel. Here is the github issue thread. Consequently I have a whole directory tree containing hundreds of (almost) Excel files that I would like to read into R and combine with plyr::ldply The files can, however, be opened just fine by XLConnect::loadWorkbook. But unfortunately, even with allocating huge amounts of memory for the Java virtual machine, it always crashes after reading a few files. I tried adding these three lines to my import function:
options(java.parameters = "-Xmx16g")
detach("package:XLConnect", unload = TRUE)
library(XLConnect)
xlcFreeMemory()
However, I still get:
Error: OutOfMemoryError (Java): Java heap space
All I need to do is resave them in Excel and then they read in just fine from readxl::read_excel. I'm hoping I could also resave them in batch using XLConnect and then read them in using readxl::read_excel. Unfortunately, using Linux, I can't just script Excel to resave them. Does anyone have another workaround?
Since you're on Linux, running an Excel macro to re-save the spreadsheets looks to be difficult.
You could start a separate R process to read each spreadsheet with XLConnect. This can be done in at least two ways:
Run Rscript with a script file, passing it the name of the spreadsheet. Save the data to a .RData file, and read it back in your master R process.
Use parLapply from the parallel package, passing it a vector of spreadsheet names and a function to read the file. In this case, you don't have to save the data to disk as an intermediate step. However, you might have to do this in chunks, as the slave processes will slowly run out of memory unless you restart them.
Example of the latter:
files <- list.files(pattern="xlsx$")
filesPerChunk <- 5
clustSize <- 4 # or how ever many slave nodes you want
runSize <- clustSize * filesPerChunk
runs <- length(files)%/%runSize + (length(files)%%runSize != 0)
library(parallel)
sheets <- lapply(seq(runs), function(i) {
runStart <- (i - 1) * runSize + 1
runEnd <- min(length(files), runStart + runSize - 1)
runFiles <- files[runStart:runEnd]
# periodically restart and stop the cluster to deal with memory leaks
cl <- makeCluster(clustSize)
on.exit(stopCluster(cl))
parLapply(cl, runFiles, function(f) {
require(XLConnect)
loadWorkbook(f, ...)
})
})
sheets <- unlist(sheets, recursive=FALSE) # convert list of lists to a simple list

Working with Large Number of csv files in R

I have a directory containing ~40000 csv files, each ranging in size from ~400 bytes to ~11 MB. I have written a function that reads in a csv file and calculates some basic numbers for each csv file (eq. how many values of "female" are in each csv file). This code successfully ran for the same number of csv files, but when the csv files were smaller.
I'm using the packages parallel and doParallel to run this on my machine and receive the following error:
Error in unserialize(node$con) : error reading from connection
I suspect I'm running out of memory but I am not sure how best to handle the increased size of the files I'm working with. My code is as follows:
Say that 'fpath' is the path to my directory where all these csvs live:
fpath<-"../Desktop/bigdirectory"
filelist <- as.vector(list.files(fpath,pattern="site.csv"))
(f <- file.path(fpath,filelist))
# demographics csv in specified directory
dlist <- as.vector(list.files(fpath,pattern="demographics.csv"))
(d <- file.path(fpath,dlist))
demos <- fread(d,header=T,sep=",")
cl <- makeCluster(4)
registerDoParallel(cl)
setDefaultCluster(cl)
clusterExport(NULL,c('transit.grab'))
clusterEvalQ(NULL,library(data.table))
sdemos <- demo.grab(dl[[1]],demos)
stmp <- parLapply(cl,f,FUN=transit.grab,sdemos)
And the function 'transit.grab' is the following:
transit.grab <- function(sitefile,selected.demos){
require(sqldf)
demos <- selected.demos
# Renaming sitefile columns
sf <- fread(sitefile,header=T,sep=",")
names(sf)[1] <- c("id")
# Selecting transits from site file using list of selected users
sdat <- sqldf('select sf.* from sf inner join demos on sf.id=demos.id')
return(sdat)
}
I'm not looking for someone to debug code, as I know it runs properly for a smaller amount of data, but rather I desperately need suggestions on how to implement this code for ~6.7 GB of data. Any and all feedback is welcome, thanks!
UPDATE:
As suggested, I replaced sqldf() with merge(), which reduced my computation time by half when tested on a smaller directory. When observing my memory usage via Activity Monitor, my trend is pretty flat. BUT now when I try running my code on the large directory, my R session crashes.

Append new data to an existing dataframe (RDS) in R

I have an Rscript that is reading in a constant stream of data in the form of a flat file. Another script picks up this flat file, does some parsing and processing, then saves the result as a data.frame in RDS format. It then sleeps, and repeats the process.
saveRDS(tmp.df, file="H:/Documents/tweet.df.rds") #saving the data.frame
On the second... nth iteration, I have the code only process the new lines added to the flat file since the previous iteration. However, in order to append the delta lines to the permanent data frame, I have to read it in, append, and then save it back out, overwriting the original.
df2 <- readRDS("H:/Documents/tweet.df.rds") #read in permanent
tmp.df2 <- rbind(df2, tmp.df) #append new to existing
saveRDS(tmp.df2, file="H:/Documents/tweet.df.rds") #save it
rm(df2) #housecleaning
rm(tmp.df2) #housecleaning
This approach is risky because whenever the RDS is open for reading/writing, another process wanting to touch that file has to wait. As the base file gets bigger, the risk increases.
Is there something like an appendRDS (I know literally there isn't) that can achieve what I want- iterative updating of a single data frame- saved to a file- that uses appending rather than complete replacement?
I think you can safeguard your process by using connections, opening and closing it before the next process takes over.
con <- file("tmp.rds")
open(con)
df <- readRDS(con)
df.new <- rbind(df,df)
saveRDS(df.new, con)
close(con)
Update:
You can test if a connection to the file is open and tell it to wait for a bit if you're having problems with concurrency.
while(is.Open(con)) { # untested but something of this nature should work
sys.Sleep(2)
}
Is there anything wrong with using a series of numbered RDS files in a directory instead of a single RDS file? I don't think is is possible to append to a data frame an an RDS file without rewriting the entire file, since data frames are simply lists of columns, so presumably they are serialized one column at a time, so only the last column ends near the end of the file.
If you want to stick with a single file but minimize the risk of reading inconsistent data from a RDS file, you can read it in, do the append operation, and then write it out to a temp file and rename the temp file to the original name once it is finished. Then at least your period of risk is not dependent on the size of the file. I'm not familiar with what kind of atomicity is guaranteed by various filesystems when renaming a file to an existing name, but it's probably better than the time taken by saveRDS.

Resources