Inefficient memory usage in R when turning xml into dataframes

Inefficient memory usage in R when turning xml into dataframes - r

I'm trying to figure out, how to stop R from consuming unreasonably much RAM during my script run. My concern is to convert each of 100MB-ish xml files (that are parts of a large 10GB xml) to data frames and then save them as csv files to merge them later. I tried to convert the large xml, had no success though, as it not only crashed R, but the whole system, so it needed to reboot.
library(XML)
library(plyr)
library(data.table)
filelist<-list.files(pattern="origfile\\.\\d*\\.xml")
#vector with xml names in my working directory
for (i in 1:length(filelist)) {
data <- xmlParse(filelist[i])
xml_data <- xmlToList(data)
df<-ldply(xml_data, rbind)
fwrite(df, file = paste("newfile",i,".csv",sep=""),row.names = FALSE)
rm(data,xml_data,df)
gc()
}
The problem is, every time the for loop starts again, it adds new data without discarding variables that have no use already, even if I delete them explicitly. Sometimes RAM usage drops to 60%, but then grows up again. After it reaches some limit, exception "cannot allocate vector of size x MB" is thrown. PC has 16GB of RAM, but R doesn't seem to need more than 1.5-2GB for one cycle step.
What should be changed in code or what needs to be done in general to prevent reaching RAM limit?

Related

How to delete temporary files in parallel task in R

Is it possible to delete temporary files from within a parallelized R task?
I rely on parallelization with doParallel and foreach in R to perform various calculations on small subsets of a huge raster file. This involves cropping a subset of the large raster many times. My basic syntax looks similar to this:
grid <- raster::raster("grid.tif")
data <- raster::raster("data.tif")
cl <- parallel::makeCluster(32)
doParallel::registerDoParallel(cl)
m <- foreach(col=ncol(grid)) %:% foreach(row=nrow(grid)) %dopar% {
# get extent of subset
cell <- raster::cellFromRowCol(grid, row, col)
ext <- raster::extentFromCells(grid, cell)
# crop main raster to subset extent
subset <- raster::crop(data, ext)
# ...
# perform some processing steps on the raster subset
# ...
# save results to a separate file
saveRDS(subset, paste0("output_folder/", row, "_", col)
}
The algorithm works perfectly fine and achieves what I want it to. However, raster::crop(data, ext) creates a small temporary file everytime it is called. This seems to be standard behavior of the raster package, but it becomes a problem, because these temp files are only deleted after the whole code has been executed, and take up way too much disk space in the meantime (hundreds of GB).
In a serial execution of the task I can simply delete the temporary file with file.remove(subset#file#name). However, this does not work anymore when running the task in parallel. Instead, the command is simply ignored and the temp file stays where it is until the whole task is done.
Any ideas as to why this is the case and how I could solve this problem?

There is a function for this removeTmpFiles.
You should be able to use f <- filename(subset), avoid reading from slots (#). I do not see why you would not be able to remove it. But perhaps it needs some fiddling with the path?
temp files are only created when the raster package deems it necessary, based on RAM available and required. See canProcessInMemory( , verbose=TRUE). The default settings are somewhat conservative, and you can change them with rasterOptions() (memfrac and maxmemory)
Another approach is to provide a filename argument to crop. Then you know what the filename is, and you can delete it. Of course you need to take care of not overwriting data from different tasks, so you may need to use some unique id associated with it.
saveRDS( ) won't work if the raster is backed up by a tempfile (as it will disappear).

How to restart R and continue a benchmark script from previous line (on Windows)?

I want to benchmark the time and profile memory used by several functions (regression with random effects and other analysis) applied to different dataset sizes.
My computer has 16GB RAM and I want to see how R behaves with large datasets and what is the limit.
In order to do it I was using a loop and the package bench.
After each iteration I clean the memory with gc(reset=TRUE).
But when the dataset is very large the garbage collector doesn't work properly, it just frees part of the memory.
At the end all the memory stays filled, and I need to restar my R session.
My full dataset is called allDT and I do something like this:
for (NN in (1:10)*100000) {
gc(reset=TRUE)
myDT <- allDT[sample(.N,NN)]
assign(paste0("time",NN), mark(
model1 = glmer(Out~var1+var2+var3+(1|City/ID),data=myDT),
model2 = glmer(Out~var1+var2+var3+(1|ID),data=myDT),
iterations = 1, check=F))
}
That way I can get the results for each size.
The method is not fair because at the end the memory doesn't get properly cleaned.
I've thought an alternative is to restart the whole R program after every iteration (exit R and start it again, this is the only way I've found you can have the memory cleaned), loading again the data and continuing from the last step.
Is there any simple way to do it or any alternative?
Maybe I need to save the results on disk every time but it will be difficult to keep track of the last executed line, specially if R hangs.
I may need to create an external batch file and run a loop calling R at every iteration. Though I prefer to do it everything from R without any external scripting/batch.

One thing I do for benchmarks like this is to launch another instance of R and have that other R instance return the results to stdout (or simpler, just save it as a file).
Example:
times <- c()
for( i in 1:length(param) ) {
system(sprintf("Rscript functions/mytest.r %s", param[i]))
times[i] <- readRDS("/tmp/temp.rds")
}
In the mytest.r file read in parameters and save results to a file.
args <- commandArgs(trailingOnly=TRUE)
NN <- args[1]
allDT <- readRDS("mydata.rds")
...
# save results
saveRDS(myresult, file="/tmp/temp.rds")

Memory issues in-memory dataframe. Best approach to write output?

I’m running into memory issues in my R script processing huge folder. I have to perform several operations per file and then output one row per file into my results data frame.
sometimes the resulting data frame has hundred of rows pasted together In one row as if it got stuck in the same line (seems that rbind is not working ok when the load is huge)
I think the issues arises when keeping a temporal data frame in memory to append results, so I’m taking other approach:
A loop to read every file one by one, process it, and then open a connection to results file, write a line, close the connection and go to read next file. Came to mind that avoiding a big df in memory and writing immediately to file could solve my issues.
I assume this is very inefficient, so my question: is there another way of efficiently appending line by line of output instead of binding in-memory data frame and writing to disk at the end?
I’m versed in the many options: sink, cat, write line......my doubt is which one to use to avoid conflicts and be the most efficient given the conditions

I have been using the following snippet:
library(data.table)
filepaths <- list.files(dir)
resultFilename <- "/path/to/resultFile.txt"
for (i in 1:length(filepaths)) {
content <- fread(filepaths, header = FALSE, sep = ",")
### some manipulation for the content
results <- content[1]
fwrite(results, resultFilename, col.names = FALSE, quote = FALSE, append = TRUE)
}
finalData <- fread(resultFilename, header = FALSE, sep = ",")
In my use case, for ~2000 files and tens of millions of rows the processing time decreased over 95 % compared with read.csv and incrementally increasing data into a data.frame in the loop. As you can see in https://csgillespie.github.io/efficientR/importing-data.html section 4.3.1 and https://www.r-bloggers.com/fast-csv-writing-for-r/, fread and fwrite are very affordable data I/O functions.

freeing up memory in R

In R, I am trying to combine and convert several sets of timeseries data as an xts from http://www.truefx.com/?page=downloads however, the files are large and there many files so this is causing me issues on my laptop. They are stored as a csv file which have been compressed as a zip file.
Downloading them and unzipping them is easy enough (although takes up a lot of space on a hard drive).
Loading the 350MB+ files for one month's worth of data into the R is reasonably straight forward with the new fread() function in the data.table package.
Some datatable transformations are done (inside a function) so that the timestamps can be read easily and a mid column is produced. Then the datatable is saved as an RData file on the hard drive, and all references are to the datatable object are removed from the workspace, and a gc() is run after removal...however when looking at the R session in my Activity Monitor (run from a Mac)...it still looks like it is taking up almost 1GB of RAM...and things seem a bit laggy...I was intending to load several years worth of the csv files at the same time, convert them to useable datatables, combine them and then create a single xts object, which seems infeasible if just one month uses 1GB of RAM.
I know I can sequentially download each file, convert it, save it shut down R and repeat until i have a bunch of RData files that i can just load and bind, but was hopeing there might be a more efficient manner to do this so that after removing all references to a datatable you get back not "normal" or at startup levels of RAM usage. Are there better ways of clearing memory than gc()? Any suggestions would be greatly appreciated.

In my project I had to deal with many large files. I organized the routine on the following principles:
Isolate memory-hungry operations in separate R scripts.
Run each script in new process which is destroyed after execution. Thus system gives used memory back.
Pass parameters to the scripts via text file.
Consider the toy example below.
Data generation:
setwd("/path/to")
write.table(matrix(1:5e7, ncol=10), "temp.csv") # 465.2 Mb file
slave.R - memory consuming part
setwd("/path/to")
library(data.table)
# simple processing
f <- function(dt){
dt <- dt[1:nrow(dt),]
dt[,new.row:=1]
return (dt)
}
# reads parameters from file
csv <- read.table("io.csv")
infile <- as.character(csv[1,1])
outfile <- as.character(csv[2,1])
# memory-hungry operations
dt <- as.data.table(read.csv(infile))
dt <- f(dt)
write.table(dt, outfile)
master.R - executes slaves in separate processes
setwd("/path/to")
# 3 files processing
for(i in 1:3){
# sets iteration-specific parameters
csv <- c("temp.csv", paste("temp", i, ".csv", sep=""))
write.table(csv, "io.csv")
# executes slave process
system("R -f slave.R")
}

Append new data to an existing dataframe (RDS) in R

I have an Rscript that is reading in a constant stream of data in the form of a flat file. Another script picks up this flat file, does some parsing and processing, then saves the result as a data.frame in RDS format. It then sleeps, and repeats the process.
saveRDS(tmp.df, file="H:/Documents/tweet.df.rds") #saving the data.frame
On the second... nth iteration, I have the code only process the new lines added to the flat file since the previous iteration. However, in order to append the delta lines to the permanent data frame, I have to read it in, append, and then save it back out, overwriting the original.
df2 <- readRDS("H:/Documents/tweet.df.rds") #read in permanent
tmp.df2 <- rbind(df2, tmp.df) #append new to existing
saveRDS(tmp.df2, file="H:/Documents/tweet.df.rds") #save it
rm(df2) #housecleaning
rm(tmp.df2) #housecleaning
This approach is risky because whenever the RDS is open for reading/writing, another process wanting to touch that file has to wait. As the base file gets bigger, the risk increases.
Is there something like an appendRDS (I know literally there isn't) that can achieve what I want- iterative updating of a single data frame- saved to a file- that uses appending rather than complete replacement?

I think you can safeguard your process by using connections, opening and closing it before the next process takes over.
con <- file("tmp.rds")
open(con)
df <- readRDS(con)
df.new <- rbind(df,df)
saveRDS(df.new, con)
close(con)
Update:
You can test if a connection to the file is open and tell it to wait for a bit if you're having problems with concurrency.
while(is.Open(con)) { # untested but something of this nature should work
sys.Sleep(2)
}

Is there anything wrong with using a series of numbered RDS files in a directory instead of a single RDS file? I don't think is is possible to append to a data frame an an RDS file without rewriting the entire file, since data frames are simply lists of columns, so presumably they are serialized one column at a time, so only the last column ends near the end of the file.
If you want to stick with a single file but minimize the risk of reading inconsistent data from a RDS file, you can read it in, do the append operation, and then write it out to a temp file and rename the temp file to the original name once it is finished. Then at least your period of risk is not dependent on the size of the file. I'm not familiar with what kind of atomicity is guaranteed by various filesystems when renaming a file to an existing name, but it's probably better than the time taken by saveRDS.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex