Set temporary director for raster() in R not working - raster

I have a large raster data set that I am trying to use Maxent to make predictions regarding habitat suitability. I have had trouble with speed, and so I rewrote my code to parallel process based on this post.
ncores = 4
cl = parallel::makeCluster(ncores)
doParallel::registerDoParallel(cl,ncores)
rows=1:nrow(env)
split=sort(rows%%ncores)+1
outname="envOut"
prediction = foreach(i=unique(split), .combine=c) %dopar% {
rows_sub=rows[split==i]
sub = raster::crop(env,raster::extent(env,min(rows_sub),max(rows_sub),1,ncol(env)))
raster::predict(object=env, model = model1,
filename=paste("I:/GIS/Projects/LISU/",outname,i,".tif",sep=""))
gc()
}
stopCluster(cl)
However, when I did so, over 500 GB of hard drive space was used and I was unable to generate an output raster. I added the following line before my block of code based on these posts.
rasterOptions(tmpdir="I:/GIS/tmpdir")
The I drive is an external hard drive location that has a ton of space. However, the command seems to have been ignored, as when I reran the parallel prediction, my hard drive filled up again. Any advice?

In Windows, I set my environmental variable TMPDIR=I:\GIS\tmpdir.
When I started R, typing tempdir() retrieved the correct path to the temporary directory that I wanted to use.
I am still not sure what rasterOptions(tmpdir=...) actually does.

Related

How to delete temporary files in parallel task in R

Is it possible to delete temporary files from within a parallelized R task?
I rely on parallelization with doParallel and foreach in R to perform various calculations on small subsets of a huge raster file. This involves cropping a subset of the large raster many times. My basic syntax looks similar to this:
grid <- raster::raster("grid.tif")
data <- raster::raster("data.tif")
cl <- parallel::makeCluster(32)
doParallel::registerDoParallel(cl)
m <- foreach(col=ncol(grid)) %:% foreach(row=nrow(grid)) %dopar% {
# get extent of subset
cell <- raster::cellFromRowCol(grid, row, col)
ext <- raster::extentFromCells(grid, cell)
# crop main raster to subset extent
subset <- raster::crop(data, ext)
# ...
# perform some processing steps on the raster subset
# ...
# save results to a separate file
saveRDS(subset, paste0("output_folder/", row, "_", col)
}
The algorithm works perfectly fine and achieves what I want it to. However, raster::crop(data, ext) creates a small temporary file everytime it is called. This seems to be standard behavior of the raster package, but it becomes a problem, because these temp files are only deleted after the whole code has been executed, and take up way too much disk space in the meantime (hundreds of GB).
In a serial execution of the task I can simply delete the temporary file with file.remove(subset#file#name). However, this does not work anymore when running the task in parallel. Instead, the command is simply ignored and the temp file stays where it is until the whole task is done.
Any ideas as to why this is the case and how I could solve this problem?
There is a function for this removeTmpFiles.
You should be able to use f <- filename(subset), avoid reading from slots (#). I do not see why you would not be able to remove it. But perhaps it needs some fiddling with the path?
temp files are only created when the raster package deems it necessary, based on RAM available and required. See canProcessInMemory( , verbose=TRUE). The default settings are somewhat conservative, and you can change them with rasterOptions() (memfrac and maxmemory)
Another approach is to provide a filename argument to crop. Then you know what the filename is, and you can delete it. Of course you need to take care of not overwriting data from different tasks, so you may need to use some unique id associated with it.
saveRDS( ) won't work if the raster is backed up by a tempfile (as it will disappear).

How to restart R and continue a benchmark script from previous line (on Windows)?

I want to benchmark the time and profile memory used by several functions (regression with random effects and other analysis) applied to different dataset sizes.
My computer has 16GB RAM and I want to see how R behaves with large datasets and what is the limit.
In order to do it I was using a loop and the package bench.
After each iteration I clean the memory with gc(reset=TRUE).
But when the dataset is very large the garbage collector doesn't work properly, it just frees part of the memory.
At the end all the memory stays filled, and I need to restar my R session.
My full dataset is called allDT and I do something like this:
for (NN in (1:10)*100000) {
gc(reset=TRUE)
myDT <- allDT[sample(.N,NN)]
assign(paste0("time",NN), mark(
model1 = glmer(Out~var1+var2+var3+(1|City/ID),data=myDT),
model2 = glmer(Out~var1+var2+var3+(1|ID),data=myDT),
iterations = 1, check=F))
}
That way I can get the results for each size.
The method is not fair because at the end the memory doesn't get properly cleaned.
I've thought an alternative is to restart the whole R program after every iteration (exit R and start it again, this is the only way I've found you can have the memory cleaned), loading again the data and continuing from the last step.
Is there any simple way to do it or any alternative?
Maybe I need to save the results on disk every time but it will be difficult to keep track of the last executed line, specially if R hangs.
I may need to create an external batch file and run a loop calling R at every iteration. Though I prefer to do it everything from R without any external scripting/batch.
One thing I do for benchmarks like this is to launch another instance of R and have that other R instance return the results to stdout (or simpler, just save it as a file).
Example:
times <- c()
for( i in 1:length(param) ) {
system(sprintf("Rscript functions/mytest.r %s", param[i]))
times[i] <- readRDS("/tmp/temp.rds")
}
In the mytest.r file read in parameters and save results to a file.
args <- commandArgs(trailingOnly=TRUE)
NN <- args[1]
allDT <- readRDS("mydata.rds")
...
# save results
saveRDS(myresult, file="/tmp/temp.rds")

Is there a way to transfer R objects to separate R sessions on linux?

I've got a program that repeatedly loads largish datasets that are stored in R's Rds format. Here's a silly example that has all of the salient features:
# make and save the data
big_data <- matrix(rnorm(1e6^2), 1e6)
saveRDS(big_data, file = "big_data.Rds")
# write a program that uses the data
big_data <- readRDS("big_data.Rds")
BIGGER_data <- big_data+rnorm(1)
print("hooray!")
# save this in a text file called `my_program.R`
# run this program a bunch
for (i = 1:1000){
system("Rscript my_program.R")
}
The bottleneck is loading the data. But what if I had a separate process somewhere that held the data in memory?
Maybe something like this:
# write a program to hold the data in memory
big_data <- readRDS("big_data.Rds")
# save this as `holder.R` open a terminal and do
Rscript holder.R
Now there is a process running somewhere with my data in memory. How can I get it from a different R session? (I'm assuming that this would be faster than loading it -- but is this correct?)
Maybe something like this:
# write another program:
big_data <- get_big_data_from_holder()
BIGGER_data <- big_data+1
print("yahoo!")
# save this as `my_improved_program.R`
# now do the following:
for (i = 1:1000){
system("Rscript my_improved_program.R")
}
So I guess my question is what would the function get_big_data_from_holder() look like? Is it possible to do this? Practical?
Backstory: I'm trying to work around what appears to be a memory leak in R's interface to keras/tensorflow, that I've described here. The workaround is to let the OS clean up all of the cruft left over from a TF session, so that I can run TF sessions one after another without my computer slowing to a crawl.
Edit: maybe I could do this with a clone() system call? Conceptually I can imagine that I'd clone the process running holder and then run all of the commands in the program that depend on the data that's loaded. But I don't know how this would be done.
You may also improve the performance of saving and loading the data by turning off compression:
saveRDS(..., compress = FALSE)
You may find my filematrix package useful for storing and quickly accessing the big matrix.
To create it, run:
big_data = matrix(rnorm(1e4^2), 1e4)
library(filematrix)
fm = fm.create.from.matrix('matrix_file', big_data)
close(fm)
To access it from another R session:
library(filematrix)
fm = fm.open('matrix_file')
show(fm[1:3,1:3])
close(fm)

Running system() inside foreach()

I am running R version 3.4.1 on Windows 7.
I need to run a certain program using N=500 bootstrapped data files that I have stored within a working directory. The executable will produce 3 more datafiles for each of the N runs, all of which are stored in one working directory. I'd like to use parallel computing to increase the speed of this process.
The loop requires one starting file ('starter') that gets modified at every iteration of the loop. This is a really important part that I can't really get around doing. Also, at every iteration any pre-existing data files are removed and then the executable is called using system(). The executable produces the three data files called Report, CompReport, and covar.
In a normal loop the process looks something like this:
starter_strt <- SS_readstarter(file="starter.ss") # read starter file
for(iboot in 1:N){
starter$datfile <- paste("BootData", iboot, ".ss", sep="")
#code to replace the starter file with the new one
SS_writestarter(starter, overwrite=TRUE)
#any old files are removed
file.remove("Report.sso", paste("Report_",iboot,".sso",sep=""))
file.remove("CompReport.sso", paste("CompReport_",iboot,".sso",sep=""))
file.remove("covar.sso", paste("covar_",iboot,".sso",sep=""))
#then the program is called
system("ss3")
#the files produced get pushed to the working directory
file.copy("Report.sso",paste("Report_",iboot,".sso",sep=""))
file.copy("CompReport.sso",paste("CompReport_",iboot,".sso",sep=""))
file.copy("covar.sso",paste("covar_",iboot,".sso",sep=""))
}
This above process takes 5 days to complete so I've been exploring options for parallelization. What I've worked with so far doesn't work so any help or suggestions are appreciated. Here's what I was thinking. Using a foreach loop and making sure I'm not reading and overwriting the starter file. This is what I have:
# read starter file
starter_strt <- SS_readstarter(file="starter.ss") # read starter file
iter <- 2 #just for an example here even though I said N=500 above
cores <- parallel::detectCores()-1
cl <- makeCluster(cores)
registerDoParallel(cl)
clusterCall(cl, function(x) .libPaths(x), .libPaths())
foreach(iboot=1:iter, .packages='r4ss', .export="starter_strt", system ("ss3")) %dopar% {
#code to create the starter file so I don't overwrite it
datfile<- data.frame(rep(paste("BootData",iboot,".ss",sep=""),2))
colnames(datfile) <- "datfile"
starter_list <- cbind(starter_strt, datfile)
# replace starter file with modified version
SS_writestarter(starter_list, overwrite=TRUE)
#delete files as above
#run program
system("ss3")
#the files produced get pushed to the working directory
file.copy("Report.sso",paste("Report_",iboot,".sso",sep=""))
file.copy("CompReport.sso",paste("CompReport_",iboot,".sso",sep=""))
file.copy("covar.sso",paste("covar_",iboot,".sso",sep=""))
}
The program runs through once but then I get a warning (see below) and the expected files are missing from the directory.
Warning message:
In e$fun(obj, substitute(ex), parent.frame(), e$data) :
already exporting variable(s): starter_strt
I'm pretty stumped at this point. I saw this example Running multiple R scripts using the system() command where the system command is called in a parLapply function. Maybe this is what I need? Help and/or suggestions are greatly appreciated.

saveRDS inflating size of object

This is a tricky one as I can't provide a reproducible example, but I'm hoping that others may have had experience dealing with this.
Essentially I have a function that pulls a large quantity of data from a DB, cleans and reduces the size and loops through some parameters to produce a series of lm model objects, parameter values and other reference values. This is compiled into a complex list structure that totals about 10mb.
It's then supposed to saved as an RDS file on AWS s3 where it's retrieved in a production environment to build predictions.
e.g.
db.connection <- db.connection.object
build_model_list <- function(db.connection) {
clean_and_build_models <- function(db.connection, other.parameters) {
get_db_data <- function(db.connection, some.parameters) {# Retrieve db data} ## Externally defined
db.data <- get_db_data()
build_models <- function(db.data, some.parameters) ## Externally defined
clean_data <- function(db.data, some.parameters) {# Cleans and filters data based on parameters} ## Externally defined
clean.data <- clean_data()
lm_model <- function(clean.data) {# Builds lm model based on clean.data} ## Externally defined
lm.model <- lm_model()
return(list(lm.model, other.parameters))} ## Externally defined
looped.model.object <- llply(some.parameters, clean_and_build_models)
return(looped.model.object)}
model.list <- build_model_list()
saveRDS(model.list, "~/a_place/model_list.RDS")
The issue I'm getting is that 'model.list' object which is only 10MB in memory will inflate to many GBs when I save locally as RDS or try to upload to AWS s3.
I should note that though the function processes very large quantities of data (~ 5 million rows), the data used in the outputs is no larger than a few hundred rows.
Reading the limited info on this on Stack Exchange, I've found that moving some of the externally defined functions (as part of a package) inside the main function (e.g. clean_data and lm_model) helps reduce the RDS save size.
This however has some big disadvantages.
Firstly it's trial and error and follows no clear logical order, with frequent crashes and a couple of hours taken to build the list object, it's a very long debugging cycle.
Secondly, it'll mean my main function will be many hundreds of lines long which will make future alterations and debugging much more tricky.
My question to you is:
Has anyone encountered this issue before?
Any hypotheses as to what's causing it?
Has anyone found a logical non-trial-and-error solution to this?
Thanks for your help.
It took a bit of digging but I did actually find a solution in the end.
It turns out it was the lm model objects that were the guilty party. Based on this very helpful article:
https://blogs.oracle.com/R/entry/is_the_size_of_your
It turns out that the lm.object$terms component includes a an environment component that references to the objects present in the global environment when the model was built. Under certain circumstances, when you saveRDS R will try and draw in the environmental objects into the save object.
As I had ~0.5GB sitting in the global environment and an list array of ~200 lm model objects, this caused the RDS object to inflate dramatically as it was actually trying to compress ~100GB of data.
To test if this is what's causing the problem. Execute the following code:
as.matrix(lapply(lm.object, function(x) length(serialize(x,NULL))))
This will tell you if the $terms component is inflating.
The following code will remove the environmental references from the $terms component:
rm(list=ls(envir = attr(lm.object$terms, ".Environment")), envir = attr(lm.object$terms, ".Environment"))
Be warned though it'll also remove all the global environmental objects it references.
For model objects you could also simply delete the reference to the environment.
As for example like this
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
attr(lm.D9$terms, ".Environment") <- NULL
saveRDS(lm.D9, file = "path_to_save.RDS")
This unfortunatly breaks the model - but you can add an environment manualy after loading again.
readRDS("path_to_save.RDS")
attr(lm.D9$terms, ".Environment") <- globalenv()
This helped me in my specific use case and looks a bit saver to me...
Neither of these two solutions worked for me.
Instead I have used:
downloaded_object <- storage_download(connection, "path")
read_RDS <- readRDS(downloaded_object)
The answer by mhwh mostly solved my problem, but with the additional step of creating an empty list and copying into it from the model object what was relevant. This might be due to additional (undocumented) environment references associated with using the model class I used.
mm <- felm(formula=formula, data=data, keepX=TRUE, ...)
# Make an empty list and copy into it what we need:
mm_cp <- list()
mm_cp$coefficients <- mm$coefficients
# mm_cp$ <- something else from mm you might need ...
mm_cp$terms <- terms(ans)
attr(mm_cp$terms, ".Environment") <- NULL
saveRDS(mm_cp, file = "path_to_save.RDS")
Then when we need to use it:
mm_cp <- saveRDS("path_to_save.RDS")
attr(mm_cp$terms, ".Environment") <- globalenv()
In my case the file went from 5.5G to 13K. Additionally, when reading in the file it used to allocate >32G of memory, more than 6 times the file-size. This also reduced execution time significantly (no need to recreate various environments?).
Environmental references sounds like an excellent contender for a new chapter in the R Inferno book.

Resources