Appending a global table using asynchronous futures in R

Appending a global table using asynchronous futures in R - r

I am trying to create a global table that is produced by asynchronously running parallel processes. They are completely independent, however they should append to the same global variable (this is reactive in R shiny so I either need to have a call back function once all futures are done with their task -which would be very nice but I dont know how-, or I need to constantly update the table as new results come in).
I tried the following approach which just locks (probably because all processes are assigning to the same variable, when I change 'a' to 'b' it works but then result is useless)
library("listenv")
library("future")
plan(multiprocess)
futureVals <- listenv()
options(future.globals.onMissing = "ignore")
a<-0
b<-0
for(i in 1:5){
futureVals[[i]] <- futureAssign(x='a', value={
a <- a+1
print(a)
})
}
futureVals2 <- as.list(futureVals)
print(a)
how can I achieve this goal?

It is not possible for future (or other parallel, background R workers) to assign values to variables in the master R process. Any results need to be returned as values. This is a fundamental property of all parallel/asynchronous processing in R.(*)
Having said this, you might be interested in https://rstudio.github.io/promises/articles/shiny.html.
PS. (*) Your expectations of futureAssign() seems to be incorrect.

Related

R not remembering objects written within functions

I'm struggling to clearly explain this problem.
Essentially, something has seemed to have happened within the R environment and none of the code I write inside my functions are working and not data is being saved. If I type a command line directly into the console it works (i.e. Monkey <- 0), but if I type it within a function, it doesn't store it when I run the function.
It could be I'm missing a glaring error in the code, but I noticed the problem when I accidentally clicked on the debugger and tried to excite out of the browser[1] prompt which appeared.
Any ideas? This is driving me nuts.
corr <- function(directory, threshold=0) {
directory <- paste(getwd(),"/",directory,"/",sep="")
file.list <- list.files(directory)
number <- 1:length(file.list)
monkey <- c()
for (i in number) {
x <- paste(directory,file.list[i],sep="")
y <- read.csv(x)
t <- sum(complete.cases(y))
if (t >= threshold) {
correl <- cor(y$sulfate, y$nitrate, use='pairwise.complete.obs')
monkey <- append(monkey,correl)}
}
#correl <- cor(newdata$sulfate, newdata$nitrate, use='pairwise.complete.obs')
#summary(correl)
}
corr('specdata', 150)
monkey```

It's a namespace issue. Functions create their own 'environment', that isn't necessarily in the global environment.
Using <- will assign in the local environment. To save an object to the global environment, use <<-
Here's some information on R environments.

I suggest you give a look at some tutorial on using functions in R.
Briefly (and sorry for my horrible explanation) objects that you define within functions will ONLY be defined within functions, unless you explicitly export them using (one of the possible approaches) the return() function.
browser() is indeed used for debugging, keeps you inside the function, and allows you accessing objects created inside the function.
In addition, to increase the probability to have useful answers, I suggest that you try to post a self-contained, working piece of code allowing quickly reproducing the issue. Here you are reading some files we have no access to.

It seems to me you have to store the output yourself when you run your script:
corr_out <- corr('specdata', 150)

How to restart R and continue a benchmark script from previous line (on Windows)?

I want to benchmark the time and profile memory used by several functions (regression with random effects and other analysis) applied to different dataset sizes.
My computer has 16GB RAM and I want to see how R behaves with large datasets and what is the limit.
In order to do it I was using a loop and the package bench.
After each iteration I clean the memory with gc(reset=TRUE).
But when the dataset is very large the garbage collector doesn't work properly, it just frees part of the memory.
At the end all the memory stays filled, and I need to restar my R session.
My full dataset is called allDT and I do something like this:
for (NN in (1:10)*100000) {
gc(reset=TRUE)
myDT <- allDT[sample(.N,NN)]
assign(paste0("time",NN), mark(
model1 = glmer(Out~var1+var2+var3+(1|City/ID),data=myDT),
model2 = glmer(Out~var1+var2+var3+(1|ID),data=myDT),
iterations = 1, check=F))
}
That way I can get the results for each size.
The method is not fair because at the end the memory doesn't get properly cleaned.
I've thought an alternative is to restart the whole R program after every iteration (exit R and start it again, this is the only way I've found you can have the memory cleaned), loading again the data and continuing from the last step.
Is there any simple way to do it or any alternative?
Maybe I need to save the results on disk every time but it will be difficult to keep track of the last executed line, specially if R hangs.
I may need to create an external batch file and run a loop calling R at every iteration. Though I prefer to do it everything from R without any external scripting/batch.

One thing I do for benchmarks like this is to launch another instance of R and have that other R instance return the results to stdout (or simpler, just save it as a file).
Example:
times <- c()
for( i in 1:length(param) ) {
system(sprintf("Rscript functions/mytest.r %s", param[i]))
times[i] <- readRDS("/tmp/temp.rds")
}
In the mytest.r file read in parameters and save results to a file.
args <- commandArgs(trailingOnly=TRUE)
NN <- args[1]
allDT <- readRDS("mydata.rds")
...
# save results
saveRDS(myresult, file="/tmp/temp.rds")

How to write foreach %dopar% logs to separate files

I use R to run Ant Colony Optimization and usually repeat the same optimization several times to cross-validate my results. I want to save time by running the processes in parallel with the foreach and doParallel packages.
A reproducible example of my code would be very long so I'm hoping this is sufficient. I think I managed to get the code running like this:
result <- list()
short <- function(n){
for(n in 1:10){
result[[n]] <- ACO(data, ...)}}
foreach(n=1:50) %dopar% short(n)
Within the ACO() function I continuously create objects with intermediate results (e.g. the current pheromone levels) which I save using write.table(..., append=TRUE) to keep track of the iterations and their results. Now that I'm running the processes in parallel, the file I write contains results from all processes and I'm not able to tell which process the data belongs to. Therefore, I'd like to write different files for each process.
What's the best way, in general, to save intermediate results when using parallel processing?

You can use the log4r package to write info needed in a log file. More info about the package here.
An example of the code which you have to put in your short function:
# Import the log4r package.
library('log4r')
# Create a new logger object with create.logger().
logger <- create.logger()
# Set the logger's file output.
logfile(logger) <- 'base.log'
# Set the current level of the logger.
level(logger) <- 'INFO'
# Try logging messages with different priorities. # At priority level INFO, a call to debug() won't print anything.
debug(logger, 'Iretation and result info')

saveRDS inflating size of object

This is a tricky one as I can't provide a reproducible example, but I'm hoping that others may have had experience dealing with this.
Essentially I have a function that pulls a large quantity of data from a DB, cleans and reduces the size and loops through some parameters to produce a series of lm model objects, parameter values and other reference values. This is compiled into a complex list structure that totals about 10mb.
It's then supposed to saved as an RDS file on AWS s3 where it's retrieved in a production environment to build predictions.
e.g.
db.connection <- db.connection.object
build_model_list <- function(db.connection) {
clean_and_build_models <- function(db.connection, other.parameters) {
get_db_data <- function(db.connection, some.parameters) {# Retrieve db data} ## Externally defined
db.data <- get_db_data()
build_models <- function(db.data, some.parameters) ## Externally defined
clean_data <- function(db.data, some.parameters) {# Cleans and filters data based on parameters} ## Externally defined
clean.data <- clean_data()
lm_model <- function(clean.data) {# Builds lm model based on clean.data} ## Externally defined
lm.model <- lm_model()
return(list(lm.model, other.parameters))} ## Externally defined
looped.model.object <- llply(some.parameters, clean_and_build_models)
return(looped.model.object)}
model.list <- build_model_list()
saveRDS(model.list, "~/a_place/model_list.RDS")
The issue I'm getting is that 'model.list' object which is only 10MB in memory will inflate to many GBs when I save locally as RDS or try to upload to AWS s3.
I should note that though the function processes very large quantities of data (~ 5 million rows), the data used in the outputs is no larger than a few hundred rows.
Reading the limited info on this on Stack Exchange, I've found that moving some of the externally defined functions (as part of a package) inside the main function (e.g. clean_data and lm_model) helps reduce the RDS save size.
This however has some big disadvantages.
Firstly it's trial and error and follows no clear logical order, with frequent crashes and a couple of hours taken to build the list object, it's a very long debugging cycle.
Secondly, it'll mean my main function will be many hundreds of lines long which will make future alterations and debugging much more tricky.
My question to you is:
Has anyone encountered this issue before?
Any hypotheses as to what's causing it?
Has anyone found a logical non-trial-and-error solution to this?
Thanks for your help.

It took a bit of digging but I did actually find a solution in the end.
It turns out it was the lm model objects that were the guilty party. Based on this very helpful article:
https://blogs.oracle.com/R/entry/is_the_size_of_your
It turns out that the lm.object$terms component includes a an environment component that references to the objects present in the global environment when the model was built. Under certain circumstances, when you saveRDS R will try and draw in the environmental objects into the save object.
As I had ~0.5GB sitting in the global environment and an list array of ~200 lm model objects, this caused the RDS object to inflate dramatically as it was actually trying to compress ~100GB of data.
To test if this is what's causing the problem. Execute the following code:
as.matrix(lapply(lm.object, function(x) length(serialize(x,NULL))))
This will tell you if the $terms component is inflating.
The following code will remove the environmental references from the $terms component:
rm(list=ls(envir = attr(lm.object$terms, ".Environment")), envir = attr(lm.object$terms, ".Environment"))
Be warned though it'll also remove all the global environmental objects it references.

For model objects you could also simply delete the reference to the environment.
As for example like this
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
attr(lm.D9$terms, ".Environment") <- NULL
saveRDS(lm.D9, file = "path_to_save.RDS")
This unfortunatly breaks the model - but you can add an environment manualy after loading again.
readRDS("path_to_save.RDS")
attr(lm.D9$terms, ".Environment") <- globalenv()
This helped me in my specific use case and looks a bit saver to me...

Neither of these two solutions worked for me.
Instead I have used:
downloaded_object <- storage_download(connection, "path")
read_RDS <- readRDS(downloaded_object)

The answer by mhwh mostly solved my problem, but with the additional step of creating an empty list and copying into it from the model object what was relevant. This might be due to additional (undocumented) environment references associated with using the model class I used.
mm <- felm(formula=formula, data=data, keepX=TRUE, ...)
# Make an empty list and copy into it what we need:
mm_cp <- list()
mm_cp$coefficients <- mm$coefficients
# mm_cp$ <- something else from mm you might need ...
mm_cp$terms <- terms(ans)
attr(mm_cp$terms, ".Environment") <- NULL
saveRDS(mm_cp, file = "path_to_save.RDS")
Then when we need to use it:
mm_cp <- saveRDS("path_to_save.RDS")
attr(mm_cp$terms, ".Environment") <- globalenv()
In my case the file went from 5.5G to 13K. Additionally, when reading in the file it used to allocate >32G of memory, more than 6 times the file-size. This also reduced execution time significantly (no need to recreate various environments?).
Environmental references sounds like an excellent contender for a new chapter in the R Inferno book.

Why does R store the loop variable/index/dummy in memory?

I've noticed that R keeps the index from for loops stored in the global environment, e.g.:
for (ii in 1:5){ }
print(ii)
# [1] 5
Is it common for people to have any need for this index after running the loop?
I never use it, and am forced to remember to add rm(ii) after every loop I run (first, because I'm anal about keeping my namespace clean and second, for memory, because I sometimes loop over lists of data.tables--in my code right now, I have 357MB-worth of dummy variables wasting space).
Is there an easy way to get around this annoyance?
Perfect would be a global option to set (a la options(keep_for_index = FALSE); something like for(ii in 1:5, keep_index = FALSE) could be acceptable as well.

In order to do what you suggest, R would have to change the scoping rules for for loops. This will likely never happen because i'm sure there is code out there in packages that rely on it. You may not use the index after the for loop, but given that loops can break() at any time, the final iteration value isn't always known ahead of time. And having this as a global option again would cause problems with existing code in working packages.
As pointed out, it's for more common to use sapply or lapply loops in R. Something like
for(i in 1:4) {
lm(data[, 1] ~ data[, i])
}
becomes
sapply(1:4, function(i) {
lm(data[, 1] ~ data[, i])
})
You shouldn't be afraid of functions in R. After all, R is a functional language.
It's fine to use for loops for more control, but you will have to take care of removing the indexing variable with rm() as you've pointed out. Unless you're using a different indexing variable in each loop, i'm surprised that they are piling up. I'm also surprised that in your case, if they are data.tables, they they are adding additional memory since data.tables don't make deep copies by default as far as i know. The only memory "price" you would pay is a simple pointer.

I agree with the comments above. Even if you have to use for loop (using just side effects, not functions' return values) it would be a good idea to structure
your code in several functions and store your data in lists.
However, there is a way to "hide" index and all temporary variables inside the loop - by calling the for function in a separate environment:
do.call(`for`, alist(i, 1:3, {
# ...
print(i)
# ...
}), envir = new.env())
But ... if you could put your code in a function, the solution is more elegant:
for_each <- function(x, FUN) {
for(i in x) {
FUN(i)
}
}
for_each(1:3, print)
Note that with using "for_each"-like construct you don't even see the index variable.

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Appending a global table using asynchronous futures in R - r

Related

R not remembering objects written within functions

How to restart R and continue a benchmark script from previous line (on Windows)?

How to write foreach %dopar% logs to separate files

saveRDS inflating size of object

Why does R store the loop variable/index/dummy in memory?

Categories

Resources