I am running a rather lengthy job that I need to replicate 100 times, thus I have turned to the foreach capability in R which I then run on a 8-core cluster through a shell script. I am trying to input all of my results from each run to the same file. I have included a simplified version of my code.
cl<-makeCluster(core-1)
registerDoParallel(cl,cores=core)
SigEpsilonSq<-list()
SigLSq<-list()
RatioMat<-list()
foreach(p=1:100) %dopar%{
functions defining my variables{...}
for(i in 1:fMaxInd){
rhoSqjMatr[,i]<-1/(1+Bb[i])*(CbAdj+AbAdj*XjBarAdj+BbAdj[i]*XjSqBarAdj)/(dataZ*dataZ)
sigmaEpsSqV[i]<-mean(rhoSqjMatr[,i])
rhoSqjMatr[,i]<-rhoSqjMatr[,i]/sigmaEpsSqV[i]
biasCorrV[,i]<-sigmaEpsSqV[i]/L*gammaQl(rhoSqjMatr[,i])
Qcbar[,i]<-Qflbar-biasCorrV[,i]
sigmaExtSq[,i]<-sigmaSqExt(sigmaEpsSqV[i], rhoSqjMatr[,i])
ratioMatr[,i]<-sigmaExtSq[,i]/(sigmaL*sigmaL)#ratio (sigma_l^e)^2/(sigmaL)^2
}
sigmaEpsSqV<-as.matrix(sigmaEpsSqV)
SigEpsilonSq[[p]]<-sigmaEpsSqV
SigLSq[[p]]<-sigmaExtSq
RatioMat[[p]]<-ratioMatr
} #End of the dopar loop
stopCluster(cl)
write.csv(SigEpsilonSq,file="Sigma_Epsilon_Sq.csv")
write.csv(SigLSq,file="Sigma_L_Sq.csv")
write.csv(RatioMat,file="Ratio_Matrix.csv")
When the job completes, my .csv files are empty. I believe I'm not quite understanding how the foreach saves results and how I can access them. I would like to avoid having to merge files manually. Also, do I need to write
stopCluster(cl)
at the end of my foreach loop or do I wait until the very end? Any help would be much appreciated.
This is not how foreach works. You should look into examples. You need to use .combine, if you want to output something from your parallelized jobs.
Also, instead of this:
sigmaEpsSqV<-as.matrix(sigmaEpsSqV)
SigEpsilonSq[[p]]<-sigmaEpsSqV
SigLSq[[p]]<-sigmaExtSq
RatioMat[[p]]<-ratioMatr
You have to re-write something like this:
list(as.matrix(sigmaEpsSqV),sigmaEpsSqV,sigmaExtSq,ratioMatr)
You can also use rbind, cbind, c,... to aggregate the results into one final output.
You can even your own combine function, example:
.combine=function(x,y)rbindlist(list(x,y))
The solution below should work. The output should be a list of lists. However it might be painful to retreive results and save them in the correct format. If so, you should design your own .combine function.
cl<-makeCluster(core-1)
registerDoParallel(cl,cores=core)
SigEpsilonSq<-list()
SigLSq<-list()
RatioMat<-list()
results = foreach(p=1:100, .combine=list) %dopar%{
functions defining my variables{...}
for(i in 1:fMaxInd){
rhoSqjMatr[,i]<-1/(1+Bb[i])*(CbAdj+AbAdj*XjBarAdj+BbAdj[i]*XjSqBarAdj)/(dataZ*dataZ)
sigmaEpsSqV[i]<-mean(rhoSqjMatr[,i])
rhoSqjMatr[,i]<-rhoSqjMatr[,i]/sigmaEpsSqV[i]
biasCorrV[,i]<-sigmaEpsSqV[i]/L*gammaQl(rhoSqjMatr[,i])
Qcbar[,i]<-Qflbar-biasCorrV[,i]
sigmaExtSq[,i]<-sigmaSqExt(sigmaEpsSqV[i], rhoSqjMatr[,i])
ratioMatr[,i]<-sigmaExtSq[,i]/(sigmaL*sigmaL)#ratio (sigma_l^e)^2/(sigmaL)^2
}
list(as.matrix(sigmaEpsSqV),sigmaEpsSqV,sigmaExtSq,ratioMatr)
} #End of the dopar loop
stopCluster(cl)
#Then you extract and save results
Related
I'm attempting to write an R script in a way that remains as automated as possible. To this end, I am trying to create a for loop to execute a function on multiple files. The outputs need to be saved as objects for the purposes of the program I am using and therefore each output from the for loop needs to have a distinct name. This is the code I have so far:
filenames <- as.list(Sys.glob("*.ab1"))
SeqOb <- list()
for (i in filenames)
{
SeqOb <- readsangerseq(i)
}
"readsangerseq" is the function I'm attempting to execute to create multiple SeqOb objects. What I've read from other discussions led me to create an empty list in which to store my output objects, but I can't seem to figure out how to make the for loop write them as distinct outputs.
If you would like to continue using the for loop and want distinct outputs instead of a list you may consider using assign(paste()) in order to give each file a unique object name. Although, as a relative newcomer to R myself, I'm starting to learn there are more elegant ways than for loops as well, such as MrFlick's answer.
for (i in 1:length(filenames)) {
#You may be able to substitute your function in the line below
assign(paste("SomeNamingRule", i, sep = ""), (readsangerseq(i)))
}
I want to benchmark the time and profile memory used by several functions (regression with random effects and other analysis) applied to different dataset sizes.
My computer has 16GB RAM and I want to see how R behaves with large datasets and what is the limit.
In order to do it I was using a loop and the package bench.
After each iteration I clean the memory with gc(reset=TRUE).
But when the dataset is very large the garbage collector doesn't work properly, it just frees part of the memory.
At the end all the memory stays filled, and I need to restar my R session.
My full dataset is called allDT and I do something like this:
for (NN in (1:10)*100000) {
gc(reset=TRUE)
myDT <- allDT[sample(.N,NN)]
assign(paste0("time",NN), mark(
model1 = glmer(Out~var1+var2+var3+(1|City/ID),data=myDT),
model2 = glmer(Out~var1+var2+var3+(1|ID),data=myDT),
iterations = 1, check=F))
}
That way I can get the results for each size.
The method is not fair because at the end the memory doesn't get properly cleaned.
I've thought an alternative is to restart the whole R program after every iteration (exit R and start it again, this is the only way I've found you can have the memory cleaned), loading again the data and continuing from the last step.
Is there any simple way to do it or any alternative?
Maybe I need to save the results on disk every time but it will be difficult to keep track of the last executed line, specially if R hangs.
I may need to create an external batch file and run a loop calling R at every iteration. Though I prefer to do it everything from R without any external scripting/batch.
One thing I do for benchmarks like this is to launch another instance of R and have that other R instance return the results to stdout (or simpler, just save it as a file).
Example:
times <- c()
for( i in 1:length(param) ) {
system(sprintf("Rscript functions/mytest.r %s", param[i]))
times[i] <- readRDS("/tmp/temp.rds")
}
In the mytest.r file read in parameters and save results to a file.
args <- commandArgs(trailingOnly=TRUE)
NN <- args[1]
allDT <- readRDS("mydata.rds")
...
# save results
saveRDS(myresult, file="/tmp/temp.rds")
I use R to run Ant Colony Optimization and usually repeat the same optimization several times to cross-validate my results. I want to save time by running the processes in parallel with the foreach and doParallel packages.
A reproducible example of my code would be very long so I'm hoping this is sufficient. I think I managed to get the code running like this:
result <- list()
short <- function(n){
for(n in 1:10){
result[[n]] <- ACO(data, ...)}}
foreach(n=1:50) %dopar% short(n)
Within the ACO() function I continuously create objects with intermediate results (e.g. the current pheromone levels) which I save using write.table(..., append=TRUE) to keep track of the iterations and their results. Now that I'm running the processes in parallel, the file I write contains results from all processes and I'm not able to tell which process the data belongs to. Therefore, I'd like to write different files for each process.
What's the best way, in general, to save intermediate results when using parallel processing?
You can use the log4r package to write info needed in a log file. More info about the package here.
An example of the code which you have to put in your short function:
# Import the log4r package.
library('log4r')
# Create a new logger object with create.logger().
logger <- create.logger()
# Set the logger's file output.
logfile(logger) <- 'base.log'
# Set the current level of the logger.
level(logger) <- 'INFO'
# Try logging messages with different priorities. # At priority level INFO, a call to debug() won't print anything.
debug(logger, 'Iretation and result info')
I am trying to create a global table that is produced by asynchronously running parallel processes. They are completely independent, however they should append to the same global variable (this is reactive in R shiny so I either need to have a call back function once all futures are done with their task -which would be very nice but I dont know how-, or I need to constantly update the table as new results come in).
I tried the following approach which just locks (probably because all processes are assigning to the same variable, when I change 'a' to 'b' it works but then result is useless)
library("listenv")
library("future")
plan(multiprocess)
futureVals <- listenv()
options(future.globals.onMissing = "ignore")
a<-0
b<-0
for(i in 1:5){
futureVals[[i]] <- futureAssign(x='a', value={
a <- a+1
print(a)
})
}
futureVals2 <- as.list(futureVals)
print(a)
how can I achieve this goal?
It is not possible for future (or other parallel, background R workers) to assign values to variables in the master R process. Any results need to be returned as values. This is a fundamental property of all parallel/asynchronous processing in R.(*)
Having said this, you might be interested in https://rstudio.github.io/promises/articles/shiny.html.
PS. (*) Your expectations of futureAssign() seems to be incorrect.
When running this code:
library(TeachingDemos)
etxtStart(dir=getwd(), file="Nofunciona.txt")
etxtComment('Just a test')
for(i in 1:10){
cat("###",i,":\n")
my.sample = sample(100)
print(summary(my.sample))
qqnorm(my.sample)
etxtPlot(width=7.5)
}
etxtStop()
I only get a file named "Nofunciona.txt" with a text line "Just a test" and the commands to include the graphs, but nothing about the results of cat() or print(summary()), although I can see the results on the console.
If I change the loop using these two loops:
for(i in 1:10){
cat("###",i,":\n")
my.sample = sample(100)
print(summary(my.sample))
}
for(i in 1:10){
qqnorm(my.sample)
if(archivo) etxtPlot(width=7.5)
}
etxtStop()
Then I can obtain a file with the text results of cat(), and summary() and also the commands to include the graphs at the end. I know that with the last for loop I obtain ten times the same graph, that is not relevant.
It seems I cannot obtain graphical results and text results at the same time inside a for loop. Why does not the first code work?
Any idea?
Thanks.
The reason that is happening is because it is assumed that you do not want the etxtPlot command to show up in the transcript or command history, so when that function is called it sets a variable that tells the workhorse internal function (that is called by the task manager) to skip outputting the commands and results temporarily. This works correctly outside of a loop because the suppression of the output only lasts for the call to etxtPlot and everything else is properly output. The problem comes when you do this in a loop, everything done in the loop is processed in a single step (see ?addTaskCallback for the details on how things are handled), so the suppressing of the command and output from etxtPlot ends up also suppressing the commands and output from everything else in the loop.
A possible work around is to run the command:
trace(etxtPlot, edit=TRUE)
Then change the TRUE to FALSE in the second to last line of the code. Now you will see all the commands and output (including the calls to etxtPlot), but the plots will all come before the output (because the commands to include the plots are inserted at each iteration, but the other output is inserted after the loop has completed).
You might consider using the knitr package as an alternative, specifically the stitch or spin functions if you don't want to create a full template file, but just have some code processed. They don't do the realtime transcript, but deal better with automatic plot insertion.