Printing from mclapply in R Studio - r

I am using mclapply from within RStudio and would like to have an output to the console from each process but this seems to be suppressed somehow (as mentioned for example here: Is mclapply guaranteed to return its results in order?).
How could I get R Studio to print something like
x <- mclapply(1:20, function(i) cat(i, "\n"))
to the console?
I've tried print(), cat(), write() but they all seem not to work. I also tried to set mc.silent = FALSE explicitly without an effect.

Parallel processing with GUI's is problematic. I write a lot of parallel code and it's constantly crashing my colleague's computer because he insists on using Rstudio instead of console R.
From what I read, RStudio "does not propagate the output of forked processes to the RStudio console. If you are doing this, it is best to start R via a shell."
This makes sense as a workaround for the RStudio people because parallel processing typically breaks GUI's when people try to output to the GUI from a bunch of different processes. It works in the console (albeit often not in order) but parallel processing gurus will pinch their noses when they hear about any I/O from a forked thread.
If you must have output from forked threads, save them in a string and return it. Then collect and output from the main process. Or just use a console for your parallel runs. What I tell my colleague is to do all his debugging and development in RStudio using lapply(), then switch to a console for the real run.

Here's a workaround which uses shell echo to print to R console in Rstudio:
#' Function which prints a message using shell echo; useful for printing messages from inside mclapply when running in Rstudio
message_parallel <- function(...){
system(sprintf('echo "\n%s\n"', paste0(..., collapse="")))
}

Just expanding a little on the solution used by the asker, i.e. writing to a file to check progress:
write.file = '/temp_output/R_progress'
time1 = proc.time()[3]
outstuff = unlist(mclapply(1:1000000, function(i){
if (i %% 1000 == 0 ){
file.create(write.file)
fileConn<-file(write.file)
writeLines(paste0(i,'/',nrow(loc),' ',(i/nrow(loc)*100)), fileConn)
close(fileConn)
}
#do your stuff here
}, mc.cores=6))
print(proc.time()[3] - time1)
And then you can monitor from a console with
tail -c +0 -f '/temp_output/R_progress'

Related

Restart R session without interrupting the for loop

In my for loop, I need to remove RAM. So I delete some objects with rm() command. Then, I do gc() but the RAM still the same
So I use .rs.restartR() instead of gc() and it works: a sufficient part of my RAM is removed after the restart of the R session.
My problem is the for loop which is interrupted after the R restart. Do you have an idea to automatically go on the for loop after the .rs.restartR() command ?
I just stumbled across this post as I'm having a similar problem with rm() not clearing memory as expected. Like you, if I kill the script, remove everything using rm(list=ls(all.names=TRUE)) and restart, the script takes longer than it did initially. However, restarting the session using .rs.restartR() then sourcing again works as expected. As you say, there is no way to 'refresh' the session while inside a loop.
My solution was to write a simple bash script which calls my .r file.
Say you have a loop in R that runs from 1 to 3 and you want to restart the session after each iteration. My bash script 'runR.sh' could read as follows:
#!/bin/bash
for i in {1..3}
do
echo "Rscript myRcode.r $i" #check call to r script is as expected
Rscript myRcode.r $i
done
Then at the top of 'myRcode.r':
args <- commandArgs()
print(args) #list the command line arguments.
myvar <- as.numeric(args[6])
and remove your for (myvar in...){}, keeping just the contents of the loop.
You'll see from print(args) that your input from your shell script is the 6th element of the array, hence args[6] in the following line when assigning your variable. If you're passing in a string, e.g. a file name, then you don't need as.numeric of course.
Running ./runR.sh will then call your script and hopefully solve your memory problem. The only minor issue is that you have to reload your packages each time, unlike when using .rs.restartR(), and may have to repeat other bits that ordinarily would only be run once.
It works in my case, I would be interested to hear from other more seasoned R/bash users whether there are any issues with this solution...
By saving the iteration as an external file, and writing an rscript which calls itself, the session can be restarted within a for loop from within rstudio. This example requires the following steps.
#Save an the iteration as a separate .RData file in the working directory.
iter <- 1
save(iter, file="iter.RData")
Create a script which calls itself for a certain number of iterations. Save the following script as "test_script.R"
###load iteration
library(rstudioapi)
load("iter.RData")
###insert function here.
time_now <- Sys.time()
###save output of function to a file.
save(time_now, file=paste0("time_", iter, ".Rdata"))
###update iteration
iter <- iter+1
save(iter, file="iter.RData")
###restart session calling the script again
if(iter < 5){
restartSession(command='source("test_script.R")')
}
Do you have an idea to automatically go on the for loop after the .rs.restartR() command ?
It is not possible.
Okay, you could configure your R system to do something like this, but it sounds like a bad idea. I'm not really sure if you want to restart the for loop from the beginning or pick it up where left off. (I'm also very confused that you seem to have been able to enter commands in the R console while a for loop was executing. I think there's more than you are not telling us.)
You can use your rprofile.site file to automatically run commands when R starts. You could set it up to automatically run your for loop code whenever R starts. But this seems like a bad idea. I think you should find a different sort of fix for your problem.
Some of the things you could do to help the situation: have your for loop write output for each iteration to disk and also write some sort of log to disk so you know where you left off. Maybe write a function around your for loop that takes an argument of where to start, so that you can "jump in" at any point.
With this approach, rather than "restarting R and automatically picking up the loop", a better bet would be to use Rscript (or similar) and use R or the command line to sequentially run each iteration (or batch of iterations) in its own R session.
The best fix would be to solve the memory issue without restarting. There are several questions on SO about memory management - try the answers out and if they don't work, make a reproducible example and ask a new question.
You can make your script recursive by sourcing itself after restarting session.
Make sure the script will take into account the initial status of the loop. So you might have to save the current status of the loop in a .rds file before restarting session. Then call the .rds file from inside the loop after restarting session. This would help you start the loop where it was before restarting r session.
I just found out about this command 'restartSession'. I'm using it because I was also running into memory consumption issues as the garbage collector will not give back the RAM to the OS (Linux).
library(rstudioapi)
restartSession(command = "print('x')")
An approach independent from Rstudio:
If you want to run this in Rstudio, don't use R console, but terminal, otherwise use rstudioapi::restartSession() as in other answers - not recommended (crashes) -.
create iterator and load script (in system terminal would be:)
R -e 'saveRDS(1,"i.rds"); source("Script.R")'
Script.R file:
# read files and iterator
i<-readRDS("i.rds")
print(i)
# open process id of previous loop to kill it
tryCatch(pid <- readRDS(file="pid.rds"), error=function(e){NA} )
if (exists("pid")){
library(tools)
tools::pskill(pid, SIGKILL)
}
# update objects and iterator
i <- i+1
# process
pid <- Sys.getpid()
# save files and iterator
saveRDS(i, file="i.rds")
# process ID to close it in next loop
saveRDS(pid, file="pid.rds")
### restart session calling the script again
if(i <= 20 ) {
print(paste("Processing of", i-1,"ended, restarting") )
assign('.Last', function() {system('Rscript Script.R')} )
q(save = 'no')
}

real-time printing to console with R in jupyter

Using an R GUI or just R from a command line, this code results in integers being printed 0.2 seconds apart.
In contrast when I use R in a jupyter notebook, all of the printing happens only after the loop is complete.
for(x in 1:10){
print(x)
Sys.sleep(0.2)
}
I tried to force real-time printing inside of Jupyter with
for(x in 1:10){
print(x)
flush.console()
Sys.sleep(0.2)
}
...to no effect. The results were the same -- printing from within a for loop in jupyter always seems to be delayed until after the loop.
Is there a way to ensure the notebook outputs the results of print statements in a real time way?
Currently, the only way to "trigger" processing of printed output is either using message("text") (instead of print("text") or cat("text")) or not writing it into the loop but in a statement of it's own.
The underlying problem is in https://github.com/IRkernel/IRkernel/issues/3 and a proposed fix is in https://github.com/hadley/evaluate/pull/62 -> It needs a change in evaluate to allow flush.console() and friends to work. The gist of the problem: we use evaluate to execute the code and evaluate process the output one statement at a time and handles the output after the statement completes. Unfortunately, in this case, a for-loop is just one statement (as is everything in {...} blocks), so printed output only appears in the client after the for-loop is done.
A workaround is using the IRdisplay package and the display_...() functions instead of print()/cat() (or plots...). But this needs full control over the printed stuff: It's either everything is using print (and it gets delayed until after the complete statement is finished) or nothing should print (or plot) in that statement. If a called functions prints something, the output would be in the wrong order ({print("a"); display_text("b"); print("c")} would end up as b a c). Using capture.output() might get you around this limitation, if you really have to... If you use plots, there are currently no workarounds apart from writing the plot to disc and sending it via display_png(..) and friends.
This is not an issue any more. The following code works in jupterLab now
for(x in 1:10){
print(x)
flush.console()
Sys.sleep(0.2)
}
R version of the answer for Python Flush output in for loop in Jupyter notebook works for me:
cat(paste0('Your text', '\r'))
Apparently \r will trigger a flush.

rstudio - is it possible to run a code in the background

Question regarding RStudio. Suppose I am running a code in the console:
> code1()
assume that code1() prints nothing on the console, but code1() above takes an hour to complete. I want to work on something else while I wait for code1(). is it possible? Is there a function like runInBackground which I can use as follows
> runInBackground(code1())
> code2()
The alternatives are running two RStudios or writing a batch file that uses Rscript to run code1(), but I wanted to know if there is something easier that I can do without leaving the RStudio console. I tried to browse through R's help documentation but didn't come up with anything (or may be I didn't use the proper keywords).
The future package (I'm the author) provides this:
library("future")
plan(multisession)
future(code1())
code2()
FYI, if you use
plan(cluster, workers = c("n1", "n3", "remote.server.org"))
then the future expression is resolved on one of those machines. Using
plan(future.BatchJobs::batchjobs_slurm)
will cause it to be resolved via a Slurm job scheduler queue.
This question is closely related to Run asynchronous function in R
You can always do this, which is not ideal but works for most purposes:
shell(cmd = 'Rscript.exe some_script.R', wait=FALSE)
RStudio as of version 1.2 provides this feature. To run a script in the background select "Start Job" in the "Jobs" panel. You also have the option of copying the background job result into the working environment.
The mcparallel() function in the parallel package will do the trick, if you are on Linux, that is ...
library(parallel)
Job1 = mcparallel(code1())
JobResult1 = mccollect(Job1)

Restart R within Rstudio

I'm trying to call a simple python script from within R using system2(). I've read some information I found vague that said if 'too much' memory is used, it won't work.
If I load a large dataset and use some information in it to use as arguments to pass into system2(), it will only work if I manually click "Restart R" in call Rstudio.
What I want:
df <- read.csv('some_large_file.csv')
###extracting some info called 'args_vec'
for(arg in args_vec){
system2('python', args)
}
This won't work as is. The for loop is simply passed over.
What I need:
df <- read.csv('some_large_file.csv')
###extracting some info called 'args_vec'
###something that 'restarts' R
for(arg in args_vec){
system2('python', args)
}
This answer doesn't quite get what I want. Namely, it doesn't work for me within Rstudio and it calls "system" (which presents the same problem as "system2" in this case). In fact, when I put the answer referenced above in my Rprofile.site file, it just immediately closed rstudio:
I tried the suggestion as a normal function (rather than using "makeActiveBinding", and it didn't quite work.
##restart R in r session -- doesn't work
makeActiveBinding("refresh", function() { system("R --save"); q("no") }, .GlobalEnv)
##nor did this:
refresh <- function() { system("R --save"); q("no") }
I tried a number of variations of these two options above, but this is getting long for what feels like a simple question. There's a lot I don't yet understand about the startup process and "makeActiveBinding" is a bit mysterious. Can anyone point me in the right direction?
In Rstudio, you can restart the R session by:
command/ctrl + shift + F10
You can also use:
.rs.restartR()
RStudio has this undocumented rs.restartR() which is supposed to do just that: restarting R.
However, it does not unload the packages that were loaded, nor does it clean the environment, so that I have some doubts about if it restarts R at all.
If you use RStudio, use the menu item Session > Restart R or the associated keyboard shortcut Ctrl+Shift+F10 (Windows and Linux) or Command+Shift+F10 (Mac OS). Additional keyboard shortcuts make it easy to restart development where you left off, i.e. to say “re-run all the code up to HERE”:
In an R script, use Ctrl+Alt+B (Windows and Linux) or Command+Option+B (Mac OS)
In R markdown, use Ctrl+Alt+P (Windows and Linux) or Command+Option+P (Mac OS)
If you run R from the shell, use Ctrl+D or q() to quit, then restart R.
Have you tried embedding the function call within the apply function, rather than a for loop?
I've had some pieces of code that ran the system out of memory in a for loop run perfectly with apply. It might help?
For those not limited to a command that want something that actually resets the system (no prior state, no loaded packages, no variables, etc.) you can select Terminate R from the Session menu.
It is a bit awkward (asks you if you are sure). If anyone knows something like clear all or really clear classes in MATLAB let me know!

Hook into Rgui console?

Is it possible to change the behaviour of R console such that, for instance
before each command execution a fortune() is printed, or
similar to browser(), the prompt is altered, and some new commands (c,n,Q) are introduced?
I am looking for an alternative to readline() that keeps the history function (key up) intact.
I am using R on winxp with Rgui, but a portable solution would be great.
Yes, though it is independent of the GUI as just uses the callback mechanism as for example in the quesion on R: Display a time clock in the R command line.

Resources