Rmpi send/receive hanging? - r

I am writing the code for a population MCMC. I will try to provide as much information I think could help, so please bear with me.
I am using tempered distributions and I want to perform exchange moves, i.e a moves that propose to swap the value of two chains.
What I have done (exchange happening in the master)
I had written this initially by
letting each chain mutate for a specified number of iterations n.
at every n-th iteration, I would send the slaves' results to the master and there attempt an exchange of parameters between chains.
Then send updated values back to slaves and repeat the process.
What I want to achieve (exchange directly between slaves)
This is working fine, but I wanted to clean up my code and remove unnecessary communication between master and slave . That is, let the slaves communicate directly between them.
So assuming I am spawning 10 slaves,
at iteration n, I want to let slave1-slave2, slave3-slave4, ...., slave9-slave10 communicate between them
at iteration 2*n, I want to let slave2-slave3, slave4-slave5, ...slave8-slave9 communicate between them
and so on, so that I let samples travel through the temperature ladder.
The problem
And this is where I am facing a problem.
I think I am managing to send a value from one slave to another (my print statement "Succesfully sent" gets printed in the log files of the slaves that are sending) but this doesn't seem to be received (my "Succesfully received" statement doesn't get printed in the log of the partner slaves).
And the program just hangs. I think maybe I have caused a deadlock, but I am not sure what I have done wrong?
Could you please advise? I have used as guide this Parallel Tempering R code
http://www.lindonslog.com/mathematics/parallel-tempering-r-rmpi/
Please see below my code
Many thanks!
Sofia
ind <- mpi.comm.rank()
oddFlag<-0 ### object to flag code suitable for odd/even numbered slaves.
for (i in 1:TotalIter) {
##### normal MCMC move (single chain mutation) - logL.current
if ( i%%exchangeInterval == 0 ){ ### every nth (right now 5th) iteration, attempt an exchange
message("\n\nAttempt an exchange move")
oddFlag<-oddFlag+1
exchange<-0
logL.partner<-0
if (ind%%2 == oddFlag%%2) { ###when oddFlag even , the following code concerns even-numbered slaves. When odd number, it concerns odd-numbered slaves.
ind.partner<-ind+1
if (0<ind.partner && ind.partner<(noChains+1)){
message("This is the slave: ", ind, " and its partner is: ", ind.partner)
message("The tag for receiving logL.partner is: ", ind.partner)
logL.partner<-mpi.recv.Robj(source=ind.partner,tag=ind.partner) #### receive the logL of partner
message("Succesfully received")
message("This is the logL.partner: ", logL.partner)
exchanges.attempted<-exchanges.attempted+1
if (runif(1)< min(1, exp((logL.partner - estimatorSelf)*(temper[ind] - temper[ind.partner] )))) { ############# exp((chain2 - chain1)*(T1 - T2))
message("I exchanged the values")
exchange<-1
print(exchange)
exchanges.accepted<-exchanges.accepted+1
}
mpi.send.Robj(obj=exchange,dest=ind.partner,tag=15*ind)
}
if (exchange==1){
### exchange parameters with mpi.send.Robj/mpi.recv.Robj functions
}
} else { ##### ###when oddFlag even , the following code concerns odd-numbered slaves. For oddFlag odd, it concerns even-numbered slaves.
ind.partner<-ind-1
if (0<ind.partner && ind.partner<(noChains+1)){
message("This is the slave: ", ind, " and its partner is: ", ind.partner)
message("The tag for sending logL.current is: ", ind)
mpi.send.Robj(obj=logL.current,dest=ind.partner,tag=ind) ### send logL to partner
message("Succesfully sent")
exchange<-mpi.recv.Robj(source=ind.partner, tag=15*ind.partner)
message("I received the exchange message")
}
if (exchange==1){
### exchange parameters send/receive functions
}
}
}
}

This seems very strange. To my knowledge the send and receive commands are locking so for the send to work without the receive to work strikes me as a bit odd. Have you tried the code on the guide? If you like I can send you the entire R-code just to see if it works on your system. If the code on the guide works then you can safely say that it isn't a problem with your MPI setup.
One thing you might try is for the receive command instead of using ind.partner you could use mpi.any.source() (i think is the correct one) which will accept a message of any tag. If this resolves your deadlock, it could be a problem with the tags (but by eye there doesn't seem anything wrong to me).
Another thing you might try is removing the "source=" and the "tag=" on the receive commands. I notice in my code that I don't have any there whilst I do for the send commands. Perhaps that was causing me too problems, but I can't quite remember. Let me know how it goes, I hope it all works out.

Related

Sabre Scribe Scripting Specifically Looping

Anybody have any tips for looping, and continue? For example, I placed about 2500 pnrs on a queue, and I need to add a remark to each of them. Is it possible for a script to add the remark then move to the next pnr?
For example, I placed about 2500 pnrs on a queue, and I need to add a remark to each of them. Is it possible for a script to add the remark then move to the next pnr?
Loops are supported in Scribe but have to be built manually by creating your own iteration variables and breaking the loop manually when you know the work is complete.
What you are describing is definitely possible, but working in queues can be difficult as there are many possible responses when you try to end the PNRs. You'll have to capture the response to detect whether you need to do something else to get out of the error condition (e.g. if a PNR warning indicates you have to double-end the record).
If possible, its likely simpler to work off the queue by collecting all PNR locators and then looping through that list, adding your remarks, and then ending the PNRs. You'll still have to capture the response to determine if the PNR is actually ended properly, but you won't have to deal with the buggy queue behavior. A basic Scribe loop sample is below. Note that I haven't been a Scribe developer for a while and I did this in Notepad so there might be some errors in here, but hopefully it's a good starting point.
DEFINE [ROW=N:8] ;iteration variable/counter
DEFINE [LOCATOR_FILE=*:60] ;File Path
DEFINE [TEMP_LOCATOR=*:6] ;pnr locator variable, read from the temp file
DEFINE [BREAK=*:1] ;loop breaking variable
OPEN F=[TEMP_LOCATOR] L=0 ;open the file of locators
[BREAK] = ""
[ROW] = 0
REPEAT
[ROW] = [ROW] + 1
[TEMP_LOCATOR] = "" ;Reset temp locator variable, this will break our loop
READ F=[LOCATOR_FILE] R=[ROW] C=1 [TEMP_LOCATOR]
IF $[TEMP_LOCATOR] = 6 THEN ;test length of locator, if this is 6 chars, you have a good one, pull it up and add your remark
»"5YOUR REMARK HERE"{ENTER}«
»ER{ENTER}«
;trap errors
READ F="EMUFIND:" R=0 C=0 [TEMP_LOCATOR] ;read for the locator being present on this screen, which should indicate that the ER was successful - you'll have to trap other errors here though
IF [#SYSTEM_ERROR] = 0 THEN ;this locator was found, ER appears successful
»I{ENTER}« ;Ignore this PNR and move to the next one
ELSE
[BREAK] = "Y" ;error found afeter ER, break loop. Maybe show a popup box or something, up to you
ENDIF
ELSE ;No locator found in file, break the loop
[BREAK] = "Y"
ENDIF
UNTIL [BREAK] = "Y"
CLOSE [LOCATOR_FILE]

Manual API rate limiting

I am trying to write a manual rate-limiting function for the rgithub package. So far this is what I have:
library(rgithub)
pull <- function(i){
commits <- get.pull.request.commits(owner = owner, repo = repo, id = i, ctx = get.github.context(), per_page=100)
links <- digest_header_links(commits)
number_of_pages <- links[2,]$page
if (number_of_pages != 0)
try_default(for (n in 1:number_of_pages){
if (as.integer(commits$headers$`x-ratelimit-remaining`) < 5)
Sys.sleep(as.integer(commits$headers$`x-ratelimit-reset`)-as.POSIXct(Sys.time()) %>% as.integer())
else
get.pull.request.commits(owner = owner, repo = repo, id = i, ctx = get.github.context(), per_page=100, page = n)
}, default = NULL)
else
return(commits)
}
list <- c(500, 501, 502)
pull_lists <- lapply(list, pull)
The intention i that if the x-ratelimit-remaining variable goes below a certain threshold the script should wait until the time specified in x-ratelimit-reset has passed, and then continue the script. However, I'm not sure if this is the actual behavior of the if else set up that I have here.
The function runs fine, but I have some doubts about whether it actually does the rate limiting or whether it somehow skips that steps. Hence I ask: a) how can I find out if it actually does rate-limiting, and b) if not, how can I rewrite it so that it actually does rate limiting? Would a while condition/loop perhaps be better?
You can test if it does the rate limiting changing 5 to a large enough number and adding a display of the timing of Sys.sleep using:
print(system.time(Sys.sleep(...)))
That said, the function seems ok to me, unfortunately I cannot test it easily as rgithub is not available for my version of R (3.1.3).
Not a canonical answer, but some working example.
You should add some logging in your script, even kind of write.csv(append=TRUE).
I've implemented automatic antiddos process which prevent your ip to be banned by the exchange market. You can find it jangorecki/Rbitcoin/R/utils.R.
Rbitcoin.last_api_call is env object stored in package namespace, kind of session package cache.
This can help you with setting it in your package.
You should also consider a optional parallel supported version. Linking to database with concurrency read. My function can be easy modified to queue call and recheck timing every X seconds.
Edit
I forget to add that mentioned function support multiple source systems. That allows for example to extend your rgithub for bitbucket, etc. and still effectively manage API rate limiting.

Out-of-memory error in parfor: kill the slave, not the master

When an out-of-memory error is raised in a parfor, is there any way to kill only one Matlab slave to free some memory instead of having the entire script terminate?
Here is what happens by default when an out-of-memory error occurs in a parfor: the script terminated, as shown in the screenshot below.
I wish there was a way to just kill one slave (i.e. removing a worker from parpool) or stop using it to release as much memory as possible from it:
If you get a out of memory in the master process there is no chance to fix this. For out of memory on the slave, this should do it:
The simple idea of the code: Restart the parfor again and again with the missing data until you get all results. If one iteration fails, a flag (file) is written which let's all iterations throw an error as soon as the first error occurred. This way we get "out of the loop" without wasting time producing other out of memory.
%Your intended iterator
iterator=1:10;
%flags which indicate what succeeded
succeeded=false(size(iterator));
%result array
result=nan(size(iterator));
FLAG='ANY_WORKER_CRASHED';
while ~all(succeeded)
fprintf('Another try\n')
%determine which iterations should be done
todo=iterator(~succeeded);
%initialize array for the remaining results
partresult=nan(size(todo));
%initialize flags which indicate which iterations succeeded (we can not
%throw erros, it throws aray results)
partsucceeded=false(size(todo));
%flag indicates that any worker crashed. Have to use file based
%solution, don't know a better one. #'
delete(FLAG);
try
parfor falseindex=1:sum(~succeeded)
realindex=todo(falseindex);
try
% The flag is used to let all other workers jump out of the
% loop as soon as one calculation has crashed.
if exist(FLAG,'file')
error('some other worker crashed');
end
% insert your code here
%dummy code which randomly trowsexpection
if rand<.5
error('hit out of memory')
end
partresult(falseindex)=realindex*2
% End of user code
partsucceeded(falseindex)=true;
fprintf('trying to run %d and succeeded\n',realindex)
catch ME
% catch errors within workers to preserve work
partresult(falseindex)=nan
partsucceeded(falseindex)=false;
fprintf('trying to run %d but it failed\n',realindex)
fclose(fopen(FLAG,'w'));
end
end
catch
%reduce poolsize by 1
newsize = matlabpool('size')-1;
matlabpool close
matlabpool(newsize)
end
%put the result of the current iteration into the full result
result(~succeeded)=partresult;
succeeded(~succeeded)=partsucceeded;
end
After quite bit of research, and a lot of trial and error, I think I may have a decent, compact answer. What you're going to do is:
Declare some max memory value. You can set it dynamically using the MATLAB function memory, but I like to set it directly.
Call memory inside your parfor loop, which returns the memory information for that particular worker.
If the memory used by the worker exceeds the threshold, cancel the task that worker was working on. Now, here it get's a bit tricky. Depending on the way you're using parfor, you'll either need to delete or cancel either the task or worker. I've verified that it works with the code below when there is one task per worker, on a remote cluster.
Insert the following code at the beginning of your parfor contents. Tweak as necessary.
memLimit = 280000000; %// This doesn't have to be in parfor. Everything else does.
memData = memory;
if memData.MemUsedMATLAB > memLimit
task = getCurrentTask();
cancel(task);
end
Enjoy! (Fun question, by the way.)
One other option to consider is that since R2013b, you can open a parallel pool with 'SpmdEnabled' set to false - this allows MATLAB worker processes to die without the whole pool being shut down - see the doc here http://www.mathworks.co.uk/help/distcomp/parpool.html . Of course, you still need to arrange somehow to shutdown the workers.

Implementation of simple polling of results file

For one of my dissertation's data collection modules, I have implemented a simple polling mechanism. This is needed, because I make each data collection request (one of many) as SQL query, submitted via Web form, which is simulated by RCurl code. The server processes each request and generates a text file with results at a specific URL (RESULTS_URL in code below). Regardless of the request, URL and file name are the same (I cannot change that). Since processing time for different data requests, obviously, is different and some requests may take significant amount of time, my R code needs to "know", when the results are ready (file is re-generated), so that it can retrieve them. The following is my solution for this problem.
POLL_TIME <- 5 # polling timeout in seconds
In function srdaRequestData(), before making data request:
# check and save 'last modified' date and time of the results file
# before submitting data request, to compare with the same after one
# for simple polling of results file in srdaGetData() function
beforeDate <- url.exists(RESULTS_URL, .header=TRUE)["Last-Modified"]
beforeDate <<- strptime(beforeDate, "%a, %d %b %Y %X", tz="GMT")
<making data request is here>
In function srdaGetData(), called after srdaRequestData()
# simple polling of the results file
repeat {
if (DEBUG) message("Waiting for results ...", appendLF = FALSE)
afterDate <- url.exists(RESULTS_URL, .header=TRUE)["Last-Modified"]
afterDate <- strptime(afterDate, "%a, %d %b %Y %X", tz="GMT")
delta <- difftime(afterDate, beforeDate, units = "secs")
if (as.numeric(delta) != 0) { # file modified, results are ready
if (DEBUG) message(" Ready!")
break
}
else { # no results yet, wait the timeout and check again
if (DEBUG) message(".", appendLF = FALSE)
Sys.sleep(POLL_TIME)
}
}
<retrieving request's results is here>
The module's main flow/sequence of events is linear, as follows:
Read/update configuration file
Authenticate with the system
Loop through data requests, specified in configuration file (via lapply()),
where for each request perform the following:
{
...
Make request: srdaRequestData()
...
Retrieve results: srdaGetData()
...
}
The issue with the code above is that it doesn't seem to be working as expected: upon making data request, the code should print "Waiting for results ..." and then, periodically checking the results file for being modified (re-generated), print progress dots until the results are ready, when it prints confirmation. However, the actual behavior is that the code waits long time (I intentionally made one request a long-running), not printing anything, but then, apparently retrieves results and prints both "Waiting for results ..." and " Ready" at the same time.
It seems to me that it's some kind of synchronization issue, but I can't figure out what exactly. Or, maybe it's something else and I'm somehow missing it. Your advice and help will be much appreciated!
In a comment to the question, I believe MrFlick solved the issue: the polling logic appears to be functional, but the problem is that the progress messages are out of synch with current events on the system.
By default, the R console output is buffered. This is by design: to speed things up and avoid the distracting flicker that may be associated with frequent messages etc. We tend to forget this fact, particularly after we've been using R in a very interactive fashion, running various ad-hoc statement at the console (the console buffer is automatically flushed just before returning the > prompt).
It is however possible to get message() and more generally console output in "real time" by either explicitly flushing the console after each critical output statement, using the flush.console() function, or by disabling buffering at the level of the R GUI (right-click when on the console, see Buffered output Ctrl W item. This is also available in the Misc menu)
Here's a toy example of the explicit use of flush.console. Note the use of cat() rather than message() as the former doesn't automatically add a CR/LF to the output. The latter however is useful however because its messages can be suppressed with suppressMessages() and the like. Also as shown in the comment you can cat the "\b" (backspace) character to make the number overwrite one another.
CountDown <- function() {
for (i in 9:1){
cat(i)
# alternatively to cat(i) use: message(i)
flush.console() # <<<<<<< immediate ouput to console.
Sys.sleep(1)
cat(" ") # also try cat("\b") instead ;-)
}
cat("... Blast-off\n")
}
The output is the following, what is of course not evident in this print-out is that it took 10 seconds overall with one number printed every second, before the final "Blast off"; do remove the flush.console() statement and the output will come at once, after 10 seconds, i.e. when the function terminates (unless console is not buffered at the level of the GUI).
CountDown()
9 8 7 6 5 4 3 2 1 ... Blast-off

How to successfully interrupt R from tcltk scripting widget?

I am trying to create a scripting widget for R using the tcltk package. But I don't know how to to create a STOP button to interrupt a script coming from the widget. Basically, I would like to have a button, a menu option, and/or a key binding that will interrupt the current script execution, but I can't figure out how to make it work.
One (non-ideal) strategy is to just use the RGui STOP button (or <ESC> or <Ctrl-c> on the console), but this seems cause the tk widget to hang permanently.
Here's a minimal example of the widget based on the tcl/tk examples (http://bioinf.wehi.edu.au/~wettenhall/RTclTkExamples/evalRcode.html):
require(tcltk)
tkscript <- function() {
tt <- tktoplevel()
txt <- tktext(tt, height=10)
tkpack(txt)
run <- function() {
code <- tclvalue(tkget(txt,"0.0","end"))
e <- try(parse(text=code))
if (inherits(e, "try-error")) {
tkmessageBox(message="Syntax error", icon="error")
return()
}
print(eval(e))
}
tkbind(txt,"<Control-r>",run)
}
tkscript()
In the scripting widget if you try executing Sys.sleep(20) and then interrupt from the console, the widget hangs. The same thing happens if one were to run, for example, an infinite loop, like while(TRUE) 2+2.
I think what I'm experiencing may be similar to the bug reported here: https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14730
Also, I should mention that I'm running this on R 3.0.0 on Windows (x64), so maybe the problem is platform-specific.
Any thoughts on how to interrupt the running script without causing the widget to hang?
It depends on what the script is doing; a script that is sitting waiting for the user to do something is easy to interrupt (since you can make it listen for your interruption message) but a script that is doing an intensive loop is rather more tricky. The possible solutions depend on the version of Tcl inside.
Interpreter Cancellation — Requires 8.6
If you are using Tcl 8.6, you can use interpreter cancellation to stop the script. All you have to do is arrange for:
interp cancel -unwind
to be run, and the script will return control back to you. A reasonable way of doing this would be to use the extra Tcl package TclX (or Expect) to install a signal handler that will run the command when a signal is received:
package require Tcl 8.6
package require TclX
# Our signal handler
proc doInterrupt {} {
# Print a message so you can see what's happening
puts "It goes boom!"
# Unwind the stack back to the R code
interp cancel -unwind
}
# Install it...
signal trap sigint doInterrupt
# Now evaluate the code which might try to run forever
Adding signal handling in earlier versions is possible, but not quite as easy as you can't guarantee that things will return control to you so easily; the stack unwinding isn't there.
Execution Time Limiting — Requires 8.5 (or 8.6)
The other thing you could try is setting an execution time limit on a slave interpreter and running the user script in that slave. The time limit machinery will then guarantee a trap back to you every so often, giving you a chance to check for interruption and a way to do the stack unwinding. This is a considerably more complex method.
proc nextSecond {} {
clock add [clock seconds] 1 second
}
interp create child
proc checkInterrupt {} {
if {["decide if the R code wanted an interrupt"]} {
# Do nothing
return
}
# Reset the time limit to another second ahead
interp limit child time -seconds [nextSecond]
}
interp limit child time -seconds [nextSecond] -command checkInterrupt
interp eval child "the user script"
Think of this mechanism being a lot like how an operating system works, and yes, it can stop a tight loop.
Use a Subprocess — Any version of Tcl
The most portable mechanism is to run the script in a subprocess (with the tclsh program; the exact name varies by version, platform and distribution, but it's all variations on that) and just kill off that subprocess when it is no longer wanted with pskill. The downside of this is that you cannot (easily) carry any state over from one execution to another; subprocesses are pretty isolated from each other. The other methods described above can leave the state able to be accessed from another run: they do a real interrupt, whereas this destroys.
Also, I don't know exactly how to start the subprocess in such a way that you can communicate with it from R while it is still running; system and system2 don't seem to quite give enough control, and hacking something with forking is non-portable. Needs an R expert here. Alternatively, use a Tcl script (running inside the R process) to do it with:
set executable "tclsh"; # Adjust this line
set scriptfile "file/where/you/put/the_user/script.tcl"
# Open a bi-directional pipe to talk to the subprocess
set pipeline [open |[list $executable $scriptfile] "r+"]
# Get the subprocess's PID
set thePID [pid $pipeline]
That is actually reasonably portable to Windows (if not perfectly so) but intermediate states with forking are not.
I seem to have found a solution to prevent the tk widget from hanging by embedding the eval in a tryCatch that handles interrupt conditions. Unfortunately, it requires interruption from the console rather than the widget, but it does work. tryCatch is pretty poorly documented, so I'm putting this out here in case anyone else has similar needs in the future.
require(tcltk)
tkscript <- function() {
tt <- tktoplevel()
txt <- tktext(tt, height=10)
tkpack(txt)
run <- function() {
code <- tclvalue(tkget(txt,"0.0","end"))
e <- try(parse(text=code))
if (inherits(e, "try-error")) {
tkmessageBox(message="Parse error", icon="error")
tkfocus(txt)
return()
}
e <- tryCatch(eval(e),
error = function(errmsg)
tkmessageBox(message=as.character(errmsg), icon="error"),
interrupt = function(errmsg)
tkmessageBox(message=as.character(errmsg), icon="error")
)
print(eval(e))
}
tkbind(txt,"<Control-r>",run)
}
tkscript()
Another strategy I stumbled across is using tools::pskill (in the form pskill(Sys.getpid(), SIGINT) as a tk menu option) to interrupt the process, but - at least on Windows - this terminates the entire R process (including the tk widget). So, that's not a great solution but at least seems to exit everything as an absolute fallback.

Resources