R - get worker name when running in parallel - r

I am running a function in parallel. In order to get progress updates on the state of the work, I would like one but only one worker to report periodically on its progress. My natural thought for how to do this would be to have the function that the workers execute check the name of the worker, and only give the status update if the name matches a particular value. But, I can't find a reliable way to determine this in advance. In Julia for instance, there is a simple myid() function that will give a worker's ID (i.e. 1, 2, etc.). I am looking for something equivalent in R. The best that I've found so far is to have each worker call Sys.getpid(). But, I don't know a reliable way to write my script so that I'll know in advance what one of the pids that gets assigned to a worker would be. The basic functionality script that I'm looking to write looks like the below, with the exception that I'm looking for R's equivalent to the myid() function:
library(parallel)
Test_Fun = function(a){
for (idx in 1:10){
Sys.sleep(1)
if (myid() == 1){
print(idx)
}
}
}
mclapply(1:4, Test_Fun, mc.cores = 4)

The parallel package doesn't provide a worker ID function as of R 3.3.2. There also isn't a mechanism provided to initialize the workers before they start to execute tasks.
I suggest that you pass an additional task ID argument to the worker function by using the mcmapply function. If the number of tasks is equal to the number of workers, the task ID can be used as a worker ID. For example:
library(parallel)
Test_Fun = function(a, taskid){
for (idx in 1:10){
Sys.sleep(1)
if (taskid == 1){
print(idx)
}
}
}
mcmapply(Test_Fun, 1:4, 1:4, mc.cores = 4)
But if there are more tasks than workers, you'll only see the progress messages for the first task. You can work around that by initializing each of the workers when they execute their first task:
WORKERID <- NA # indicates worker is uninitialized
Test_Fun = function(a, taskid){
if (is.na(WORKERID)) WORKERID <<- taskid
for (idx in 1:10){
Sys.sleep(1)
if (WORKERID == 1){
print(idx)
}
}
}
cores <- 4
mcmapply(Test_Fun, 1:8, 1:cores, mc.cores = cores)
Note that this assumes that mc.preschedule is TRUE, which is the default. If mc.preschedule is FALSE and the number of tasks is greater than the number of workers, the situation is much more dynamic because each task is executed by a different worker process and the workers don't all execute concurrently.

Related

Manually interrupt a loop in R and continue below

I have a loop in R that does very time-consuming calculations. I can set a max-iterations variable to make sure it doesn't continue forever (e.g. if it is not converging), and gracefully return meaningful output.
But sometimes the iterations could be stopped way before max-iterations is reached. Anyone who has an idea about how to give the user the opportunity to interrupt a loop - without having to wait for user input after each iteration? Preferably something that works in RStudio on all platforms.
I cannot find a function that listens for keystrokes or similar user input without halting until something is done by the user. Another solution would be to listen for a global variable change. But I don't see how I could change such a variable value when a script is running.
The best idea I can come up with is to make another script that creates a file that the first script checks for the existence of, and then breaks if it is there. But that is indeed an ugly hack.
Inspired by Edo's reply, here is an example of what I want to do:
test.it<-function(t) {
a <- 0
for(i in 1:10){
a <- a + 1
Sys.sleep(t)
}
print(a)
}
test.it(1)
As you see, when I interrupt by hitting the read button in RStudio, I break out of the whole function, not just the loop.
Also inspired by Edo's response I discovered the withRestarts function, but I don't think it catches interrupts.
I tried to create a loop as you described it.
a <- 0
for(i in 1:10){
a <- a + 1
Sys.sleep(1)
if(i == 5) break
}
print(a)
If you let it go till the end, a will be equal to 5, because of the break.
If you stop it manually by clicking on the STOP SIGN on the Rstudio Console, you get a lower number.
So it actually works as you would like.
If you want a better answer, you should post a reproducible example of your code.
EDIT
Based on the edit you posted... Try with this.
It's a trycatch solution that returns the last available a value
test_it <- function(t) {
a <- 0
tryCatch(
for(i in 1:10){
a <- a + 1
message("I'm at ", i)
Sys.sleep(t)
if(i==5) break
},
interrupt = function(e){a}
)
a
}
test_it(1)
If you stop it by clicking the Stop Sign, it returns the last value a is equal to.

R when debugging, is there a way to skip a specific statement?

I think most people has met the same problem with me
f1 = function(){
function1() #takes 1hour
b = function2() #takes 2hours
c = function3(b)
statement1
statement2
...}
suppose function1 and function2 is very time consuming, I want to skip at least one of them to see if the rest part of my function works
question1:
Is there a way to skip function1?
question2:
Is there a way to skip function2? this function2 produce result b which is critical for the function to continue, In java there is a way to hack value for b and make the process continue, is that also possible in R?
1) When in the debugger you can redefine the functions on the spot. For example, any time before getting to the point where function1 is invoked enter this into the debugger:
function1 <- list
Now invoking function1() actually invokes list() .
This could alternately be done outside of f1 before invoking it. In that case we may wish to store function1 in another name first to make it easy to revert back to it.
function1.orig <- function1
function1 <- list
Later, after we have completed our debugging, we can revert function1 back by writing:
function1 <- function1.orig
2) For function2 you may wish to redefine it as follows where 32 is the critical value needed later.
function2 <- function() 32

Python crashes when I put process.join in script when using multiprocessing

I have been researching multiprocessing and came upon an example of it on a website. However, when I try to run that example on my MacBook retina, nothing happens. The following was the example:
import random
import multiprocessing
def list_append(count, id, out_list):
"""
Creates an empty list and then appends a
random number to the list 'count' number
of times. A CPU-heavy operation!
"""
for i in range(count):
out_list.append(random.random())
if __name__ == "__main__":
size = 10000000 # Number of random numbers to add
procs = 2 # Number of processes to create
# Create a list of jobs and then iterate through
# the number of processes appending each process to
# the job list
jobs = []
for i in range(0, procs):
out_list = list()
process = multiprocessing.Process(target=list_append,
args=(size, i, out_list))
jobs.append(process)
# Start the processes (i.e. calculate the random number lists)
for j in jobs:
j.start()
# Ensure all of the processes have finished
for j in jobs:
j.join()
print ("List processing complete.")
As it turns out after I put a print statement in the 'list_append' function, nothing printed, so the problem is actually not the j.join() but rather the j.start() bit.
When you create a process with multiprocessing.Process, you prepare a sub-function to be run in a different process asynchronously. The computation starts when you call the start method. The join method waits for the computation to be done. So if you just start the process and do not wait for the to complete (or join) nothing will happen as the process will be killed when your program exit.
Here, one issue is that you are not using an object sharable in multiprocessing. When you use a common list(), each process will use a different list in memory. The local process will be cleared when the processes exit and the main list will be empty. If you want to be able to exchange processes between data you should use a multiprocessing.Queue:
import random
import multiprocessing
def list_append(count, id, out_queue):
"""
Creates an empty list and then appends a
random number to the list 'count' number
of times. A CPU-heavy operation!
"""
for i in range(count):
out_queue.put((id, random.random()))
if __name__ == "__main__":
size = 10000 # Number of random numbers to add
procs = 2 # Number of processes to create
# Create a list of jobs and then iterate through
# the number of processes appending each process to
# the job list
jobs = []
q = multiprocessing.Queue()
for i in range(0, procs):
process = multiprocessing.Process(target=list_append,
args=(size, i, q))
process.start()
jobs.append(process)
result = []
for k in range(procs*size):
result += [q.get()]
# Wait for all the processes to finish
for j in jobs:
j.join()
print("List processing complete. {}".format(result))
Note that this code can hang quite easily if you do not compute correctly the number of results sent back in out_queue.
If you try to retrieve too many results, q.get will wait for an extra result that will never come. If you do not retrieve all the result from q, your processes will freeze as the out_queue will be full, and out_queue.put will not return. Your processes will thus never exit and you will not be able to join them.
If your computation are independent, I strongly advise to look at higher level tools like Pool or even more robust third party library like joblib as it will take care of these aspects for you. (see this answer for some insights on Process vs Pool/joblib)
I actually reduced the number size as the program become to slow if you try to put to many objects in a Queue. If you need to pass a lot of small object, try passing all of them in one batch:
import random
import multiprocessing
def list_append(count, id, out_queue):
a = [random.random() for i in range(count)]
out_queue.put((id, a))
if __name__ == "__main__":
size = 10000 # Number of random numbers to add
procs = 2 # Number of processes to create
jobs = []
q = multiprocessing.Queue()
for i in range(0, procs):
process = multiprocessing.Process(target=list_append,
args=(size, i, q))
process.start()
jobs.append(process)
result += [q.get() for _ in range(procs)]
for j in jobs:
j.join()
print("List processing complete.")

How to interrupt at loop end

How can I write a loop in R so that if I press esc it stops not in the middle but at the end of the loop? So to say I want to define points where it is safe to interrupt the loop.
I am running an iteration with a plot every second. If the visual results don't develop in the right direction, I want to be able to interrupt the iteration, tweak some parameters and continue the iteration. Now I am doing this by pressing esc or the stop button in the GUI. But this corrupts some data, because the interrupt takes place in the middle of a calculation. So I am not able to continue the iteration.
Here is an option using withRestarts and hijacking the "abort" restart. Try running the code and hitting "ESC" while the loop is going:
# Make up a task that takes some time
task_to_perform_in_loop <- function() for(i in 1:100) Sys.sleep(0.01)
# Run task in loop
{
stopAtEnd <- FALSE
for(i in 1:5) {
cat("Loop #", i, "\n", sep="")
withRestarts(
task_to_perform_in_loop(),
abort=function(e) {
message("Will stop when loop complete")
stopAtEnd <<- TRUE
}
)
}
if(stopAtEnd) stop("Stopping After Finishing Loop")
cat("Continuing After Finishing Loop")
}
Just keep in mind this will prevent you from truly exiting the loop short of sending a kill command to R.

stop a running mcparallel job prematurely

I have three tasks:
is disk I/O bound
is network I/O bound
is CPU bound on a remote machine
The result of 3 will tell me whether the answer I want will come from task 1 or task 2. Since each task requires separate resources, I'd like to start all three tasks with mcparallel, then wait on the result from the third task and determine whether to terminate task 1 or task 2. However, I can not determine how to prematurely cancel an mcparallel task from within R. Is it safe to just kill the PID of the forked process from a call to system()? If not, is there a better way to cancel the unneeded computation?
I don't think the parallel package supports an official way to kill a process started via mcparallel, but my guess is that it's safe to do, and you can use the pskill function from the tools package to do it. Here's an example:
library(parallel)
library(tools)
fun1 <- function() {Sys.sleep(20); 1}
fun2 <- function() {Sys.sleep(20); 2}
fun3 <- function() {Sys.sleep(5); sample(2, 1)}
f1 <- mcparallel(fun1())
f2 <- mcparallel(fun2())
f3 <- mcparallel(fun3())
r <- mccollect(f3)
if (r[[1]] == 1) {
cat('killing fun1...\n')
pskill(f1$pid)
print(mccollect(f1))
r <- mccollect(f2)
} else {
cat('killing fun2...\n')
pskill(f2$pid)
print(mccollect(f2))
r <- mccollect(f1)
}
print(r)
It's usually dangerous to randomly kill threads within a multi-threaded application because they might be holding a shared lock of some kind, but these of course are processes, and the master process seems to handle the situation just fine.
current version of parallel::mccollect() supports wait argument.
Simply pass FALSE to quit any running jobs prematurely.
> mccollect(wait = FALSE)

Resources