I wrote a program in C that starts a further program written in expect. the program in expect provides a numerical output for each execution step, in order to identify an error and return it to the C program that performs the visualization through uart. within the C program, I launch the execution of Expect: ret = system ("sudo / usr / bin / expect ./expTest.exp ... arguments ...");
The program is compiled in gcc and runs perfectly if I log in as root and launch it from the shell. But if I start it automatically at start up with rc.local, the C program is executed in every detail, at the launch of the Expect, I have an execution error, as if I had problems with root permissions or something like that . I can not see the error but I see, through a uart port connected to the C program that the sub program in Expect is not executed after launch.
Related
I distribute OpenMPI-based application using SLURM launcher srun. When one the process crashes, I would like to detect that in the other PEs and to do some actions. I am aware of the fact that OpenMPI does not have fault-tolerance, but still I need to perform a graceful exit in other PEs.
To do this, every PE has to be able:
To continue running despite the crash of another PE.
To detect that one of the PEs crashed.
Currently I'm focusing on the first task. According to the manual, srun has --no-kill flag. However, it does not seem to work for me. I see the following log messages:
srun: error: node0: task 0: Aborted // this is where I crash the PE deliberately
slurmstepd: error: node0: [0] pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Interrupted system call (4)
srun: Jb step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: ***STEP 12123.0 ON node0 CANCELLED AT 2020-12-02 ***
srun: error: node0: task 1: Killed // WHY?!
Why does it happen? Is there any other relevant flag or environment variable, or any configuration option that might help?
To reproduce the problem, one can use the following program (it uses Boost.MPI for brevity, but has the same effect without Boost as well):
#include <boost/mpi.hpp>
int main() {
using namespace boost::mpi;
environment env;
communicator comm;
comm.barrier();
if (comm.rank() == 0) {
throw 0;
}
while (true) {}
}
According to the documentation that you linked, the --no-kill flag only affects the behaviour in case of node failure.
In your case you should be using the --kill-on-bad-exit=0 option that will prevent the rest of the tasks to be killed if one of them exits with a non-zero exit code.
I am running an external executable, capturing the result as an object. The .exe is a tool for selection of a population based on genetic parameters and genetic value predictions. The program executes and writes output as requested, but fails to exit. There is no error and when there is a manual stop it exits with status code 0. How can I get this call to exit and continue as it might with other system calls?
The call is formatted as seen below:
t <- tryCatch(system2("OPSEL.exe", args = "CMD.txt", timeout = 10))
I've tried running this in command shell with the two files referenced above and it exits appropriately.
I have the following situation: In my R script I start a third-party program with system2. The program is called lots of times, and unfortunately it is not very stable and crashes sometimes. If this happens, control is not returned to R until I kill the program via Task Manager manually.
What I would like to do: If the program has not returned control after 10 minutes, kill it automatically.
I could of course wrap the program in C++, Java or similar, implement this functionality in the wrapper, and call the wrapper from R. Quite possibly I could also utilize Rcpp.
However, I wonder if there is any way to achive this in R directly?
Btw: I am on Windows 7.
Thanks for any hints!
If you are using a unix-like system, you can add the unix command timeout to your system call. Example:
# system command that times out
> exitcode = system('timeout 1 sleep 20')
> exitcode
[1] 124
# system command that does not time out
> exitcode = system('timeout 2 sleep 1')
> exitcode
[1] 0
system returns the exit status of the process so you can check whether it is 0 (OK) or 124 (timed out).
I'm trying to execute a command remotely through Robot Framework which is failing through Robot framework and giving me the wrong exit status of 13.
But if we run this manually exit status of TTman.sh is 112 which is actually pass(Not the standard return codes).
am I doing something wrong here?
You are not getting the remote code of the remote command, in fact the RC 13 you are getting from the run is most probably from the robotframework - on run completion its RC is the number of failed cases. I.e. 13 cases should have failed, when you observed this.
To get the return code of your command, a few changes in the case are needed; this is how the semi-last line should look like, with explanations below:
${rc}= Execute Command your_command_from_the_question &>/dev/null; echo $?
First, all the output of your command (stdout & stderr) is redirected to /dev/null - to not return it. Then the special var $? is printed - it holds the RC of the last executed command (and is available in most *sh variants, like bash).
Finally, that value is stored in the ${rc} robotframework variable, and you can do whatever checks you need on it, further in the case.
This approach has one drawback - as stderr is hidden, you will not be able to see any errors coming from running the command. But if it was not, then they would be interleaved with the RC, which would have required further processing of the {rc} var, to get the desired value. If you need it (the stderr output in case of failures), change accordingly.
P.S. don't add screenshots of a source in a question, it is much less usable than a text version.
I am running a long job using a cluster of computers. On occasion, the process is interrupted and I have to manually restart. There is considerable downtime when the interruptions occur overnight. I was wondering if there is a way run a supervisor script in Julia that monitors whether the job running in another instance of Julia. It would restart the process if it is interrupted and would terminate once the job is finished. Unfortunately, I do not know exactly how to check that the process is running and how to restart the process. Here is the rough idea I have:
state = true
while state == true
#check every minute
sleep(60)
data = readcsv("outputfile.csv")
#read file to check if process is finished
if size(data,1) < N
#some function to check if the process is running
if isrunning() == true
#Do nothing.Keep running
else
#some function to spawn new instance of julia
#run the code
include("myscript.jl")
end
else
#Job finished, exit while loop
state = false
end
end
Right tool for the right Job.
Use your commandline shell.
If something it untimely terminated, it will give a error status code.
Eg Bash
until julia myscript.jl;
do echo "Failed/Interrupted. Restarting in 5s. Press Ctrl-C now to interrupt.";
sleep 5;
done`
Because Julia is not unuable as a commandline runner you could do, in julia:
while true
try
run(`julia myscript.jl`) #Run a separate process
break
catch
println("Failed/Interrupted. Restarting in 5s. Press Ctrl-C now to interrupt.")
sleep(5)
end
end