I work with Rscript on a cluster. qmake (a specialized version of GNU-make for the cluster) is used to parallelize jobs on several nodes. But Rscript seems to need to write a Xauthority file and it creates an error when every nodes work in the same time. In this way, my makefile-bases pipeline stops after the first group of parallelized tasks and don't start the next group of tasks. But the results of the first group are ok.
I'm also invoking usr/bin/xvfb-run ( https://en.wikipedia.org/wiki/Xvfb) when runnning RScript.
I've already changed the ssh-config (FORWARD X11 yes) but the problem persists.
I also tried to change the name of Xauthority file for each job but it didn't work (option -f in Rscript).
Here is the error which appear at the beginning of the process
/usr/bin/xauth: error in locking authority file .Xauthority
Here is the error which appears before the process stops :
/usr/bin/xvfb-run: line 171: kill: (44402) - No such process
qmake: *** [Data/Median/median_B00FTXC.tmp] Error 1
Related
I am trying to run my julia code on multiple nodes of a cluster, which uses Moab and Torque for the scheduler and resource manager.
In an interactive session where I requested 3 nodes, I load julia and openmpi modules and run:
mpirun -np 72 --hostfile $PBS_NODEFILE -display-allocation julia --project=. "./estimation/test.jl"
The mpirun does successfully recognize my 3 nodes since it displays:
====================== ALLOCATED NODES ======================
comp-bc-0383: slots=24 max_slots=0 slots_inuse=0 state=UP
comp-bc-0378: slots=24 max_slots=0 slots_inuse=0 state=UNKNOWN
comp-bc-0372: slots=24 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
However, after that it returns an error message
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 48; it may have occurred for other processes as well.
NOTE: A common cause for this error is misspelling a mpirun command
line parameter option (remember that mpirun interprets the first
unrecognized command line token as the executable).
Node: comp-bc-0372
Executable: /opt/aci/sw/julia/1.5.3_gcc-4.8.5-ips/bin/julia
--------------------------------------------------------------------------
What could be the possible cause of this? Is it because it has trouble accessing julia from other nodes? (I think this is the case because the code runs as long as -np X where x <= 24, which is the number of slots for one node; as soon as x >= 25, it fails to run)
Here a good manual how to work with modules and mpirun. UsingMPIstacksWithModules
To sum it up with what is written in the manual:
It should be highlighted that modules are nothing else than a structured way to manage your environment variables; so, whatever hurdles there are about modules, apply equally well about environment variables.
What you need is to export the environment variables in your mpirun command with -x PATH -x LD_LIBRARY_PATH. To see if this worked you can then run
mpirun -np 72 --hostfile $PBS_NODEFILE -display-allocation -x PATH -x LD_LIBRARY_PATH which julia
Also, you should consider giving the whole path of the file you want to run, so /path/to/estimation/test.jl instead of ./estimation/test.jl since your working directory is not the same in every node. (In general it is always safer to use whole paths).
By using whole paths, you should also be able to use /path/to/julia (that is the output of which julia) instead of only julia, this way you should not need to export the environment variables.
I am trying to run an executable called swat_edit.exe in R. It works perfectly when I run it directly in the command prompt, and also when I run it directly in the Terminal tab in R. However, when I try to write a function in R to run the executable, I get an error (I get a number of different errors...).
I have tried to use different methods of running the file:
1: I used system("swat_edit"), which returns the following error:
Unhandled Exception: System.IO.IOException: The handle is invalid.
at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
at System.Console.set_CursorVisible(Boolean value)
at SWEdit.Program.Run(String[] args)
at SWEdit.Program.Main(String[] args)
[1] 17234
2: I used shell("swat_edit"), which returns the exact same error as (1).
3: I used shell.exec("swat_edit"). This works, but it opens the executable in a new window, which then runs for a few seconds and closes (as intended). I need the program to run in the R terminal window so it can run many iterations in the background without disrupting other things. This is not a viable option.
4: I tried using terminalSend(ID,"swat_edit") (from the rstudioapi package). This works in that it sends the command to the terminal window in R. When I move there and hit enter it executes perfectly, running in the terminal window like I want it to. However, I need to run many iterations so this is not viable either. I tried using KeyboardSimulator to go to the Terminal tab and hitting enter (which worked), but this also does not let me use the PC for other purposes while running my code.
5: I tried using terminalExecute("swat_edit"), which returns the following error code:
Error calling capture_console_output: 87
[Process completed]
[Exit code: -532462766]
6: I tried making a python file that runs swat_edit.exe, and then running that file in R. The python file works when I run it by itself, from the command prompt, or from the terminal in R. It does not, however, work when I try to run it in the R terminal using terminalExecute (same error as in (5)).
NOTE: I have another executable called swat.exe (entirely different program) that works with all of the above-mentioned methods.
So in summary: swat_edit.exe runs perfectly in command prompt and R terminal, but does not work when I try to run it using R code (either system(), shell(), or terminalExecute().
I can't figure out the difference between terminalExecute() and typing the string into terminal and hitting enter, but apparently there is something happening in between...
It will be tedious to reproduce this since it uses external programs, but if anyone has any idea about the error messages or how I can copy a string and run it in the terminal without any interference, that would be greatly appreciated.
EDIT: I found a method that solves my problem. I created a .bat file that runs swat_edit minimized. I was able to run this .bat file with the shell function (or any of the other commands I mentioned) in R. This doesn't answer why I was having the issues I described, and it doesn't let me run swat_edit in the R terminal, but it's good enough for me.
The .bat file was simply the following:
"START /MIN /WAIT C:\~\SWAT_Edit.exe"
I am trying to understand how, for the above command, the shell arranges for output re-direction when the process associated with the shell itself is replaced by that of echo? Would very much appreciate any help.
Regards
Rupam
Output redirection of an external command is achieved by manipulating file descriptor, typically involving dup2, a system call that assigns a file descriptor to an existing open file. In this case, the standard output of the process that executes echo is instructed to point to the target output file.
The shell does a close equivalent of the following steps:
create a copy of the current process by invoking fork(); the subsequent steps happen in the child process. (This step is omitted when the command is invoked with exec, as in your example.)
open test for writing and remember the file descriptor
make file descriptor 1 equal to the descriptor, by calling dup2(fd, 1)
call execlp or similar to replace current process execution with execution of echo
The echo command simply writes to standard output, a fancy name for file descriptor number 1. It doesn't care if that ends up writing to a TTY, to a file, or to another program. The above steps make sure that file descriptor 1 really points to the open file test.
I'm really just looking to pick the community's brain for some leads in figuring out what is going on with the issue I'm having.
I'm writing a MR job with RHadoop (rmr2, v3.0.0) and things are great -- IO with HDFS, mapping, reducing. No problems. Life is great.
I'm trying to schedule the job with Apache Oozie, and am running into some issues:
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
I've read the rmr2 debugging guide, but nothing is really getting to the stderr because the job fails before anything even gets scheduled.
In my head, everything points to a difference in environments. However, Oozie is running the job as the same user that I'm able to run everything with via cli, and all of the R environment variables (fetched with Sys.getenv()) are the same, excepting there's some additional class path stuff set with Oozie.
I can post more of the OS or Hadoop versions and config details, but sleuthing some version-specific bugs seems like a bit of a red herring as everything runs fine at the command line.
Anybody have any thoughts what might be some helpful next steps in hunting this beast down?
UPDATE:
I overwrote the system function in the base package to log the user, the host name of the node, and the command being executed before the internal call to system. So before any system call is actually executed, I get something like the following in the stderr:
user#host.name
/usr/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-102.jar ...
When ran with Oozie, the command printed in the stderr fails with an exit status of 1. When I run the command on user#host.name, it runs successfully. So essentially the EXACT same command with the SAME user on the SAME node fails with Oozie, but runs successfully from cli.
I am building a Qt GUI application via Jenkins. I added 3 build steps:
Building the test executable
Running the test executable
compiling a coverage report with gcovr
For some reason, the shell task for running the test executable stops after execution. Even a simple echo does not run after. The tests are written with Google Test and output xUnit XML files, which are analyzed after the build.
Some tests start the applications user interface, so I installed the jenkins xvnc plugin to get them to run.
The build tasks are as follows:
Build
cd $WORKSPACE/projectfiles/QMake
sh createbin.sh
Test
cd $WORKSPACE/bin
./Application --gtest_output=xml
Coverage Report
cd $WORKSPACE/projectfiles/QMake/out
gcovr -x -o coverage.xml
Now, an echo at the end of the first build task is correctly printed, but an echo at the end of the second is not. The third build task is therefore not even run, although the Google Test output is visible. I thought that maybe the problem is that some of the Google Tests fail, but why whould the script stop executing just because the tests fail?
Maybe someone can give me a hint on why the second task stops.
Edit
The console output looks like this:
Updating svn://repo/ to revision '2012-11-15T06:43:15.228 -0800'
At revision 2053
no change for svn://repo/ since the previous build
Starting xvnc
[VG5] $ vncserver :10
New 'ubuntu:10 (jenkins)' desktop is ubuntu:10
Starting applications specified in /var/lib/jenkins/.vnc/xstartup
Log file is /var/lib/jenkins/.vnc/ubuntu:10.log
[VG5] $ /bin/sh -xe /tmp/hudson7777833632767565513.sh
+ cd /var/lib/jenkins/workspace/projectfiles/QMake
+ sh createbin.sh
... Compiler output ...
+ echo Build Done
Build Done
[VG5] $ /bin/sh -xe /tmp/hudson4729703161621217344.sh
+ cd /var/lib/jenkins/workspace/VG5/bin
+ ./Application --gtest_output=xml
Xlib: extension "XInputExtension" missing on display ":10".
[==========] Running 29 tests from 8 test cases.
... Test output ...
3 FAILED TESTS
Build step 'Execute shell' marked build as failure
Terminating xvnc.
$ vncserver -kill :10
Killing Xvnc4 process ID 1953
Recording test results
Skipping Cobertura coverage report as build was not UNSTABLE or better ...
Finished: FAILURE
Generally, if one Build Step fails, the rest will not be executed.
Pay attention to this line from your log:
[VG5] $ /bin/sh -xe
The -x makes the shell print each command in console before execution.
The -e makes the shell exit with error if any of the commands failed.
A "fail" in this case, would be a return code of not 0 from any of the individual commands.
You can verify this by running this directly on the machine:
./Application --gtest_output=xml
echo $?
If the echo $? displays 0, it indicates successful completion of the previous command. If it displays anything else, it indicates an error code from the previous command (from ./Application), and Jenkins treats it as such.
Now, there are several things at play here. First is that your second Build Step (essentially a temporary shell script /tmp/hudson4729703161621217344.sh) is set to fail if one command fails (the default behaviour). When the Build Step fails, Jenkins will stop and fail the whole job.
You can fix this particular behaviour by adding set +e to the top of your second Build Step. This will not cause the script (Build Step) to fail due to individual command failure (it will display an error for the command, and continue).
However, the overall result of the script (Build Step) is the exit code of the last command. Since in your OP, you only have 2 commands in the script, and the last is failing, it will cause the whole script (Build Step) to be considered a failure, despite the +x that you've added. Note that if you add an echo as the 3rd command, this would actually work, since the last script command (echo) was successful, however this "workaround" is not what you need.
What you need is proper error handling added to your script. Consider this:
set +e
cd $WORKSPACE/bin && ./Application --gtest_output=xml
if ! [ $? -eq 0 ]; then
echo "Tests failed, however we are continuing"
else
echo "All tests passed"
fi
Three things are happening in the script:
First, we are telling shell not to exit on failure of individual commands
Then i've added basic error handling in the second line. The && means "execute ./Application if-and-only-if the previous cd was successful. You never know, maybe the bin folder is missing, or whatever else can happen. BTW, the && internally works on the same error code equals 0 principle
Lastly, there is now proper error handling for the result of ./Application. If the result is not 0, then we show that it had failed, else we show that it had passed. Note, this since the last command is not a (potentially) failing ./Application, but an echo from either of if-else possibilities, the overall result of the script (Built Step) will be a success (i.e 0), and the next Build Step will be executed.
BTW, you can as well put all 3 of your build steps into a single build step with proper error handling.
Yes... this answer may be a little longer than what's required, but i wanted you to understand how Jenkins and shell treat exit codes.