LSF bsub: job always in PENDING state, not going to RUN state - unix

I stuck on a small problem.
I'm launching many bsub commands at the same time each one on a specified host:
bsub -sp 20 -W 0:5 -m $myhostname -q "myQueue" -J "mkdir_script" -o $log_file "script_to_launch param1 param2 param3"
all this inside a for, for each hostName.
The problem is that everything is OK for all hosts except one (always the same one). The job is always in PENDING state, and is not moving to RUN state.
The script to execute is a script that will check for a folder and creating it if is not there (so a very small task to do).
Is there a way to see what happens on that host and why my job is not going to RUN state ?
PS: I just found the bjobs -p command and I have the following message:
Not specified in job submission: 81 hosts;
Closed by LSF administrator: 3 hosts;
What does this message mean?

The -m option limits you to a particular host, which excludes 81 hosts. The other three have been closed by your system administrator. You would have to contact them to find out why.

Related

Why OpenMPI uses a different server given a different -n setting?

I am testing out OpenMPI, provided and compiled by another user, (I am using soft link to his directories for all bin, include, etc - all the mandatory directories) but I ran into this weird thing:
First of all, if I ran mpirun with -n setting <= 10, I can run this below. testrunmpi.py simply prints out "run." from each core.
# I am in serverA.
bash-3.2$ /home/karl/bin/mpirun -n 10 ./testrunmpi.py
run.
run.
run.
run.
run.
run.
run.
run.
run.
run.
However, when I tried running -n more than 10, I will run into this:
bash-3.2$ /home/karl/bin/mpirun -n 24 ./testrunmpi.py
karl#serverB's password: Could not chdir to home directory /home/karl: No such file or directory
bash: /home/karl/bin/orted: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 19203) died unexpectedly with status 127 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
bash-3.2$
bash-3.2$
Permission denied, please try again.
karl#serverB's password:
Permission denied, please try again.
karl#serverB's password:
I see that the work is dispatched to serverB, while I was on serverA. I don't have any account on serverB. But if I invoke mpirun -n <= 10, the work will be on serverA.
This is strange, so I checked out /home/karl/etc/openmpi-default-hostfile, and tried set the following:
serverA slots=24 max_slots=24
serverB slots=0 max_slots=32
But the problem persists and still gives out the same error message above. What must I do in order to have my program run on serverA only?
The default hostfile in Open MPI is system-wide, i.e. its location is determined while the library is being built and installed and there is no user-specific version of it. The actual location can be obtained by running the ompi_info command like this:
$ ompi_info --param orte orte | grep orte_default_hostfile
MCA orte: parameter "orte_default_hostfile" (current value: <LOOK HERE>, data source: default value)
You can override the list of hosts in several different ways. First, you can provide your own hostfile via the -hostfile option to mpirun. If so, you don't have to put hosts with zero slots inside it - simply omit machines that you have no access to. For example:
localhost slots=10 max_slots=10
serverA slots=24 max_slots=24
You can also change the path to the default hostfile by setting the orte_default_hostfile MCA parameter:
$ mpirun --mca orte_default_hostfile /path/to/your/hostfile -n 10 executable
Instead of passing each time the --mca option, you can set the value in an exported environment variable called OMPI_MCA_orte_default_hostfile. This could be set in your shell's dot-rc file, e.g. in .bashrc if using Bash.
You can also specify the list of nodes directly via the -H (or -host) option.

tcp_probe module does no output

I'm trying to plot the TCP congestion window and the slow start threshold using iperf and the tcp_probe module. I do exactly what is told here:
to obtain the data:
modprobe tcp_probe port=5001
chmod 444 /proc/net/tcpprobe
cat /proc/net/tcpprobe >/tmp/tcpprobe.out &
TCPCAP=$!
iperf -i 10 -t 100 -c receiver
kill $TCPCAP
Oops!
/tmp/tcpprobe.out is empty :(
This is Ubuntu 11.04 x86
and already tried the same on Ubuntu 11.04 x64
Any suggestions?
I was having the same problem. What worked for me was:
modprobe -r tcp_probe
sudo modprobe tcp_probe port=5002 full=1
sudo chmod 444 /proc/net/tcpprobe
cat /proc/net/tcpprobe > /tmp/tcpprobe.out &
TCPCAP=$!
iperf -c <servers IP address here> -p 5002 -t 100 -i 1
sudo kill $TCPCAP
See iperf parameters to check if those (-t 100 -i 1) are what you need by typing:
man iperf
I/O function in C standard library use buffer by default, usually 4k , so fread() only return when buffer full or EOF. You can use small buffer, 128 bytes, see:
dd if=/proc/net/tcpprobe ibs=128 obs=128
Now, message flush quickly.
By default the tcp_probe logs only when the cnwd changes, try modprobe tcp_probe ... full=1.
Linux source code referece: http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/net/ipv4/tcp_probe.c#L47
I had similiar issue, the tcp_probe module outputs only in non-obvious time intervals. I've created a modified version of it that flushes on every received tcp segments. This slows down the system, but allows to better monitor short-lived connections like HTTP.
Find the source code to the module here.
another issue which causes no output is file permission of output file tcpprobe.out
when cat tcpprobe directly, it's able to see the output, but if redirecting the output the file, the output file size is 0, which reminds me that it's the permission issue...
A very late answer, but have been struggling with this issue myself. I was trying out the version Dyna provided, yet still got no output, regardless of the parameters used. In the end, I found that the order was the problem.
The way I was using tcp_probe was: install/activate the module, run some tcp application (I was running some tcp unit tests), then start the copy process for /proc/net/tcpprobe (as shown in the other answers) and then remove/stop the module. The correct way is to start the copy process (barring killing of the process) BEFORE you perform the tcp intensive activity. Keep the cat process running while you perform the tcp activity and only kill it afterwards.
A pretty humbling experience for me, as it took hours to figure this out. Hopefully, people find this useful.

How to reload a spawned script for nginx fast cgi

Below is by code for spawing a fcgi script for nginx.
spawn-fcgi -d /home/ubuntu/workspace -f /home/ubuntu/workspace/index.py -a 127.0.0.1 -p 9001
Now, lets I want to make changes to the index.py script and reload with out bring down the system. How do reload the spawned program so the next connections are using the updated program while the others finish? For now I am killing the spawned process and running command again. I am hoping for something more graceful.
I tried this by the way.
sudo kill -1 `sudo lsof -t -i:9001
I have recently made something similar for node.js.
The idea is to have index.py as a very simple bootstrap script (which doesn‘t actually change much over time). It should catch SIGHUP, and reload/reread the application files (which are expected to change frequently).

rsync over ssh hangs in a file sync daemon

I am writing a file syncing application where I collect event from the filesystem whenever the file is modified and than later I copy it over to remote share via rsync over ssh. In my setup I have a slot which is connected to a QTimer. Each 5 seconds I pick a file from a sqlite db for synchronization and start a QProcess::start with the following parameters
/usr/bin/rsync -a /aufs/another-test-folder/testfile286.txt --rsh="ssh -p 8023" user#myserver.de:/home/neox/another-test-folder/testfile286.txt --rsync-path="mkdir -p /home/neox/another-test-folder && rsync"
I have at most 2 rsync processes running in parallel. This results in a process tree:
MyApp
\_rsync
| \_ssh
|_rsync
\_ssh
The problem is that sometimes the application hangs and the ps says that ssh processes have gone zombie. First I have tried to kill MyApp with SIGKILL but no luck. Than I moved on to kill rsync and ssh but still no luck. The whole tree hangs. And if I try to start the daemon from another console or even try to ssh to another box, I can't. My idea here is that somewhere ssh blocks some IO resources. Any idea how to solve this?
P.S. This happens randomly and not often

How to get the trace of process?

Suppose there is a process which is in inactive state from many days and i want to know until what time the process is in active state. Other than the log records where i can get that information?
This is unix platform.
Use strace debugging utility. You can attach to already running process, save output to log file and analyse it later.
[root#localhost ~]#
[root#localhost ~]# strace -o log -p 7166
Process 7166 attached - interrupt to quit

Resources