Autosys jobs hung - autosys

We have jobs getting stuck in autosys R11 screen due to app server down
So is there any way to monitor autoys itself is up and running
Note-The jobs which got stuck shows completed in database but the dependent jobs cannot start though from front end the jobs are still in runnig status
Please help

chk_auto_up command will check if application server, event server,
scheduler and agent are working fine.
chase command checks if agent is running fine.
autoping command checks if agent is able to communicate with the
application server.
Check the log files of components by below commands :
autosyslog -e (scheduler)
autosyslog -s (server)
autosyslog -d j (job)
check the status of each components manually by below commands
unisrvcntr status waae_server.$AUTOSERV
unisrvcntr status waae_agent-$AGENT_NAME
unisrvcntr status waae_webserver.$AUOTSERV
unisrvcntr status waae_sched.$AUTOSERV

Related

About expected timeouts in RobotFramework SSHLibrary

I’m a newb with respect to Robot Framework.
I’m writing a test procedure that is expected to
connect to another machine
perform an image update (which causes the unit to close all services and reboot itself)
re-connect to the unit
run a command that returns a known string.
This is all supposed to happen within the __init__.robot module
What I have noticed is that I must invoke the upgrade procedure in a synchronous, or blocking way, like so
Execute Command sysupgrade upgrade.img
This succeeds in upgrading the unit, but the robotframework script hangs executing the command. I suspect this works because it keeps the ssh session alive long enough for the upgrade to reach a critical junction where the session is closed by the remote host, the host expects this, the upgrade continues, and this does not cause the upgrade to fail.
But the remote host appears to close the ssh session in such a way that the robotframework script does not detect it, and the script hangs indefinitely.
Trying to work around this, I tried invoking the remote command like so
Execute Command sysupgrade upgrade.img &
But then the update fails because the connection appear to drop and leaves the upgrade procedure incomplete.
If instead I execute it like this
Execute Command sysupgrade upgrade.img &
Sleep 600
Then this also fails, for some reason I am unable to deduce.
However, if I invoke it like this
Execute Command sysupgrade upgrade.img timeout=600
The the command succeeds in updating the unit, and after the set timeout period, the robotframework script does indeed resume, but since it has arrived at the timeout, the test has (from the point of view of robotframework) failed.
But this is actually an expected failure, and should be ignored. The rest of the script would then reconnect to the host and continue the remaining test(s)
Is there a way to treat the timeout condition as non-fatal?
Here is the code, as explained above, the __init__.robot initialization module is expected to perform the upgrade and then reconnect, leaving the other xyz.robot files to be run and continue testing the applications.
The __init__.robot file:
*** Settings ***
| Library | OperatingSystem |
| Library | SSHLibrary |
Suite Setup ValidationInit
Suite Teardown ValidationTeardown
*** Keywords ***
ValidationInit
Enable SSH Logging validation.log
Open Connection ${host}
Login ${username} ${password}
# Upload the firmware to the unit.
Put File ${firmware} upgrade.img scp=ALL
# Perform firmware upgrade on the unit.
log "Launch upgrade on unit"
Execute Command sysupgrade upgrade.img timeout=600
log "Restart comms"
Close All Connections
Open Connection ${host}
Login ${username} ${password}
ValidationTeardown
Close All Connections
log “End tests”
This should work :
Comment Change ssh client timeout configuration set client configuration timeout=600 Comment "Launch upgrade on unit" SSHLibrary.Write sysupgrade upgrade.img SSHLibrary.Read Until expectedResult Close All Connections
You can use 'Run Keyword And Ignore Error' to ignore the failgure. Or here I think you should use write command if you do not care the execution result.

Rabbitmq: Node down

I am getting node down error on rabbitmq, this is happening sometimes.
Able to see the below error when I execute: sudo rabbitmqctl status or sudo rabbitmqctl list_queues
Error: unable to connect to node : nodedown
connected to epmd (port 4369) on host-name
epmd reports node 'rabbit' running on port 25672
can't establish TCP connection, reason: timeout
suggestion: blocked by firewall?
version: {rabbit,"RabbitMQ","3.6.9"}
os: Ubuntu 16.04
I have checked hostname which is ok with me, not changed since the installation
Also able to telnet localhost 25672
What could be the reason behind this error and possible solution?
And one more question, I am checking node status using below API
curl -s GET http://edx:edx#127.0.0.1:15672/api/healthchecks/node/
Is above API ok or not to check the health status of the node? Please suggest if there is anything else. I have set up one shell script which will call this API and if status is not ok then it will restart rabbitmq-server service. Script is executed from cron every minute.
Looks like your rabbitmq node is... down. rabbitmqctl needs a running node to perform these commands.
If you're using systemd, you can check the service status:
service rabbitmq-server status
Or just try to restart the node:
rabbitmqctl start_app
Telnet on port 25672 tells you the rabbitmqctl is running, but RabbitMQ itself does not run on that port (by default, it's listening on 5672).

Airflow: New DAG is not found by webserver

In Airflow, how should I handle the error "This DAG isn't available in the webserver DagBag object. It shows up in this list because the scheduler marked it as active in the metadata database"?
I've copied a new DAG to an Airflow server, and have tried:
unpausing it and refreshing it (basic operating procedure, given in this previous answer https://stackoverflow.com/a/42291683/160406)
restarting the webserver
restarting the scheduler
stopping the webserver and scheduler, resetting the database (airflow resetdb), then starting the webserver and scheduler again
running airflow backfill (suggested here Airflow "This DAG isnt available in the webserver DagBag object ")
running airflow trigger_dag
The scheduler log shows it being processed and no errors occurring, I can interact with it and view it's state through the CLI, but it still does not appear in the web UI.
Edit: the webserver and scheduler are running on the same machine with the same airflow.cfg. They're not running in Docker.
They're run by Supervisor, which runs them both as the same user (airflow). The airflow user has read, write and execute permission on all of the dag files.
This helped me...
pkill -9 -f "airflow scheduler"
pkill -9 -f "airflow webserver"
pkill -9 -f "gunicorn"
then restart the airflow scheduler and webserver.
Just had this issue myself. After changing permissions, resetting the meta database, restarting the webserver & even making some potential code changes to rectify the situation, it didn't happen.
However, I noticed that even though we were stopping the webserver, our gunicorn process was still running. Killing these processes & then starting everything back up resulted in success
I had the same problem on an airflow installed from a Docker image
What I did was:
1- delete all files .pyc
2- delete Metadata databse using :
for t in ["xcom", "task_instance", "sla_miss", "log", "job", "dag_run", "dag" ]:
sql="delete from {} where dag_id='{}'".format(t, dag_input)
hook.run(sql, True)
3- restart webserver & scheduler
4- Execute airflow updatedb
It resolved the problem for me.
if the airflow_home - dags_folder config parameter is same for scheduler, webUI and the command line interface the only cause for the error:
This DAG isn't available in the webserver DagBag object
can be file permissions or error in python script.
Please check
Run the dag as normal python script and check for errors
User in airflow.cfg and the one creating the dag should be same or the dag should have execute permission for the airflow user
With Airflow 1.9 I don't experience the problem with zombie gunicorn processes.
I do a simple restart: systemctl restart airflow-webserver and it forces webserver to refresh DAG status.

How can I stop the http server, downloaded using 'npm install http-server"?

How can I stop the http server, downloaded using 'npm install http-server" comand in terminal (console) and launched then?
Simply Ctrl+C, if you read the output after you launch it, you should see:
Starting up http-server, serving xxx
Available on:
http://<some ip>:<some port>
Hit CTRL-C to stop the server
Its built on node so Kill the node process for stopping it if it stuck. You can find all the node process ids and see what I'd your server have and kill that.

Boot script execution order (rc.local)?

With some great help from another user on here I've managed to create a script which writes the necessary network configurations required to /etc/network/interfaces and allow public access to a DomU server.
I’ve placed this script in the /etc/rc.local file, and executed chmod u+x /etc/rc.local to enable it.
The server is a DomU Ubuntu server on the a host (Dom0). And rc.local doesn't seem to be executing before the network is brought up at boot/creation time.
So the configuration changes are being made to the /etc/network/interfaces file, but are not active once the boot process completes. I have to reboot once more before the changes take effect.
I've tried adding /etc/init.d/networking restart to the the end of the rc.local script (before exit 0), but with no joy.
I also tried adding the script to the S35networking file, but again without success.
Any advice or suggestions on getting this script to execute before the network device is brought up would be greatly appreciated.?

Resources