OpenContrail Collector and Zookepper - Daemons stuck in initialization - openstack

I have a problem with OpenContrail. I do not understand why, but when I run:
---code_being---
root# watch contrail-status
---code_end---
I get a lot of daemons showing that they are stuck in initializing state. I try to stop and restart these daemons, but to know avail. I also have looked into the ZooKeeper status, but do not know how to get it back up. Any thoughts?
Here is the Contrail-Status Output:
Every 2.0s: contrail-status Mon Jul 30 20:19:24 2018
== Contrail Control ==
supervisor-control: active
contrail-control initializing (Number of connections:4, Expected:5 Missing: IFMap
:IFMapServer)
contrail-control-nodemgr initializing (Collector connection down)
contrail-dns active
contrail-named active
== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen initializing (Collector, Zookeeper:Zookeeper[Connection time-out
] connection down)
contrail-analytics-api initializing (Collector, UvePartitions:UVE-Aggregation[Partition
s:0] connection down)
contrail-analytics-nodemgr initializing (Collector connection down)
contrail-collector initializing
contrail-query-engine initializing (Collector connection down)
contrail-snmp-collector initializing (Collector, Zookeeper:Zookeeper[Connection time-out
] connection down)
contrail-topology initializing (Collector, Zookeeper:Zookeeper[Connection time-out
] connection down)
== Contrail Config ==
supervisor-config: active
unix:///var/run/supervisord_config.sockno
== Contrail Database ==
contrail-database: inactive (disabled on boot)
== Contrail Supervisor Database ==
supervisor-database: active
contrail-database active
contrail-database-nodemgr active
kafka initializing
For Cassandra and ZooKeeper service I get a Blank return:
root#ntw02:~# contrail-cassandra-status
root#ntw02:~# service contrail-control status
contrail-control RUNNING pid 21499, uptime 5:05:39
root#ntw02:~# service contrail-collector status
contrail-collector RUNNING pid 1132, uptime 17 days, 5:45:34
Interesting, I restarted the contrail-collector, but the uptime does not match 07/28/2018
I am going to try and kill the process and provide the output and restart.

So after a long awaited time-frame I finally found the answer to this question. Apparently contrail has a script that will clean up entries specifically related to the database.
Enter:
db_manage.py
You should be able to execute it with
python /usr/lib/python2.7/dist-packages/vnc_cfg_api_server/db_manage.py

Related

How to deploy listener of text-index on NebulaGraph Database?

Here are the steps (and problems):
to stop 172.16.0.17 nebula graph
sudo /usr/local/nebula/scripts/nebula.service stop all
kill -9 to stop listener
restart service
sudo /usr/local/nebula/scripts/nebula.service start all
start listener
./bin/nebula-storaged --flagfile /usr/local/nebula/etc/nebula-storaged-listener.conf
On 172.16.0.20 nebula graph, create a new space and use this space. Then
ADD LISTENER ELASTICSEARCH 172.16.0.17:9789
to add listener
SHOW LISTENER. Here is the problem: It's offline
The major reason is that one step is missing:
We must sign in text service before add listener. I.e., before step 4. Otherwise, an error occurs:
Running on machine: k3s01
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E1130 19:25:45.874213 20331 MetaClient.cpp:636] Send request to "172.16.0.17":9559, exceed retry limit
E1130 19:25:45.874900 20339 MetaClient.cpp:139] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused

Airflow tasks ending up in Retry state without logs

Hi I'm currently running airflow on a Dataproc cluster. My DAGs used to run fine but facing this issue where tasks are ending up in 'retry' state without any logs when I click on task instance -> logs on airflow UI
I see the following error in terminal where I started the airflow webserver
2022-06-24 07:30:36.544 [ERROR] Executor reports task instance
<TaskInstance: **task name** 2022-06-23 07:00:00+00:00 [queued]> finished (failed)
although the task says its queued. Was the task killed externally?
None
[2022-06-23 06:08:33,202] {models.py:1758} INFO - Marking task as UP_FOR_RETRY
2022-06-23 06:08:33.202 [INFO] Marking task as UP_FOR_RETRY
What I tried so far
restarted webserver
Started server from 3 different ports
re-ran backfill command with 3 different timestamps
deleted dag runs for my dag, created a new dag run and then re-ran backfill command
cleared the PID as mentioned here How do I restart airflow webserver? and restarted the webserver
None of these worked. This issue is persistent for the past two days, appreciate any help here.At this point I'm guessing this is to do with a shared DB but not sure how to fix this.
<<update>> So what I also found is these tasks eventually go to success or failure state. when that happens the logs are available, but still no logs for the retry attempts in $airflow_home or our remote directory
The issue was there was another celery worker listening on the same queue. since this second worker was not configured properly it was failing the task and not writing the logs to remote location.

Artifactory service fails to start upon Fedora 35 reboot

I have installed on Fedora 35 jfrog-artifactory-oss (v7.31.11-73111900.x86_64) and enabled it as a system service to start at boot. But whenever I boot up my OS, the server never starts properly. I will always need to kill the PID of the active running Artifactory process. If I then do sudo service artifactory restart it will bring up the server cleanly and everything is good. How can I avoid having to do this little dance? Is there something about OS boot up that is causing Artifactory to get thrown off?
I have looked at console.log when the server is not running properly after bootup, I see some logs like:
2022-01-27T08:35:38.383Z [shell] [INFO] [] [artifactoryManage.sh:69] [main] - Artifactory Tomcat already started
2022-01-27T08:35:43.084Z [jfac] [WARN] [d84d2d549b318495] [o.j.c.ExecutionUtils:165] [pool-9-thread-2] - Retry 900 Elapsed 7.56 minutes failed: Registration with router on URL http://localhost:8046 failed with error: UNAVAILABLE: io exception. Trying again
That shows that the server is not running properly, but doesn't give a clear idea of what to try next. Any suggestions?
2 things to check,
How is the artifactory.service file in the systemd directory
Whenever the OS is rebooted, what is the error seen in the logs, check all the logs.
Hint: From the warning shared, it seems that Router service is not able to start when OS is rebooted, so whenever OS is rebooted and issue comes up check the router-service.log for any errors/warnings.

airflow - all tasks being queued and not moved to execution

airflow 1.8.1
Scheduler, worker and webserver are running in separate dockers on AWS.
The system was operational, and now for some reason all tasks are staying in queued state...
No errors in scheduler logs.
In worker I see this error (not sure if its related since scheduler should move tasks from queued state):
[2018-01-23 20:46:00,428] {base_task_runner.py:95} INFO - Subtask: [2018-01-23 20:46:00,428] {models.py:1122} INFO - Dependencies not met for , dependency 'Task Instance State' FAILED: Task is in the 'success' state which is not a valid state for execution. The task must be cleared in order to be run.
I tried reboots, airflow clear and then resetdb commands but it did not help.
Any idea what else can be done to fix that problem?
Thanks

Unicorn is killed automatically

I'm using unicorn in a staging environment (Ubuntu), when a build process is started unicorn is killed automatically with the following logs.
I, [2014-09-23T06:59:58.912673 #16717] INFO -- : reaped #<Process::Status: pid 16720 exit 0> worker=0
I, [2014-09-23T06:59:58.913144 #16717] INFO -- : reaped #<Process::Status: pid 16722 exit 0> worker=1
I, [2014-09-23T06:59:58.913464 #16717] INFO -- : master complete
I'm unable to locate why this is error is happening.
It seems your unicorn server is gracefully shutdown by sending a SIGQUIT to the master process. In this case, the master process reaps all its worker processes after they have finished their current request and then shuts down itself. Unicorn supports a couple more signals to trigger certain behaviour (e.g. adding or removing workers, reloading itself, ...). You can lean more about that at the SIGNALS documentation of unicorn.
The SIGQUIT is probably caused by your deployment process which probably tries to reload/restart your unicorn but dies something strange. Generally, you should look at your unicorn init script or your deployment process for which signals are send (e.g by using the kill command).

Resources