nova-compute service State changes every second (UP<->DOWN) - openstack

the status of my nova-compute service is changed from UP to down every second. This causes instance creation to fail or repeat success.
Please let me know if you need any additional information.
Thank you.
Compute Service State 1
Compute Service State 2

This can happen when the time is out of sync between nodes. Especially if the nodes running nova-api have even minor drifts.
Can you run date on all nodes at the same time?

Related

RMAN does not delete archive logs not applied on GG Downstream

There is an issue in a topology of 3 hosts.
Primary has a scheduled OS-task (every hour) to delete archive logs older 3 hours with RMAN. Archivelog deletion policy is configured to "Applied on all standby".
There are 2 remote log_archive_dest entries - Physical Standby and Downstream. Every day there appears a "RMAN-08120: WARNING: archived log not deleted, not yet applied by standby" in the task's logs and than it resolves in 2-3 hours.
I've checked V$ARCHIVE_LOG during the issue and figured out, that the redos are not applied on the Downstream server. I have not caught the issue on the Downstream server yet, but during the "good" periods all the apply processes are enabled, but the dba_apply_progress view tells me, that apply_time of the messages is 01-JAN-1970.
The dba_capture view tells that capture processes' statuses are ENABLED, status_change_time is approximately the time of the RMAN-issue resolved.
I'm new to the Golden Gate, Streams and Downstream technologies. I've read the reference Oracle Docs, but couldn't find anything about some schedule for capture or apply processes.
Could someone please help to figure out, what else to check or what to read?
Grateful for every response.
Thanks.

What happens when two nodes attempt to write at the same time in 2PC?

Does anyone know what happens when two nodes try to write data at the same time and both initiate the 2PC protocol? Does a request from one node get aborted and another one succeed? The failed node would retry with some exponential backoff?
If not, what happens?
Usually resource managers does not allow both nodes to participate in the same transaction in the same transaction branch at the same time. Probably second node/binary/thread which tries to join to the transaction will get timeout or some other error on xa_start(..., TMJOIN) call.

JobRunr - Trying to run multiple recurring jobs with Spring Boot

I am using JobRunr to run my background jobs and in this I am providing the users to setup recurring jobs using an endpoint like below:
#PostMapping("/schedule-recurring")
public String scheduleRecurring(#RequestBody ExecutionJob executionJob) {
return BackgroundJob.scheduleRecurrently(executionJob.getId(),executionJob.getCronExpression(), ()
-> jobService.executeSomeJob(executionJob, JobContext.Null));
}
These jobs could run in 5 mins, 10 mins or it might sometimes take upto 4 hours. This all depends on how many records to process. Right now, I am in a phase where I have only one background job server since I am building a POC for this. However, in future we plan to scale it to support 1 instance per customer having one or multiple background job server based on the client license.
My issue here is that, I have 2 recurring jobs which run 1 hour apart from each other. The first recurring job executes for more than 1 hour and deems the second job unexecuted, because this job is not triggered as there is no available Background Job Worker to address this request. I am thinking of adding a check to the job to trigger itself only if a Background Job Worker is available. But is there a better idea where-in the schedule-recurring method itself adds a condition to queue the job if a Background Job worker is not available?
Thanks in advance.

Airflow Dependencies Blocking Task From Getting Scheduled

I have an airflow instance that had been running with no problem for 2 months until Sunday. There was a blackout in a system on which my airflow tasks depend and some tasks where queued for 2 days. After that we decided it was better to mark all the tasks for that day as failed and just lose that data.
Nevertheless, now all the new tasks get trigger at the proper time but they are never being set to any state (neither queued nor running). I check the logs and I see this output:
Dependencies Blocking Task From Getting Scheduled
All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
The scheduler is down or under heavy load
The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
This task instance already ran and had its state changed manually (e.g. cleared in the UI)
I get the impression the 3rd topic is the reason why it is not working.
The scheduler and the webserver were working, however I restarted the scheduler and still I am having the same outcome. I also deleted the data in mysql database for one job and it is still not running.
I also saw a couple of post that said it is not running because the depens_on_past was set to true and if the previous runs failed, the next one will never be executed. I also checked it and it is not my case.
Any input would be really apreciated.
Any ideas? Thanks
While debugging a similar issue i found this setting: AIRFLOW__SCHEDULER__MAX_DAGRUNS_PER_LOOP_TO_SCHEDULE (or http://airflow.apache.org/docs/apache-airflow/2.0.1/configurations-ref.html#max-dagruns-per-loop-to-schedule), checking the airflow code it seems that the scheduler queries for dagruns to examine (consider to run ti's for), this query is limited to that number of rows (or 20 by default). So if you have >20 dagruns that are in some way blocked (in our case because ti's were on up-for-retry), then it won't consider other dagruns even though these could run fine.

Control-M Prerequisites - Make job dependent on server availability

I want to know if I can add pre-requisite conditions for a job based on server availability. Suppose Job J runs from job server JS, and it interacts with two other servers SERVER1 and SERVER2.
I want to configure job J such that it runs only when SERVER1 and SERVER2 are available. In case any of the two servers is down, the job should wait for servers to come back online.
I don't know if this should be a comment or an answer, but what you are looking for is not natively available within Control-M.
The simplest solution I can think for you is to configure a sleep job to run on SERVER1 and SERVER2 and have them as pre-decessors to job J. These sleep jobs will only run when the agents on SERVER1/2 are available therefore confirming server availability prior to execution of job J.
Alternatively you could write a script that loops waiting for SERVER1/2 to respond to pings then complete and configure this job as a pre-decessor to job J.
I'm still newbie in Control-M but we have implemented a solution with similar goals with a job hook to proof nodes.
Assumed, your target server (node) called JS which interacts with SERVER1 (let's call node01). Any number of servers / nodes can be added later, let's see with just one node.
Overview components:
Jobs created for monitor changes and check status while OK and NOT OK status
Quantitative resource created for each node, for example node01_run (or stacked, as you wish)
Jobs are containing quantitative resource "node01_run" with least 1 free resource
If everything ok, jobs should run as expected
If downtime is recognized, quantitative resource (QR) will be changed to 0, so affected jobs should not run,
If the node is up again, quantitative resource will be set to the original value (10, 100, 1000, ...) and the jobs should run again as usual
Job name: node01_check_resource
Job Goal ---> Check if quantitative resource already existing
Job OS Command ---> ecaqrtab LIST node01_run
Result yes ---> do nothing,
Result no ---> Job node01_create_resource, command: ecaqrtab ADD node01_run 100 (or as many as you wish)
Job name: node01_check (cyclic)
Job Goal ---> Check if node up
Job OS Command ---> As you define that you node is up: check webservice, check uptime, wmi result, ping, ...
Result up ---> rerun job in x minutes
Result no ---> go for next job
Job name: node01_up-down
Job Goal ---> Case for switching between status up and status down
Job OS Command ---> ecaqrtab UPDATE node01_run 0
On-Do action: ---> when job ended, remove condition that node01_check cannot start (as is defined as cyclic job)
Job name: node01_check_down (cyclic)
Job Goal ---> Check status while known status is down
Job OS Command ---> As defined in node01_check
Result down ---> Do nothing as job is defined as cyclic
Result up ---> Remove some contitions
Job name: node01_down-up
Job Goal ---> Switching all back to normal operation
Job OS Command ---> ecaqrtab UPDATE node01_run 100
Action ---> Add condition that node01_check can run again
You can define such job hooks for multiple nodes and you can define in each job, which nodes should be up and running (means where the quantitative resource is higher than 0). It can be monitored multiple hosts and still set the same resource - as you wish.
I hope this helps further, unless you have found a suitable solution already.

Resources