Fail ansible play fast on concurrent async task failure - asynchronous

I've got an ansible play (split across several included playbooks) which looks a bit like this:
- name: Build packer images
<snip>
register: packer_run
async: 2700 # 45 minutes
poll: 0
- name: Do other stuff
<snip>
...
- name: Check Packer build finished
async_status:
jid: "{{ packer_run.ansible_job_id }}"
register: packer_result
until: packer_result.finished
retries: 30 # allow 15 mins
delay: 30
Note the async task sets poll=0 to allow subsequent tasks to run concurrently. However even if the async task fails immediately, ansible only fails once it's finished the subsequent tasks and checks on the async task status. Which is really annoying because it can take ~30 minutes to find out that the async task failed 30s in to the run.
Is there a way to preserve the concurrent behaviour but have ansible fail as soon as the async task fails?
I guess the obvious answer is "don't do the concurrency with ansible" but as this is running in a github actions workflow (which doesn't allow concurrent steps) using this approach provides a reasonable workaround apart from this "slow failure" problem.

Even though 'Build packer images' task fails immediately, the status of the task is only checked when 'Check Packer build finished' is reached.
Or 'Build packer images' logs the status in some file and other tasks read that.
We can keep 'Check Packer build finished' whenever we need to check the status. Check it multiple time at different points in the playbook. In any case if no one is checking if the task is failed, will never get that info in ansible console.

Related

Airflow task improperly has an `upstream_failed` status after previous task succeeded after 1 retry

I have two tasks A and B. Task A failed once but the retry succeeded and is marked as a success (green). I would expect Task B to perform normally since Task A retry succeeded but it is marked as upstream_failed and was not triggered. Is this a way to fix this behavior?
The Task B has an ALL_SUCCESS trigger rule.
I am using Airflow 2.0.2 on AWS (MWAA).
Trying to restart the scheduler.
upstream_failed happened from scheduler flow or when depends are seting to failed state, you can check states from Task Instances
in Retry Mode:
Task A will be in up_for_retry state until exceed retries number.
If trigger_rule set with all_success(it's default trigger rule), Task B will not trigger untill Task A finished, If every thing running correctly.
Could you add the DAG implementation?

Airflow Dependencies Blocking Task From Getting Scheduled

I have an airflow instance that had been running with no problem for 2 months until Sunday. There was a blackout in a system on which my airflow tasks depend and some tasks where queued for 2 days. After that we decided it was better to mark all the tasks for that day as failed and just lose that data.
Nevertheless, now all the new tasks get trigger at the proper time but they are never being set to any state (neither queued nor running). I check the logs and I see this output:
Dependencies Blocking Task From Getting Scheduled
All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
The scheduler is down or under heavy load
The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
This task instance already ran and had its state changed manually (e.g. cleared in the UI)
I get the impression the 3rd topic is the reason why it is not working.
The scheduler and the webserver were working, however I restarted the scheduler and still I am having the same outcome. I also deleted the data in mysql database for one job and it is still not running.
I also saw a couple of post that said it is not running because the depens_on_past was set to true and if the previous runs failed, the next one will never be executed. I also checked it and it is not my case.
Any input would be really apreciated.
Any ideas? Thanks
While debugging a similar issue i found this setting: AIRFLOW__SCHEDULER__MAX_DAGRUNS_PER_LOOP_TO_SCHEDULE (or http://airflow.apache.org/docs/apache-airflow/2.0.1/configurations-ref.html#max-dagruns-per-loop-to-schedule), checking the airflow code it seems that the scheduler queries for dagruns to examine (consider to run ti's for), this query is limited to that number of rows (or 20 by default). So if you have >20 dagruns that are in some way blocked (in our case because ti's were on up-for-retry), then it won't consider other dagruns even though these could run fine.

task must be cleared before being run

I have a task that's scheduled to run hourly, however it's not being triggered. When I look at theTask Instance Details it says:
All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
- The scheduler is down or under heavy load
- The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
- This task instance already ran and had its state changed manually (e.g. cleared in the UI)
If this task instance does not start soon please contact your Airflow administrator for assistance.
If I clear the task in the UI I am able to execute it through terminal but it does not run when scheduled.
Why do I have to manually clear it after every run?

Control-M Prerequisites - Make job dependent on server availability

I want to know if I can add pre-requisite conditions for a job based on server availability. Suppose Job J runs from job server JS, and it interacts with two other servers SERVER1 and SERVER2.
I want to configure job J such that it runs only when SERVER1 and SERVER2 are available. In case any of the two servers is down, the job should wait for servers to come back online.
I don't know if this should be a comment or an answer, but what you are looking for is not natively available within Control-M.
The simplest solution I can think for you is to configure a sleep job to run on SERVER1 and SERVER2 and have them as pre-decessors to job J. These sleep jobs will only run when the agents on SERVER1/2 are available therefore confirming server availability prior to execution of job J.
Alternatively you could write a script that loops waiting for SERVER1/2 to respond to pings then complete and configure this job as a pre-decessor to job J.
I'm still newbie in Control-M but we have implemented a solution with similar goals with a job hook to proof nodes.
Assumed, your target server (node) called JS which interacts with SERVER1 (let's call node01). Any number of servers / nodes can be added later, let's see with just one node.
Overview components:
Jobs created for monitor changes and check status while OK and NOT OK status
Quantitative resource created for each node, for example node01_run (or stacked, as you wish)
Jobs are containing quantitative resource "node01_run" with least 1 free resource
If everything ok, jobs should run as expected
If downtime is recognized, quantitative resource (QR) will be changed to 0, so affected jobs should not run,
If the node is up again, quantitative resource will be set to the original value (10, 100, 1000, ...) and the jobs should run again as usual
Job name: node01_check_resource
Job Goal ---> Check if quantitative resource already existing
Job OS Command ---> ecaqrtab LIST node01_run
Result yes ---> do nothing,
Result no ---> Job node01_create_resource, command: ecaqrtab ADD node01_run 100 (or as many as you wish)
Job name: node01_check (cyclic)
Job Goal ---> Check if node up
Job OS Command ---> As you define that you node is up: check webservice, check uptime, wmi result, ping, ...
Result up ---> rerun job in x minutes
Result no ---> go for next job
Job name: node01_up-down
Job Goal ---> Case for switching between status up and status down
Job OS Command ---> ecaqrtab UPDATE node01_run 0
On-Do action: ---> when job ended, remove condition that node01_check cannot start (as is defined as cyclic job)
Job name: node01_check_down (cyclic)
Job Goal ---> Check status while known status is down
Job OS Command ---> As defined in node01_check
Result down ---> Do nothing as job is defined as cyclic
Result up ---> Remove some contitions
Job name: node01_down-up
Job Goal ---> Switching all back to normal operation
Job OS Command ---> ecaqrtab UPDATE node01_run 100
Action ---> Add condition that node01_check can run again
You can define such job hooks for multiple nodes and you can define in each job, which nodes should be up and running (means where the quantitative resource is higher than 0). It can be monitored multiple hosts and still set the same resource - as you wish.
I hope this helps further, unless you have found a suitable solution already.

Task with no status leads to DAG failure

I have a DAG that fetches data from Elasticsearch and ingests into the data lake. The first task, BeginIngestion, opens in several tasks (one for each resource), and these tasks open in more tasks (one for each shard). After the shards are fetched, the data is uploaded to S3 and then closed into a task EndIngestion, followed by a task AuditIngestion.
It was executing correctly, but now all tasks are executed successfully, but the "closing task" EndIngestion remains with no status. When I refresh the webserver's page, the DAG is marked as Failed.
This image shows successful upstream tasks, with the task end_ingestion with no status and the DAG marked as Failed.
I also dug into the task instance details and found
Dagrun Running: Task instance's dagrun was not in the 'running' state but in the state 'failed'.
Trigger Rule: Task's trigger rule 'all_success' requires all upstream tasks to have succeeded, but found 1 non-success(es). upstream_tasks_state={'failed': 0, 'upstream_failed': 0, 'skipped': 0, 'done': 49, 'successes': 49}, upstream_task_ids=['s3_finish_upload_ingestion_raichucrud_complain', 's3_finish_upload_ingestion_raichucrud_interaction', 's3_finish_upload_ingestion_raichucrud_company', 's3_finish_upload_ingestion_raichucrud_user', 's3_finish_upload_ingestion_raichucrud_privatecontactinteraction', 's3_finish_upload_ingestion_raichucrud_location', 's3_finish_upload_ingestion_raichucrud_companytoken', 's3_finish_upload_ingestion_raichucrud_indexevolution', 's3_finish_upload_ingestion_raichucrud_companyindex', 's3_finish_upload_ingestion_raichucrud_producttype', 's3_finish_upload_ingestion_raichucrud_categorycomplainsto', 's3_finish_upload_ingestion_raichucrud_companyresponsible', 's3_finish_upload_ingestion_raichucrud_category', 's3_finish_upload_ingestion_raichucrud_additionalfieldoption', 's3_finish_upload_ingestion_raichucrud_privatecontactconfiguration', 's3_finish_upload_ingestion_raichucrud_phone', 's3_finish_upload_ingestion_raichucrud_presence', 's3_finish_upload_ingestion_raichucrud_responsible', 's3_finish_upload_ingestion_raichucrud_store', 's3_finish_upload_ingestion_raichucrud_socialprofile', 's3_finish_upload_ingestion_raichucrud_product', 's3_finish_upload_ingestion_raichucrud_macrorankingpresenceto', 's3_finish_upload_ingestion_raichucrud_macroinfoto', 's3_finish_upload_ingestion_raichucrud_raphoneproblem', 's3_finish_upload_ingestion_raichucrud_macrocomplainsto', 's3_finish_upload_ingestion_raichucrud_testimony', 's3_finish_upload_ingestion_raichucrud_additionalfield', 's3_finish_upload_ingestion_raichucrud_companypageblockitem', 's3_finish_upload_ingestion_raichucrud_rachatconfiguration', 's3_finish_upload_ingestion_raichucrud_macrorankingitemto', 's3_finish_upload_ingestion_raichucrud_purchaseproduct', 's3_finish_upload_ingestion_raichucrud_rachatproblem', 's3_finish_upload_ingestion_raichucrud_role', 's3_finish_upload_ingestion_raichucrud_requestmoderation', 's3_finish_upload_ingestion_raichucrud_categoryproblemto', 's3_finish_upload_ingestion_raichucrud_companypageblock', 's3_finish_upload_ingestion_raichucrud_problemtype', 's3_finish_upload_ingestion_raichucrud_key', 's3_finish_upload_ingestion_raichucrud_macro', 's3_finish_upload_ingestion_raichucrud_url', 's3_finish_upload_ingestion_raichucrud_document', 's3_finish_upload_ingestion_raichucrud_transactionkey', 's3_finish_upload_ingestion_raichucrud_catprobitemcompany', 's3_finish_upload_ingestion_raichucrud_privatecontactinteraction', 's3_finish_upload_ingestion_raichucrud_categoryinfoto', 's3_finish_upload_ingestion_raichucrud_marketplace', 's3_finish_upload_ingestion_raichucrud_macroproblemto', 's3_finish_upload_ingestion_raichucrud_categoryrankingto', 's3_finish_upload_ingestion_raichucrud_macrorankingto', 's3_finish_upload_ingestion_raichucrud_categorypageto']
As you see, the "Trigger Rule" field says that one of the tasks is in a "non-successful state", but at the same time the stats shows that all upstreams are marked as successful.
If I reset the database, it doesn't happen, but I can't reset it for every execution (hourly). I also don't want to reset it.
Does anyone have any light?
PS: I am running in an EC2 instance (c4.xlarge) with LocalExecutor.
[EDIT]
I found in the scheduler log that the DAG is in deadlock:
[2017-08-25 19:25:25,821] {models.py:4076} DagFileProcessor157 INFO - Deadlock; marking run failed
I guess this may be due to some exception treatment.
I have had this exact issue before, for me my code was generating duplicate task ids. And it looks like in your case there is also a duplicate id:
s3_finish_upload_ingestion_raichucrud_privatecontactinteraction
This is probably a year late for you, but hopefully this will save others, lots of debugging time :)

Resources