How to Identify Common error on Dag level - airflow

I want trying setup a dag (calling it as Indentify_Common_error_dag) which can read error logs from S3 bucket and Identify the common error. let's say if I run the Dag once in a week, it should read error logs from S3 bucket give me a report which says these many time this error occurred.
Example: If My other Dags fail continuously every day due to SQL compilation error. When I run my Dag (Indentify_Common_error_dag) once in a week which read logs from S3. It should say my other dags failed due to SQL compilation error 4 times in a week.
I have 200+ dags these dags fails due to many error like SQL compilation error, Duplicate row or Bash command failed or any other issue. Indentify_Common_error_dag Dag helps me to Identify what common error occurred in last one week or one month. that will help me to modify my 200+ dags to handle these error on its own.
I don't know if it really make sense, I also don't know if we can setup Dag like this. I am new Airflow if there any way of identifying it please let me know.
Any suggestion or feedback. if there is way it would be great. Really appreciate your ideas.

Related

Airflow upstream task in "none status" status, but downstream tasks executed

As screenshot:
Any idea why this would occur, and how would I go about troubleshooting it?
--
Update:
It's "none status" not "queued" as I originally interpreted
The DAG run occurred on 3/8 and last relevant commit was on 3/1. But I'm having trouble finding the same DAG run....will keep investigating
It's not Queued status. It's None status.
This can happen in one of the following cases:
The task drop_staging_table_if_exists was added after create_staging_table started to run.
The task drop_staging_table_if_exists used to have a different task_id in the past.
The task drop_staging_table_if_exists was somewhere else in the workflow and you changed the dependencies after the DAG run started.
Note Airflow currently doesn't support DAG versioning (It will be supported in future versions when AIP-36 DAG Versioning is completed) This means that Airflow constantly reload the DAG structure, so changes that you make will also be reflected on past runs - This is by design! and it's very useful for cases where you want to backfill past runs.
Either way, if you will start a new run or clear this specific run the issue you are facing will be resolved.

XCOM variable push is getting failed

In Airflow DAG, I am trying to write data in return.json and trying to push it. This task is getting failed randomly saying below error.
**{taskinstance.py:1455} ERROR - Unterminated string starting at: line 1**
Once I retry this task, it is getting successfully executed. I am not able to understand why some tasks are getting randomly failed in flow. Sometime in flow execution, none of the task are failing. Could someone please help me on this.

Airflow Dependencies Blocking Task From Getting Scheduled

I have an airflow instance that had been running with no problem for 2 months until Sunday. There was a blackout in a system on which my airflow tasks depend and some tasks where queued for 2 days. After that we decided it was better to mark all the tasks for that day as failed and just lose that data.
Nevertheless, now all the new tasks get trigger at the proper time but they are never being set to any state (neither queued nor running). I check the logs and I see this output:
Dependencies Blocking Task From Getting Scheduled
All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
The scheduler is down or under heavy load
The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
This task instance already ran and had its state changed manually (e.g. cleared in the UI)
I get the impression the 3rd topic is the reason why it is not working.
The scheduler and the webserver were working, however I restarted the scheduler and still I am having the same outcome. I also deleted the data in mysql database for one job and it is still not running.
I also saw a couple of post that said it is not running because the depens_on_past was set to true and if the previous runs failed, the next one will never be executed. I also checked it and it is not my case.
Any input would be really apreciated.
Any ideas? Thanks
While debugging a similar issue i found this setting: AIRFLOW__SCHEDULER__MAX_DAGRUNS_PER_LOOP_TO_SCHEDULE (or http://airflow.apache.org/docs/apache-airflow/2.0.1/configurations-ref.html#max-dagruns-per-loop-to-schedule), checking the airflow code it seems that the scheduler queries for dagruns to examine (consider to run ti's for), this query is limited to that number of rows (or 20 by default). So if you have >20 dagruns that are in some way blocked (in our case because ti's were on up-for-retry), then it won't consider other dagruns even though these could run fine.

Triggering an Airflow DAG from terminal always keep running state

I am trying to use airflow trigger_dag dag_id to trigger my dag, but it just show me running state and doesn't do anymore.
I have searched for many questions, but all people just say dag id paused. the problem is my dag is unpaused, but also keep the running state.
Note: I can use one dag to trigger another one in Web UI. But it doesn't work in command line.
please see the snapshot as below
I had the same issue many times, The state of the task is not running, it is not queued either, it's stuck after we 'clear'. Sometimes I found the task is going to Shutdown state before getting into stuck. And after a large time the instance will be failed, still, the task status will be in white. I have solved it in many ways, I
can't say its reason or exact solution, but try one of this:
Try trigger dag command again with the same Execution date and time instead of the clear option.
Try backfill it will run only unsuccessful instances.
or try with a different time within the same interval it will create another instance which is fresh and not have the issue.

Is there any way to pass the error text of a failed Airflow task into another task?

I have a DAG defined that contains a number of tasks, the last of which is only run if any of the previous tasks fail. This task simply posts to a Slack channel that the DAG run experienced errors.
What I would really like is if the message sent to the Slack channel contained the actual error that is logged in the task logs, to provide immediate context to the error and perhaps save Ops from having to dig through the logs.
Is this at all possible?

Resources