Airflow - backend database to get reports - airflow

I am trying to get some useful information from airflow backend. I need to get following details
How many times a particular job failed.
which task has failed over and over.
The problem is all our task has dependency on their upstream, and so when it fails, we fix the issue and mark it as success. This changes status in database as well. Is there a place I can get historical records?
following query shows which task failed. However if I mark it as success from UI, status is updated in database as well. And I have no way to show if this was failed.
select * from dag_run where dag_id='test_spark'

Currently, there is no native way but you can check log table -- it adds a record whenever there is action via CLI or UI.

Related

DAG seems to be missing from DagBag error in Airflow 2.4.0

I updated my Airflow setup from 2.3.3 to 2.4.0. and I started to get these errors on the UI DAG <dag name> seems to be missing from DagBag. Scheduler log shows ERROR - DAG < dag name> not found in serialized_dag table
One of my airflow instanced seemed to work well for the old dags, but when I add new dags I get the error. On the other airflow Instance, every dag was outputting this error and the only way out of this mess was to delete the db and init it again. The error message appears when I click the dag from the main view.
Deleting db is not the solution I want to use in the future, is there any other way this can be fixed?
Side note:
It's also weird, that I use the same airflow image in both of my instances and still the other instance has the newly added Datasets menu on top bar and the other instance doesn't have it.
My setup:
Two isolated airflow main instances(dev,prod) with CeleryExecutor and each of these instances have 10 worker machines. I'm running the setup on each machine using docker compose conf and shared .env file that ensures that the setup is the same on the main machine and the worker machines.
Airflow version: 2.4.0 (same error in 2.4.1)
PSQL: 13
Redis:6.2.4
UPDATE:
Still unresolved. The new dag is shown at Airflow UI and it can be activated. Running the dag is not possible. I think theres no other solution than to reset the db.
I have found no official references for this fix so use it carefully and backup your db first :)
I have encountered the same problem after the upgrade to Airflow 2.4.1 (from 2.3.4). Pre-existing DAGs still worked properly, but for new DAGs I saw the error you mentioned.
Debugging, I found in the scheduler logs:
{manager.py:419} ERROR - Add View Menu Error: (psycopg2.errors.NotNullViolation) null value in column "id" of relation "ab_view_menu" violates not-null constraint
DETAIL: Failing row contains (null, DAG:my-new-dag-id).
[SQL: INSERT INTO public.ab_view_menu (name) VALUES (%(name)s) RETURNING public.ab_view_menu.id]
[parameters: {'name': 'DAG:my-new-dag-id'}]
which seems to be the cause of the problem: a null value for the id column, which prevents the DAG from being loaded.
I also saw similar errors when running airflow db upgrade.
After a check on the ab_view_menu database table I noticed that a sequence exists for its primary key (ab_view_menu_id_seq), but it was not linked to the column.
So I linked it:
ALTER TABLE ab_view_menu ALTER COLUMN id SET DEFAULT NEXTVAL('public.ab_view_menu_id_seq'::REGCLASS);
ALTER SEQUENCE ab_view_menu_id_seq OWNED BY ab_view_menu.id;
SELECT setval('ab_view_menu_id_seq', (SELECT max(id) FROM ab_view_menu));
The same consideration applies to other tables:
ab_permission
ab_permission_view
ab_permission_view_role
ab_register_user
ab_role
ab_user
ab_user_role
ab_view_menu
With this fix on the sequences the problem seems to be solved.
NOTE: the database used is PostgreSQL
Given your latest comment, it sounds like you are running two airflow versions with two different schedulers connected to the same database.
If one has access to DAGs, that the other doesn't, that alone would already explain the errors you are seeing regarding DAG missing.
Please share some more details on your setup and we can look into this more in depth.

Send an alert when a dag did not run google cloud

I have a DAG in Airflow where the run is not scheduled, but triggered by an event. I would like to send an alert when the DAG did not run in the last 24 hours. My problem is I am not really sure which tool is the best for the task.
I tried to solve it with the Logs Explorer, I was able to write a quite good query filtering by the textPayload, but it seems that tool is designed to send the alert when a specific log is there, not when it is missing. (Maybe I missed something?)
I also checked Monitoring where I could set up an Alert when logs are missing, however in this case I was not able to write any query where I can filter logs by textPayload.
Thank you in advance if you can help me!
You could set up a separate alert DAG that notifies you if other DAGs haven't run in a specified amount of time? To get the last runtime of a DAG, use something like this:
from airflow.models import DagRun
dag_runs = DagRun.find(dag_id=dag_id)
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
Then you can use dag_runs[0] and compare with the current server time. If the date difference is greater than 24h, raise an alert.
I was able to do it in the monitoring. I did not need the filtering query which I used in the Logs Explorer. I needed to create an Alerting Policy, filtered by workflow_name, task_name and location. In the configure trigger section I was able to choose "Metric absence" with a 1 day absence time, so I resolved my old query with this.
Of course, it could be solved with setting up a new DAG, but setting up an Alerting Policy seems more easier.

Airflow Dataflow Job status error even though Dataflowtemplate run successful

I am orchestrating Dataflow Template job via Composer and using DataflowTemplatedJobStartOperator and DataflowJobStatusSensor for running the job. I am getting following error with sensor operator
Failure log of DataflowJobStatusSensor
job_status = job["currentState"]
KeyError: 'currentState'
Error
Dataflow Template job runs successfully but DataflowJobStatusSensor fails always with the error . I have attached screenshot of the whole orchestration
[2022-02-11 04:18:11,057] {dataflow.py:100} INFO - Waiting for job to be in one of the states: JOB_STATE_DONE.
[2022-02-11 04:18:11,109] {credentials_provider.py:300} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook.
[2022-02-11 04:18:11,776] {taskinstance.py:1152} ERROR - 'currentState'
Traceback (most recent call last):
Code
wait_for_job = DataflowJobStatusSensor(
task_id="wait_for_job",
job_id="{{task_instance.xcom_pull('start_x_job')['dataflow_job_id']}}",
expected_statuses={DataflowJobStatus.JOB_STATE_DONE},
location=gce_region
)
Xcom value -
return_value
{"id": "2022-02-12_02_35_39-14489165686319399318", "projectId": "xx38", "name": "start-x-job-0b4921", "type": "JOB_TYPE_BATCH", "currentStateTime": "1970-01-01T00:00:00Z", "createTime": "2022-02-12T10:35:40.423475Z", "location": "us-xxx", "startTime": "2022-02-12T10:35:40.423475Z"}
Any clue why I am getting the Error - currentstate
Thanks
After checking documentation for version 1.10.15, it gives you the option to run airflow providers (from version 2.0.*) on airflow 1.0. So you shouldn't haver issues, as described in my comments you should be able to run example_dataflow although you might need to update the code to reflect your version.
For what I see from your error message, have you also check your credentials as described on Google Cloud Connection page. Use the example or a small dag run using the operators to test your connection. You can find video-guides like this video. Remember that the credentials must be within reach of your airflow application.
Also, If you are using google-dataflow-composer you should be able to setup your credentials as show on DataflowTemplateOperator Configuration.
As a final note, if you find messy to move forward with airflow migration and latest updates, your best approach is to use kubernetes Operator. In the short term, this will allow to create image with latest updates and you only have to pass credential info to the image and you will be able to update your docker image to the latest and it will still working regardless of the version of airflow that you are using. It's a short term solution, still you should consider migrating to 2.0.*.

Is there any way to pass the error text of a failed Airflow task into another task?

I have a DAG defined that contains a number of tasks, the last of which is only run if any of the previous tasks fail. This task simply posts to a Slack channel that the DAG run experienced errors.
What I would really like is if the message sent to the Slack channel contained the actual error that is logged in the task logs, to provide immediate context to the error and perhaps save Ops from having to dig through the logs.
Is this at all possible?

How to reschedule a coordinator job in OOZIE without restarting the job?

When i changed the start time of a coordinator job in job.properties in oozie, the job is not taking the changed time, instead its running in the old scheduled time.
Old job.properties:
startMinute=08
startTime=${startDate}T${startHour}:${startMinute}Z
New job.properties:
startMinute=07
startTime=${startDate}T${startHour}:${startMinute}Z
The job is not running at the changed time:07th minute,its running at 08th minute in every hour.
Please can you let me know the solution, how i can make the job pickup the updated properties(changed timing) without restarting or killing the job.
You can't really change the timing of the co-ordinator via any methods given by Oozie(v3.3.2) . When you submit a job the contents properties are stored in the database whereas the actual workflow is in the HDFS.
Everytime you execute the co-ordinator it is necessary to have the workflow in the path specified in properties during job submission but the properties file is not needed. What I mean to imply is the properties file does not come into the picture after submitting the job.
One hack is to update the time directly in the database using SQL query.But I am not sure about the implications of it.The property might become inconsistent across the database.
You have to kill the job and resubmit a new one.
Note: oozie provides a way to change the concurrency,endtime and pausetime as specified in the official docs.

Resources