Airflow - Chart Query failing - airflow

I am new to Airflow and I am trying to create charts for my DAG but I keep getting the following error:
SQL execution failed. Details: 'NoneType' object has no attribute 'get_pandas_df'
My query looks like:
SELECT dag_id, execution_date, count(*) as ccount FROM task_instance GROUP BY dag_id, execution_date

In the web UI, if you click on the pen-shape edit button, you'll see the connection (Conn Id field) it uses is airflow_db, which means you need to make sure the aireflow_db connection (in the top menu Admin -> Connections) is configured to point to the database you set in the airflow.cfg file.

Related

DAG seems to be missing from DagBag error in Airflow 2.4.0

I updated my Airflow setup from 2.3.3 to 2.4.0. and I started to get these errors on the UI DAG <dag name> seems to be missing from DagBag. Scheduler log shows ERROR - DAG < dag name> not found in serialized_dag table
One of my airflow instanced seemed to work well for the old dags, but when I add new dags I get the error. On the other airflow Instance, every dag was outputting this error and the only way out of this mess was to delete the db and init it again. The error message appears when I click the dag from the main view.
Deleting db is not the solution I want to use in the future, is there any other way this can be fixed?
Side note:
It's also weird, that I use the same airflow image in both of my instances and still the other instance has the newly added Datasets menu on top bar and the other instance doesn't have it.
My setup:
Two isolated airflow main instances(dev,prod) with CeleryExecutor and each of these instances have 10 worker machines. I'm running the setup on each machine using docker compose conf and shared .env file that ensures that the setup is the same on the main machine and the worker machines.
Airflow version: 2.4.0 (same error in 2.4.1)
PSQL: 13
Redis:6.2.4
UPDATE:
Still unresolved. The new dag is shown at Airflow UI and it can be activated. Running the dag is not possible. I think theres no other solution than to reset the db.
I have found no official references for this fix so use it carefully and backup your db first :)
I have encountered the same problem after the upgrade to Airflow 2.4.1 (from 2.3.4). Pre-existing DAGs still worked properly, but for new DAGs I saw the error you mentioned.
Debugging, I found in the scheduler logs:
{manager.py:419} ERROR - Add View Menu Error: (psycopg2.errors.NotNullViolation) null value in column "id" of relation "ab_view_menu" violates not-null constraint
DETAIL: Failing row contains (null, DAG:my-new-dag-id).
[SQL: INSERT INTO public.ab_view_menu (name) VALUES (%(name)s) RETURNING public.ab_view_menu.id]
[parameters: {'name': 'DAG:my-new-dag-id'}]
which seems to be the cause of the problem: a null value for the id column, which prevents the DAG from being loaded.
I also saw similar errors when running airflow db upgrade.
After a check on the ab_view_menu database table I noticed that a sequence exists for its primary key (ab_view_menu_id_seq), but it was not linked to the column.
So I linked it:
ALTER TABLE ab_view_menu ALTER COLUMN id SET DEFAULT NEXTVAL('public.ab_view_menu_id_seq'::REGCLASS);
ALTER SEQUENCE ab_view_menu_id_seq OWNED BY ab_view_menu.id;
SELECT setval('ab_view_menu_id_seq', (SELECT max(id) FROM ab_view_menu));
The same consideration applies to other tables:
ab_permission
ab_permission_view
ab_permission_view_role
ab_register_user
ab_role
ab_user
ab_user_role
ab_view_menu
With this fix on the sequences the problem seems to be solved.
NOTE: the database used is PostgreSQL
Given your latest comment, it sounds like you are running two airflow versions with two different schedulers connected to the same database.
If one has access to DAGs, that the other doesn't, that alone would already explain the errors you are seeing regarding DAG missing.
Please share some more details on your setup and we can look into this more in depth.

Airflow - backend database to get reports

I am trying to get some useful information from airflow backend. I need to get following details
How many times a particular job failed.
which task has failed over and over.
The problem is all our task has dependency on their upstream, and so when it fails, we fix the issue and mark it as success. This changes status in database as well. Is there a place I can get historical records?
following query shows which task failed. However if I mark it as success from UI, status is updated in database as well. And I have no way to show if this was failed.
select * from dag_run where dag_id='test_spark'
Currently, there is no native way but you can check log table -- it adds a record whenever there is action via CLI or UI.

Airflow connection list check through python operator

Before executing the DAG, I want to check whether a particular connection id is present in the connection list or not. I dont have any mechanismn of retaining a connection. Even if I create a connection through GUI, when server reboots all the connections gets removed.
Following is the task I thought I should add but thenI got an ascii error when I ran it, may be because the command return a table that might not be adequately parsed by the logger.
def create_connection(**kwargs):
print(kwargs.get('ds'))
list_conn = BashOperator(
task_id='list_connections',
bash_command='airflow connections --l',
xcom_push=True)
conns = list_conn.execute(context=kwargs)
logging.info(conns)
if not conns:
new_conn = Connection(conn_id='xyz', conn_type='s3',
host='https://api.example.com')
session = settings.Session()
session.add(new_conn)
session.commit()
logging.info('Connection is created')
Question: Is there any way I would get to know in Airflow DAG itself that the connection is added or not. If its already there then I would not create a new connection.
session.query(Connection) should do the trick.
def list_connections(**context):
session = settings.Session()
return session.query(Connection)
list_conn = PythonOperator(
task_id='list_connections',
python_callable=list_connections,
provide_context=True,
)
Please make sure all the code is contained within tasks. Or to phrase it correctly, they should execute during run time instead of load time. Adding the code directly in DAG file cause it to run during load time which is not recommended.
The accepted answers work perfectly. I had a scenario where I needed to get a connection by connection id to create the DAG. So I had to get it outside the task and in the DAG creation itself.
The following code worked for me:
from airflow.hooks.base_hook import BaseHook
conn = BaseHook.get_connection(connection)
Hope this might help someone! :)

How to list all the failed tasks in Airflow using Data Profiling --> Ad Hoc Query

I have a DAG which runs every 4 hours in a day. Every day the first run of the day fails, while the remaining runs pass successfully. The Recent Task's give me all task as passed. But when I click the DAG, I can see the first day of the run as failed from the Tree View.
How can I get the list of all the failed run/tasks for particular DAG from Data Profiling, as I dont want to modify anything in production environment.
Go to Data Profiling -> Adhoc Query -> airflow_Db and then execute the following query:
Select * from task_instance where state="failed" && dag_id="your_dag_id"
It will list out all the task of that particular dag_id which is failed.
If you want all the task_id of a particular dag_id which got failed, execute the below query:
Select * from task_instance where state="failed" && dag_id="your_dag_id" && task_id="your_task_id"
Well like wise you can perform any query, you can see all the filed which is present in the task_instance by querying select * from task_instance

Airflow - mark a specific task_id of given dag_id and run_id as success or failure

Can I externally(use a http request ?) to mark a specific task_id associated with dag_id and run_id as success/failure.
My task is a long running task on external system and I don't want my task to poll the system to find the status.. since we can probably have several 1000 task running at same time ..
Ideally want my task to
make a http request to start my external job
go to sleep
once the job is finished, it(External system or the post build action of my job) informs airflow that the task is done (identified by task_id, dag_id and run_id)
Thanks
You can solve this by sending SQL queries directly into Airflow's metadata DB:
UPDATE task_instance
SET state = 'success',
try_number = 0
WHERE
task_id = 'YOUR-TASK-ID'
AND
dag_id = 'YOUR-DAG-ID'
AND
execution_date = '2019-06-27T16:56:17.789842+00:00';
Notes:
The execution_date filter is crucial, Airflow identifies DagRuns by execution_date, not really by their run_id. This means you really need to get your DagRun's execution/run date to make it work.
The try_number = 0 part is added because sometimes Airflow will reset the task back to failed if it notices that try_number is already at its limit (max_tries)
You can see it in Airflow's source code here: https://github.com/apache/airflow/blob/750cb7a1a08a71b63af4ea787ae29a99cfe0a8d9/airflow/models/dagrun.py#L203
Airflow doesnt yet have a Rest endpoint. However you have a couple of options
- Using the airflow command line utilities to mark the jobs to success. E.g. In python using Popen.
- Directly update the Airflow DB table task_instance

Resources