Mutiple instances of same DAG in airflow are failing without any logs - airflow

I am trying to place multiple instances of the same dag in google cloud composer dags folder and trying to run them in airflow. What is happening is that apart from the first instance all other instances are failing and not producing any logs. These instances do not even go in the running state and fail instantly after getting queued.
And since there are no logs i cant even debug anything.

Related

Prevent Airflow from triggering on scheduler restart

My Airflow Scheduler went down for some reason, and when I re-started it, all the DAGS triggered simultaneously. It was as if it was catching up from the missed jobs. Also, it seems when I modify a DAG, the workflow triggers. These unexpected triggers corrupt my data and loses trust in the system.
Is there a way to prevent a DAG running unexpectedly unless it is the exact time (no catch-up) or unless it is manually triggered?
The airflow scheduler will, at a minimum, attempt to run the current schedule interval when it is online to do so. This means that if the scheduler process is offline for a period of time, when it comes back online it will reconcile which jobs should have run and attempt to start those jobs.
There is some control using catchup, which tells the scheduler that only the latest job should be run and schedule intervals other than the latest that were missed do not need to be run.
Some info on catchup here: https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html#catchup
Is there a way to prevent a DAG running unexpectedly unless it is the exact time (no catch-up) or unless it is manually triggered?
There is no way to tell Airflow to only attempt to schedule the job at the exact time the job is supposed to run (and never attempt again after the fact). You can set the schedule interval to None and the job will never be scheduled, however. You can manually trigger the job through the UI or via the Airflow API in this case.
https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html#cron-presets
preset | meaning
-------+----------------------------------------------------------------
None | Don’t schedule, use for exclusively “externally triggered” DAGs

Airflow 2.2.4 manually triggered DAG stuck in 'queued' status

I have two DAGs in my airflow scheduler, which were working in the past. After needing to rebuild the docker containers running airflow, they are now stuck in queued. DAGs in my case are triggered via the REST API, so no actual scheduling is involved.
Since there are quite a few similar posts, I ran through the checklist of this answer from a similar question:
Do you have the airflow scheduler running?
Yes!
Do you have the airflow webserver running?
Yes!
Have you checked that all DAGs you want to run are set to On in the web ui?
Yes, both DAGS are shown in the WebUI and no errors are displayed.
Do all the DAGs you want to run have a start date which is in the past?
Yes, the constructor of both DAGs looks as follows:
dag = DAG(
dag_id='image_object_detection_dag',
default_args=args,
schedule_interval=None,
start_date=days_ago(2),
tags=['helloworld'],
)
Do all the DAGs you want to run have a proper schedule which is shown in the web ui?
No, I trigger my DAGs manually via the REST API.
If nothing else works, you can use the web ui to click on the dag, then on Graph View. Now select the first task and click on Task Instance. In the paragraph Task Instance Details you will see why a DAG is waiting or not running.
Here is the output of what this paragraph is showing me:
What is the best way to find the reason, why the tasks won't exit the queued state and run?
EDIT:
Out of curiousity I tried to trigger the DAG from within the WebUI and now both Runs executed (the one triggered from the WebUI failed, but that was expected, since there was no config set)

Airflow - task fail without log

I have a pipeline that start with by getting a file in a GCS bucket and do several tasks on it.
Sometime, after big update I have to rerun every file one by one. That's what I did and everything goes well for 190 files (for around 290 in total) but suddenly on task fail, without log. This task should have launch a K8s pod but here nothing. No logs in airflow, no pod launched in K8s. If I try to clear the task but still fail the same way.
The task fail as soon as lunched. I restarted the instance, I have space on the disk... I have no idea what to do.

Determining if a DAG is executing

I am using Airflow 1.9.0 with a custom SFTPOperator. I have code in my DAGs that poll an SFTP site to find new files. If any are found, then I create custom task id's for the dynamically created task and retrieve/delete the files.
directory_list = sftp_handler('sftp-site', None, '/', None, SFTPToS3Operation.LIST)
for file_path in directory_list:
... SFTP code that GET's the remote files
That part works fine. It seems both the airflow webserver and airflow scheduler are iterating through all the DAGs once a second and actually running the code that retrieves the directory_list. This means I'm hitting the SFTP site ~2 x a second to authenticate and pull a list of files. I'd like to have some conditional code that only executes if the DAG is actually being run.
When an SFTP site uses password authentication, the # of times I connect really isn't an issue. One site requires key authentication and if there are too many authentication failures in a short timespan, the account is locked. During my testing, this seems to happen occasionally for reasons I'm still trying to track down.
However, if I were authenticating only when the DAG was scheduled to execute, or executing manually, this would not be an issue. It also seems wasteful to spend so much time connecting to an SFTP site when it's not scheduled to do so.
I've seen a post that can check to see if a task is executing, but that's not ideal as I'd have to create a long-running task, using up resources I shouldn't require, just to perform that test. Any thoughts on how to accomplish this?
You have a very good use case for Airflow (SFTP to _____ batch jobs), but Airflow is not meant for dynamic DAGs as you are attempting to use them.
Top-Level DAG Code and the Scheduler Loop
As you noticed, any top-level code in a DAG is executed with each scheduler loop. Or put another way, every time the scheduler loop processes the files in your DAG directory it is interpreting all the code in your DAG files. Anything not in a task or operator is interpreted/executed immediately. This puts undue strain on the scheduler as well as any external systems you are making calls to.
Dynamic DAGs and the Airflow UI
Airflow does not handle dynamic DAGs through the UI well. This is mostly the result of the Airflow DAG state not being stored in the database. DAG views and history are rendered based on what exist in the interpreted DAG file at any given moment. I personally hope to see this change in the future with some form of DAG versioning.
In a dynamic DAG you can both add and remove tasks from a DAG.
Adding Tasks Dynamically
When adding tasks for a DAG run will make it appear (in the UI) that all DAG
runs before when that task never ran that task all. The will have a None state
and the DAG run will be set to success or failed depending on the outcome
of the DAG run.
Removing Tasks Dynamically
If your dynamic DAG ever removes tasks you will lose the ability to review history of the DAG. For example, if you run a DAG with task_x in the first 20 DAG runs but remove it after that, it will fail to show up in the UI until it is added back into the DAG.
Idempotency and Airflow
Airflow works best when the DAG runs are idempotent. This means that re-running any DAG Run should have the same affect no matter when you run it or how many times you run it. Dynamic DAGs in Airflow break idempotency by adding and removing tasks to previous DAG runs so that the results of re-running are not the same.
Solution Options
You have at least two options moving forward
1.) Continue to build your SFTP DAG dynamically, but create another DAG that writes the available SFTP files to a local file (if not using distributed executor) or an Airflow Variable (this will result in more reads to the Airflow DB) and build your DAG dynamically from that.
2.) Overload the SFTPOperator to take a list of files so that every file that exist is processed within a single task run. This will make the DAGs idempotent and you will maintain accurate history through the logs.
I apologize for the extended explanation, but you're touching on one of the rough spots of Airflow and I felt it was appropriate to give an overview of the problem at hand.

How to stop Airflow running tasks from 'off' dags

I created some DAGs, ran them and stopped them in the middle of their execution (with the OFF button).
The UI still shows 'Running tasks' for those stopped DAGs though.
I tried to set 'clear' to those tasks and not they are in blue, in 'shutdown state'.
I am wondering if those tasks are counted in the total of running tasks, and blocking other tasks from starting (with my current configuration, only 32 tasks can run in parallel). Is there a way to clean completely the DAGs that I don't need anymore and to make sure the tasks are not blocking anything and making Airflow slower?
Thanks!
You can delete all of the dag data from the dag_run and task_instances tables in the meta database.
You can also do this through the Airflow Webserver UI by navigating to
Browse -> DAG Runs
& Browse -> Task Instances
And deleting all the records relevant to the dag id.
One note though is that the task and DAG status fields on the main page may take a while to reflect the changes.

Resources