DAG Run Logs are not time sensitive - airflow

Hi all and thanks for reading my question.
I've noticed after some weeks of deploying and changing DAGs that the logs are only reflecting the structure of the latest design.
For example, I have a DAG with 3 sequential operators. That DAG has run 10 times with a mix of success and failure. Before the 11th run, I re-deployed the DAG again with 2 sequential operators (removing 1 operator).
Now, the DAG is showing all the logs (11 runs) based on the latest design (2 operators) and does not change to the 3-operator view when I look at older logs.
Is there a way to "fix" this?
Thanks and regards,
Rama

I think you are talking about the structure of the DAG in the UI.
When you change the structure (add or remove new tasks) it is going to apply the structural changes in the UI through the previous runs.
If you want to keep the history of the structure the best way would be to create a new dag or rename the dag:
with DAG(
dag_id='dag_name_1.0', -- change to dag_name_1.1
schedule_interval="0 2 * * *",
dagrun_timeout=timedelta(minutes=60),
default_args=default_args,
max_active_runs=1,
catchup=False,
doc_md=__doc__
) as dag:
You can access the older logs, just not through the ui. Check the base_log_folder in you airflow config file and you are going to be able to find the log files in that address.

Related

How can I tell if an Airflow dag run finished without failing using the Airflow DB?

I want to extract statistics on our Airflow processes using its database. One of these statistics is how many DAG runs are finished smoothly, without any failures and reruns. Doing that using the try_number column of the dag_run table doesn't help, since it also counts automatic retries. I want to count only the cases in which an engineer had to rerun or resume the DAG run.
Thank you.
If I understand correctly you want to get all Dagruns that never had a failed task in them? You can do this by excluding all DAG run_id s that have an entry in the task_failed table:
SELECT *
FROM dag_run
WHERE run_id NOT IN (
SELECT run_id
FROM dag_run
JOIN task_fail USING(run_id)
)
;
This of course would not catch other task states that an engineer might intervene with like marking a task as successful that is stuck in running or a deferred state etc.
One note as of Airflow 2.5.0 you can add notes to DAG and task runs in the UI when manually intervening. Those notes are stored in the tables dag_run_note and task_instance_note.

Is it a good practice to use airflow metadatabase to control pipelines?

Recently I'm developing an airflow pipeline that will be running for multi tenants. This DAG will be triggered via API, and separated by batches, which is controlled by a metadabase in SQL following some business rules.
Each batch has a batch_id in order to controll the batches, and it is passed to conf DAG via API. The batch id has the timestamp of creation combined with tenant and filetype. Example: tenant1_20221120123323 ... tenant2_20221120123323. These batches can contain two filetypes ( for example purpouses) and for each filetype a DAG is triggered (DAG1 for filetype 1 and DAG2 for filetype 2) and then from the file perspective, it is combined with the filetype in some stages tenant1_20221120123323_filetype1, tenant1_20221120123323_filetype2 ...
For illustrate this, imagine that the first dag the following pipeline process_data_on_spark >> check_new_files_on_statingstorage >> [filetype2_exists, write_new_data_to_warehouse] filetype2_exists >> read_data_from_filetype2 >> merge_filetype2_filetype2 >> write_new_data_to_warehouse . Where the filetype2_exists is a BranchPythonOperator, that verify if DAG_2 was triggered, and if it was, it will merge the resulted data form DAG2 with the DAG1 before execute write_new_data_to_warehouse.
Based on this DAG model, there will be one DAG run for each tenant. So, the DAG can have multiple DAG runs running in parallel if we trigger more than one DAG run (one per tenant). Here is my first question:
Is a good practice work with multiple DAG runs in the same DAG instead of working with Dynamic DAGs ? In this case, I would end withprocess_data_on_spark _tenant1,
process_data_on_spark _tenant2, ...process_data_on_spark _tenantN. It worth mention that the number of tenants can reach hundreads.
Now, considering that the filetype2 can or not be present in the batch, and, considering that I would use the model mentioned above (on single DAG with multiples DAG run runnning in parallel - one for each tenant). The only idead that I have for check if DAG2 was triggered for the current batch (ie., filetype2 was present in the batch) was modify the DAG_run_id to include the batch_id, combined with the filetype:
The default dag_run_id: manual__2022-11-19T00:00:00+00:00
The new dag_run_id: manual__tenant1_20221120123323_filetype2__2022-11-19T00:00:00+00:00
And from then, I would be able to query the airflow metadatabse and check if there was an dag_run_id that contains the current batch_id and the filetype2 running, and, with a sensor, wait for the dag_status be success. Then, I could run the read_data_from_filetype2 task. Otherwise, if there is no dag_run_id with batch_id and filetype2 registed in airflow metadatabase, I can follow the write_new_data_to_warehouse directly.
Here's the other question:
Is a good practice to modify dag_run_id and use it combined with airflow metadatabase to control pipelines?
Considering this scenario, It would be better to create dynamic DAGs, even if there would be result in hundeads DAGs or working with dag_run_id and airflow_metadabase and keep parallel DAG runs in one single DAG?
Or, there would be a better approach for this problem?
Thank You.

How to choose how often Apache Airflow scheduler updates a DAG?

As stated in Apache Airflow documentation, I can control how often a DAG is updated by setting configuration variable min_file_process_interval in your airflow.cfg file:
min_file_process_interval
Number of seconds after which a DAG file is parsed. The DAG file is parsed every min_file_process_interval number of seconds. Updates to DAGs are reflected after this interval. Keeping this number low will increase CPU usage.
However, I didn't find any clue or best practice about which value should I set for min_file_process_interval.
Example
My DAG changes once a day. By default min_file_process_interval is set to 30 seconds. It means most of the time updating DAG is useless: as long as DAG doesn't change, updated DAG and previous DAG are the same. It consumes resources and generates logs. But if I update DAG only once a day, do I risk to run wrong DAG if DAG changes after the daily DAG update or the DAG is also updated just before run ?
What value for min_file_process_interval should I set in this case ?
EDIT: As stated in Elad's answer responding to a previous version of this question, dynamic DAGs should be avoided. However, If I have dynamic DAGs, how to choose min_file_process_interval?
You are mixing two different things.
min_file_process_interval means how often Airflow scan the .py files and update the DAG within Airflow. Consider that when you deploy new .py file Airflow needs to read it and create it in the metastore database - so the setting is about how often it happens.
For your use case the DAG code should not be updated every day - In fact it should not be updated at all. It should just run everyday. Your dag just need to be able to handle the correct file per date. You code can be something like:
from airflow.providers.ftp.sensors.ftp import FTPSensor
with DAG(dag_id='stackoverflow',
default_args=default_args,
schedule_interval="#daily",
catchup=False
) as dag:
# Waits for a file or directory to be present on FTP.
sensor_op = FTPSensor(
task_id='sensor_task',
path='/my_folder/{{ ds }}/file.csv', #path to your file in the server
fail_on_transient_errors=False,
ftp_conn_id='ftp_default'
)
# Operator to process the file
operator_op = SomeOperator()
sensor_op >> operator_op
In that DAG it will start a run everyday - the first operator is sensor thus if the file for that day isn't present the workflow will wait until it appears only once it appear the workflow will continue to the 2nd operator which should process the file.
Note that the path parameter of FTPSensor is templated. This means you can use macros like {{ ds }} this will give you a path that contains each day date like:
/my_folder/2021-05-01/file.csv
/my_folder/2021-05-02/file.csv
/my_folder/2021-05-03/file.csv
You can also do path='/my_folder/{{ ds }}.csv' which will give:
/my_folder/2021-05-01.csv
/my_folder/2021-05-02.csv
/my_folder/2021-05-03.csv

Airflow DAG Multiple Runs

I have a DAG that I want to run multiple times after each successful completion. For an example I want to run it 10 times and stop. Is there a way to accomplish this? I tried looking into scheduling with CRON but it doesn't seem clean nor triggering the DAG via UI multiple times doesn't work (runs in parallel).
I found a solution to my use case. It incorporated using depends_on_past=True (mentioned by #Hitesh Gupta) and setting your airflow.cfg file below:
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 1
This allowed us to only have one active DAG run at a time and also to not continue the next DAG run if there were failure in the previous run. This is for Airflow version 1.10.1 that I tested on.
You can, in addition to supplying a start_date, provide your DAG an end_date
Quoting the docstring
:param start_date: The timestamp from which the scheduler will attempt
to backfill
:type start_date: datetime.datetime
:param end_date: A date beyond which your DAG won't run, leave to None for open ended scheduling
:type end_date: datetime.datetime
While unrelated, also have a look at following scheduler settings in airflow.cfg as mentioned in this article
run_duration
num_runs
UPDATE-1
In his article Use apache airflow to run task exactly once, #Andreas P has described a clever technique, which I believe can be adapted to your use-case. While even that won't be a very-tidy solution, it would at-least allow you to specify beforehand the number of runs (integer) for DAG instead of end_date.
Alternatively (assuming you implement the above approach) rather than rather than baking this skipping-dag-after max-runs functionality within each DAG, you can create a separate orchestrator DAG that disables a given DAG after its max runs have passed.
You have to set property depends_on_past. This is set under DAG's default arguments section and it refers to previous instance dag instance. This is fix your problem.

Airflow on demand DAG with multiple instances running at the sametime

I am trying to see if I airflow is a good fit for this scenario. At present, I have a DAG. This looks for a trigger file at s3, creates EMR cluster and submit spark job, then delete the EMR cluster.
My requirement is to convert this into on demand run. There will be many users running the export from the application. For each export run, I will have to call this DAG. That means there will be more than once instance of the same DAG will be running at the sametime.
I know we an make an API call to trigger a DAG. But I am not sure if we can run more than once instance of a DAG at the sametime. Can anyone had similar use case?
I am handling this with max_active_runs
dag = DAG(
'dev_clickstream_v1',
max_active_runs=5,
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=2),
params=PARAMS
)

Resources