I have a DAG script in my airflow and it auto-refreshes 30 seconds. I want to either disable it (or) if possible set to a higher time limit.
Also, as suggested here, I have set min_file_process_interval and dag_dir_list_interval to higher values and restarted the airflow server, but again the DAG is getting refreshed in next 30 seconds.
Please suggest a workaround to disable this auto-refresh or delay it to more time interval. Thanks
Update :-
worker_refresh_interval=some_number
https://airflow.apache.org/docs/stable/configurations-ref.html#worker-refresh-interval
Set this variable to higher limit in airflow.cfg file and the DAG wont refresh until that time limit.
Related
I am completely new to Airflow and am trying to grasp the concepts of scheduling and default args.
I have a scenario where I would like to schedule my DAG hourly to do some data transfer task between a source and a database. What I am trying to understand is, lets say one of my DAG runs has triggered at 00:00 AM. Now if it takes more than an hour for this run to successfully complete all of its tasks (say 1 hour 30 min), does it mean that the next DAG run that was supposed to be triggered at 01:00 AM will NOT get triggered but the DAG run from 02:00 AM will get triggered?
Yes.
In order to avoid, you need catchup=True for the DAG object.
Reference : https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html
(Search for Catchup)
Airflow Scheduler used to monitor all DAGs and tasks in Airflow.Default Arguments can be used to create tasks with default parameters in DAG.
The first DAG runs based on start_date and runs based on scheduled_interval sequentially. Scheduler doesn’t trigger tasks until the period has ended.For your requirement you can set dag.catchup to true as to run the DAG for each completed interval and scheduler will execute them sequentially.Catchup is used to start the DAG run since the last data interval which has not started for any data interval.
In our application, one DAG run processes 50records every 10mins. To load historical data(~80k) in a short period of time, we increased the max_active_runs to 3 and decreased the interval to 2mins.
When the DAG starts execution, in the first task, we pick up the first 50records that are eligible, mark them as IN-PROGRESS, and proceed.
The issue we noticed is, when multiple DAGs start their execution at the same time (when there is some delay from the previous DAG run), the same records are being picked up by more than one DAG run.
Is there a possibility to force a delay between multiple active DAG runs?
As stated in Apache Airflow documentation, I can control how often a DAG is updated by setting configuration variable min_file_process_interval in your airflow.cfg file:
min_file_process_interval
Number of seconds after which a DAG file is parsed. The DAG file is parsed every min_file_process_interval number of seconds. Updates to DAGs are reflected after this interval. Keeping this number low will increase CPU usage.
However, I didn't find any clue or best practice about which value should I set for min_file_process_interval.
Example
My DAG changes once a day. By default min_file_process_interval is set to 30 seconds. It means most of the time updating DAG is useless: as long as DAG doesn't change, updated DAG and previous DAG are the same. It consumes resources and generates logs. But if I update DAG only once a day, do I risk to run wrong DAG if DAG changes after the daily DAG update or the DAG is also updated just before run ?
What value for min_file_process_interval should I set in this case ?
EDIT: As stated in Elad's answer responding to a previous version of this question, dynamic DAGs should be avoided. However, If I have dynamic DAGs, how to choose min_file_process_interval?
You are mixing two different things.
min_file_process_interval means how often Airflow scan the .py files and update the DAG within Airflow. Consider that when you deploy new .py file Airflow needs to read it and create it in the metastore database - so the setting is about how often it happens.
For your use case the DAG code should not be updated every day - In fact it should not be updated at all. It should just run everyday. Your dag just need to be able to handle the correct file per date. You code can be something like:
from airflow.providers.ftp.sensors.ftp import FTPSensor
with DAG(dag_id='stackoverflow',
default_args=default_args,
schedule_interval="#daily",
catchup=False
) as dag:
# Waits for a file or directory to be present on FTP.
sensor_op = FTPSensor(
task_id='sensor_task',
path='/my_folder/{{ ds }}/file.csv', #path to your file in the server
fail_on_transient_errors=False,
ftp_conn_id='ftp_default'
)
# Operator to process the file
operator_op = SomeOperator()
sensor_op >> operator_op
In that DAG it will start a run everyday - the first operator is sensor thus if the file for that day isn't present the workflow will wait until it appears only once it appear the workflow will continue to the 2nd operator which should process the file.
Note that the path parameter of FTPSensor is templated. This means you can use macros like {{ ds }} this will give you a path that contains each day date like:
/my_folder/2021-05-01/file.csv
/my_folder/2021-05-02/file.csv
/my_folder/2021-05-03/file.csv
You can also do path='/my_folder/{{ ds }}.csv' which will give:
/my_folder/2021-05-01.csv
/my_folder/2021-05-02.csv
/my_folder/2021-05-03.csv
I have a DAG 'abc' scheduled to run every day at 7 AM CST. For some reason, I do not want to run tomorrow's instance. How can I skip that particular instance. Is there any way to do that using command line ? Appreciate any help on this.
I believe you can preemptively create a DAG Run for the future date at in the UI under Browse->DAG Run -> Create, initializing it in the success (or failed) state, which should prevent the scheduler from creating a new run when the time comes. I think you can do this on the CLI with trigger_dag as well, but you'll just need to separately update its state cause it'll default to running.
I think you can set the start_date for the day after tomorrow or whatever date you want your dag run as long as it is in the future. but the schedule interval will stay the same every 7AM. You can start date in Default_Args
I have an airflow dag specified as shown in the picture above.
The git_pull_datagenerator_batch_2 is supposed to be delayed by the TimeDeltaSensor wait_an_hour.
However, the task git_pull_datagenerator seems to be delayed as well although it does not have a dependency on wait_an_hour. (The whole dag is scheduled at 2019-12-10T20:00:00, but git_pull_datagenerator started one hour later than that)
I have checked all documents of airflow but could not find any clues.
I'm assuming your schedule interval is hourly? A DAG run with an execution date of 2019-12-10T20:00:00 on an #hourly schedule interval is expected to run at or shortly after 2019-12-10T21:00:00 when hour 20 has "completed". I don't think it has anything to do with the sensor.
This is a common Airflow pitfall:
Airflow was developed as a solution for ETL needs. In the ETL world,
you typically summarize data. So, if I want to summarize data for
2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be
right after all data for 2016-02-19 becomes available.
If this is what is happening, wait_an_hour started at 2019-12-10T21:00:00 and git_pull_datagenerator_batch_2 at 2019-12-10T22:00:00.
It turns out that the default executor is a SequentialExecutor, which causes all of the tasks to run in a linear order.