Airflow Dag starts immediately - airflow

I have a question please,below the parameters of my airflow diagram
default_args = {
'owner': 'me',
'email': ['tig.bena#gmail.com',"tig.bena#yahoo.com"],
'email_on_failure': True,
'email_on_retry': True,
'start_date': dt.datetime(2024, 3, 4, 9, 55, 00),
}
I launch airflow with docker compose, the problem is that my diagram is launched directly when I launch docker-compose up
although the launch date is 2024
any idea please ?
thank you

Airflow may launch a DAG immediately even if the start date is in the future because of a feature called "catchup" mode. When "catchup" mode is enabled, Airflow will run all the tasks for all the past instances of a DAG Run, up to the current date and time.
This feature is usually used to quickly bring a DAG up-to-date after it has been turned off or if it has missed some runs. You can control the behavior of "catchup" mode in the DAG configuration by setting the "catchup" parameter to either "True" or "False".
If you set it to "catchup": "False", Airflow will only run the tasks for the future instances of a DAG Run and will not run any tasks for past instances.

Related

How fix DAG seems to be missing?

I want to run a simple Dag "test_update_bq", but when I go to localhost I see this: DAG "test_update_bq" seems to be missing.
There are no errors when I run "airflow initdb", also when I run test airflow test test_update_bq update_table_sql 2015-06-01, It was successfully done and the table was updated in BQ. Dag:
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'Anna',
'depends_on_past': True,
'start_date': datetime(2017, 6, 2),
'email': ['airflow#airflow.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=5),
}
schedule_interval = "00 21 * * *"
# Define DAG: Set ID and assign default args and schedule interval
dag = DAG('test_update_bq', default_args=default_args, schedule_interval=schedule_interval, template_searchpath = ['/home/ubuntu/airflow/dags/sql_bq'])
update_task = BigQueryOperator(
dag = dag,
allow_large_results=True,
task_id = 'update_table_sql',
sql = 'update_bq.sql',
use_legacy_sql = False,
bigquery_conn_id = 'test'
)
update_task
I would be grateful for any help.
/logs/scheduler
[2019-10-10 11:28:53,308] {logging_mixin.py:95} INFO - [2019-10-10 11:28:53,308] {dagbag.py:90} INFO - Filling up the DagBag from /home/ubuntu/airflow/dags/update_bq.py
[2019-10-10 11:28:53,333] {scheduler_job.py:1532} INFO - DAG(s) dict_keys(['test_update_bq']) retrieved from /home/ubuntu/airflow/dags/update_bq.py
[2019-10-10 11:28:53,383] {scheduler_job.py:152} INFO - Processing /home/ubuntu/airflow/dags/update_bq.py took 0.082 seconds
[2019-10-10 11:28:56,315] {logging_mixin.py:95} INFO - [2019-10-10 11:28:56,315] {settings.py:213} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=3600, pid=11761
[2019-10-10 11:28:56,318] {scheduler_job.py:146} INFO - Started process (PID=11761) to work on /home/ubuntu/airflow/dags/update_bq.py
[2019-10-10 11:28:56,324] {scheduler_job.py:1520} INFO - Processing file /home/ubuntu/airflow/dags/update_bq.py for tasks to queue
[2019-10-10 11:28:56,325] {logging_mixin.py:95} INFO - [2019-10-10 11:28:56,325] {dagbag.py:90} INFO - Filling up the DagBag from /home/ubuntu/airflow/dags/update_bq.py
[2019-10-10 11:28:56,350] {scheduler_job.py:1532} INFO - DAG(s) dict_keys(['test_update_bq']) retrieved from /home/ubuntu/airflow/dags/update_bq.py
[2019-10-10 11:28:56,399] {scheduler_job.py:152} INFO - Processing /home/ubuntu/airflow/dags/update_bq.py took 0.081 seconds
Restarting the airflow webserver helped.
So I kill gunicorn process on ubuntu and then restart airflow webserver
This error is usually due to an exception happening when Airflow tries to parse a DAG. So the DAG gets registered in metastore(thus visible UI), but it wasn't parsed by Airflow. Can you take a look at Airflow logs, you might see an exception causing this error.
None of the responses helped me solving this issue.
However after spending some time I found out how to see the exact problem.
In my case I ran airflow (v2.4.0) using helm chart (v1.6.0) inside kubernetes. It created multiple docker containers. I got into the running container using ssh and executed two commands using airflow's cli and this helped me a lot to debug and understand the problem
airflow dags report
airflow dags reserialize
In my case the problem was that database schema didn't match the airflow version.

Airflow 1.10.1 - Change TimeZone

I am running airflow (1.10.1) inside a VM on GCP via docker. Already changed the local time of my VM and config (airflow.cfg) also set the default_zone of my country (America / Sao_Paulo) but it still continues in UTC time on the home screen and consequently processing is done in UTC too. Can you do anything else?
Complementing the given answer, I was able to change the execution according to my timezone inside the DAG through the code below:
import pendulum
default_args = {
'owner': 'airflow',
'start_date': pendulum.datetime(year=2019, month=7, day=26).astimezone('America/Sao_Paulo'),
'depends_on_past': False,
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'depends_on_past': False,
# If a task fails, retry it once after waiting
# at least 5 minutes
'retries': 1,
'retry_delay': timedelta(minutes=5),
'on_failure_callback': slack_msg
}
dag = DAG(
dag_id=nm_dag,
default_args=default_args,
schedule_interval='40 11 * * *',
dagrun_timeout=timedelta(minutes=60)
)
From the documentation:
Support for time zones is enabled by default. Airflow stores datetime information in UTC internally and in the database. It allows you to run your DAGs with time zone dependent schedules. At the moment Airflow does not convert them to the end user’s time zone in the user interface. There it will always be displayed in UTC. Also templates used in Operators are not converted.
Time zone information is exposed and it is up to the writer of DAG to process it accordingly.
you can change it by setting the correct value of the timezone in the variable "AIRFLOW__CORE__DEFAULT_TIMEZONE" in airflow config file or from the env vars during the run time.

Airflow Task failure/retry workflow

I have retry logic for tasks and it's not clear how Airflow handles task failures when retries are turned on.
Their documentation just states that on_failure_callback gets triggered when a task fails, but if that task fails and is also marked for retry does that mean that both the on_failure_callback and on_retry_callback would be called?
Retry logic/parameters will take place before failure logic/parameters. So if you have a task set to retry twice, it will attempt to run again two times (and thus executing on_retry_callback ) before failing (and then executing on_failure_callback).
An easy way to confirm the sequence that it is executed in is to set your email_on_retry and email_on_failure to True and see the order in which they appear. You can physically confirm that it will retry before failing.
default_args = {
'owner': 'me',
'start_date': datetime(2019, 2, 8),
'email': ['you#work.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=1)
}

Airflow scheduler fails to pickup scheduled DAG's but runs when triggered manually

I have Airflow 1.10.2 installation with python 3.5.6.
Metadata is lying into Mysql database with LocalExecutor for execution.
I have created sample helloworld.py dag with below schedule.
default_args = {
'owner': 'Ashish',
'depends_on_past': False,
'start_date': datetime(2019, 2, 15),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('Helloworld',schedule_interval='56 6 * * *', default_args=default_args)
But scheduler didn't pickup this dag as per scheduled time whereas when i run it manually from UI it runs perfectly fine.
Concern here is why does scheduler fails to pickup dag run as per the scheduled time.
I think you are confused on start_date:. Your current schedule is set to run at 6:56 AM UTC on 2/15/2019. With this schedule, the DAG will run tomorrow with no problem. This is because Airflow runs jobs at the end of an interval, not the beginning.
start_date: is not when you want the DAG to be triggered, but when you want the scheduling interval to start. If you wanted your job to run today, start date should be: 'start_date': datetime(2019, 2, 14). Then your current daily scheduling interval would have ended at 6:56 AM today as intended and your DAG would have ran.
Taken from this answer.

What's the proper sequence of Airflow commands to run to schedule a DAG?

I don't understand what command(s) I need to run in order to get a DAG scheduled. Let's say I tested the DAG using airflow test dag_name task_id_1 2017-06-22 and the second task with airflow test dag_name task_id_2 2017-06-22.
I ran airflow trigger_dag dag_name, but is that for instantiating the DAG for just right that moment?
Let's say I want the dag_name's timing/scheduling to look like:
'start_date': datetime.datetime(2017, 6, 22, 18),
'end_date': datetime.datetime(2017, 6, 23, 20),
schedule_interval = datetime.timedelta(1)
So I just want to schedule and run it today and tomorrow, starting # 18:00 UTC today and 24 hours after that.
Now what command or list of commands am I supposed to run? Do I have to run airflow scheduler every time I want to add and schedule a DAG?
trigger_dag is to trigger the dag run instantaneously. To schedule the DAG, just put it in the DAG folder, go to Airflow UI and enable the DAG.

Resources