Airflow 1.10.1 - Change TimeZone - airflow

I am running airflow (1.10.1) inside a VM on GCP via docker. Already changed the local time of my VM and config (airflow.cfg) also set the default_zone of my country (America / Sao_Paulo) but it still continues in UTC time on the home screen and consequently processing is done in UTC too. Can you do anything else?

Complementing the given answer, I was able to change the execution according to my timezone inside the DAG through the code below:
import pendulum
default_args = {
'owner': 'airflow',
'start_date': pendulum.datetime(year=2019, month=7, day=26).astimezone('America/Sao_Paulo'),
'depends_on_past': False,
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'depends_on_past': False,
# If a task fails, retry it once after waiting
# at least 5 minutes
'retries': 1,
'retry_delay': timedelta(minutes=5),
'on_failure_callback': slack_msg
}
dag = DAG(
dag_id=nm_dag,
default_args=default_args,
schedule_interval='40 11 * * *',
dagrun_timeout=timedelta(minutes=60)
)

From the documentation:
Support for time zones is enabled by default. Airflow stores datetime information in UTC internally and in the database. It allows you to run your DAGs with time zone dependent schedules. At the moment Airflow does not convert them to the end user’s time zone in the user interface. There it will always be displayed in UTC. Also templates used in Operators are not converted.
Time zone information is exposed and it is up to the writer of DAG to process it accordingly.

you can change it by setting the correct value of the timezone in the variable "AIRFLOW__CORE__DEFAULT_TIMEZONE" in airflow config file or from the env vars during the run time.

Related

Airflow Dag starts immediately

I have a question please,below the parameters of my airflow diagram
default_args = {
'owner': 'me',
'email': ['tig.bena#gmail.com',"tig.bena#yahoo.com"],
'email_on_failure': True,
'email_on_retry': True,
'start_date': dt.datetime(2024, 3, 4, 9, 55, 00),
}
I launch airflow with docker compose, the problem is that my diagram is launched directly when I launch docker-compose up
although the launch date is 2024
any idea please ?
thank you
Airflow may launch a DAG immediately even if the start date is in the future because of a feature called "catchup" mode. When "catchup" mode is enabled, Airflow will run all the tasks for all the past instances of a DAG Run, up to the current date and time.
This feature is usually used to quickly bring a DAG up-to-date after it has been turned off or if it has missed some runs. You can control the behavior of "catchup" mode in the DAG configuration by setting the "catchup" parameter to either "True" or "False".
If you set it to "catchup": "False", Airflow will only run the tasks for the future instances of a DAG Run and will not run any tasks for past instances.

Apache Airflow does not enforce dagrun_timeout

I am using Apache Airflow version 1.10.3 with the sequential executor, and I would like the DAG to fail after a certain amount of time if it has not finished. I tried setting dagrun_timeout in the example code
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 6, 1),
'retries': 0,
}
dag = DAG('min_timeout', default_args=default_args, schedule_interval=timedelta(minutes=5), dagrun_timeout = timedelta(seconds=30), max_active_runs=1)
t1 = BashOperator(
task_id='fast_task',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='slow_task',
bash_command='sleep 45',
dag=dag)
t2.set_upstream(t1)
slow_task alone takes more than the time limit set by dagrun_timeout, so my understanding is that airflow should stop DAG execution. However, this does not happen, and slow_task is allowed to run for its entire duration. After this occurs, the run is marked as failed, but this does not kill the task or DAG as desired. Using execution_timeout for slow_task does cause the task to be killed at the specified time limit, but I would prefer to use an overall time limit for the DAG rather than specifying execution_timeout for each task.
Is there anything else I should try to achieve this behavior, or any mistakes I can fix?
The Airflow scheduler runs a loop at least every SCHEDULER_HEARTBEAT_SEC (the default is 5 seconds).
Bear in mind at least here, because the scheduler performs some actions that may delay the next cycle of its loop.
These actions include:
parsing the dags
filling up the DagBag
checking the DagRun and updating their state
scheduling next DagRun
In your example, the delayed task isn't terminated at the dagrun_timeout because the scheduler performs its next cycle after the task completes.
According to Airflow documentation:
dagrun_timeout (datetime.timedelta) – specify how long a DagRun should be up before timing out / failing, so that new DagRuns can be created. The timeout is only enforced for scheduled DagRuns, and only once the # of active DagRuns == max_active_runs.
So dagrun_timeout wouldn't work for non-scheduled DagRuns (e.g. manually triggered) and if the number of active DagRuns < max_active_runs parameter.

How fix DAG seems to be missing?

I want to run a simple Dag "test_update_bq", but when I go to localhost I see this: DAG "test_update_bq" seems to be missing.
There are no errors when I run "airflow initdb", also when I run test airflow test test_update_bq update_table_sql 2015-06-01, It was successfully done and the table was updated in BQ. Dag:
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'Anna',
'depends_on_past': True,
'start_date': datetime(2017, 6, 2),
'email': ['airflow#airflow.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=5),
}
schedule_interval = "00 21 * * *"
# Define DAG: Set ID and assign default args and schedule interval
dag = DAG('test_update_bq', default_args=default_args, schedule_interval=schedule_interval, template_searchpath = ['/home/ubuntu/airflow/dags/sql_bq'])
update_task = BigQueryOperator(
dag = dag,
allow_large_results=True,
task_id = 'update_table_sql',
sql = 'update_bq.sql',
use_legacy_sql = False,
bigquery_conn_id = 'test'
)
update_task
I would be grateful for any help.
/logs/scheduler
[2019-10-10 11:28:53,308] {logging_mixin.py:95} INFO - [2019-10-10 11:28:53,308] {dagbag.py:90} INFO - Filling up the DagBag from /home/ubuntu/airflow/dags/update_bq.py
[2019-10-10 11:28:53,333] {scheduler_job.py:1532} INFO - DAG(s) dict_keys(['test_update_bq']) retrieved from /home/ubuntu/airflow/dags/update_bq.py
[2019-10-10 11:28:53,383] {scheduler_job.py:152} INFO - Processing /home/ubuntu/airflow/dags/update_bq.py took 0.082 seconds
[2019-10-10 11:28:56,315] {logging_mixin.py:95} INFO - [2019-10-10 11:28:56,315] {settings.py:213} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=3600, pid=11761
[2019-10-10 11:28:56,318] {scheduler_job.py:146} INFO - Started process (PID=11761) to work on /home/ubuntu/airflow/dags/update_bq.py
[2019-10-10 11:28:56,324] {scheduler_job.py:1520} INFO - Processing file /home/ubuntu/airflow/dags/update_bq.py for tasks to queue
[2019-10-10 11:28:56,325] {logging_mixin.py:95} INFO - [2019-10-10 11:28:56,325] {dagbag.py:90} INFO - Filling up the DagBag from /home/ubuntu/airflow/dags/update_bq.py
[2019-10-10 11:28:56,350] {scheduler_job.py:1532} INFO - DAG(s) dict_keys(['test_update_bq']) retrieved from /home/ubuntu/airflow/dags/update_bq.py
[2019-10-10 11:28:56,399] {scheduler_job.py:152} INFO - Processing /home/ubuntu/airflow/dags/update_bq.py took 0.081 seconds
Restarting the airflow webserver helped.
So I kill gunicorn process on ubuntu and then restart airflow webserver
This error is usually due to an exception happening when Airflow tries to parse a DAG. So the DAG gets registered in metastore(thus visible UI), but it wasn't parsed by Airflow. Can you take a look at Airflow logs, you might see an exception causing this error.
None of the responses helped me solving this issue.
However after spending some time I found out how to see the exact problem.
In my case I ran airflow (v2.4.0) using helm chart (v1.6.0) inside kubernetes. It created multiple docker containers. I got into the running container using ssh and executed two commands using airflow's cli and this helped me a lot to debug and understand the problem
airflow dags report
airflow dags reserialize
In my case the problem was that database schema didn't match the airflow version.

Airflow Task failure/retry workflow

I have retry logic for tasks and it's not clear how Airflow handles task failures when retries are turned on.
Their documentation just states that on_failure_callback gets triggered when a task fails, but if that task fails and is also marked for retry does that mean that both the on_failure_callback and on_retry_callback would be called?
Retry logic/parameters will take place before failure logic/parameters. So if you have a task set to retry twice, it will attempt to run again two times (and thus executing on_retry_callback ) before failing (and then executing on_failure_callback).
An easy way to confirm the sequence that it is executed in is to set your email_on_retry and email_on_failure to True and see the order in which they appear. You can physically confirm that it will retry before failing.
default_args = {
'owner': 'me',
'start_date': datetime(2019, 2, 8),
'email': ['you#work.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=1)
}

Airflow scheduler fails to pickup scheduled DAG's but runs when triggered manually

I have Airflow 1.10.2 installation with python 3.5.6.
Metadata is lying into Mysql database with LocalExecutor for execution.
I have created sample helloworld.py dag with below schedule.
default_args = {
'owner': 'Ashish',
'depends_on_past': False,
'start_date': datetime(2019, 2, 15),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('Helloworld',schedule_interval='56 6 * * *', default_args=default_args)
But scheduler didn't pickup this dag as per scheduled time whereas when i run it manually from UI it runs perfectly fine.
Concern here is why does scheduler fails to pickup dag run as per the scheduled time.
I think you are confused on start_date:. Your current schedule is set to run at 6:56 AM UTC on 2/15/2019. With this schedule, the DAG will run tomorrow with no problem. This is because Airflow runs jobs at the end of an interval, not the beginning.
start_date: is not when you want the DAG to be triggered, but when you want the scheduling interval to start. If you wanted your job to run today, start date should be: 'start_date': datetime(2019, 2, 14). Then your current daily scheduling interval would have ended at 6:56 AM today as intended and your DAG would have ran.
Taken from this answer.

Resources