I'm new to Airflow. I'm following the offical tutorial to set up the first DAG and task
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'admin',
'retries': 3,
'retry_delay': timedelta(minutes=1)
}
with DAG(
dag_id="hello_world_dag",
description="Hello world DAG",
start_date=datetime(2023, 1, 16),
schedule_interval='#daily',
default_args=default_args
) as dag:
task1 = BashOperator(
task_id="hello_task",
bash_command="echo hello world!"
)
task1
When I tried to run this manually, it always failed. I've checked the web server logs and the scheduler logs, they don't have any obvious errors. I also checked the task run logs, it's empty.
The setup is pretty simple: SequentialExecutor with sqlite. My question is: where can I see the worker logs, or any other places that have any useful message logged?
Ok finally figured this out.
Firstly let me correct my question - there's actually an error raised in scheduler log that the "BashTaskRunner" cannot be loaded. So I searched Airflow's source code, and found it was renamed to StandardBashRunner like 3 years ago(link).
This is the only occurrence of the word BashTaskRunner in the whole repo. So I'm curious how the AIRFLOW_HOME/airflow.cfg is generated, which sets this as the default task_runner value.
Related
example_trigger_controller_dag.py
import pendulum
from airflow import DAG
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
with DAG(
dag_id="example_trigger_controller_dag",
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False,
schedule="#once",
tags=["example"],
) as dag:
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
trigger_dag_id="example_trigger_target_dag", # Ensure this equals the dag_id of the DAG to trigger
conf={"message": "Hello World"},
)
example_trigger_target_dag.py
import pendulum
from airflow import DAG
from airflow.decorators import task
from airflow.operators.bash import BashOperator
#task(task_id="run_this")
def run_this_func(dag_run=None):
"""
Print the payload "message" passed to the DagRun conf attribute.
:param dag_run: The DagRun object
"""
print(f"Remotely received value of {dag_run.conf.get('message')} for key=message")
with DAG(
dag_id="example_trigger_target_dag",
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False,
schedule=None,
tags=["example"],
) as dag:
run_this = run_this_func()
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: $message"',
env={"message": '{{ dag_run.conf.get("message") }}'},
)
the task in controller dag successfully ended but the task in target dag stuck in queue. Any ideas about how to solve this problem?
I ran your DAGs (with both of them unpaused) and they work fine in a completely new environment (Airflow 2.5.0, Astro CLI Runtime 7.1.0). So the issue is most likely not with your DAG code.
Tasks stuck in queue is often an issue with the scheduler, mostly with older Airflow versions. I suggest you:
make sure both DAGs are unpaused when the first DAG runs.
make sure all start_dates are in the past (though in this case usually the tasks don't even get queued)
restart your scheduler/Airflow environment
try running the DAGs while no other DAGs are running to check if the issue could be that the parallelism limit is reached. (if you are using K8s executor you should also check worker_pods_creation_batch_size and with the Celery Executor worker_concurrency and stalled_task_timeout)
take a look at your scheduler logs (at $AIRFLOW_HOME/logs/scheduler)
upgrade Airflow if you are running an older version.
Dear Apache Airflow experts,
I am currently trying to make the parallel execution of Apache Airflow 2.3.x DAGs configurable via the DAG run config.
When executing below code the DAG creates two tasks - for the sake of my question it does not matter what the other DAG does.
Because max_active_tis_per_dag is set to 1, the two tasks will be run one after another.
What I want to achieve: I want to provide the result of get_num_max_parallel_runs (which checks the DAG config, if no value is present it falls back to 1 as default) to max_active_tis_per_dag.
I would appreciate any input on this!
Thank you in advance!
from airflow import DAG
from airflow.decorators import task
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from datetime import datetime
with DAG(
'aaa_test_controller',
schedule_interval=None,
start_date=datetime(2021, 1, 1),
catchup=False
) as dag:
#task
def get_num_max_parallel_runs(dag_run=None):
return dag_run.conf.get("num_max_parallel_runs", 1)
trigger_dag = TriggerDagRunOperator.partial(
task_id="trigger_dependent_dag",
trigger_dag_id="aaa_some_other_dag",
wait_for_completion=True,
max_active_tis_per_dag=1,
poke_interval=5
).expand(conf=['{"some_key": "some_value_1"}', '{"some_key": "some_value_2"}'])
I'm facing some issues trying to set up a basic DAG file inside the Airflow (but also I have other two files).
I'm using the LocalExecutor through the Ubuntu and saved my files at "C:\Users\tdamasce\Documents\workspace" with the dag and log file inside it.
My script is
# step 1 - libraries
from email.policy import default
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.dummy_operator import DummyOperator
# step 2
default_args = {
'ownwer': 'airflow',
'depends_on_past': False,
'start_date': days_ago(2),
'retries':0
}
# step 3
dag = DAG(
dag_id='DAG-1',
default_args=default_args,
catchup=False,
schedule_interval=timedelta(minutes=5)
)
# step 4
start = DummyOperator(
task_id='start',
dag=dag
)
end = DummyOperator(
task_id='end',
dag=dag
)
My DAG stays like that:
Please, let me know if any add info is needed
As per your updated Question , I can see that you place the DAgs under a directory
"C:\Users\tdamasce\Documents\workspace" with the dag and log file
inside it.
you need to add dags to dags_folder (specified in airflow.cfg. By default it's $AIRFLOW_HOME/dags subfolder). See if your AIRFLOW_HOME variable and you should found a dag folder there.
you can also check airflow list_dags - this will list out all the dags,
Still you are not able to get that in the UI , then restart the servers.
I have a slightly complex setup:
I run my Airflow (v1.10.13) pipelines in local time (setup on the VM's timezone). The following DAG was marked for the Monday run as successful, but the task within was never scheduled (thus having no logs whatsoever). I had some issues with the Airflow scheduler and usage of non-UTC timezones in the past, so I wonder, if that could be another reason?
from airflow import DAG
from datetime import timedelta
from somewhere import get_localized_yesterday
import prepered_tasks as t
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': get_localized_yesterday(),
'email': [],
'email_on_failure': True,
'email_on_retry': False,
'retry_delay': timedelta(minutes=1)
}
# Schedule the DAG daily at 2 a.m.
dag = DAG(
'descriptive_DAG_name',
default_args=default_args,
description='',
schedule_interval='0 2 * * Mon-Fri',
tags=['PROD']
)
single_task = t.task_partial(dag=dag)
single_task
The 'task_partial' is a task object embedded in a partial, so I only need to provide the dag for instantiation. This works as intended in other pipelines, which work properly.
I checked the usual suspects:
The scheduler is running.
The workers are running.
The DAG is turned on in the UI.
Other pipelines behave normally.
No dependencies on previous runs.
Start date lies well in the past.
There was a bug in airflow 1.10.13 and the release was yanked.
You should upgrade to 1.10.14.
See the issue and the fix.
Quote from issue:
After performing an upgrade to v1.10.13 we noticed that tasks in some of our DAGs were not be scheduled. After a bit of investigation we discovered that by commenting out 'depends_on_past': True the issue went away.
I have the following DAG defined in code:
from datetime import timedelta, datetime
import airflow
from airflow import DAG
from airflow.operators.docker_operator import DockerOperator
from airflow.contrib.operators.ecs_operator import ECSOperator
default_args = {
'owner': 'airflow',
'retries': 1,
'retry_delay': timedelta(minutes=5),
'start_date': datetime(2018, 9, 24, 10, 00, 00)
}
dag = DAG(
'data-push',
default_args=default_args,
schedule_interval='0 0 * * 1,4',
)
colors = ['blue', 'red', 'yellow']
for color in colors:
ECSOperator(dag=dag,
task_id='data-push-for-%s' % (color),
task_definition= 'generic-push-colors',
cluster= 'MY_ECS_CLUSTER_ARN',
launch_type= 'FARGATE',
overrides={
'containerOverrides': [
{
'name': 'push-colors-container',
'command': [color]
}
]
},
region_name='us-east-1',
network_configuration={
'awsvpcConfiguration': {
'securityGroups': ['MY_SG'],
'subnets': ['MY_SUBNET'],
'assignPublicIp': "ENABLED"
}
},
)
This should create a DAG with 3 tasks, one for each color in my colors list.
This seems good, when i run:
airflow list_dags
I see my dag listed:
data-push
And when I run:
airflow list_tasks data-push
I see my three tasks appear as they should:
data-push-for-blue
data-push-for-red
data-push-for-yellow
I then test run one of my tasks by entering the following into terminal:
airflow run data-push data-push-for-blue 2017-1-23
And this runs the task, which I can see appear in my ECS cluster on the aws dashboard so I know for a fact the task runs on my ECS cluster and the data is pushed succesfully and everything is great.
Now when I try and run the DAG data-push from the Airflow UI is where i run into a problem.
I run:
airflow initdb
followed by:
airflow webserver
and now go into the airflow UI at localhost:8080.
I see the dag data-push in the list of dags, click it, and then to test run the entire dag i click the "Trigger DAG" button. I don't add any configuration json and then click 'Trigger'. The tree view for the DAG then shows a green circle on the right of the tree structure, seemingly indicating the DAG is 'running'. But the green circle just stays there for ages and when I manually check my ECS dashboard I see no tasks actually running so nothing is happening after triggering the DAG from the Airflow UI, despite the tasks working when i manually run them from the CLI.
I am using the SequentialExecutor if that matters.
My two main theories as to why the triggering the DAG does nothing when running the individual tasks from the CLI works are that maybe I am missing something in my python code where I define the dag (maybe because I dont specifiy any dependencies for the tasks?) or that I am not running the airflow scheduler but if I am manually triggering the DAGS from the Airflow UI i don't see why the scheduler would need to be running and why it wouldn't show me an error saying this is a problem.
Any ideas?
Sounds like you did not unpause your dag:
Toggle On/Off switch in the upper left of Web UI or using cli: airflow unpause <dag_id>.