MWAA ECSOperator "No task found" but succeeds - airflow

I have an ECSOperator task in MWAA.
When I trigger the task, it succeeds immediately. However, the task should take time to complete, so I do not believe it is actually starting.
When I go to inspect the task run, I get the error "No tasks found".
The task definition looks like this:
from datetime import datetime
from airflow import DAG
from airflow.providers.amazon.aws.operators.ecs import ECSOperator
dag = DAG(
"my_dag",
description = "",
start_date = datetime.fromisoformat("2022-03-28"),
catchup = False,
)
my_task = ECSOperator(
task_id = "my_task",
cluster = "my-cluster",
task_definition = "my-task",
launch_type = "FARGATE",
aws_conn_id = "aws_ecs",
overrides = {},
network_configuration = {
"awsvpcConfiguration": {
"securityGroups": [ "sg-aaaa" ],
"subnets": [ "subnet-bbbb" ],
},
},
awslogs_group = "/ecs/my-task",
)
my_task
What am I missing here?

If task executed it should have a log.
I think your issue is that the task you defined is not assigned to any DAG object thus you see No task found error (empty DAG)
You should add dag=dag:
my_task = ECSOperator(
task_id = "my_task",
...,
dag=dag
)
or use context manager to avoid such issue:
with DAG(
dag_id="my_dag",
...
) as dag:
my_task = ECSOperator(
task_id = "my_task",
...,
)
If you are using Airflow 2 you can also use dag decorator.

Related

DatabricksRunOperator Execution date

opr_run_now = DatabricksRunNowOperator(
task_id = 'run_now',
databricks_conn_id = 'databricks_default',
job_id = 754377,
notebook_params = meta_data,
dag = dag
) here
Is there way to pass execution date using databricks run operator.
What do you want to pass the execution_date to? What are you trying to achieve in the end? The following doc was helpful for me:
https://www.astronomer.io/guides/airflow-databricks
And here is an example where I am passing execution_date to be used in a python file run in Databricks. I'm capturing the execution_date using sys.argv.
from airflow import DAG
from airflow.providers.databricks.operators.databricks import (
DatabricksRunNowOperator,
)
from datetime import datetime, timedelta
spark_python_task = {
"python_file": "dbfs:/FileStore/sandbox/databricks_test_python_task.py"
}
# Define params for Run Now Operator
python_params = [
"{{ execution_date }}",
"{{ execution_date.subtract(hours=1) }}",
]
default_args = {
"owner": "airflow",
"depends_on_past": False,
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=2),
}
with DAG(
dag_id="databricks_dag",
start_date=datetime(2022, 3, 11),
schedule_interval="#hourly",
catchup=False,
default_args=default_args,
max_active_runs=1,
) as dag:
opr_run_now = DatabricksRunNowOperator(
task_id="run_now",
databricks_conn_id="Databricks",
job_id=2060,
python_params=python_params,
)
opr_run_now
There are two ways to set DatabricksRunOperator. One with named arguments (as you did) - which doesn't support templating. The second way is to use JSON payload that you typically use to call the api/2.0/jobs/run-now - This way also gives you the ability to pass execution_date as the json parameter is templated.
notebook_task_params = {
'new_cluster': new_cluster,
'notebook_task': {
'notebook_path': '/test-{{ ds }}',
}
DatabricksSubmitRunOperator(task_id='notebook_task', json=notebook_task_params)
For more information see the operator docs.

Airflow Debugging: How to skip backfill job execution when running DAG in vscode

I have setup airflow and am running a DAG using the following vscode debug configuration:
{
"version": "0.2.0",
"configurations": [
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"justMyCode": false,
"env":{
"AIRFLOW__CORE__EXECUTOR": "DebugExecutor",
"AIRFLOW__DEBUG__FAIL_FAST": "True",
"LC_ALL": "en_US.UTF-8",
"LANG": "en_US.UTF-8"
}
}
]
}
It runs the file, my breakpoints DAG defs break as expected, then at the end of the file: It executes the dag.run() and then I wait forever for the dag to backfill, and my breakpoints within python_callable functions of tasks never break.
What airflow secret am I not seeing?
Here is my dag:
# scheduled to run every minute, poke for a new file every ten seconds
dag = DAG(
dag_id='download-from-s3',
start_date=days_ago(2),
catchup=False,
schedule_interval='*/1 * * * *',
is_paused_upon_creation=False
)
def new_file_detection(**context):
print("File found...") # a breakpoint here never lands
pprint(context)
init = BashOperator(
task_id='init',
bash_command='echo "My DAG initiated at $(date)"',
dag=dag,
)
file_sensor = S3KeySensor(
task_id='file_sensor',
poke_interval=10, # every 10 seconds
timeout=60,
bucket_key="s3://inbox/new/*",
bucket_name=None,
wildcard_match=True,
soft_fail=True,
dag=dag
)
file_found_message = PythonOperator(
task_id='file_found_message',
provide_context=True,
python_callable=new_file_detection,
dag=dag
)
init >> file_sensor >> file_found_message
if __name__ == '__main__':
dag.clear(reset_dag_runs=True)
dag.run() #this triggers a backfill job
This is working for me as expected. I can set breakpoints at DAG level, or inside the python callables definition and go through them using VSCode debugger.
I'm using the same debug settings that you provided, but I changed the parameter reset_dag_runs=True to dag_run_state=State.NONE during dag.clear() call, as specified on the DebugExecutor docs page. I believe this has changed on one of the latest releases.
Regarding backfills, I'm setting catchup=False on the DAG arguments (it works both ways). Important note, I'm running version 2.0.0 of Airflow.
Here is an example using the same code from example_xcomp.py that comes with the default installation:
Debug settings:
{
"version": "0.2.0",
"configurations": [
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"program": "${file}",
"console": "internalConsole",
"justMyCode": false,
"env":{
"AIRFLOW__CORE__EXECUTOR": "DebugExecutor",
"AIRFLOW__DEBUG__FAIL_FAST": "True",
}
}
]
}
Example DAG:
import logging
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
dag = DAG(
'excom_xample',
schedule_interval="#once",
start_date=days_ago(2),
default_args={'owner': 'airflow'},
tags=['example'],
catchup=False
)
value_1 = [1, 2, 3]
value_2 = {'a': 'b'}
def push(**kwargs):
"""Pushes an XCom without a specific target"""
logging.info("log before PUSH") # <<<<<<<<<<< Before landing on breakpoint
kwargs['ti'].xcom_push(key='value from pusher 1', value=value_1)
def push_by_returning(**kwargs):
"""Pushes an XCom without a specific target, just by returning it"""
return value_2
def puller(**kwargs):
"""Pull all previously pushed XComs and
check if the pushed values match the pulled values."""
ti = kwargs['ti']
# get value_1
pulled_value_1 = ti.xcom_pull(key=None, task_ids='push')
print("PRINT Line after breakpoint ") # <<<< After landing on breakpoint
if pulled_value_1 != value_1:
raise ValueError("The two values differ"
f"{pulled_value_1} and {value_1}")
# get value_2
pulled_value_2 = ti.xcom_pull(task_ids='push_by_returning')
if pulled_value_2 != value_2:
raise ValueError(
f'The two values differ {pulled_value_2} and {value_2}')
# get both value_1 and value_2
pulled_value_1, pulled_value_2 = ti.xcom_pull(
key=None, task_ids=['push', 'push_by_returning'])
if pulled_value_1 != value_1:
raise ValueError(
f'The two values differ {pulled_value_1} and {value_1}')
if pulled_value_2 != value_2:
raise ValueError(
f'The two values differ {pulled_value_2} and {value_2}')
push1 = PythonOperator(
task_id='push',
dag=dag,
python_callable=push,
)
push2 = PythonOperator(
task_id='push_by_returning',
dag=dag,
python_callable=push_by_returning,
)
pull = PythonOperator(
task_id='puller',
dag=dag,
python_callable=puller,
)
pull << [push1, push2]
if __name__ == '__main__':
from airflow.utils.state import State
dag.clear(dag_run_state=State.NONE)
dag.run()

Airflow sla_miss_callback function not triggering

I have been trying to get a slack message callback to trigger on SLA misses. I've noticed that:
SLA misses get registered successfully in the Airflow web UI at
slamiss/list/
on_failure_callback works successfully
However, the sla_miss_callback function itself will never get triggered.
What I've tried:
Different combinations adding sla and sla_miss_callback at the
default_args level, the DAG level, and the task level
Checking logs on our scheduler and workers for SLA related messages (see also here), but we haven't seen anything
The slack message callback function works if called from any other
basic task or function
default_args = {
"owner": "airflow",
"depends_on_past": False,
'start_date': airflow.utils.dates.days_ago(n=0,minute=1),
'on_failure_callback': send_task_failed_msg_to_slack,
'sla': timedelta(minutes=1),
"retries": 0,
"pool": 'canary',
'priority_weight': 1
}
dag = airflow.DAG(
dag_id='sla_test',
default_args=default_args,
sla_miss_callback=send_sla_miss_message_to_slack,
schedule_interval='*/5 * * * *',
catchup=False,
max_active_runs=1,
dagrun_timeout=timedelta(minutes=5)
)
def sleep():
""" Sleep for 2 minutes """
time.sleep(90)
LOGGER.info("Slept for 2 minutes")
def simple_print(**context):
""" Prints a message """
print("Hello World!")
sleep = PythonOperator(
task_id="sleep",
python_callable=sleep,
dag=dag
)
simple_task = PythonOperator(
task_id="simple_task",
python_callable=simple_print,
provide_context=True,
dag=dag
)
sleep >> simple_task
I was in similar situation once.
On investigating the scheduler log, I found the following error:
[2020-07-08 09:14:32,781] {scheduler_job.py:534} INFO - --------------> ABOUT TO CALL SLA MISS CALL BACK
[2020-07-08 09:14:32,781] {scheduler_job.py:541} ERROR - Could not call sla_miss_callback for DAG
sla_miss_alert() takes 1 positional arguments but 5 were given
The problem is that your sla_miss_callback function is expecting only 1 argument, but actually this should be like:
def sla_miss_alert(dag, task_list, blocking_task_list, slas, blocking_tis):
"""Function that alerts me that dag_id missed sla"""
# <function code here>
For reference, checkout the Airflow source code.
Note: Don't put sla_miss_callback=sla_miss_alert in default_args. It should be defined in the DAG definition itself.
Example of using SLA missed and Execution Timeout alerts:
At first, you'll get SLA missed after 2 minutes task run,
and then, after 4 minutes task will fail with Execution Timeout alert.
"sla": timedelta(minutes=2), # Default Task SLA time
"execution_timeout": timedelta(minutes=4), # Default Task Execution Timeout
Also, you have log_url right in the message, so you can easily open task log in Airflow.
Example Slack Message
import time
from datetime import datetime, timedelta
from textwrap import dedent
from typing import Any, Dict, List, Optional, Tuple
from airflow import AirflowException
from airflow.contrib.operators.slack_webhook_operator import SlackWebhookOperator
from airflow.exceptions import AirflowTaskTimeout
from airflow.hooks.base_hook import BaseHook
from airflow.models import DAG, TaskInstance
from airflow.operators.python_operator import PythonOperator
SLACK_STATUS_TASK_FAILED = ":red_circle: Task Failed"
SLACK_STATUS_EXECUTION_TIMEOUT = ":alert: Task Failed by Execution Timeout."
def send_slack_alert_sla_miss(
dag: DAG,
task_list: str,
blocking_task_list: str,
slas: List[Tuple],
blocking_tis: List[TaskInstance],
) -> None:
"""Send `SLA missed` alert to Slack"""
task_instance: TaskInstance = blocking_tis[0]
message = dedent(
f"""
:warning: Task SLA missed.
*DAG*: {dag.dag_id}
*Task*: {task_instance.task_id}
*Execution Time*: {task_instance.execution_date.strftime("%Y-%m-%d %H:%M:%S")} UTC
*SLA Time*: {task_instance.task.sla}
_* Time by which the job is expected to succeed_
*Task State*: `{task_instance.state}`
*Blocking Task List*: {blocking_task_list}
*Log URL*: {task_instance.log_url}
"""
)
send_slack_alert(message=message)
def send_slack_alert_task_failed(context: Dict[str, Any]) -> None:
"""Send `Task Failed` notification to Slack"""
task_instance: TaskInstance = context.get("task_instance")
exception: AirflowException = context.get("exception")
status = SLACK_STATUS_TASK_FAILED
if isinstance(exception, AirflowTaskTimeout):
status = SLACK_STATUS_EXECUTION_TIMEOUT
# Prepare formatted Slack message
message = dedent(
f"""
{status}
*DAG*: {task_instance.dag_id}
*Task*: {task_instance.task_id}
*Execution Time*: {context.get("execution_date").to_datetime_string()} UTC
*SLA Time*: {task_instance.task.sla}
_* Time by which the job is expected to succeed_
*Execution Timeout*: {task_instance.task.execution_timeout}
_** Max time allowed for the execution of this task instance_
*Task Duration*: {timedelta(seconds=round(task_instance.duration))}
*Task State*: `{task_instance.state}`
*Exception*: {exception}
*Log URL*: {task_instance.log_url}
"""
)
send_slack_alert(
message=message,
context=context,
)
def send_slack_alert(
message: str,
context: Optional[Dict[str, Any]] = None,
) -> None:
"""Send prepared message to Slack"""
slack_webhook_token = BaseHook.get_connection("slack").password
notification = SlackWebhookOperator(
task_id="slack_notification",
http_conn_id="slack",
webhook_token=slack_webhook_token,
message=message,
username="airflow",
)
notification.execute(context)
# These args will get passed on to each operator
# You can override them on a per-task basis during operator initialization
default_args = {
"owner": "airflow",
"email": ["test#test,com"],
"email_on_failure": True,
"depends_on_past": False,
"retry_delay": timedelta(minutes=5),
"sla": timedelta(minutes=2), # Default Task SLA time
"execution_timeout": timedelta(minutes=4), # Default Task Execution Timeout
"on_failure_callback": send_slack_alert_task_failed,
}
with DAG(
dag_id="test_sla",
schedule_interval="*/5 * * * *",
start_date=datetime(2021, 1, 11),
default_args=default_args,
sla_miss_callback=send_slack_alert_sla_miss, # Must be set here, not in default_args!
) as dag:
delay_python_task = PythonOperator(
task_id="delay_five_minutes_python_task",
#MIKE MILLIGAN ADDED THIS
sla=timedelta(minutes=2),
python_callable=lambda: time.sleep(300),
)
It seems that the only way to make the sla_miss_callback work is by explicitly passing the arguments that it needs... nothing else has worked for me and these arguments: 'dag', 'task_list', 'blocking_task_list', 'slas', and 'blocking_tis' are not been sent to the callback at all.
TypeError: print_sla_miss() missing 5 required positional arguments: 'dag', 'task_list', 'blocking_task_list', 'slas', and 'blocking_tis'
A lot of these answers are 90% complete so I wanted to share my example using bash operators which combined what I found from all of the responses above and other resources
The most important things being how you define sla_miss_callback in the dag definition and not in the default_args, and not passing context to the sla function.
"""
A simple example showing the basics of using a custom SLA notification response.
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta, datetime
from airflow.operators.slack_operator import SlackAPIPostOperator
from slack import slack_attachment
from airflow.hooks.base_hook import BaseHook
import urllib
#slack alert for sla_miss
def slack_sla_miss(dag, task_list, blocking_task_list, slas, blocking_tis):
dag_id = slas[0].dag_id
task_id = slas[0].task_id
execution_date = slas[0].execution_date.isoformat()
base_url = 'webserver_url_here'
encoded_execution_date = urllib.parse.quote_plus(execution_date)
dag_url = (f'{base_url}/graph?dag_id={dag_id}'
f'&execution_date={encoded_execution_date}')
message = (f':alert: *Airflow SLA Miss*'
f'\n\n'
f'*DAG:* {dag_id}\n'
f'*Task:* {task_id}\n'
f'*Execution Date:* {execution_date}'
f'\n\n'
f'<{dag_url}|Click here to view DAG>')
sla_miss_alert = SlackAPIPostOperator(
task_id='slack_sla_miss',
channel='airflow-alerts-test',
token=str(BaseHook.get_connection("slack").password),
text = message
)
return sla_miss_alert.execute()
#slack alert for successful task completion
def slack_success_task(context):
success_alert = SlackAPIPostOperator(
task_id='slack_success',
channel='airflow-alerts-test',
token=str(BaseHook.get_connection("slack").password),
text = "Test successful"
)
return success_alert.execute(context=context)
default_args = {
"depends_on_past": False,
'start_date': datetime(2020, 11, 18),
"retries": 0
}
# Create a basic DAG with our args
# Note: Don't put sla_miss_callback=sla_miss_alert in default_args. It should be defined in the DAG definition itself.
dag = DAG(
dag_id='sla_slack_v6',
default_args=default_args,
sla_miss_callback=slack_sla_miss,
catchup=False,
# A common interval to make the job fire when we run it
schedule_interval=timedelta(minutes=3)
)
# Add a task that will always fail the SLA
t1 = BashOperator(
task_id='timeout_test_sla_miss',
# Sleep 60 seconds to guarantee we miss the SLA
bash_command='sleep 60',
# Do not retry so the SLA miss fires after the first execution
retries=0,
#on_success_callback = slack_success_task,
provide_context = True,
# Set our task up with a 10 second SLA
sla=timedelta(seconds=10),
dag=dag
)
t2 = BashOperator(
task_id='timeout_test_sla_miss_task_2',
# Sleep 30 seconds to guarantee we miss the SLA of 20 seconds set in this task
bash_command='sleep 30',
# Do not retry so the SLA miss fires after the first execution
retries=0,
#on_success_callback = slack_success_task,
provide_context = True,
# Set our task up with a 20 second SLA
sla=timedelta(seconds=20),
dag=dag
)
t3 = BashOperator(
task_id='timeout_test_sla_miss_task_3',
# Sleep 60 seconds to guarantee we miss the SLA
bash_command='sleep 60',
# Do not retry so the SLA miss fires after the first execution
retries=0,
#on_success_callback = slack_success_task,
provide_context = True,
# Set our task up with a 30 second SLA
sla=timedelta(seconds=30),
dag=dag
)
t1 >> t2 >> t3
I think the airflow documentation is a bit fuzzy on this.
Instead of the method signature as
def slack_sla_miss(dag, task_list, blocking_task_list, slas, blocking_tis)
Modify your signature like this
def slack_sla_miss(*args, **kwargs)
This way all the parameters get passed. You will not get the errors which you are seeing in the logs.
Learnt this on url - https://www.cloudwalker.io/2020/12/15/airflow-sla-management/
I had the same issue, but was able to get it working with this code:
import logging as log
import airflow
import time
from datetime import timedelta
# The DAG object; we'll need this to instantiate a DAG
from airflow import DAG
# Operators; we need this to operate!
from airflow.operators.python_operator import PythonOperator
from airflow import configuration
import urllib
from airflow.operators.slack_operator import SlackAPIPostOperator
def sleep():
""" Sleep for 2 minutes """
time.sleep(60*2)
log.info("Slept for 2 minutes")
def simple_print(**context):
""" Prints a message """
print("Hello World!")
def slack_on_sla_miss(dag,
task_list,
blocking_task_list,
slas,
blocking_tis):
log.info('Running slack_on_sla_miss')
slack_conn_id = 'slack_default'
slack_channel = '#general'
dag_id = slas[0].dag_id
task_id = slas[0].task_id
execution_date = slas[0].execution_date.isoformat()
base_url = configuration.get('webserver', 'BASE_URL')
encoded_execution_date = urllib.parse.quote_plus(execution_date)
dag_url = (f'{base_url}/graph?dag_id={dag_id}'
f'&execution_date={encoded_execution_date}')
message = (f':o: *Airflow SLA Miss*'
f'\n\n'
f'*DAG:* {dag_id}\n'
f'*Task:* {task_id}\n'
f'*Execution Date:* {execution_date}'
f'\n\n'
f'<{dag_url}|Click here to view>')
slack_op = SlackAPIPostOperator(task_id='slack_failed',
slack_conn_id=slack_conn_id,
channel=slack_channel,
text=message)
slack_op.execute()
default_args = {
"owner": "airflow",
"depends_on_past": False,
'start_date': airflow.utils.dates.days_ago(n=0, minute=1),
"retries": 0,
'priority_weight': 1,
}
dag = DAG(
dag_id='sla_test',
default_args=default_args,
sla_miss_callback=slack_on_sla_miss,
schedule_interval='*/5 * * * *',
catchup=False,
max_active_runs=1,
)
with dag:
sleep = PythonOperator(
task_id="sleep",
python_callable=sleep,
)
simple_task = PythonOperator(
task_id="simple_task",
python_callable=simple_print,
provide_context=True,
sla=timedelta(minutes=1),
)
sleep >> simple_task
I've run into this issue myself. Unlike the on_failure_callback that is looking for a python callable function, it appears that sla_miss_callback needs the full function call.
An example that is working for me:
def sla_miss_alert(dag_id):
"""
Function that alerts me that dag_id missed sla
"""
<function code here>
def task_failure_alert(dag_id, context):
"""
Function that alerts me that a task failed
"""
<function code here>
dag_id = 'sla_test'
default_args = {
"owner": "airflow",
"depends_on_past": False,
'start_date': airflow.utils.dates.days_ago(n=0,minute=1),
'on_failure_callback': partial(task_failure_alert, dag_id),
'sla': timedelta(minutes=1),
"retries": 0,
"pool": 'canary',
'priority_weight': 1
}
dag = airflow.DAG(
dag_id='sla_test',
default_args=default_args,
sla_miss_callback=sla_miss_alert(dag_id),
schedule_interval='*/5 * * * *',
catchup=False,
max_active_runs=1,
dagrun_timeout=timedelta(minutes=5)
)
As far as I can tell, sla_miss_callback doesn't have access to context, which is unfortunate. Once I stopped looking for the context, I finally got my alerts.

HttpError 400 when trying to run DataProcSparkOperator task from a local Airflow

I'm testing out a DAG that I used to have running on Google Composer without error, on a local install of Airflow. The DAG spins up a Google Dataproc cluster, runs a Spark job (JAR file located on a GS bucket), then spins down the cluster.
The DataProcSparkOperator task fails immediately each time with the following error:
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://dataproc.googleapis.com/v1beta2/projects//regions/global/jobs:submit?alt=json returned "Invalid resource field value in the request.">
It looks as though the URI is incorrect/incomplete, but I am not sure what is causing it. Below is the meat of my DAG. All the other tasks execute without error, and the only difference is the DAG is no longer running on Composer:
default_dag_args = {
'start_date': yesterday,
'email': models.Variable.get('email'),
'email_on_failure': True,
'email_on_retry': True,
'retries': 0,
'retry_delay': dt.timedelta(seconds=30),
'project_id': models.Variable.get('gcp_project'),
'cluster_name': 'susi-bsm-cluster-{{ ds_nodash }}'
}
def slack():
'''Posts to Slack if the Spark job fails'''
text = ':x: The DAG *{}* broke and I am not smart enough to fix it. Check the StackDriver and DataProc logs.'.format(DAG_NAME)
s.post_slack(SLACK_URI, text)
with DAG(DAG_NAME, schedule_interval='#once',
default_args=default_dag_args) as dag:
# pylint: disable=no-value-for-parameter
delete_existing_parquet = bo.BashOperator(
task_id = 'delete_existing_parquet',
bash_command = 'gsutil rm -r {}/susi/bsm/bsm.parquet'.format(GCS_BUCKET)
)
create_dataproc_cluster = dpo.DataprocClusterCreateOperator(
task_id = 'create_dataproc_cluster',
num_workers = num_workers_override or models.Variable.get('default_dataproc_workers'),
zone = models.Variable.get('gce_zone'),
init_actions_uris = ['gs://cjones-composer-test/susi/susi-bsm-dataproc-init.sh'],
trigger_rule = trigger_rule.TriggerRule.ALL_DONE
)
run_spark_job = dpo.DataProcSparkOperator(
task_id = 'run_spark_job',
main_class = MAIN_CLASS,
dataproc_spark_jars = [MAIN_JAR],
arguments=['{}/susi.conf'.format(CONF_DEST), DATE_CONST]
)
notify_on_fail = po.PythonOperator(
task_id = 'output_to_slack',
python_callable = slack,
trigger_rule = trigger_rule.TriggerRule.ONE_FAILED
)
delete_dataproc_cluster = dpo.DataprocClusterDeleteOperator(
task_id = 'delete_dataproc_cluster',
trigger_rule = trigger_rule.TriggerRule.ALL_DONE
)
delete_existing_parquet >> create_dataproc_cluster >> run_spark_job >> delete_dataproc_cluster >> notify_on_fail
Any assistance with this would be much appreciated!
Unlike the DataprocClusterCreateOperator, the DataProcSparkOperator does not take the project_id as a parameter. It gets it from the Airflow connection (if you do not specify the gcp_conn_id parameter, it defaults to google_cloud_default). You have to configure your connection.
The reason you don't see this while running DAG in Composer is that Composer configures the google_cloud_default connection.

Airflow dynamic DAG and Task Ids

I mostly see Airflow being used for ETL/Bid data related jobs. I'm trying to use it for business workflows wherein a user action triggers a set of dependent tasks in future. Some of these tasks may need to be cleared (deleted) based on certain other user actions.
I thought the best way to handle this would be via dynamic task ids. I read that Airflow supports dynamic dag ids. So, I created a simple python script that takes DAG id and task id as command line parameters. However, I'm running into problems making it work. It gives dag_id not found error. Has anyone tried this? Here's the code for the script (call it tmp.py) which I execute on command line as python (python tmp.py 820 2016-08-24T22:50:00 ):
from __future__ import print_function
import os
import sys
import shutil
from datetime import date, datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
execution = '2016-08-24T22:20:00'
if len(sys.argv) > 2 :
dagid = sys.argv[1]
taskid = 'Activate' + sys.argv[1]
execution = sys.argv[2]
else:
dagid = 'DAGObjectId'
taskid = 'Activate'
default_args = {'owner' : 'airflow', 'depends_on_past': False, 'start_date':date.today(), 'email': ['fake#fake.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1}
dag = DAG(dag_id = dagid,
default_args=default_args,
schedule_interval='#once',
)
globals()[dagid] = dag
task1 = BashOperator(
task_id = taskid,
bash_command='ls -l',
dag=dag)
fakeTask = BashOperator(
task_id = 'fakeTask',
bash_command='sleep 5',
retries = 3,
dag=dag)
task1.set_upstream(fakeTask)
airflowcmd = "airflow run " + dagid + " " + taskid + " " + execution
print("airflowcmd = " + airflowcmd)
os.system(airflowcmd)
After numerous trials and errors, I was able to figure this out. Hopefully, it will help someone. Here's how it works: You need to have an iterator or an external source (file/database table) to generate dags/task dynamically through a template. You can keep the dag and task names static, just assign them ids dynamically in order to differentiate one dag from the other. You put this python script in the dags folder. When you start the airflow scheduler, it runs through this script on every heartbeat and writes the DAGs to the dag table in the database. If a dag (unique dag id) has already been written, it will simply skip it. The scheduler also look at the schedule of individual DAGs to determine which one is ready for execution. If a DAG is ready for execution, it executes it and updates its status.
Here's a sample code:
from airflow.operators import PythonOperator
from airflow.operators import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta
import sys
import time
dagid = 'DA' + str(int(time.time()))
taskid = 'TA' + str(int(time.time()))
input_file = '/home/directory/airflow/textfile_for_dagids_and_schedule'
def my_sleeping_function(random_base):
'''This is a function that will run within the DAG execution'''
time.sleep(random_base)
def_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(), 'email_on_failure': False,
'retries': 1, 'retry_delay': timedelta(minutes=2)
}
with open(input_file,'r') as f:
for line in f:
args = line.strip().split(',')
if len(args) < 6:
continue
dagid = 'DAA' + args[0]
taskid = 'TAA' + args[0]
yyyy = int(args[1])
mm = int(args[2])
dd = int(args[3])
hh = int(args[4])
mins = int(args[5])
ss = int(args[6])
dag = DAG(
dag_id=dagid, default_args=def_args,
schedule_interval='#once', start_date=datetime(yyyy,mm,dd,hh,mins,ss)
)
myBashTask = BashOperator(
task_id=taskid,
bash_command='python /home/directory/airflow/sendemail.py',
dag=dag)
task2id = taskid + '-X'
task_sleep = PythonOperator(
task_id=task2id,
python_callable=my_sleeping_function,
op_kwargs={'random_base': 10},
dag=dag)
task_sleep.set_upstream(myBashTask)
f.close()
From How can I create DAGs dynamically?:
Airflow looks in you [sic] DAGS_FOLDER for modules that contain DAG objects in their global namespace, and adds the objects it finds in the DagBag. Knowing this all we need is a way to dynamically assign variable in the global namespace, which is easily done in python using the globals() function for the standard library which behaves like a simple dictionary.
for i in range(10):
dag_id = 'foo_{}'.format(i)
globals()[dag_id] = DAG(dag_id)
# or better, call a function that returns a DAG object!
copying my answer from this question. Only for v2.3 and above:
This feature is achieved using Dynamic Task Mapping, only for Airflow versions 2.3 and higher
More documentation and example here:
Official Dynamic Task Mapping documentation
Tutorial from Astronomer
Example:
#task
def make_list():
# This can also be from an API call, checking a database, -- almost anything you like, as long as the
# resulting list/dictionary can be stored in the current XCom backend.
return [1, 2, {"a": "b"}, "str"]
#task
def consumer(arg):
print(list(arg))
with DAG(dag_id="dynamic-map", start_date=datetime(2022, 4, 2)) as dag:
consumer.expand(arg=make_list())
example 2:
from airflow import XComArg
task = MyOperator(task_id="source")
downstream = MyOperator2.partial(task_id="consumer").expand(input=XComArg(task))
The graph view and tree view are also updated:
Relevant issues here:
https://github.com/apache/airflow/projects/12

Resources