Trouble branching DAGs in Airflow Taskflow API - airflow

I am new to Airflow. Using Taskflow API, I am trying to dynamically change the flow of DAGs. If a condition is met, the two step workflow should be executed a second time.
After defining two functions/tasks, if I fix the DAG sequence as below, everything works fine.
#dag(default_args=default_args, schedule_interval=None, start_date=days_ago(2))
def genesis(**kwargs):
#task()
def extract():
print("X")
#task()
def add_timeframe():
print("Y")
extracted_data = extract()
timeframe_data = add_timeframe(extracted_data)
However, I write any conditional logic to trigger the second run (either inside a DAG or after the function/task definitions), I get the error below. The error seems to be about setting upstream tasks. But the older "task.set_upstream(task2)" commands don't work in Taskflow Airflow 2.0.
All examples I could find of conditionally branching were based on the non Taskflow API. Please help.
#dag(default_args=default_args, schedule_interval=None, start_date=days_ago(2))
def genesis(**kwargs):
#task()
def extract():
print("X")
if <condition>:
extracted_data2 = extract()
timeframe_data2 = add_timeframe(extracted_data2)
#task()
def add_timeframe():
print("Y")
extracted_data = extract()
timeframe_data = add_timeframe(extracted_data)
ERROR - Tried to create relationships between tasks that don't have DAGs yet. Set the DAG for at least one task and try again: [<Task(_PythonDecoratedOperator): add_timeframe>, <Task(_PythonDecoratedOperator): extract>]
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1086, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1260, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1300, in _execute_task
result = task_copy.execute(context=context)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/airflow/operators/python.py", line 233, in execute
return_value = self.python_callable(*self.op_args, **self.op_kwargs)
File "/opt/airflow/dags/genesis.py", line 77, in extract
timeframe_data = add_timeframe(extracted_data)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/airflow/operators/python.py", line 294, in factory
**kwargs,
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 91, in __call__
obj.set_xcomargs_dependencies()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 722, in set_xcomargs_dependencies
apply_set_upstream(arg)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 711, in apply_set_upstream
apply_set_upstream(elem)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 708, in apply_set_upstream
self.set_upstream(arg.operator)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 1239, in set_upstream
self._set_relatives(task_or_task_list, upstream=True)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 1205, in _set_relatives
"task and try again: {}".format([self] + task_list)

Related

Airflow DagRunAlreadyExists even after providing the custom run id and execution date

I am Getting DagRunAlreadyExists exception even after providing the custom run id and execution date.
This occurs when there are multiple request within a second.
Here is the MWAA CLI call
def get_unique_key():
from datetime import datetime
import random
import shortuuid
import string
timestamp = datetime.now().strftime(DT_FMT_HMSf)
random_str = timestamp + ''.join(random.choice(string.digits + string.ascii_letters) for _ in range(8))
uuid_str = shortuuid.ShortUUID().random(length=12)
return '{}{}'.format(uuid_str, random_str)
execution_date = datetime.utcnow().strftime("%Y-%m-%dT%H:%m:%S.%f")
dag_run_id = get_unique_key()
workflow_id = "my_workflow"
conf = json.dumps({"foo": "bar"})
"dags trigger {0} -c '{1}' -r {2} -e {3}".format(workflow_id, conf, dag_run_id, execution_date)
and here is the error log from MWAA CLI. If this can help to debug the issue.
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/site-packages/airflow/__main__.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 48, in command
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/airflow/utils/cli.py", line 92, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/airflow/cli/commands/dag_command.py", line 138, in dag_trigger
dag_id=args.dag_id, run_id=args.run_id, conf=args.conf, execution_date=args.exec_date
File "/usr/local/lib/python3.7/site-packages/airflow/api/client/local_client.py", line 30, in trigger_dag
dag_id=dag_id, run_id=run_id, conf=conf, execution_date=execution_date
File "/usr/local/lib/python3.7/site-packages/airflow/api/common/experimental/trigger_dag.py", line 125, in trigger_dag
replace_microseconds=replace_microseconds,
File "/usr/local/lib/python3.7/site-packages/airflow/api/common/experimental/trigger_dag.py", line 75, in _trigger_dag
f"A Dag Run already exists for dag id {dag_id} at {execution_date} with run id {run_id}"
airflow.exceptions.DagRunAlreadyExists: A Dag Run already exists for dag id my_workflow at 2022-10-18T06:10:28+00:00 with run id CL4Adauihkvz121928332658Gp6bsTWU
The problem is that the execution_date resolution is seconds . Airflow ignoring the milliseconds.
You can see in the error that no milliseconds mentioned in the execution_date (2022-10-18T06:10:28)

Make Airflow show print-statements on run-time (not after task is completed)

Say I have the following DAG (stuff omitted for clarity)
#dag.py
from airflow.operators.python import PytonOperator
def main():
print("Task 1")
#some code
print("Task 2")
#some more code
print("Done")
return 0
t1 = PythonOperator(python_callable=main)
t1
Say the program fails at #some more code due to e.g RAM-issues I just get an error in my log e.g
[2021-05-25 12:49:54,211] {process_utils.py:137} INFO - Output:
[2021-05-25 12:52:44,605] {taskinstance.py:1482} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1138, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/operators/python.py", line 493, in execute
super().execute(context=serializable_context)
File "/usr/local/lib/python3.6/site-packages/airflow/operators/python.py", line 117, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.6/site-packages/airflow/operators/python.py", line 531, in execute_callable
string_args_filename,
File "/usr/local/lib/python3.6/site-packages/airflow/utils/process_utils.py", line 145, in execute_in_subprocess
raise subprocess.CalledProcessError(exit_code, cmd)
subprocess.CalledProcessError: Command '['/tmp/venv2wbjnabi/bin/python', '/tmp/venv2wbjnabi/script.py', '/tmp/venv2wbjnabi/script.in', '/tmp/venv2wbjnabi/script.out', '/tmp/venv2wbjnabi/string_args.txt']' died with <Signals.SIGKILL: 9>.
[2021-05-25 13:00:55,733] {taskinstance.py:1532} INFO - Marking task as FAILED. dag_id=test_dag, task_id=clean_data, execution_date=20210525T105621, start_date=20210525T105732, end_date=20210525T110055
[2021-05-25 13:00:56,555] {local_task_job.py:146} INFO - Task exited with return code 1
but none of the print-statements are printed thus I don't know where the program failed (I know it now due to debugging).
I assume, due to that, that Airflow don't flush before the task is marked as "success". Is there a way to make Airflow flush on runtime/print on runtime?

Apache Airflow dagrun_operator in Airflow Version 1.10.10

On Migrating Airflow from V1.10.2 to V1.10.10 One of our DAG have a task which is of dagrun_operator type.
Code snippet of the task looks something as below. Please assume that DAG dag_process_pos exists.
The DAG that is being triggered by the TriggerDagRunOperator is dag_process_pos. That starts with task of type dummy_operator [ Just a hint if this could be the trouble maker ]
task_trigger_dag_positional = TriggerDagRunOperator(
trigger_dag_id="dag_process_pos",
python_callable=set_up_dag_run_preprocessing,
task_id="trigger_preprocess_dag",
on_failure_callback=log_failure,
execution_date=datetime.now(),
provide_context=False,
owner='airflow')
def set_up_dag_run_preprocessing(context, dag_run_obj):
ti = context['ti']
dag_name = context['ti'].task.trigger_dag_id
dag_run = context['dag_run']
trans_id = dag_run.conf['transaction_id']
routing_info = ti.xcom_pull(task_ids="json_validation", key="route_info")
new_file_path = routing_info['file_location']
new_file_name = os.path.basename(routing_info['new_file_name'])
file_path = os.path.join(new_file_path, new_file_name)
batch_id = "123-AD-FF"
dag_run_obj.payload = {'inputfilepath': file_path,
'transaction_id': trans_id,
'Id': batch_id}
The DAG runs all fine. In fact the python callable of the task mentioned until the last line. Then it errors out.
[2020-06-09 11:36:22,838] {taskinstance.py:1145} ERROR - No row was found for one()
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 983, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/operators/dagrun_operator.py", line 95, in execute
replace_microseconds=False)
File "/usr/local/lib/python3.6/site-packages/airflow/api/common/experimental/trigger_dag.py", line 141, in trigger_dag
replace_microseconds=replace_microseconds,
File "/usr/local/lib/python3.6/site-packages/airflow/api/common/experimental/trigger_dag.py", line 98, in _trigger_dag
external_trigger=True,
File "/usr/local/lib/python3.6/site-packages/airflow/utils/db.py", line 74, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/airflow/models/dag.py", line 1471, in create_dagrun
run.refresh_from_db()
File "/usr/local/lib/python3.6/site-packages/airflow/utils/db.py", line 74, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/airflow/models/dagrun.py", line 109, in refresh_from_db
DR.run_id == self.run_id
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3446, in one
raise orm_exc.NoResultFound("No row was found for one()")
sqlalchemy.orm.exc.NoResultFound: No row was found for one()
After which the on_failure_callback of that task is executed and all code of that callable runs perfectly ok as is expected. The query here is why did the dagrun_operator fail after the python callable.
P.S : The DAG that is being triggered by the TriggerDagRunOperator , in this case dag_process_pos starts with task of typedummy_operator

execution_date jinja resolving as a string

I have an airflow dag that uses the following jinja template: "{{ execution_date.astimezone('Etc/GMT+6').subtract(days=1).strftime('%Y-%m-%dT00:00:00') }}"
This template works in other dags, and it works when the schedule_interval for the dag is set to timedelta(hours=1). However, when we set the schedule interval to 0 8 * * *, it throws the following traceback at runtime:
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/airflow/models/__init__.py", line 1426, in _run_raw_task
self.render_templates()
File "/usr/lib/python2.7/site-packages/airflow/models/__init__.py", line 1790, in render_templates
rendered_content = rt(attr, content, jinja_context)
File "/usr/lib/python2.7/site-packages/airflow/models/__init__.py", line 2538, in render_template
return self.render_template_from_field(attr, content, context, jinja_env)
File "/usr/lib/python2.7/site-packages/airflow/models/__init__.py", line 2520, in render_template_from_field
for k, v in list(content.items())}
File "/usr/lib/python2.7/site-packages/airflow/models/__init__.py", line 2520, in <dictcomp>
for k, v in list(content.items())}
File "/usr/lib/python2.7/site-packages/airflow/models/__init__.py", line 2538, in render_template
return self.render_template_from_field(attr, content, context, jinja_env)
File "/usr/lib/python2.7/site-packages/airflow/models/__init__.py", line 2514, in render_template_from_field
result = jinja_env.from_string(content).render(**context)
File "/usr/lib64/python2.7/site-packages/jinja2/environment.py", line 1008, in render
return self.environment.handle_exception(exc_info, True)
File "/usr/lib64/python2.7/site-packages/jinja2/environment.py", line 780, in handle_exception
reraise(exc_type, exc_value, tb)
File "<template>", line 1, in top-level template code
TypeError: astimezone() argument 1 must be datetime.tzinfo, not str
It appears the execution date being passed in is a string, not a datetime object; but I am only able to hit this error on this specific dag, and no others. I've tried deleting the dag entirely and recreating it with no luck.
Looks like astimezone(..) function is misbehaving, it expects a datetime.tzinfo while you are passing it an str argument ('Etc/GMT+6')
TypeError: astimezone() argument 1 must be datetime.tzinfo, not str
While I couldn't make the exact thing work, I believe following achieves pretty much the same effect as what you are trying
{{ execution_date.in_timezone("US/Eastern") - timedelta(days=1) }}
Recall that
execution_date macro is a Pendulum object
in_timezone(..) converts it into a datetime.datetime(..)
then we just add a datetime.timedelta(days=1) to it

Airflow HttpSensor won't work

I'm trying to create a HttpSensor in Airflow using the following code:
wait_to_launch = HttpSensor(
task_id="wait-to-launch",
endpoint='http://' + socket.gethostname() + ":8500/v1/kv/launch-cluster?raw",
response_check=lambda response: True if 'oui'==response.content else False,
dag=dag
)
But I keep getting this error:
Traceback (most recent call last):
File "http_sensor_test.py", line 30, in <module>
dag=dag
File "/home/me/.local/lib/python2.7/site-packages/airflow/utils/decorators.py", line 86, in wrapper
result = func(*args, **kwargs)
File "/home/me/.local/lib/python2.7/site-packages/airflow/operators/sensors.py", line 663, in __init__
self.hook = hooks.http_hook.HttpHook(method='GET', http_conn_id=http_conn_id)
File "/home/me/.local/lib/python2.7/site-packages/airflow/utils/helpers.py", line 436, in __getattr__
raise AttributeError
AttributeError
What am I missing?
You are running into a known issue, see AIRFLOW-1030. A fix has been merged (#2180), but unfortunately is not yet on a released version of airflow. The fix is marked for the next release (1.9.0), but it could be weeks/months until that is out. You can run a fork of airflow with this change or add the updated version of the HttpSensor as a custom operator (plugin).

Resources