(airflow) emr steps operator -> emr steps sensor; sensor failed -> trigger before operator - airflow

I want to handle failovers.
But, sensor failed -> retries only sensor self.
But, I want to trigger before operator.
This is flow chart.
a -> a_sensor (failed) -> a (retry) -> a_sensor -> (done)
Can I do this?

I recommend waiting the EMR job in the operator itself, even if this keeps the task running and occupying the worker slot, but it doesn't consume much resources, and you can simply manage the timeout, cleanup and the retry strategy:
class EmrOperator(BaseOperator):
...
def execute():
run_job():
wait_job()
def wait_job():
while not (is_finished()):
sleep(10s)
def on_kill():
cleanup()
And you can use the official operator EmrAddStepsOperator which already supports this.
And if you want to implement what you have mentioned in the question, Airflow doesn't support the retry for a group of tasks yet, but you can achieve this using callbacks:
a = EmrOperator(..., retries=0)
a_sensor = EmrSensor(, retries=0, on_failure_callback=emr_a_callback)
def emr_a_callback(ti, dag_run,):
max_retries = 3
retry_num = ti.xcom_pull(dag_ids=ti.task_id, key="retry_num")
if retry > max_retries:
retrun # do nothing
task_a = dag_run.get_task_instance("<task a id>")
task_a.state = None # pass a state to None
ti.state = None # pass the sensor state to None

Related

Airflow how to connect the previous task to the right next dynamic branch with multiple tasks?

I am facing this situation:
I have generated two dynamic branches. Each branch has multiple chained tasks.
This is what I need Airflow create for me:
taskA1->taskB1 taskC1->taskD1
taskA2->taskB2... taskZ.. taskC2->taskD2
taskA3->taskB3 taskC3->taskD3
and here is my sudocode:
def create_branch1(task_ids):
source = []
for task_id in task_ids:
source += [
Operator1(task_id=’task_A{0}'.format(task_id))) >>
Operator2(task_id=’task_B{0}’.format(task_id)) ]
return source
def create_branch2(task_ids):
source = []
for task_id in task_ids:
source += [
Operator1(task_id=’task_C{0}'.format(task_id))) >>
Operator2(task_id=’task_D{0}’.format(task_id)) ]
return source
create_branch1 >> dummyOperator(Z) >> create_branch2 >> end
However, what the Airflow generates, looks like this:
taskA1->taskB1 taskD1<-taskC1
taskA2->taskB2...taskZ...taskD2<-taskC2
taskA3->taskB3 taskD3<-taskC3
I mean in the second branch, dummyOperator(Z) will be connected to the last task of the chain (D), instead of connecting to the first task of the chain in the second branch (C).
It seems, no matters what, DummpyOperator(task-Z) will connect to the last task of the chained branches.
Do you have any idea, how to tackle this issue?

Airflow 2.0 - running locally keeps running the function

I have the below task keeps running I know this because it runs a query in Snowflake and I keep getting the DUO push notification. every. 5. seconds! What can I do to stop this and only have it run when the DAG runs
This is the task:
create_foreign_keys = SnowflakeQueryOperator(
dag=dag,
task_id='check_and_run_foreign_key_query',
sql=SnowHook().run_fk_alter_statements(schema,query),
trigger_rule=TriggerRule.ALL_DONE
)
This is the method being called in the sql part:
def run_fk_alter_statements(self, schema, additional_fk):
fk_query_path = "/fkeys.sql"
fd = open(f'{fk_query_path}', 'r')
query = fd.read()
fd.close()
additions = []
for fk in additional_fk:
additions.append(f""" or (t2.table_name = '{fk['table_name']}' and t2.column_name = '{fk['column_name']}'
and t1.table_name = '{fk['ref_table_name']}' and t1.column_name = '{fk['ref_column_name']}')\n""".upper())
raw_out = self.execute_query(query.format(schema=schema, fks=''.join(additions)), fetch_all=True)
query_jobs = []
for raw_query in raw_out:
query_jobs.append(raw_query[0])
return query_jobs
The sql=SnowHook().run_fk_alter_statements(schema,query) call in your instantiation of the SnowflakeQueryOperator is actually top-level code so it will execute every time the DAG is parsed by the Scheduler. You need to find a way to have that function called within an operator's execute() method.
You could add a TaskFlow function/PythonOperator task to push the output from run_fk_alter_statements() to XCom and then the SnowflakeQueryOperator uses this XCom to execute the SQL(s) that's generated.

airflow reschedule error: dependency 'Task Instance State' PASSED: False

I have a customized sensor that looked like below. The idea is one dag can have different tasks that can start from different time, and take advantage of the built-in airflow reschedule system.
class MySensor(BaseSensorOperator):
def __init__(self, *, start_time, tz, ...)
super().__init__(**kwargs)
self._start_time = start_time
self._tz = tz
#provide_session
def execute(self, context, session: Session=None):
dt_start = datetime.combine(context['next_execution_date'].date(), self._start_time)
dt_start = dt_start.replace(tzinfo=self._tz)
if datetime.now().timestamp() < dt_start.timestamp():
dt_reschedule = datetime.utcnow().replace(tzinfo=UTC)
dt_reschedule += timedelta(seconds=dt_start.timestamp()-datetime.now().timestamp())
raise AirflowRescheduleException(dt_reschedule)
return super().execute(context)
In the dag, I have something as below. However, I notice when the mode is 'poke', which is default, the sensor will not work properly.
with DAG( schedule='0 10 * * 1-5', ... ) as dag:
task1 = MySensor(start_time=time(14,0), mode='poke')
task2 = MySensor(start_time=time(16,0), mode='reschedule')
... ...
From the log, i can see the following:
{taskinstance.py:1141} INFO - Rescheduling task, mark task as UP_FOR_RESCHEDULE
[5s later]
{local_task_job.py:102} INFO - Task exited with return code 0
[14s later]
{taskinstance.py:687} DEBUG - <TaskInstance: mydag.mytask execution_date [failed]> dependency 'Task Instance State' PASSED: False, Task in in the 'failed' state which is not a valid state for execution. The task must be cleared in order to be run.
{taskinstance.py:664} INFO - Dependencies not met for <TaskInstance ... [failed]> ...
Why rescheduling not working with mode='poke'? And when did the scheduler(?) flip the state of the taskinstance from "up_for_reschedule" to "failed"? Any better way to start the each task/sensor at different time? The sensor is an improved version of FileSensor, and checks a bunch of files or files with patterns. My current option is to force every task with mode='reschedule'
Airflow version 1.10.12

Airflow - Get last success instance of the task

I have a need to terminate and start the EMR cluster every 24 hours from Airflow.
I have implemented logic to check if the previous task execution date - current execution date =1, then terminate the cluster and create a new one. Otherwise, skip the execution of the EMR creation task.
But, I'm seeing a weird scenario where the DAG execution is getting marked as a success but no task is being executed!!!
So, to handle this scenario, I'm trying to check for the last success date of the EMR (XCOM will have cluster id) to terminate the cluster before I start a new one.
I'm not successful so far.. any help is appreciated.
DAG Image:
The Pink ones indicate skip. If you closely observe, the emr_termination task (last row in the image) after the blank boxes should be green just like for emr_creation task. But, it got skipped and the previous cluster was not terminated
Code:
def emr_termination_trigger(execution_date, prev_execution_date_success, prev_execution_date, **kwargs):
days_diff = (execution_date.date() - prev_execution_date.date()).days
provisioned_product_id, cluster_id = None, None
creation_tsk_status = 'N/A'
try:
ti = TaskInstance(emr_creation_tsk, execution_date)
if ti.previous_ti is None:
print("previous execution of emr creation is not available. feching last 2 execution")
ti = TaskInstance(emr_creation_tsk, prev_execution_date)
if ti.previous_ti is None:
print("last 2 executions are not available. Nothing to terminate")
creation_tsk_status = None
else:
creation_tsk_status = ti.previous_ti.state
provisioned_product_id, cluster_id = ti.previous_ti.xcom_pull(
task_ids=emr_creation_tsk.task_id)
else:
creation_tsk_status = ti.previous_ti.state
provisioned_product_id, cluster_id = ti.previous_ti.xcom_pull(task_ids=emr_creation_tsk.task_id)
except:
pass
print(f"days_diff:{days_diff} - provisioned_prd:{provisioned_product_id} - create_status:{creation_tsk_status}")
if days_diff == 1:
print("Inside Terminating cluster")
terminate_cluster = TerminateEMROperator(
task_id='terminate_cluster',
provisioned_product_id=provisioned_product_id,
airflow_conn_id=airflow_conn_id,
provide_context=True,
dag=dag)
try:
terminate_cluster.execute(context=kwargs)
except Exception as e:
print(f"Got exception while terminating the EMR cluster:{str(e)}")

Airflow set task instance status as skipped programmatically

I have list that I loop to create the tasks. The list are static as far as size.
for counter, account_id in enumerate(ACCOUNT_LIST):
task_id = f"bash_task_{counter}"
if account_id:
trigger_task = BashOperator(
task_id=task_id,
bash_command="echo hello there",
dag=dag)
else:
trigger_task = BashOperator(
task_id=task_id,
bash_command="echo hello there",
dag=dag)
trigger_task.status = SKIPPED # is there way to somehow set status of this to skipped instead of having a branch operator?
trigger_task
I tried this manually but cannot make the task skipped:
start = DummyOperator(task_id='start')
task1 = DummyOperator(task_id='task_1')
task2 = DummyOperator(task_id='task_2')
task3 = DummyOperator(task_id='task_3')
task4 = DummyOperator(task_id='task_4')
start >> task1
start >> task2
try:
start >> task3
raise AirflowSkipException
except AirflowSkipException as ase:
log.error('Task Skipped for task3')
try:
start >> task4
raise AirflowSkipException
except AirflowSkipException as ase:
log.error('Task Skipped for task4')
yes there you need to raise AirflowSkipException
from airflow.exceptions import AirflowSkipException
raise AirflowSkipException
For more information see the source code
Have a fixed number of tasks to execute per DAG. This is really fine and this is also planning how much max parallel task your system should handle without degrading downstream systems. Also, having fixed number of tasks makes it visible in the web UI and give you indication whether they are executed or skipped.
In the code below, I initialized the list with None items and then update the list with values based on returned data from the DB. In the python_callable function, check if the account_id is None then raise an AirflowSkipException, otherwise execute the function. In the UI, the tasks are visible and indicates whether executed or skipped(meaning there is no account_id)
def execute(account_id):
if account_id:
print(f'************Executing task for account_id:{account_id}')
else:
raise AirflowSkipException
def create_task(task_id, account_id):
return PythonOperator(task_id=task_id,
python_callable=execute,
op_args=[account_id])
list_from_dbhook = [1, 2, 3] # dummy list. Get records using DB Hook
# Need to have some fix size. Need to allocate fix resources or # of tasks.
# Having this fixed number of tasks will make this tasks to be visible in UI instead of being purely dynamic
record_size_limit = 5
ACCOUNT_LIST = [None] * record_size_limit
for index, account_id_val in enumerate(list_from_dbhook):
ACCOUNT_LIST[index] = account_id_val
for idx, acct_id in enumerate(ACCOUNT_LIST):
task = create_task(f"task_{idx}", acct_id)
task

Resources