Execute bash script (kerberos kinit) when reschedule DAG task (HiveCLIPartitionSensor) - airflow

I've got two tasks.
Bash Operator [kinit], which takes kerberos ticket for hadoop
Hive Sensor [check_partition ], which checks if partition exists.
My problem is that, Kerberos ticket is valid for 9 hours while the hive sensor might wait from 1 to 15 hours, because the time when data arrives is really fickle. Therefore I would like to execute kinit each time the hive sensor is reschedule (by 1 hour).
kinit = BashOperator(
task_id="CIDF_BASH_KINIT",
bash_command="bash kinit command",
dag=dag
)
check_partition = HiveCLIPartitionSensor(
task_id="CIDF_BASH_HIVE_CHECK_PARTITION",
table='table',
partition="partition='{}'".format('{{ ds }}'),
poke_interval=60*60,
mode='reschedule',
retries=0,
timeout=60*60*23,
dag=dag
)
kinit >> check_partition

you can run a cron job or something scheduled on the background that generates a kerberos ticket every 5-6 hours automatically.

Related

Airflow: Can operators run on an external service, and communicate with Airflow to update the progress of the DAG?

I'm currently experimenting with a new concept where the operator will communicate with an external service to run the operator instead of running the operator locally, and the external service can communicate with Airflow to update the progress of the DAG.
For example, let's say we have a bash operator:
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo \"This Message Shouldn't Run Locally on Airflow\"",
)
That is part of a DAG:
from airflow import DAG
from airflow.operators.bash import BashOperator
with DAG() as dag:
t1 = BashOperator(
task_id="bash_task1",
bash_command="echo \"t1:This Message Shouldn't Run Locally on Airflow\""
)
t2 = BashOperator(
task_id="bash_task2",
bash_command="echo \"t2:This Message Shouldn't Run Locally on Airflow\""
)
t1 >> t2
Is there a method in the Airflow code that will allow an external service to tell the DAG that t1 has started/completed and that t2 has started/completed, without actually running the DAG on the Airflow instance?
Airflow has a concept of Executors which are responsible for scheduling tasks, occasionally via or on external services - such as Kubernetes, Dask, or a Celery cluster.
https://airflow.apache.org/docs/apache-airflow/stable/executor/index.html
The worker process communicates back to Airflow, often via the Metadata DB about the progress of the task.

Airflow External Task Sensor with unscheduled upstream DAG

We use airflow in a hybrid ETL system. By this I mean that some of our DAGs are not scheduled but externally triggered using the Airflow API.
We are trying to do the following: Have a sensor in a scheduled DAG (DAG1) that senses that a task inside an externally triggered DAG (DAG2) has run.
For example, the DAG1 runs at 11 am, and we want to be sure that DAG2 has run (due to an external trigger) at least once since 00:00. I have tried to set execution_delta = timedelta(hours=11) but the sensor is sensing nothing. I think the problem is that the sensor tries to look for a task that has been scheduled exactly at 00:00. This won't be the case, as DAG2 can be triggered at any time from 00:00 to 11:00.
Is there any solution that can serve the purpose we need? I think we might need to create a custom Sensor, but it feels strange to me that the native Airflow Sensor does not solve this issue.
This is the sensor I'm defining:
from datetime import timedelta
from airflow.sensors import external_task
sensor = external_task.ExternalTaskSensor(
task_id='sensor',
dag=dag,
external_dag_id='DAG2',
external_task_id='sensed_task',
mode='reschedule',
check_existence=True,
execution_delta=timedelta(hours=int(execution_type)),
poke_interval=10 * 60, # Check every 10 minutes
timeout=1 * 60 * 60, # Allow for 1 hour of delay in execution
)
I had the same problem & used the execution_date_fn parameter:
ExternalTaskSensor(
task_id="sensor",
external_dag_id="dag_id",
execution_date_fn=get_most_recent_dag_run,
mode="reschedule",
where the get_most_recent_dag_run function looks like this :
from airflow.models import DagRun
def get_most_recent_dag_run(dt):
dag_runs = DagRun.find(dag_id="dag_id")
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
if dag_runs:
return dag_runs[0].execution_date
As the ExternalTaskSensor needs to know both the dag_id and the exact last_execution_date for cross-DAGs dependencies.

How to force a Airflow Task to restart at the new scheduling date?

I have this simple Airflow DAG:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash import BashOperator
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
) as dag:
task_a = BashOperator(
task_id="ToRepeat",
bash_command="cd /home/xdf/local/ && (env/bin/python workflow/test1.py)",
retries =1,
)
The task takes a variable amount of time between one run and the other, and I don't have any guarantee that it will be finished within the 5 A.M of the next day.
If the task is still running when a new task is scheduled to start, I need to kill the old one before it starts running.
How can I design Airflow DAG to automatically kill the old task if it's still running when a new task is scheduled to start?
More details:
I am looking for something dynamic. The old DAG should be killed only when the new DAG is starting. If, for any reason, the new DAG does not start for one week, then old DAG should be able to run for an entire week. That's why using a timeout is sub-optimal
You should set dagrun_timeout for your DAG.
dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns.
Since your DAG runs daily you can set 24 hours for timeout.
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
dagrun_timeout=timedelta(hours=24)
) as dag:
If you want to set timeout on a specific task in your DAG you should use execution_timeout on your operator.
execution_timeout: max time allowed for the execution of this task instance, if it goes beyond it will raise and fail
Example:
MyOperator(task_id='task', execution_timeout=timedelta(hours=24))
If you really are looking for a dynamic solution; you can take help of Airflow DAGRun APIs and Xcoms; you can push your current dag run_id to Xcom and for subsequent runs you can pull this Xcom to consume with airflow API to check and kill the dag run with that run_id.
check_previous_dag_run_id >> kill_previous_dag_run >> push_current_run_id >> your_main_task
and your API call task should be something like
...
kill_previous_dag_run = BashOperator(
task_id="kill_previous_dag_run",
bash_command="curl -X 'DELETE' \
'http://<<your_webserver_dns>>/api/v1/dags/<<your_dag_name>>/dagRuns/<<url_encoded_run_id>>' \
-H 'accept: */*' --user <<api_username>>:<<api_user_password>>",
dag=dag
)
...

airflow - How to run dags waiting on past succhi ess

I have configured a dag in such a way that if current instance has failed next instance won't run. However, the problem is.
Problem
let's say past instance of the task is failed and current instance is in waiting state. Once I fix the issue how to run the current instance without making past run successful. I want to see the history when the task(dag) failed.
DAG
dag = DAG(
dag_id='test_airflow',
default_args=args,
tags=['wealth', 'python', 'ml'],
schedule_interval='5 13 * * *',
max_active_runs=1,
)
run_this = BashOperator(
task_id='run_after_loop',
bash_command='lll',
dag=dag,
depends_on_past=True
)
I guess you could trigger a task execution via cli using airflow run
There are two arguments that may help you:
-i, --ignore_dependencies - Ignore task-specific dependencies, e.g. upstream, depends_on_past, and retry delay dependencies
-I, --ignore_depends_on_past - Ignore depends_on_past dependencies (but respect upstream dependencies)

How to add manual tasks in an Apache Airflow Dag

I'm using Apache Airflow to manage the data processing pipeline. In the middle of the pipeline, some data need to be reviewed before the next-step processing. e.g.
... -> task1 -> human review -> task2 -> ...
where task1 and task2 are data processing task. When task1 finished, the generated data by task1 needs to be reviewed by human. After the reviewer approved the data, task2 could be launched.
Human review tasks may take a very long time(e.g. several weeks).
I'm thinking to use an external database to store the human review result. And use a Sensor to poke the review result by a time interval. But it will occupy an Airflow worker until the review is done.
any idea?
Piggy-packing off of Freedom's answer and Robert Elliot's answer, here is a full working example that gives the user two weeks to review the results of the first task before failing permanently:
from datetime import timedelta
from airflow.models import DAG
from airflow import AirflowException
from airflow.operators.python_operator import PythonOperator
from my_tasks import first_task_callable, second_task_callable
TIMEOUT = timedelta(days=14)
def task_to_fail():
raise AirflowException("Please change this step to success to continue")
dag = DAG(dag_id="my_dag")
first_task = PythonOperator(
dag=dag,
task_id="first_task",
python_callable=first_task_callable
)
manual_sign_off = PythonOperator(
dag=dag,
task_id="manual_sign_off",
python_callable=task_to_fail,
retries=1,
max_retry_delay=TIMEOUT
)
second_task = PythonOperator(
dag=dag,
task_id="second_task",
python_callable=second_task_callable
)
first_task >> manual_sign_off >> second_task
A colleague suggested having a task that always fails, so the manual step is simply to mark it as a success. I implemented it as so:
def always_fail():
raise AirflowException('Please change this step to success to continue')
manual_sign_off = PythonOperator(
task_id='manual_sign_off',
dag=dag,
python_callable=always_fail
)
start >> manual_sign_off >> end
Your idea seems good to me. You can create a dedicated DAG to check the progress of your approval process with a sensor. If you use a low timeout on your sensor and an appropriate schedule on this DAG, say every 6 hours. Adapt it to how often these tasks are approved and how soon you need to perform the downstream tasks.
Before 1.10, I used the retry feature of the operator to implement the ManualSignOffTask. The operator has set retries and retry_delay. So the task will be rescheduled after it fails. When the task is scheduled, it will check the database to see if the sign-off is done:
If the sign-off has not been done yet, the task fails and release the worker and wait for next schedule.
If the sign-off has been done, the task success, and the dag run proceeds.
After 1.10, a new TI state UP_FOR_RESCHEDULE is introduced and the Sensor natively supports long running tasks.

Resources