I have configured a dag in such a way that if current instance has failed next instance won't run. However, the problem is.
Problem
let's say past instance of the task is failed and current instance is in waiting state. Once I fix the issue how to run the current instance without making past run successful. I want to see the history when the task(dag) failed.
DAG
dag = DAG(
dag_id='test_airflow',
default_args=args,
tags=['wealth', 'python', 'ml'],
schedule_interval='5 13 * * *',
max_active_runs=1,
)
run_this = BashOperator(
task_id='run_after_loop',
bash_command='lll',
dag=dag,
depends_on_past=True
)
I guess you could trigger a task execution via cli using airflow run
There are two arguments that may help you:
-i, --ignore_dependencies - Ignore task-specific dependencies, e.g. upstream, depends_on_past, and retry delay dependencies
-I, --ignore_depends_on_past - Ignore depends_on_past dependencies (but respect upstream dependencies)
Related
I have this simple Airflow DAG:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash import BashOperator
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
) as dag:
task_a = BashOperator(
task_id="ToRepeat",
bash_command="cd /home/xdf/local/ && (env/bin/python workflow/test1.py)",
retries =1,
)
The task takes a variable amount of time between one run and the other, and I don't have any guarantee that it will be finished within the 5 A.M of the next day.
If the task is still running when a new task is scheduled to start, I need to kill the old one before it starts running.
How can I design Airflow DAG to automatically kill the old task if it's still running when a new task is scheduled to start?
More details:
I am looking for something dynamic. The old DAG should be killed only when the new DAG is starting. If, for any reason, the new DAG does not start for one week, then old DAG should be able to run for an entire week. That's why using a timeout is sub-optimal
You should set dagrun_timeout for your DAG.
dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns.
Since your DAG runs daily you can set 24 hours for timeout.
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
dagrun_timeout=timedelta(hours=24)
) as dag:
If you want to set timeout on a specific task in your DAG you should use execution_timeout on your operator.
execution_timeout: max time allowed for the execution of this task instance, if it goes beyond it will raise and fail
Example:
MyOperator(task_id='task', execution_timeout=timedelta(hours=24))
If you really are looking for a dynamic solution; you can take help of Airflow DAGRun APIs and Xcoms; you can push your current dag run_id to Xcom and for subsequent runs you can pull this Xcom to consume with airflow API to check and kill the dag run with that run_id.
check_previous_dag_run_id >> kill_previous_dag_run >> push_current_run_id >> your_main_task
and your API call task should be something like
...
kill_previous_dag_run = BashOperator(
task_id="kill_previous_dag_run",
bash_command="curl -X 'DELETE' \
'http://<<your_webserver_dns>>/api/v1/dags/<<your_dag_name>>/dagRuns/<<url_encoded_run_id>>' \
-H 'accept: */*' --user <<api_username>>:<<api_user_password>>",
dag=dag
)
...
We are using Airflow as orchestrator where it schedule workflow every hour. DataprocSubmitJobOperator is configured to schedule dataproc jobs (it uses spark). Spark sync data from source to target (runs for 50 min and then completes to avoid next schedule overlap).
Intermittent Airflow task fails due to zombie Exception. Logs show assertion failure due to pthread_mutex_lock(mu). Airflow Task exits. Underlying dataproc Job keeps running without issue.
Please suggest what can be potential issue/fix?
[2021-12-22 23:01:17,150] {dataproc.py:1890} INFO - Submitting job
[2021-12-22 23:01:17,804] {dataproc.py:1902} INFO - Job 27a2c88d-1308-4407-b965-aa490e2217fb submitted successfully.
[2021-12-22 23:01:17,805] {dataproc.py:1905} INFO - Waiting for job 27a2c88d-1308-4407-b965-aa490e2217fb to complete
E1222 23:45:58.299007027 1267 sync_posix.cc:67] assertion failed: pthread_mutex_lock(mu) == 0
[2021-12-22 23:46:00,943] {local_task_job.py:102} INFO - Task exited with return code Negsignal.SIGABRT
Config
raw_data_sync = DataprocSubmitJobOperator(
task_id="raw_data_sync",
job=RAW_DATA_GENERATION,
location='us-central1',
project_id='1f780b38bd7b0384e53292de20',
execution_timeout=timedelta(seconds=3420),
dag=dag
)
I am new to airflow and I have written a simple SSHOperator to learn how it works.
default_args = {
'start_date': datetime(2018,6,20)
}
dag = DAG(dag_id='ssh_test', schedule_interval = '#hourly',default_args=default_args)
sshHook = SSHHook(ssh_conn_id='testing')
t1 = SSHOperator(
task_id='task1',
command='echo Hello World',
ssh_hook=sshHook,
dag=dag)
When I manually trigger it on the UI, the dag shows a status of running but the operator stays white, no status.
I'm wondering why my task isn't queuing. Does anyone have any ideas? My airflow.config is the default if that is useful information.
Even this isn't running
dag=DAG(dag_id='test',start_date = datetime(2018,6,21), schedule_interval='0 0 * * *')
runMe = DummyOperator(task_id = 'testest', dag = dag)
Make sure you've started the Airflow Scheduler in addition to the Airflow Web Server:
airflow scheduler
check if airflow scheduler is running
check if airflow webserver is running
check if all DAGs are set to On in the web UI
check if the DAGs have a start date which is in the past
check if the DAGs have a proper schedule (before the schedule date) which is shown in the web UI
check if the dag has the proper pool and queue.
I'm using Apache Airflow to manage the data processing pipeline. In the middle of the pipeline, some data need to be reviewed before the next-step processing. e.g.
... -> task1 -> human review -> task2 -> ...
where task1 and task2 are data processing task. When task1 finished, the generated data by task1 needs to be reviewed by human. After the reviewer approved the data, task2 could be launched.
Human review tasks may take a very long time(e.g. several weeks).
I'm thinking to use an external database to store the human review result. And use a Sensor to poke the review result by a time interval. But it will occupy an Airflow worker until the review is done.
any idea?
Piggy-packing off of Freedom's answer and Robert Elliot's answer, here is a full working example that gives the user two weeks to review the results of the first task before failing permanently:
from datetime import timedelta
from airflow.models import DAG
from airflow import AirflowException
from airflow.operators.python_operator import PythonOperator
from my_tasks import first_task_callable, second_task_callable
TIMEOUT = timedelta(days=14)
def task_to_fail():
raise AirflowException("Please change this step to success to continue")
dag = DAG(dag_id="my_dag")
first_task = PythonOperator(
dag=dag,
task_id="first_task",
python_callable=first_task_callable
)
manual_sign_off = PythonOperator(
dag=dag,
task_id="manual_sign_off",
python_callable=task_to_fail,
retries=1,
max_retry_delay=TIMEOUT
)
second_task = PythonOperator(
dag=dag,
task_id="second_task",
python_callable=second_task_callable
)
first_task >> manual_sign_off >> second_task
A colleague suggested having a task that always fails, so the manual step is simply to mark it as a success. I implemented it as so:
def always_fail():
raise AirflowException('Please change this step to success to continue')
manual_sign_off = PythonOperator(
task_id='manual_sign_off',
dag=dag,
python_callable=always_fail
)
start >> manual_sign_off >> end
Your idea seems good to me. You can create a dedicated DAG to check the progress of your approval process with a sensor. If you use a low timeout on your sensor and an appropriate schedule on this DAG, say every 6 hours. Adapt it to how often these tasks are approved and how soon you need to perform the downstream tasks.
Before 1.10, I used the retry feature of the operator to implement the ManualSignOffTask. The operator has set retries and retry_delay. So the task will be rescheduled after it fails. When the task is scheduled, it will check the database to see if the sign-off is done:
If the sign-off has not been done yet, the task fails and release the worker and wait for next schedule.
If the sign-off has been done, the task success, and the dag run proceeds.
After 1.10, a new TI state UP_FOR_RESCHEDULE is introduced and the Sensor natively supports long running tasks.
In my first foray into airflow, I am trying to run one of the example DAGS that comes with the installation. This is v.1.8.0. Here are my steps:
$ airflow trigger_dag example_bash_operator
[2017-04-19 15:32:38,391] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-04-19 15:32:38,676] {models.py:167} INFO - Filling up the DagBag from /Users/gbenison/software/kludge/airflow/dags
[2017-04-19 15:32:38,947] {cli.py:185} INFO - Created <DagRun example_bash_operator # 2017-04-19 15:32:38: manual__2017-04-19T15:32:38, externally triggered: True>
$ airflow dag_state example_bash_operator '2017-04-19 15:32:38'
[2017-04-19 15:33:12,918] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-04-19 15:33:13,229] {models.py:167} INFO - Filling up the DagBag from /Users/gbenison/software/kludge/airflow/dags
running
The dag state remains "running" for a long time (at least 20 minutes by now), although from a quick inspection of this task it should take a matter of seconds. How can I troubleshoot this? How can I see which step it is stuck on?
To run any DAGs, you need to make sure two processes are running:
airflow webserver
airflow scheduler
If you only have airflow webserver running, the UI will show DAGs as running, but if you click on the DAG, none of it's tasks are actually running or scheduled, but rather in a Null state.
What this means is that they are waiting to be picked up by airflow scheduler. If airflow scheduler is not running, you'll be stuck in this state forever, as the tasks are never picked up for execution.
Additionally, make sure that the toggle button in the DAGs view is switched to 'ON' for the particular DAG. Otherwise it will not get picked up by the scheduler if you trigger it manually.
I too recently started using Airflow and my dags kept endlessly running. Your dag may be set on 'pause' without you realizing it, and thus the scheduler will not schedule new task instances and when you trigger the dag it just looks like it is endlessly running.
There are a few solutions:
1) In the Airflow UI toggle the button left of the dag from 'Off' to 'On'. Off means that the dag is paused, so On will allow the scheduler to pick it up and complete the dag. (this fixed my initial issue)
2) In your airflow.cfg file dags_are_paused_at_creation = True, is the default. So all new dags you create are paused from the start. Change this to False, and future dags you create will be good to go right away (i had to reboot webserver and scheduler for changes to the airflow.cfg to be recognized)
3) use the command line $ airflow unpause [dag_id]
documentation: https://airflow.apache.org/cli.html#unpause
The below worked for me.
Make sure AIRFLOW_HOME is set
in AIRFLOW_HOME have folders dags, plugins. The folders to have permissions r,w,x to airflow user.
Make sure u have atleast one dag in the dags/ folder.
pip install celery[redis]==4.1.1
I have checked the above soln on airflow 1.9.0 Airflow version
I tried the same trick with airflow 1.10 version and it worked.