I am new to airflow and I have written a simple SSHOperator to learn how it works.
default_args = {
'start_date': datetime(2018,6,20)
}
dag = DAG(dag_id='ssh_test', schedule_interval = '#hourly',default_args=default_args)
sshHook = SSHHook(ssh_conn_id='testing')
t1 = SSHOperator(
task_id='task1',
command='echo Hello World',
ssh_hook=sshHook,
dag=dag)
When I manually trigger it on the UI, the dag shows a status of running but the operator stays white, no status.
I'm wondering why my task isn't queuing. Does anyone have any ideas? My airflow.config is the default if that is useful information.
Even this isn't running
dag=DAG(dag_id='test',start_date = datetime(2018,6,21), schedule_interval='0 0 * * *')
runMe = DummyOperator(task_id = 'testest', dag = dag)
Make sure you've started the Airflow Scheduler in addition to the Airflow Web Server:
airflow scheduler
check if airflow scheduler is running
check if airflow webserver is running
check if all DAGs are set to On in the web UI
check if the DAGs have a start date which is in the past
check if the DAGs have a proper schedule (before the schedule date) which is shown in the web UI
check if the dag has the proper pool and queue.
Related
I'm currently experimenting with a new concept where the operator will communicate with an external service to run the operator instead of running the operator locally, and the external service can communicate with Airflow to update the progress of the DAG.
For example, let's say we have a bash operator:
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo \"This Message Shouldn't Run Locally on Airflow\"",
)
That is part of a DAG:
from airflow import DAG
from airflow.operators.bash import BashOperator
with DAG() as dag:
t1 = BashOperator(
task_id="bash_task1",
bash_command="echo \"t1:This Message Shouldn't Run Locally on Airflow\""
)
t2 = BashOperator(
task_id="bash_task2",
bash_command="echo \"t2:This Message Shouldn't Run Locally on Airflow\""
)
t1 >> t2
Is there a method in the Airflow code that will allow an external service to tell the DAG that t1 has started/completed and that t2 has started/completed, without actually running the DAG on the Airflow instance?
Airflow has a concept of Executors which are responsible for scheduling tasks, occasionally via or on external services - such as Kubernetes, Dask, or a Celery cluster.
https://airflow.apache.org/docs/apache-airflow/stable/executor/index.html
The worker process communicates back to Airflow, often via the Metadata DB about the progress of the task.
I got this dag, nevrtheless when trying to run it, it stacks on Queued run. When i then trying to run manually i get error:
Error:
Only works with the Celery, CeleryKubernetes or Kubernetes executors
Code:
from airflow import DAG
from airflow.providers.postgres.hooks.postgres import PostgresHook
from airflow.operators.python import PythonOperator
from datetime import datetime
def helloWorld():
print('Hello World')
def take_clients():
hook = PostgresHook(postgres_conn_id="postgres_robert")
df = hook.get_pandas_df(sql="SELECT * FROM clients;")
print(df)
# do what you need with the df....
with DAG(dag_id="test",
start_date=datetime(2021,1,1),
schedule_interval="#once",
catchup=False) as dag:
task1 = PythonOperator(
task_id="hello_world",
python_callable=helloWorld)
task2 = PythonOperator(
task_id="get_clients",
python_callable=take_clients)
task1 >> task2
I guess you are trying to use RUN button from the UI.
This button is enabled only for executors that supports it.
In your Airflow setup you are using Executor that doesn't support this command.
In newer Airflow versions the button is simply disable if you you are using Executor that doesn't support it:
I assume that what you are after is to create a new run, in that case you should use Trigger Run button. If you are looking to re-run specific task then use Clear button.
you run it as LocalExecutor , you have to change your Executor to Celery, CeleryKubernetes or Kubernetes or DaskExecutor
if you using docker-compose add:
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
otherwise go to airflow Executor
I have this simple Airflow DAG:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash import BashOperator
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
) as dag:
task_a = BashOperator(
task_id="ToRepeat",
bash_command="cd /home/xdf/local/ && (env/bin/python workflow/test1.py)",
retries =1,
)
The task takes a variable amount of time between one run and the other, and I don't have any guarantee that it will be finished within the 5 A.M of the next day.
If the task is still running when a new task is scheduled to start, I need to kill the old one before it starts running.
How can I design Airflow DAG to automatically kill the old task if it's still running when a new task is scheduled to start?
More details:
I am looking for something dynamic. The old DAG should be killed only when the new DAG is starting. If, for any reason, the new DAG does not start for one week, then old DAG should be able to run for an entire week. That's why using a timeout is sub-optimal
You should set dagrun_timeout for your DAG.
dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns.
Since your DAG runs daily you can set 24 hours for timeout.
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
dagrun_timeout=timedelta(hours=24)
) as dag:
If you want to set timeout on a specific task in your DAG you should use execution_timeout on your operator.
execution_timeout: max time allowed for the execution of this task instance, if it goes beyond it will raise and fail
Example:
MyOperator(task_id='task', execution_timeout=timedelta(hours=24))
If you really are looking for a dynamic solution; you can take help of Airflow DAGRun APIs and Xcoms; you can push your current dag run_id to Xcom and for subsequent runs you can pull this Xcom to consume with airflow API to check and kill the dag run with that run_id.
check_previous_dag_run_id >> kill_previous_dag_run >> push_current_run_id >> your_main_task
and your API call task should be something like
...
kill_previous_dag_run = BashOperator(
task_id="kill_previous_dag_run",
bash_command="curl -X 'DELETE' \
'http://<<your_webserver_dns>>/api/v1/dags/<<your_dag_name>>/dagRuns/<<url_encoded_run_id>>' \
-H 'accept: */*' --user <<api_username>>:<<api_user_password>>",
dag=dag
)
...
I have configured a dag in such a way that if current instance has failed next instance won't run. However, the problem is.
Problem
let's say past instance of the task is failed and current instance is in waiting state. Once I fix the issue how to run the current instance without making past run successful. I want to see the history when the task(dag) failed.
DAG
dag = DAG(
dag_id='test_airflow',
default_args=args,
tags=['wealth', 'python', 'ml'],
schedule_interval='5 13 * * *',
max_active_runs=1,
)
run_this = BashOperator(
task_id='run_after_loop',
bash_command='lll',
dag=dag,
depends_on_past=True
)
I guess you could trigger a task execution via cli using airflow run
There are two arguments that may help you:
-i, --ignore_dependencies - Ignore task-specific dependencies, e.g. upstream, depends_on_past, and retry delay dependencies
-I, --ignore_depends_on_past - Ignore depends_on_past dependencies (but respect upstream dependencies)
I am using Apache Airflow version 1.10.3 with the sequential executor, and I would like the DAG to fail after a certain amount of time if it has not finished. I tried setting dagrun_timeout in the example code
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 6, 1),
'retries': 0,
}
dag = DAG('min_timeout', default_args=default_args, schedule_interval=timedelta(minutes=5), dagrun_timeout = timedelta(seconds=30), max_active_runs=1)
t1 = BashOperator(
task_id='fast_task',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='slow_task',
bash_command='sleep 45',
dag=dag)
t2.set_upstream(t1)
slow_task alone takes more than the time limit set by dagrun_timeout, so my understanding is that airflow should stop DAG execution. However, this does not happen, and slow_task is allowed to run for its entire duration. After this occurs, the run is marked as failed, but this does not kill the task or DAG as desired. Using execution_timeout for slow_task does cause the task to be killed at the specified time limit, but I would prefer to use an overall time limit for the DAG rather than specifying execution_timeout for each task.
Is there anything else I should try to achieve this behavior, or any mistakes I can fix?
The Airflow scheduler runs a loop at least every SCHEDULER_HEARTBEAT_SEC (the default is 5 seconds).
Bear in mind at least here, because the scheduler performs some actions that may delay the next cycle of its loop.
These actions include:
parsing the dags
filling up the DagBag
checking the DagRun and updating their state
scheduling next DagRun
In your example, the delayed task isn't terminated at the dagrun_timeout because the scheduler performs its next cycle after the task completes.
According to Airflow documentation:
dagrun_timeout (datetime.timedelta) – specify how long a DagRun should be up before timing out / failing, so that new DagRuns can be created. The timeout is only enforced for scheduled DagRuns, and only once the # of active DagRuns == max_active_runs.
So dagrun_timeout wouldn't work for non-scheduled DagRuns (e.g. manually triggered) and if the number of active DagRuns < max_active_runs parameter.