Scheduling AirfFlow DAG job

Scheduling AirfFlow DAG job - airflow

I have written a AirFlow DAG as below -
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2016, 7, 5),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=30),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG(
'test-air', default_args=default_args, schedule_interval='*/2 * * * *')
.................
.................
{{Tasks}}
As per above config, Job should run every even minute. But instead it shows below output
airflow scheduler -d test-air
[2016-07-05 15:24:02,168] {jobs.py:574} INFO - Prioritizing 0 queued jobs
[2016-07-05 15:24:02,177] {jobs.py:726} INFO - Starting 0 scheduler jobs
[2016-07-05 15:24:02,177] {jobs.py:741} INFO - Done queuing tasks, calling the executor's heartbeat
[2016-07-05 15:24:02,177] {jobs.py:744} INFO - Loop took: 0.012636 seconds
[2016-07-05 15:24:02,256] {models.py:305} INFO - Finding 'running' jobs without a recent heartbeat
[2016-07-05 15:24:02,256] {models.py:311} INFO - Failing jobs without heartbeat after 2016-07-05 15:21:47.256816
[2016-07-05 15:24:07,177] {jobs.py:574} INFO - Prioritizing 0 queued jobs
[2016-07-05 15:24:07,182] {jobs.py:726} INFO - Starting 0 scheduler jobs
[2016-07-05 15:24:07,182] {jobs.py:741} INFO - Done queuing tasks, calling the executor's heartbeat
[2016-07-05 15:24:07,182] {jobs.py:744} INFO - Loop took: 0.007725 seconds
[2016-07-05 15:24:07,249] {models.py:305} INFO - Finding 'running' jobs without a recent heartbeat
[2016-07-05 15:24:07,249] {models.py:311} INFO - Failing jobs without heartbeat after 2016-07-05 15:21:52.249706
Can somebody guide me over here?
Thanks
Pari

By default every dag that is created is at "pause" mode. This is defined in your "airflow.cfg" file.
You can unpause your dag by
$ airflow unpause test-air
and retry again with the scheduler.
You can also toggle your dag on/off from the Airflow webUI (by default it is off)

Related

How to limit the number of dag retries?

I have a DAG configured like this:
AIRFLOW_DEFAULT_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'dagrun_timeout': timedelta(hours=5)
}
DAILY_RUNNER = DAG(
'daily_runner',
max_active_runs=1,
start_date=datetime(2019, 1, 1),
schedule_interval="0 17 * * *",
default_args=AIRFLOW_DEFAULT_ARGS)
My current understanding is that retries says that a task will be retried once before failing for good. Is there a way to set a similar limit for the number of times a DAG gets retried? If I have a dag in the running state, I want to be able to set it to failed from within the UI once and have it stop rerunning.

Currently, there is no way to set retry at dag level.
Please refer the below answer for retrying a set of tasks/whole-dag in case of failures.
Can a failed Airflow DAG Task Retry with changed parameter

Airflow worker stuck : Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run

Airflow tasks run w/o any issues and suddenly half the way it gets stuck and the task instance details say above message.
I cleared my entire database, but still, I am getting the same error.
The fact is I am getting this issue for only some dags. Mostly when the long-running jobs.
I am getting below error
[2019-07-03 12:14:56,337] {{models.py:1353}} INFO - Dependencies not met for <TaskInstance: XXXXXX.index_to_es 2019-07-01T13:30:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
[2019-07-03 12:14:56,341] {{models.py:1353}} INFO - Dependencies not met for <TaskInstance: XXXXXX.index_to_es 2019-07-01T13:30:00+00:00 [running]>, dependency 'Task Instance Not Already Running' FAILED: Task is already running, it started on 2019-07-03 05:58:51.601552+00:00.
[2019-07-03 12:14:56,342] {{logging_mixin.py:95}} INFO - [2019-07-03 12:14:56,342] {{jobs.py:2514}} INFO - Task is not able to be run
My dag looks like below
default_args = {
'owner': 'datascience',
'depends_on_past': True,
'start_date': datetime(2019, 6, 12),
'email': ['datascience#mycompany.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 3,
'retry_delay': timedelta(minutes=5),
# 'queue': 'nill',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
def get_index_date(**kwargs):
tomorrow=kwargs.get('templates_dict').get('tomorrow')
return str(tomorrow).replace('-','.')
"""
Create Dags specify its features
"""
dag = DAG(
DAG_NAME,
schedule_interval="0 9 * * *",
catchup=True,
default_args=default_args,
template_searchpath='/efs/sql')
create_table = BigQueryOperator(
dag=dag,
task_id='create_temp_table_from_query',
sql='daily_demand.sql',
use_legacy_sql=False,
destination_dataset_table=TEMP_TABLE,
bigquery_conn_id=CONNECTION_ID,
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_TRUNCATE'
)
"""Task to zip and export to GCS"""
export_to_storage = BigQueryToCloudStorageOperator(
task_id='export_to_GCS',
source_project_dataset_table=TEMP_TABLE,
destination_cloud_storage_uris=[CLOUD_STORAGE_URI],
export_format='NEWLINE_DELIMITED_JSON',
compression='GZIP',
bigquery_conn_id=CONNECTION_ID,
dag=dag)
"""Task to get the tomorrow execution date formatted for indexing"""
get_index_date = PythonOperator(
task_id='get_index_date',
python_callable=get_index_date,
templates_dict={'tomorrow':"{{ tomorrow_ds }}"},
provide_context=True,
dag=dag
)
"""Task to download zipped files and bulkindex to elasticsearch"""
es_indexing = EsDownloadAndIndexOperator(
task_id="index_to_es",
object=OBJECT,
es_url=ES_URI,
local_path=LOCAL_FILE,
gcs_conn_id=CONNECTION_ID,
bucket=GCS_BUCKET_ID,
es_index_type='demand_shopper',
es_bulk_batch=5000,
es_index_name=INDEX,
es_request_timeout=300,
dag=dag)
"""Define the chronology of tasks in DAG"""
create_table >> export_to_storage >> get_index_date >> es_indexing
Thanks for your help

I figured out the issue, it was the underlying infrastructure problem. I was using AWS EFS and the burst mode was blocking the worker as the throughput was reached. Changed to provisioned mode, workers are no more in a stuck state.
I got the idea from
ecs-airflow-1-10-2-performance-issues-operators-and-tasks-take-10x-longer

I noticed same issue. This message was logged.
Dependencies not met for <TaskInstance:xxxxx]>, dependency 'Task Instance State' FAILED:
Task is in the 'running' state which is not a valid state for execution.
The task must be cleared in order to be run.
Similarly, I ran an hourly DAG to "delete_dags_and_then_refresh" job (instead of file share). Long runnings jobs would fail until I disabled this dags-update job. Problem solved. I will try something different to refresh dags.

Tasks retrying more than specified retry in Airflow

I have recently upgraded my airflow to 1.10.2. Some tasks in the dag is running fine while some tasks are retrying more than the specified number of retries.
One of the task logs shows - Starting attempt 26 of 2. Why is the scheduler scheduling it even after two failure?
Anyone facing the similar issue?
Example Dag -
args = {
'owner': airflow,
'depends_on_past': False,
'start_date': datetime(2019, 03, 10, 0, 0, 0),
'retries':1,
'retry_delay': timedelta(minutes=2),
'email': ['my#myorg.com'],
'email_on_failure': True,
'email_on_retry': True
}
dag = DAG(dag_id='dag1',
default_args=args,
schedule_interval='0 12 * * *',
max_active_runs=1)
data_processor1 = BashOperator(
task_id='data_processor1',
bash_command="sh processor1.sh {{ ds }} ",
dag=dag)
data_processor2 = BashOperator(
task_id='data_processor2',
bash_command="ssh processor2.sh {{ ds }} ",
dag=dag)
data_processor1.set_downstream(data_processor2)

This may be useful,
I tried to generate the same error you are facing in airflow, but I couldn't generate it.
In my Airflow GUI, it shows only single retry then it is marking Task and DAG as failed, which is general airflow behavior, I don't know why and how you're facing this issue.
click here to see image screenshot of my airflow GUI for your DAG
can you please add more details regarding the problem (like logs and all).

Apache - Airflow 1.10.1 don't start a job

I have a problem with Airflow, The first job in a DAG always starts and ends successfully but the second job never starts automatically.
I try to clear the job in the UI but it doesn't starts, if I want to see it running I need to delete the running jobs in the database,
delete from job where state='running'
But I haven't a lot of jobs in running state, I have only 1 job SchedulerJob with the Latest Heartbeat ok, and 16 external task sensors waiting for this DAG
The pool have 150 slots and there are 16 running and 1 scheduled.
I have the airflow scheduler running
I have the airflow webserver running
All DAGs are set to On in the web ui
All the DAGs have a start date which is in the past
I reset the scheduler hours before
And this is the code in airflow
default_args = {
'owner': 'airgia',
'depends_on_past': False,
'retries': 2,
'start_date': datetime(2018, 12, 1, 0, 0),
'email': ['xxxx#yyyy.net'],
'email_on_failure': False,
'email_on_retry': False
}
dag = DAG('trigger_snapshot',
default_args=default_args,
dagrun_timeout= timedelta(hours=22),
schedule_interval="0 0 * * 1,2,3,4,5,7",
max_active_runs=1,
catchup=False
)
set_exec_dt = PythonOperator(
task_id='set_exec_dt',
python_callable=set_exec_dt_variable,
dag=dag,
pool='capser')
lanza_crawler = PythonOperator(
task_id='lanza_crawler',
op_kwargs={"crawler_name": crawler_name},
python_callable=start_crawler,
dag=dag,
pool='capser')
copy_as_processed = PythonOperator(
task_id='copy_as_processed',
op_kwargs={"source_bucket": Variable.get("bucket"),
"source_key": snapshot_key,
"dest_bucket": Variable.get("bucket"),
"dest_key": "{0}_processed".format(snapshot_key)},
python_callable=s3move,
dag=dag,
pool='capser')
airflow_snapshot = S3KeySensor(
task_id='airflow_snapshot',
bucket_key=snapshot_key,
wildcard_match=True,
bucket_name=Variable.get("bucket"),
timeout=8*60*60,
poke_interval=120,
dag=dag,
pool='capser')
Fin_DAG_TC = DummyOperator(
task_id='Fin_DAG_TC',
dag=dag,
pool='capser')
airflow_snapshot >> lanza_crawler >> set_exec_dt >> copy_as_processed >> Fin_DAG_TC
And this is what I see when I connect to web ui every morning
operator null
[EDIT]
This is the last log for scheduler
Here we can see the call for second job (lanza_crawler) but not the start.
[2018-12-11 03:50:54,209] {{jobs.py:1109}} INFO - Tasks up for execution:
[2018-12-11 03:50:54,240] {{jobs.py:1180}} INFO - DAG trigger_snapshot has 0/16 running and queued tasks
[2018-12-11 03:50:54,240] {{jobs.py:1218}} INFO - Setting the follow tasks to queued state:
[2018-12-11 03:50:54,254] {{jobs.py:1301}} INFO - Setting the follow tasks to queued state:
[2018-12-11 03:50:54,255] {{jobs.py:1343}} INFO - Sending ('trigger_snapshot', 'lanza_crawler', datetime.datetime(2018, 12, 10, 0, 0, tzinfo=), 1) to executor with priority 4 and queue default
[2018-12-11 03:50:54,255] {{base_executor.py:56}} INFO - Adding to queue: airflow run trigger_snapshot lanza_crawler 2018-12-10T00:00:00+00:00 --local --pool capser -sd /usr/local/airflow/dags/capser/trigger_snapshot.py
[2018-12-11 03:50:54,262] {{celery_executor.py:83}} INFO - [celery] queuing ('trigger_snapshot', 'lanza_crawler', datetime.datetime(2018, 12, 10, 0, 0, tzinfo=), 1) through celery, queue=default
[2018-12-11 03:50:54,749] {{jobs.py:1447}} INFO - Executor reports trigger_snapshot.airflow_snapshot execution_date=2018-12-10 00:00:00+00:00 as success for try_number 1
/usr/local/airflow/dags/capser/trigger_snapshot.py 1.53s 2018-12-11T03:50:54
...
/usr/local/airflow/dags/capser/trigger_snapshot.py 6866 0.68s 1.54s 2018-12-11T03:56:50
and This is the last log for worker
[2018-12-11 03:50:52,718: INFO/ForkPoolWorker-11] Task airflow.executors.celery_executor.execute_command[9a2e1ae7-9264-47d8-85ff-cac32a542708] succeeded in 13847.525094523095s: None
[2018-12-11 03:50:54,505: INFO/MainProcess] Received task: airflow.executors.celery_executor.execute_command[9ff70fc8-45ef-4751-b274-71e242553128]
[2018-12-11 03:50:54,983] {{settings.py:174}} INFO - setting.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800
[2018-12-11 03:50:55,422] {{_ _init__.py:51}} INFO - Using executor CeleryExecutor
[2018-12-11 03:50:55,611] {{models.py:271}} INFO - Filling up the DagBag from /usr/local/airflow/dags/capser/DAG_AURORA/DAG_AURORA.py
[2018-12-11 03:50:55,970] {{cli.py:484}} INFO - Running on host ip----*.eu-west-1.compute.internal

In the aws graphics we saw at 80% of worker's memory occupied and we decided increase the number of workers, and the problem was solved.

Airflow DAG not getting scheduled

I am new to Airflow and created my first DAG. Here is my DAG code. I want the DAG to start now and thereafter run once in a day.
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(),
'email': ['aaaa#gmail.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'alamode', default_args=default_args, schedule_interval=timedelta(1))
create_command = "/home/ubuntu/scripts/makedir.sh "
# t1 is the task which will invoke the directory creation shell script
t1 = BashOperator(
task_id='create_directory',
bash_command=create_command,
dag=dag)
run_spiders = "/home/ubuntu/scripts/crawl_spiders.sh "
# t2 is the task which will invoke the spiders
t2 = BashOperator(
task_id='web_scrawl',
bash_command=run_spiders,
dag=dag)
# To set dependency between tasks. 't1' should run before t2
t2.set_upstream(t1)
The DAG is not getting picked by Airflow. I checked the log and here is what it says.
[2017-09-12 18:08:20,220] {jobs.py:343} DagFileProcessor398 INFO - Started process (PID=7001) to work on /home/ubuntu/airflow/dags/alamode.py
[2017-09-12 18:08:20,223] {jobs.py:1521} DagFileProcessor398 INFO - Processing file /home/ubuntu/airflow/dags/alamode.py for tasks to queue
[2017-09-12 18:08:20,223] {models.py:167} DagFileProcessor398 INFO - Filling up the DagBag from /home/ubuntu/airflow/dags/alamode.py
[2017-09-12 18:08:20,262] {jobs.py:1535} DagFileProcessor398 INFO - DAG(s) ['alamode'] retrieved from /home/ubuntu/airflow/dags/alamode.py
[2017-09-12 18:08:20,291] {jobs.py:1169} DagFileProcessor398 INFO - Processing alamode
/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/default_comparator.py:161: SAWarning: The IN-predicate on "dag_run.dag_id" was invoked with an empty sequence. This results in a contradiction, which nonetheless can be expensive to evaluate. Consider alternative strategies for improved performance.
'strategies for improved performance.' % expr)
[2017-09-12 18:08:20,317] {models.py:322} DagFileProcessor398 INFO - Finding 'running' jobs without a recent heartbeat
[2017-09-12 18:08:20,318] {models.py:328} DagFileProcessor398 INFO - Failing jobs without heartbeat after 2017-09-12 18:03:20.318105
[2017-09-12 18:08:20,320] {jobs.py:351} DagFileProcessor398 INFO - Processing /home/ubuntu/airflow/dags/alamode.py took 0.100 seconds
What exactly am I doing wrong? I have tried changing the schedule_interval to schedule_interval=timedelta(minutes=1) to see if it starts immediately, but still no use. I can see the tasks under the DAG as expected in Airflow UI but with schedule status as 'no status'. Please help me here.

This issue has been resolved by following the below steps:
1) I used a much older date for start_date and schedule_interval=timedelta(minutes=10). Also, used a real date instead of datetime.now().
2) Added catchup = True in DAG arguments.
3) Setup environment variable as export AIRFLOW_HOME=pwd/airflow_home.
4) Deleted airflow.db
5) Moved the new code to DAGS folder
6) Ran the command 'airflow initdb' to create the DB again.
7) Turned the 'ON' switch of my DAG through UI
8) Ran the command 'airflow scheduler'
Here is the code which works now:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 9, 12),
'email': ['anjana#gapro.tech'],
'retries': 0,
'retry_delay': timedelta(minutes=15)
}
dag = DAG(
'alamode', catchup=False, default_args=default_args, schedule_interval="#daily")
# t1 is the task which will invoke the directory creation shell script
t1 = BashOperator(
task_id='create_directory',
bash_command='/home/ubuntu/scripts/makedir.sh ',
dag=dag)
# t2 is the task which will invoke the spiders
t2 = BashOperator(
task_id= 'web_crawl',
bash_command='/home/ubuntu/scripts/crawl_spiders.sh ',
dag=dag)
# To set dependency between tasks. 't1' should run before t2
t2.set_upstream(t1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scheduling AirfFlow DAG job - airflow

By default every dag that is created is at "pause" mode. This is defined in your "airflow.cfg" file. You can unpause your dag by $ airflow unpause test-air and retry again with the scheduler. You can also toggle your dag on/off from the Airflow webUI (by default it is off)

Related

How to limit the number of dag retries?

Airflow worker stuck : Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run

Tasks retrying more than specified retry in Airflow

Apache - Airflow 1.10.1 don't start a job

Airflow DAG not getting scheduled

Categories

Resources