Apache Airflow not running command or sending output of command properly? - airflow

I'm running Apache airflow and whenever I run an aws command, there is no output being displayed. The command works on the bash shell of the worker. Right now the job will just hang waiting. Is there something I have to do to send data to the airflow to tell it that the command has completed?
Logs
[2019-10-07 15:48:13,098] {bash_operator.py:91} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_ID=lambda_async
AIRFLOW_CTX_TASK_ID=aws_lambda
AIRFLOW_CTX_EXECUTION_DATE=2019-10-07T15:48:01.310140+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2019-10-07T15:48:01.310140+00:00
[2019-10-07 15:48:13,099] {bash_operator.py:105} INFO - Temporary script location: /tmp/airflowtmpj4r2x2xg/aws_lambda2qu0f2vl
[2019-10-07 15:48:13,099] {bash_operator.py:115} INFO - Running command: aws lambda --region us-east-1 list-functions
[2019-10-07 15:48:13,103] {bash_operator.py:124} INFO - Output:
Code
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
import datetime
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.datetime(2019, 7, 30),
'email': ['xxxxxxxxx'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
}
dag = DAG('lambda_async', default_args=default_args, schedule_interval=datetime.timedelta(days=1))
t1_command = "aws lambda --region us-east-1 list-functions"
t1 = BashOperator(
task_id='aws_lambda',
bash_command=t1_command,
dag=dag)
t1

Related

Airflow tasks not gettin running

I am trying to run a simple BASHOperator task in Airflow. The DAG when trigerred manually lists the tasks in Tree and Graph view but the tasks are always in not started state.
I have restarted my Airflow scheduler. I am running Airflow on local host using a Kubectl image on Docker Compose.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['vijayraghunath21#gmail.com'],
'email_on_success': True,
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=2),
}
with DAG(
dag_id='bash_demo',
default_args=default_args,
description='Bash Demo',
start_date=datetime(2021, 1, 1),
# schedule_interval='0 2 * * *',
schedule_interval=None,
max_active_runs=1,
catchup=False,
tags=['bash_demo'],
) as dag:
dag.doc_md = __doc__
# Task 1
dummy_task = DummyOperator(task_id='dummy_task')
# Task 2
bash_task = BashOperator(
task_id='bash_task', bash_command="echo 'command executed from BashOperator'")
dummy_task >> bash_task
DAG Image
As shown on the image you added the DAG is set to off thus it's not running. You should click on the toggle button to set it to on.
This issue can be avoided in two ways:
Global solution- if you wills set dags_are_paused_at_creation = False in airflow.cfg - This will effect all DAGs in the system.
Local solution - if you will use is_paused_upon_creation in the DAG contractor:
with DAG(
dag_id='bash_demo',
...
is_paused_upon_creation=False,
) as dag:
This parameter specifies if the dag is paused when created for the first time. If the dag exists already, the parameter is being ignored.

Airflow does not run dags

Context: I successfully installed Airflow on EC2, changed things like executor to LocalExecutor; sql_alchemy_conn to postgresql+psycopg2://postgres#localhost:5432/airflow; max_threads to 10.
My problem is when I create a dag which I indicate to be run everyday everything is fine, but when I create a dag to be run like at 10am on Monday and Wednesday Airflow doesn't does not run it. Does anybody know what could I do wrong and should I do in order to fix this issue?
Dag for script which runs fine and properly:
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
args = {
'owner': 'arseniyy123',
'start_date': airflow.utils.dates.days_ago(1),
'depends_on_past': False,
'email': ['exam#exam.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG(
'daily_script',
default_args=args,
description = 'daily_script',
schedule_interval = "0 10 * * *",
)
t1 = BashOperator(
task_id='daily',
bash_command='cd /root/ && python3 DAILY_WORK.py',
dag=dag)
t1
Dag for script which should run on Monday and Wednesday, but it does not run at all:
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
args = {
'owner': 'arseniyy123',
'start_date': airflow.utils.dates.days_ago(1),
'depends_on_past': False,
'email': ['exam#exam.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG(
'monday_wednesday',
default_args=args,
description = 'monday_wednesday',
schedule_interval = "0 10 * * 1,3",
)
t1 = BashOperator(
task_id='monday_wednesday',
bash_command='cd /root/ && python3 not_daily_work.py',
dag=dag)
t1
I also have some problems with scheduler, it uses to die after being working more than 10 hours, anybody know why does it happen?
Thank you in advance!
Can you try changing the start_date to a static datetime e.g. datetime.date(2020, 3, 20) instead of using airflow.utils.dates.days_ago(1)
Maybe read through the scheduling examples here, to understand why your code didn't work. From that documentation:
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period

Apache AirFlow : How to schedule it on remote machines

I am new to apache airflow, Could you please help me to understand where/what should I configure to run a DAG in a remote machines. I am using the celery_executor to execute the code on worker nodes, I have not done any configurations on worker nodes, I am using RabitMQ as queue service and seems like I have configured the Airflow cluster correctly.
My DAG file :
"""
Code that goes along with the Airflow tutorial located at:
https://github.com/apache/airflow/blob/master/airflow/example_dags/tutorial.py
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG('sample_date_print', schedule_interval='*/1 * * * *', default_args=default_args)
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
templated_command = """
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
"""
t3 = BashOperator(
task_id='templated',
bash_command=templated_command,
params={'my_param': 'Parameter I passed in'},
dag=dag)
t2.set_upstream(t1)
t3.set_upstream(t1)
The logs:
{
"host_name": "1f176162bc5e",
"full_command": "['/usr/local/bin/airflow', 'tasks', 'run', 'sample_date_print', 'print_date', '2015-06-04T00:00:00+00:00', '--local', '--pool', 'default_pool', '-sd', '/root/airflow/dags/sample_date_print.py']"
}
I am not sure how should I change the default behavior of --local in a way that DAG file
go and execute on remote machines, please help me
Have you changed the executor = SequentialExecutor to executor = CeleryExecutor in your airflow.cfg? I think you have not or if you have you forgot to mention it.
It's the firs thing to change if you want to change execution mode.
Other things you might've missed with configurations could be mentioned here or here.

Airflow does not run task

I've written a python task for Airflow. When I load my DAG into Airflow, it shows everything just fine. Once I trigger a run for my DAG, it will create a run and switch to succeeded without ever running my task. The webserver and scheduler are both running and there's nothing in the log.
The task doesn't even get a status while running (not even skipped).
If I run my task directly using airflow test update_dags work 2019-01-01 it runs just fine.
This is my DAG:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
import os
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime.today(),
'email': ['***redacted***'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'params': {
'git_repository': '***redacted***',
'git_ref': 'origin/master',
'git_folder': '/opt/dag-repository'
}
}
def command(cmd: str, *args):
templated_command = cmd.format(*args)
print('Running command: {}'.format(templated_command))
os.system(templated_command)
def do_work(**kwargs):
params = kwargs['params']
dag_directory = kwargs['conf'].get('core', 'dags_folder')
git_repository = params['git_repository']
git_ref = params['git_ref']
git_folder = params['git_folder']
command('if [ ! -d {0} ]; then git clone {1} {0}; fi', git_folder, git_repository)
command('cd {0}; git fetch -apt', git_folder)
command('cd {0}; git reset --hard {1}', git_folder, git_ref)
command('ln -sf {0} {1}', '{}/src'.format(git_folder), dag_directory)
with DAG('update_dags', default_args=default_args, schedule_interval=timedelta(minutes=10), max_active_runs=1) as dag:
work_stage = PythonOperator(
task_id="work",
python_callable=do_work,
provide_context=True
)
I also had the same problem. I solved the problem by turning os.system(cmd) into subprocess.run(cmd, shell=True, check=True), but I'm not quite sure why. hope this helps

Airflow DAG not getting scheduled

I am new to Airflow and created my first DAG. Here is my DAG code. I want the DAG to start now and thereafter run once in a day.
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(),
'email': ['aaaa#gmail.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'alamode', default_args=default_args, schedule_interval=timedelta(1))
create_command = "/home/ubuntu/scripts/makedir.sh "
# t1 is the task which will invoke the directory creation shell script
t1 = BashOperator(
task_id='create_directory',
bash_command=create_command,
dag=dag)
run_spiders = "/home/ubuntu/scripts/crawl_spiders.sh "
# t2 is the task which will invoke the spiders
t2 = BashOperator(
task_id='web_scrawl',
bash_command=run_spiders,
dag=dag)
# To set dependency between tasks. 't1' should run before t2
t2.set_upstream(t1)
The DAG is not getting picked by Airflow. I checked the log and here is what it says.
[2017-09-12 18:08:20,220] {jobs.py:343} DagFileProcessor398 INFO - Started process (PID=7001) to work on /home/ubuntu/airflow/dags/alamode.py
[2017-09-12 18:08:20,223] {jobs.py:1521} DagFileProcessor398 INFO - Processing file /home/ubuntu/airflow/dags/alamode.py for tasks to queue
[2017-09-12 18:08:20,223] {models.py:167} DagFileProcessor398 INFO - Filling up the DagBag from /home/ubuntu/airflow/dags/alamode.py
[2017-09-12 18:08:20,262] {jobs.py:1535} DagFileProcessor398 INFO - DAG(s) ['alamode'] retrieved from /home/ubuntu/airflow/dags/alamode.py
[2017-09-12 18:08:20,291] {jobs.py:1169} DagFileProcessor398 INFO - Processing alamode
/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/default_comparator.py:161: SAWarning: The IN-predicate on "dag_run.dag_id" was invoked with an empty sequence. This results in a contradiction, which nonetheless can be expensive to evaluate. Consider alternative strategies for improved performance.
'strategies for improved performance.' % expr)
[2017-09-12 18:08:20,317] {models.py:322} DagFileProcessor398 INFO - Finding 'running' jobs without a recent heartbeat
[2017-09-12 18:08:20,318] {models.py:328} DagFileProcessor398 INFO - Failing jobs without heartbeat after 2017-09-12 18:03:20.318105
[2017-09-12 18:08:20,320] {jobs.py:351} DagFileProcessor398 INFO - Processing /home/ubuntu/airflow/dags/alamode.py took 0.100 seconds
What exactly am I doing wrong? I have tried changing the schedule_interval to schedule_interval=timedelta(minutes=1) to see if it starts immediately, but still no use. I can see the tasks under the DAG as expected in Airflow UI but with schedule status as 'no status'. Please help me here.
This issue has been resolved by following the below steps:
1) I used a much older date for start_date and schedule_interval=timedelta(minutes=10). Also, used a real date instead of datetime.now().
2) Added catchup = True in DAG arguments.
3) Setup environment variable as export AIRFLOW_HOME=pwd/airflow_home.
4) Deleted airflow.db
5) Moved the new code to DAGS folder
6) Ran the command 'airflow initdb' to create the DB again.
7) Turned the 'ON' switch of my DAG through UI
8) Ran the command 'airflow scheduler'
Here is the code which works now:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 9, 12),
'email': ['anjana#gapro.tech'],
'retries': 0,
'retry_delay': timedelta(minutes=15)
}
dag = DAG(
'alamode', catchup=False, default_args=default_args, schedule_interval="#daily")
# t1 is the task which will invoke the directory creation shell script
t1 = BashOperator(
task_id='create_directory',
bash_command='/home/ubuntu/scripts/makedir.sh ',
dag=dag)
# t2 is the task which will invoke the spiders
t2 = BashOperator(
task_id= 'web_crawl',
bash_command='/home/ubuntu/scripts/crawl_spiders.sh ',
dag=dag)
# To set dependency between tasks. 't1' should run before t2
t2.set_upstream(t1)

Resources