Apache airflow sla miss - does not send email - airflow

I use Apache Airflow and I wanted it to send me an email when there is an sla miss. Here is my configuration Configuration
I created a dag run that misses sla for sure.
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
args = {
'owner': 'airflow'
, 'depends_on_past': False
, 'start_date': datetime(2018, 8, 17, 0, 0)
, 'retries': 0
, 'sla': timedelta(seconds=15)
, 'email': ['myemail#myemail.com']
, 'email_on_failure': True
, 'email_on_retry': True
}
dag = DAG('sla-email-test'
, default_args=args
, max_active_runs=1
, schedule_interval="#daily")
t1 = BashOperator(
task_id='timeout',
bash_command='sleep 60',
retries=0,
dag=dag,
)
Unfortunetely it did not send any email
Output
What can be a cause, can I use the gui to see logs about it ?

If you are using local sendmail, change your airflow config match the one below. You should not need smtp_user or smtp_password.
[email]
email_backend = airflow.utils.email.send_email_smtp
[smtp]
# If you want airflow to send emails on retries, failure, and you want to use
# the airflow.utils.email.send_email_smtp function, you have to configure an
# smtp server here
smtp_host = localhost
smtp_starttls = False
smtp_ssl = False
# Uncomment and set the user/pass settings if you want to use SMTP AUTH
#smtp_user = not used
#smtp_password = not used
smtp_port = 25
smtp_mail_from = SendingAlias#Company.com
You can tail the airflow worker to see if it attempts to send the email by using the command: journalctl -u airflow-worker -f
You can also see your sendmail logs by using: cat /var/log/maillog.
This should solve your problem / give you enough information to debug.
Here is my write up on how we handled this problem when we ran into it: airflow email on failure.

Related

MWAA not finding aws_default connection

I just set up AWS MWAA (managed airflow) and I'm playing around with running a simple bash script in a dag. I was reading the logs for the task and noticed that by default, the task looks for the aws_default connection and tries to use it but doesn't find it.
I went to the connections pane and set the aws_default connection but it still is showing the same message in the logs.
Airflow Connection: aws_conn_id=aws_default
No credentials retrieved from Connection
*** Reading remote log from Cloudwatch log_group: airflow-mwaa-Task log_stream: dms-
postgres-dialog-label-pg/start-replication-task/2021-11-22T13_00_00+00_00/1.log.
[2021-11-23 13:01:02,487] {{logging_mixin.py:104}} INFO - [2021-11-23 13:01:02,486] {{base_aws.py:368}} INFO - Airflow Connection: aws_conn_id=aws_default
[2021-11-23 13:01:02,657] {{logging_mixin.py:104}} INFO - [2021-11-23 13:01:02,656] {{base_aws.py:179}} INFO - No credentials retrieved from Connection
[2021-11-23 13:01:02,678] {{logging_mixin.py:104}} INFO - [2021-11-23 13:01:02,678] {{base_aws.py:87}} INFO - Creating session with aws_access_key_id=None region_name=us-east-1
[2021-11-23 13:01:02,772] {{logging_mixin.py:104}} INFO - [2021-11-23 13:01:02,772] {{base_aws.py:157}} INFO - role_arn is None
How can I get MWAA to recognize this connection?
My dag:
from datetime import datetime, timedelta, tzinfo
import pendulum
# The DAG object; we'll need this to instantiate a DAG
from airflow import DAG
# Operators; we need this to operate!
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
local_tz = pendulum.timezone("America/New_York")
start_date = datetime(2021, 11, 9, 8, tzinfo=local_tz)
# These args will get passed on to each operator
# You can override them on a per-task basis during operator initialization
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
# 'wait_for_downstream': False,
# 'dag': dag,
# 'sla': timedelta(hours=2),
# 'execution_timeout': timedelta(seconds=300),
# 'on_failure_callback': some_function,
# 'on_success_callback': some_other_function,
# 'on_retry_callback': another_function,
# 'sla_miss_callback': yet_another_function,
# 'trigger_rule': 'all_success'
}
with DAG(
'dms-postgres-dialog-label-pg-test',
default_args=default_args,
description='',
schedule_interval=timedelta(days=1),
start_date=start_date,
tags=['example'],
) as dag:
t1 = BashOperator(
task_id='start-replication-task',
bash_command="""
aws dms start-replication-task --replication-task-arn arn:aws:dms:us-east-1:blah --start-replication-task-type reload-target
""",
)
t1
Edit:
For now, I'm just importing an in-built function and using that to get the credentials. Example:
from airflow.hooks.base import BaseHook
conn = BaseHook.get_connection('aws_service_account')
...
print(conn.host)
print(conn.login)
print(conn.password)
Updating this as I just got off with AWS support.
The execution role MWAA creates is used instead of an access key id and secret in aws_default. To use a custom access key id and secret do as #Jonathan Porter recommends with his question's answer:
from airflow.hooks.base import BaseHook
conn = BaseHook.get_connection('aws_service_account')
...
print(conn.host)
print(conn.login)
print(conn.password)
However, if one wants to use the execution role specifically that mwaa provides, this is the default within mwaa. Confusingly, the info messages state that no credentials were retrieved from the connection, however the execution role will be used in something akin to the kubernetes pod operator.
[2021-11-23 13:01:02,487] {{logging_mixin.py:104}} INFO - [2021-11-23 13:01:02,486] {{base_aws.py:368}} INFO - Airflow Connection: aws_conn_id=aws_default
[2021-11-23 13:01:02,657] {{logging_mixin.py:104}} INFO - [2021-11-23 13:01:02,656] {{base_aws.py:179}} INFO - No credentials retrieved from Connection
[2021-11-23 13:01:02,678] {{logging_mixin.py:104}} INFO - [2021-11-23 13:01:02,678] {{base_aws.py:87}} INFO - Creating session with aws_access_key_id=None region_name=us-east-1
[2021-11-23 13:01:02,772] {{logging_mixin.py:104}} INFO - [2021-11-23 13:01:02,772] {{base_aws.py:157}} INFO - role_arn is None
For example, the following uses the .aws/credentials set by the execution role in the mwaa env automatically with this:
from datetime import timedelta
from airflow import DAG
from datetime import datetime
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
default_args = {
'owner': 'aws',
'depends_on_past': False,
'start_date': datetime(2019, 2, 20),
'provide_context': True
}
dag = DAG(
'kubernetes_pod_example', default_args=default_args, schedule_interval=None
)
#use a kube_config stored in s3 dags folder for now
kube_config_path = '/usr/local/airflow/dags/kube_config.yaml'
podRun = KubernetesPodOperator(
namespace="mwaa",
image="ubuntu:18.04",
cmds=["bash"],
arguments=["-c", "ls"],
labels={"foo": "bar"},
name="mwaa-pod-test",
task_id="pod-task",
get_logs=True,
dag=dag,
is_delete_operator_pod=False,
config_file=kube_config_path,
in_cluster=False,
cluster_context='aws',
execution_timeout=timedelta(seconds=60)
)
Hope this helps for anyone else stumbling around.

Apache Airflow not running command or sending output of command properly?

I'm running Apache airflow and whenever I run an aws command, there is no output being displayed. The command works on the bash shell of the worker. Right now the job will just hang waiting. Is there something I have to do to send data to the airflow to tell it that the command has completed?
Logs
[2019-10-07 15:48:13,098] {bash_operator.py:91} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_ID=lambda_async
AIRFLOW_CTX_TASK_ID=aws_lambda
AIRFLOW_CTX_EXECUTION_DATE=2019-10-07T15:48:01.310140+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2019-10-07T15:48:01.310140+00:00
[2019-10-07 15:48:13,099] {bash_operator.py:105} INFO - Temporary script location: /tmp/airflowtmpj4r2x2xg/aws_lambda2qu0f2vl
[2019-10-07 15:48:13,099] {bash_operator.py:115} INFO - Running command: aws lambda --region us-east-1 list-functions
[2019-10-07 15:48:13,103] {bash_operator.py:124} INFO - Output:
Code
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
import datetime
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.datetime(2019, 7, 30),
'email': ['xxxxxxxxx'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
}
dag = DAG('lambda_async', default_args=default_args, schedule_interval=datetime.timedelta(days=1))
t1_command = "aws lambda --region us-east-1 list-functions"
t1 = BashOperator(
task_id='aws_lambda',
bash_command=t1_command,
dag=dag)
t1

Airflow schedule getting skipped if previous task execution takes more time

I have two tasks in my airflow DAG. One triggers an API call ( Http operator ) and another one keeps checking its status using another api ( Http sensor ). This DAG is scheduled to run every hour & 10 minutes. But some times one execution can take long time to finish for example 20 hours. In such cases all the schedules while the previous task is running is not executing.
For example say if I the job at 01:10 takes 10 hours to finish. Schedules 02:10, 03:10, 04:10, ... 11:10 etc which are supposed to run are getting skipped and only the one at 12:10 is executed.
I am using local executor. I am running airflow server & scheduler using below script.
start_server.sh
export AIRFLOW_HOME=./airflow_home;
export AIRFLOW_GPL_UNIDECODE=yes;
export AIRFLOW_CONN_REST_API=http://localhost:5000;
export AIRFLOW_CONN_MANAGEMENT_API=http://localhost:8001;
airflow initdb;
airflow webserver -p 7200;
start_scheduler.sh
export AIRFLOW_HOME=./airflow_home;
# Connection string for connecting to REST interface server
export AIRFLOW_CONN_REST_API=http://localhost:5000;
export AIRFLOW_CONN_MANAGEMENT_API=http://localhost:8001;
#export AIRFLOW__SMTP__SMTP_PASSWORD=**********;
airflow scheduler;
my_dag_file.py
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
'email': admin_email_ids,
'email_on_failure': False,
'email_on_retry': False
}
DAG_ID = 'reconciliation_job_pipeline'
MANAGEMENT_RES_API_CONNECTION_CONFIG = 'management_api'
DA_REST_API_CONNECTION_CONFIG = 'rest_api'
recon_schedule = Variable.get('recon_cron_expression',"10 * * * *")
dag = DAG(DAG_ID, max_active_runs=1, default_args=default_args,
schedule_interval=recon_schedule,
catchup=False)
dag.doc_md = __doc__
spark_job_end_point = conf['sip_da']['spark_job_end_point']
fetch_index_record_count_config_key = conf['reconciliation'][
'fetch_index_record_count']
fetch_index_record_count = SparkJobOperator(
job_id_key='fetch_index_record_count_job',
config_key=fetch_index_record_count_config_key,
exec_id_req=False,
dag=dag,
http_conn_id=DA_REST_API_CONNECTION_CONFIG,
task_id='fetch_index_record_count_job',
data={},
method='POST',
endpoint=spark_job_end_point,
headers={
"Content-Type": "application/json"}
)
job_endpoint = conf['sip_da']['job_resource_endpoint']
fetch_index_record_count_status_job = JobStatusSensor(
job_id_key='fetch_index_record_count_job',
http_conn_id=DA_REST_API_CONNECTION_CONFIG,
task_id='fetch_index_record_count_status_job',
endpoint=job_endpoint,
method='GET',
request_params={'required': 'status'},
headers={"Content-Type": "application/json"},
dag=dag,
poke_interval=15
)
fetch_index_record_count>>fetch_index_record_count_status_job
SparkJobOperator & JobStatusSensor my custom class extending SimpleHttpOperator & HttpSensor.
If I set depends_on_past true will it work as expected?. Another problem I have for this option is some time the status check job will fail. But the next schedule should get trigger. How can I achieve this behavior ?
I think the main discussion point here is what you set is catchup=False, more detail can be found here. So airflow scheduler will skip those task execution and you would see the behavior as you mentioned.
This sounds like you would need to perform catchup if the previous process took longer than expected. You can try to change it catchup=True

HttpError 400 when trying to run DataProcSparkOperator task from a local Airflow

I'm testing out a DAG that I used to have running on Google Composer without error, on a local install of Airflow. The DAG spins up a Google Dataproc cluster, runs a Spark job (JAR file located on a GS bucket), then spins down the cluster.
The DataProcSparkOperator task fails immediately each time with the following error:
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://dataproc.googleapis.com/v1beta2/projects//regions/global/jobs:submit?alt=json returned "Invalid resource field value in the request.">
It looks as though the URI is incorrect/incomplete, but I am not sure what is causing it. Below is the meat of my DAG. All the other tasks execute without error, and the only difference is the DAG is no longer running on Composer:
default_dag_args = {
'start_date': yesterday,
'email': models.Variable.get('email'),
'email_on_failure': True,
'email_on_retry': True,
'retries': 0,
'retry_delay': dt.timedelta(seconds=30),
'project_id': models.Variable.get('gcp_project'),
'cluster_name': 'susi-bsm-cluster-{{ ds_nodash }}'
}
def slack():
'''Posts to Slack if the Spark job fails'''
text = ':x: The DAG *{}* broke and I am not smart enough to fix it. Check the StackDriver and DataProc logs.'.format(DAG_NAME)
s.post_slack(SLACK_URI, text)
with DAG(DAG_NAME, schedule_interval='#once',
default_args=default_dag_args) as dag:
# pylint: disable=no-value-for-parameter
delete_existing_parquet = bo.BashOperator(
task_id = 'delete_existing_parquet',
bash_command = 'gsutil rm -r {}/susi/bsm/bsm.parquet'.format(GCS_BUCKET)
)
create_dataproc_cluster = dpo.DataprocClusterCreateOperator(
task_id = 'create_dataproc_cluster',
num_workers = num_workers_override or models.Variable.get('default_dataproc_workers'),
zone = models.Variable.get('gce_zone'),
init_actions_uris = ['gs://cjones-composer-test/susi/susi-bsm-dataproc-init.sh'],
trigger_rule = trigger_rule.TriggerRule.ALL_DONE
)
run_spark_job = dpo.DataProcSparkOperator(
task_id = 'run_spark_job',
main_class = MAIN_CLASS,
dataproc_spark_jars = [MAIN_JAR],
arguments=['{}/susi.conf'.format(CONF_DEST), DATE_CONST]
)
notify_on_fail = po.PythonOperator(
task_id = 'output_to_slack',
python_callable = slack,
trigger_rule = trigger_rule.TriggerRule.ONE_FAILED
)
delete_dataproc_cluster = dpo.DataprocClusterDeleteOperator(
task_id = 'delete_dataproc_cluster',
trigger_rule = trigger_rule.TriggerRule.ALL_DONE
)
delete_existing_parquet >> create_dataproc_cluster >> run_spark_job >> delete_dataproc_cluster >> notify_on_fail
Any assistance with this would be much appreciated!
Unlike the DataprocClusterCreateOperator, the DataProcSparkOperator does not take the project_id as a parameter. It gets it from the Airflow connection (if you do not specify the gcp_conn_id parameter, it defaults to google_cloud_default). You have to configure your connection.
The reason you don't see this while running DAG in Composer is that Composer configures the google_cloud_default connection.

PythonOperator with python_callable set gets executed constantly

import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from workflow.task import some_task
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['jimin.park1#aig.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=1),
'start_date': airflow.utils.dates.days_ago(0)
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG('JiminTest', default_args=default_args, schedule_interval='*/1 * * * *', catchup=False)
t1 = PythonOperator(
task_id='Task1',
provide_context=True,
python_callable=some_task,
dag=dag
)
The actual some_task itself simply appends timestamp to some file. As you can see in the dag config file, the task itself is configured to run every 1 min.
def some_task(ds, **kwargs):
current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
with open("test.txt", "a") as myfile:
myfile.write(current_time + '\n')
I simply tail -f the output file and started up the webserver without the scheduler running. This function was being called and things were being appended to the file when webserver starts up. When I start up the scheduler, on each execution loop, the file gets appended.
What I want is for the function to be executed on every minute as intended, not every execution loop.
The scheduler will run each DAG file every scheduler loop, including all import statements.
Is there anything running code in the file from where you are importing the function?
Try to check the scheduler_heartbeat_sec config parameter in your config file. For your case it should be smaller than 60 seconds.
If you want the scheduler not to cahtchup previous runs set catchup_by_defaultto False (I am not sure if this relevant to your question though).
Please indicate which Apache Airflow version are you using

Resources