I have a job with 3 tasks
1) Get a token using a POST request
2) Get token value and store in a variable
3) Make a GET request by using token from step 2 and pass bearer token
Issue is step 3 is not working and i am getting HTTP error. I was able to print the value of token in the step 2 and verified in the code
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
token ="mytoken" //defined with some value which will be updated later
get_token = SimpleHttpOperator(
task_id='get_token',
method='POST',
headers={"Authorization": "Basic xxxxxxxxxxxxxxx=="},
endpoint='/token?username=user&password=pass&grant_type=password',
http_conn_id = 'test_http',
trigger_rule="all_done",
xcom_push=True,
dag=dag
)
def pull_function(**context):
value = context['task_instance'].xcom_pull(task_ids='get_token')
print("printing token")
print value
wjdata = json.loads(value)
print(wjdata['access_token'])
token=wjdata['access_token']
print token
run_this = PythonOperator(
task_id='print_the_context',
provide_context=True,
python_callable=pull_function,
dag=dag,
)
get_config = SimpleHttpOperator(
task_id='get_config',
method='GET',
headers={"Authorization": "Bearer " + token},
endpoint='someendpoint',
http_conn_id = 'test_conn',
trigger_rule="all_done",
xcom_push=True,
dag=dag
)
get_token >> run_this >> get_config
The way you are storing token as a "global" variable won't work. The Dag definition file (the script where you defined the tasks) is not the same runtime context as the one for executing each task. Every task can be run in a separate thread, process, or even on another machine, depending on the executor. The way you pass data between the tasks is not by global variables, but rather using the XCom - which you already partially do.
Try the following:
- remote the global token variable
- in pull_function instead of print token do return token - this will push the value to the XCom again, so the next task can access it
- access the token from XCom in your next task.
The last step is a bit tricky since you are using the SimpleHttpOperator, and it's only templated fields are endpoint and data, but not headers.
For example, if you wanted to pass in some data from the XCom of a previous task, you would do something like this:
get_config = SimpleHttpOperator(
task_id='get_config',
endpoint='someendpoint',
http_conn_id = 'test_conn',
dag=dag,
data='{{ task_instance.xcom_pull(task_ids="print_the_context", key="some_key") }}'
)
But you can't do the same with the headers unfortunately, so you have to either do it "manually" via a PythonOperator, or you could inherit SimpleHttpOperator and create your own, something like:
class HeaderTemplatedHttpOperator(SimpleHttpOperator):
template_fields = ('endpoint', 'data', 'headers') # added 'headers' headers
then use that one, something like:
get_config = HeaderTemplatedHttpOperator(
task_id='get_config',
endpoint='someendpoint',
http_conn_id = 'test_conn',
dag=dag,
headers='{{ task_instance.xcom_pull(task_ids="print_the_context") }}'
)
Keep in mind I did no testing on this, it's just for the purpose of explaining the concept. Play around with the approach and you should get there.
Related
I'm trying to add SLA slack alerts for our airflow. Our slack_failure_notification works fine, but I've been unable to get airflow to call sla_miss_callback.
I've seen in multiple threads people saying to put the sla_miss_callback in the dag definition and NOT in the default_args. If I do this, the dag runs fine, and my test_task gets inserted into the sla miss table, but it says notification sent = False and never seems to trigger the sla_miss_callback function. Looking at the task instance details, it has sla = 0:01:00 and has an on_failure_callback attribute but no sla_miss_callback attribute (not sure if it is supposed to have that or not).
def slack_failure_notification(context):
failed_alert = SlackWebhookOperator(
task_id='slack_failure_notification',
http_conn_id='datascience_alerts_slack',
message="failed task")
return failed_alert.execute(context=context)
def slack_sla_notification(dag, task_list, blocking_task_list, slas, blocking_tis):
alert = SlackWebhookOperator(
task_id='slack_sla_notification',
http_conn_id='datascience_alerts_slack',
message="testing SLA function")
return alert.execute()
default_args = {
'owner': 'airflow',
'start_date': datetime.datetime(2019, 1, 1),
'depends_on_past': False,
'retries': 0,
'on_failure_callback': slack_failure_notification
}
dag = DAG('template_dag', default_args=default_args, catchup=False, schedule_interval="10 13 * * *", sla_miss_callback=slack_sla_notification)
test_task = SnowflakeOperator(
task_id='test_task',
dag=dag,
snowflake_conn_id='snowflake-static-datascience_airflow',
sql=somelongrunningcodehere,
sla=datetime.timedelta(minutes=1)
)
If I instead put sla_miss_callback in default_args, the task still gets put into the sla miss table, and it says notification sent = True, but the sla_miss_callback function still never triggers. I also see nothing in our log files.
default_args = {
'owner': 'airflow',
'start_date': datetime.datetime(2019, 1, 1),
'depends_on_past': False,
'retries': 0,
'on_failure_callback': slack_failure_notification,
'sla_miss_callback': slack_sla_notification
}
I have also tried defining the function using def slack_sla_notification(*args, **kwargs): with no change in behavior.
I know the airflow developers say the SLA stuff is a bit of a mess and will be reworked at some point, but I'd love to get this to work in the meantime if anyone has any ideas of things to try.
It looks like you forgot to include context in your execute method.
If you just want to see your message, I would suggest something like this:
def slack_sla_notification(dag, task_list, blocking_task_list, slas, blocking_tis):
message = "testing SLA function"
alert = SlackWebhookOperator(
task_id='slack_sla_notification',
http_conn_id='datascience_alerts_slack',
message=message)
return alert.execute(message)
Airflow version :2.0.2
Trying to create Emr cluster, by retrying data from AWS secrets manager.
I am trying to write an airflow dag and, my task is to get data from this get_secret function and use it in Spark_steps
def get_secret():
secret_name = Variable.get("secret_name")
region_name = Variable.get(region_name)
# Create a Secrets Manager client
session = boto3.session.Session()
client = session.client(service_name='secretsmanager', region_name=region_name)
account_id = boto3.client('sts').get_caller_identity().get('Account')
try:
get_secret_value_response = client.get_secret_value(SecretId=secret_name)
if 'SecretString' in get_secret_value_response:
secret_str = get_secret_value_response['SecretString']
secret=json.loads(secret_str)
airflow_path=secret["airflow_path"]
return airflow_path
...
I need to use "airflow_path" return value in below spark_steps
Spark_steps:
SPARK_STEPS = [
{
'Name': 'Spark-Submit Command',
"ActionOnFailure": "CONTINUE",
'HadoopJarStep': {
"Jar": "command-runner.jar",
"Args": [
'spark-submit',
'--py-files',
's3://'+airflow_path+'-pyspark/pitchbook/config.zip,s3://'+airflow_path+'-pyspark/pitchbook/jobs.zip,s3://'+airflow_path+'-pyspark/pitchbook/DDL.zip',
's3://'+airflow_path+'-pyspark/pitchbook/main.py'
],
},
},
I saw on the internet I need to use Xcom, is this right ?, and do I need to run this function in python operator first and then get the value. please provide an example as I am a newbie.
Thanks for your help.
Xi
Yes if you would like to pass dynamic stuff, leveraging xcom push/pull might be easier.
Leverage PythonOperator to push data into xcom.
See reference implementation:
https://github.com/apache/airflow/blob/7fed7f31c3a895c0df08228541f955efb16fbf79/airflow/providers/amazon/aws/example_dags/example_emr.py
https://github.com/apache/airflow/blob/7fed7f31c3a895c0df08228541f955efb16fbf79/airflow/providers/amazon/aws/example_dags/example_emr.py#L108
https://www.startdataengineering.com/post/how-to-submit-spark-jobs-to-emr-cluster-from-airflow/
I'm using Airflow in Google Composer. By default all the tasks in a DAG use the default connection to communicate with Storage, BigQuery, etc. Obviously we can specify another connection configured in Airflow, ie:
task_custom = bigquery.BigQueryInsertJobOperator(
task_id='task_custom_connection',
gcp_conn_id='my_gcp_connection',
configuration={
"query": {
"query": 'SELECT 1',
"useLegacySql": False
}
}
)
Is it possible to use a specific connection as the default for all tasks in the entire DAG?
Thanks in advance.
UPDATE:
Specify gcp_conn_id via default_args in DAG (as Javier Lopez Tomas recommended) doesn't work completely. The Operators that expect gcp_conn_id as parameter works fine, but in my case unfortunately most of interactions with GCP components do so via clients or hooks within PythonOperators.
For example: If I call DataflowHook (inside a function called by a PythonOperator) without specifying the connection, it internally uses "google_cloud_default" and not "gcp_conn_id" :(
def _dummy_func(**context):
df_hook = DataflowHook()
default_args = {
'gcp_conn_id': 'my_gcp_connection'
}
with DAG(default_args=default_args) as dag:
dummy = PythonOperator(python_callable=_dummy_func)
You can use default args:
https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html#default-arguments
In your case it would be:
default_args = {
"gcp_conn_id": "my_gcp_connection"
}
with DAG(blabla)...
I'm new to airflow, can someone please help me with this as I'm unable to access 'db_conn' inside my custom operator, this argument defined in default_args.
**Dag details:**
default_args = {
'owner': 'airflow',
'email': ['example#hotmail.com'],
'db_conn': 'database_connection'
}
dag = DAG(dag_id='my_custom_operator_dag', description='Another tutorial DAG',
schedule_interval='0 12 * * *',
start_date=datetime(2020, 8, 6),
catchup=False,
default_args=default_args)
operator_task_start = MyOperator(
task_id='my_custom_operator', dag=dag
)
**Operator details:**
class MyOperator(BaseOperator):
##apply_defaults
def __init__(self,
*args,
**kwargs):
super(MyOperator, self).__init__(*args,**kwargs)
def execute(self, context):
log.info('owner: %s', self.owner)
log.info('email: %s', self.email)
log.info('db_conn: %s', self.db_conn)
# Error here, AttributeError: 'MyOperator' object has no attribute 'db_conn
You seem to have misunderstood default_args. default_args is just a shorthand (code-cleanup / refactoring / brevity) to pass common (which have same value for all operators of DAG, like owner) args to all your operators, by setting them up as defaults and passing to the DAG itself. Quoting the docstring comment from DAG params
:param default_args: A dictionary of default parameters to be used
as constructor keyword parameters when initialising operators.
Note that operators have the same hook, and precede those defined
here, meaning that if your dict contains `'depends_on_past': True`
here and `'depends_on_past': False` in the operator's call
`default_args`, the actual value will be `False`.
:type default_args: dict
So clearly for default_args to work, any keys that you are passing there should be an argument of your Operator classes.
Not just that, do note that passing invalid (non-existent) arguments to Operator constructor(s) will be penalized in Airflow 2.0 (so better not pass any)
'Invalid arguments were passed to {c} (task_id: {t}). '
'Support for passing such arguments will be dropped in '
'future. ..
Hopefully, by now it must be clear that to make this work, you must add a param db_conn in constructor of your MyOperator class
**Operator details:**
class MyOperator(BaseOperator):
##apply_defaults
def __init__(self,
db_conn: str,
*args,
**kwargs):
super(MyOperator, self).__init__(*args,**kwargs)
self.db_conn: str = db_conn
And while we are at it, may I offer you a suggestion: for something like a connection, preferably use the Connection feature offered by Airflow which eases your interaction with external services
makes them manageable (view / edit via UI)
secure (they are stored encrypted in db)
support for load balancing (define multiple connections with same conn_id, Airflow will randomly distribute calls to one of those)
When there is more than one connection with the same conn_id, the
get_connection() method on BaseHook will choose one connection
randomly. This can be be used to provide basic load balancing and
fault tolerance, when used in conjunction with retries.
nice integration with built-in hooks
They also use the airflow.models.connection.Connection model to
retrieve hostnames and authentication information. Hooks keep
authentication code and information out of pipelines, centralized in
the metadata database.
I use Apache Airflow and I would like it to send email notifications on sla miss. I store email adresses as airflow variable, and I have a dag which one of its tasks sends Email using EmailOperator.
And here comes the issue because however It sends emails when my send-mail task is run to all the recipients, It do sends sla miss notifaction only to the first adress on the list which in my example means test1#test.com.
Is this some bug, or why it's not working ?
Here's my dag and airlfow variable:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.email_operator import EmailOperator
from airflow.models import Variable
from airflow.operators.slack_operator import SlackAPIPostOperator
email = Variable.get("test_recipients")
args = {
'owner': 'airflow'
, 'depends_on_past': False
, 'start_date': datetime(2018, 8, 20, 0, 0)
, 'retries': 0
, 'email': email
, 'email_on_failure': True
, 'email_on_retry': True
, 'sla': timedelta(seconds=1)
}
dag = DAG('sla-email-test'
, default_args=args
, max_active_runs=1
, schedule_interval="#daily")
....
t2 = EmailOperator(
dag=dag,
task_id="send-email",
to=email,
subject="Testing",
html_content="<h3>Welcome to Airflow</h3>"
)
Yes, there is currently a bug in Airflow when it comes to sending the SLA emails - that code path doesn't correctly splitĀ a string by , like task failure emails do.
The short work around right now is to make your variable a list (i.e. with a value of ["test1#test.com","test2#test.com"] and access it like:
email = Variable.get("test_recipients", deserialize_json=True)
That should work in both cases (SLA, and task emails)
I don't see this as a bug, I would say it is an expected behavior. It is because you are passing a string "email1#example.com, email2#example.com" to email argument. You can do the following if you want to use this string as python list:
test_receipients="email1#example.com, email2#example.com"
email = Variable.get("test_recipients")
args = {
'owner': 'airflow'
, 'depends_on_past': False
, 'start_date': datetime(2018, 8, 20, 0, 0)
, 'retries': 0
, 'email': [email] # This will convert the string to python list
, 'email_on_failure': True
, 'email_on_retry': True
, 'sla': timedelta(seconds=10)
}
In general, if you want to send email to multiple email addresses, you should always use a list and not string.
Update: I have fixed this in https://github.com/apache/incubator-airflow/pull/3869 . This will be available in Airflow 1.10.1