AIrflow - task_instance.try_number is not working - airflow

I am trying to send parameter to an airflow task in order to identify the last execution.
The following code always send {"try_number": "1"} as POST data.
Airflow version: 1.10.2
Thanks
xxx = SimpleHttpOperator(
task_id='XXX',
endpoint='/backend/XXX',
http_conn_id='backend_url',
data=json.dumps({"try_number": "{{ti.try_number}}"}),
headers={"Content-Type": "application/json"},
response_check=lambda response: response.json().get('status') == 'ok',
dag=dag,
)

The problem is with the rendered view,
I looked at the rendered result's instead of viewing the actual value of the operator.
I pushed the output to new XCOM (xcom_push=True) and now I can see the right value:
{ "status": "ok", "try_number": "15" }

Related

How to retrive data from python function and use it in a emr operator

Airflow version :2.0.2
Trying to create Emr cluster, by retrying data from AWS secrets manager.
I am trying to write an airflow dag and, my task is to get data from this get_secret function and use it in Spark_steps
def get_secret():
secret_name = Variable.get("secret_name")
region_name = Variable.get(region_name)
# Create a Secrets Manager client
session = boto3.session.Session()
client = session.client(service_name='secretsmanager', region_name=region_name)
account_id = boto3.client('sts').get_caller_identity().get('Account')
try:
get_secret_value_response = client.get_secret_value(SecretId=secret_name)
if 'SecretString' in get_secret_value_response:
secret_str = get_secret_value_response['SecretString']
secret=json.loads(secret_str)
airflow_path=secret["airflow_path"]
return airflow_path
...
I need to use "airflow_path" return value in below spark_steps
Spark_steps:
SPARK_STEPS = [
{
'Name': 'Spark-Submit Command',
"ActionOnFailure": "CONTINUE",
'HadoopJarStep': {
"Jar": "command-runner.jar",
"Args": [
'spark-submit',
'--py-files',
's3://'+airflow_path+'-pyspark/pitchbook/config.zip,s3://'+airflow_path+'-pyspark/pitchbook/jobs.zip,s3://'+airflow_path+'-pyspark/pitchbook/DDL.zip',
's3://'+airflow_path+'-pyspark/pitchbook/main.py'
],
},
},
I saw on the internet I need to use Xcom, is this right ?, and do I need to run this function in python operator first and then get the value. please provide an example as I am a newbie.
Thanks for your help.
Xi
Yes if you would like to pass dynamic stuff, leveraging xcom push/pull might be easier.
Leverage PythonOperator to push data into xcom.
See reference implementation:
https://github.com/apache/airflow/blob/7fed7f31c3a895c0df08228541f955efb16fbf79/airflow/providers/amazon/aws/example_dags/example_emr.py
https://github.com/apache/airflow/blob/7fed7f31c3a895c0df08228541f955efb16fbf79/airflow/providers/amazon/aws/example_dags/example_emr.py#L108
https://www.startdataengineering.com/post/how-to-submit-spark-jobs-to-emr-cluster-from-airflow/

apache airflow variables on startup

I'm learning Airflow and am planning to set some variables to use across different tasks. These are in my dags folder, saved as configs.json, like so:
{
"vars": {
"task1_args": {
"something": "This is task 1"
},
"task2_args": {
"something": "this is task 2"
}
}
}
I get that we can enter Admin-->Variables--> upload the file. But I have 2 questions:
What if I want to adjust some of the variables while airflow is running? I can adjust my code easily and it updates in realtime but it doesn't seem like this works for variables.
Is there a way to just auto-import this specific file on startup? I don't want to have to add it every time I'm testing my project.
I don't see this mentioned in the docs but it seems like a pretty trivial thing to want.
What you are looking for is With code, how do you update an airflow variable?
Here's an untested snippet that should help
from airflow.models import Variable
Variable.set(key="my_key", value="my_value")
So basically you can write a bootstrap python script to do this setup for you.
In our team, we use such scripts to setup all Connections, and Pools too
In case you are wondering, here's the set(..) method from source
#classmethod
#provide_session
def set(
cls,
key: str,
value: Any,
serialize_json: bool = False,
session: Session = None
):
"""
Sets a value for an Airflow Variable with a given Key
:param key: Variable Key
:param value: Value to set for the Variable
:param serialize_json: Serialize the value to a JSON string
:param session: SQL Alchemy Sessions
"""
if serialize_json:
stored_value = json.dumps(value, indent=2)
else:
stored_value = str(value)
Variable.delete(key, session=session)
session.add(Variable(key=key, val=stored_value))
session.flush()

How to pass bearer token in the Airflow

I have a job with 3 tasks
1) Get a token using a POST request
2) Get token value and store in a variable
3) Make a GET request by using token from step 2 and pass bearer token
Issue is step 3 is not working and i am getting HTTP error. I was able to print the value of token in the step 2 and verified in the code
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
token ="mytoken" //defined with some value which will be updated later
get_token = SimpleHttpOperator(
task_id='get_token',
method='POST',
headers={"Authorization": "Basic xxxxxxxxxxxxxxx=="},
endpoint='/token?username=user&password=pass&grant_type=password',
http_conn_id = 'test_http',
trigger_rule="all_done",
xcom_push=True,
dag=dag
)
def pull_function(**context):
value = context['task_instance'].xcom_pull(task_ids='get_token')
print("printing token")
print value
wjdata = json.loads(value)
print(wjdata['access_token'])
token=wjdata['access_token']
print token
run_this = PythonOperator(
task_id='print_the_context',
provide_context=True,
python_callable=pull_function,
dag=dag,
)
get_config = SimpleHttpOperator(
task_id='get_config',
method='GET',
headers={"Authorization": "Bearer " + token},
endpoint='someendpoint',
http_conn_id = 'test_conn',
trigger_rule="all_done",
xcom_push=True,
dag=dag
)
get_token >> run_this >> get_config
The way you are storing token as a "global" variable won't work. The Dag definition file (the script where you defined the tasks) is not the same runtime context as the one for executing each task. Every task can be run in a separate thread, process, or even on another machine, depending on the executor. The way you pass data between the tasks is not by global variables, but rather using the XCom - which you already partially do.
Try the following:
- remote the global token variable
- in pull_function instead of print token do return token - this will push the value to the XCom again, so the next task can access it
- access the token from XCom in your next task.
The last step is a bit tricky since you are using the SimpleHttpOperator, and it's only templated fields are endpoint and data, but not headers.
For example, if you wanted to pass in some data from the XCom of a previous task, you would do something like this:
get_config = SimpleHttpOperator(
task_id='get_config',
endpoint='someendpoint',
http_conn_id = 'test_conn',
dag=dag,
data='{{ task_instance.xcom_pull(task_ids="print_the_context", key="some_key") }}'
)
But you can't do the same with the headers unfortunately, so you have to either do it "manually" via a PythonOperator, or you could inherit SimpleHttpOperator and create your own, something like:
class HeaderTemplatedHttpOperator(SimpleHttpOperator):
template_fields = ('endpoint', 'data', 'headers') # added 'headers' headers
then use that one, something like:
get_config = HeaderTemplatedHttpOperator(
task_id='get_config',
endpoint='someendpoint',
http_conn_id = 'test_conn',
dag=dag,
headers='{{ task_instance.xcom_pull(task_ids="print_the_context") }}'
)
Keep in mind I did no testing on this, it's just for the purpose of explaining the concept. Play around with the approach and you should get there.

Authorisation error when running airflow via cloud composer

I get an error when trying to run DAG from cloud composer using the GoogleCloudStorageToBigQueryOperator.
Final error was: {'reason': 'invalid', 'location': 'gs://xxxxxx/xxxx.csv',
and when I follow the URL link to the error ...
{
"error": {
"code": 401,
"message": "Request is missing required authentication credential. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole- project.",
"errors": [
{
"message": "Login Required.",
"domain": "global",
"reason": "required",
"location": "Authorization",
"locationType": "header"
}
],
"status": "UNAUTHENTICATED"
}
}
I have configured the Cloud Storage connection ...
Conn Id My_Cloud_Storage
Conn Type Google Cloud Platform
Project Id xxxxxx
Keyfile Path /home/airflow/gcs/data/xxx.json
Keyfile JSON
Scopes (comma seperated) https://www.googleapis.com/auth/cloud-platform
Code ..
from __future__ import print_function
import datetime
from airflow import models
from airflow import DAG
from airflow.operators import bash_operator
from airflow.operators import python_operator
from airflow.contrib.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator
default_dag_args = {
# The start_date describes when a DAG is valid / can be run. Set this to a
# fixed point in time rather than dynamically, since it is evaluated every
# time a DAG is parsed. See:
# https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
'start_date': datetime.datetime(2019, 4, 15),
}
with models.DAG(
'Ian_gcs_to_BQ_Test',
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
load_csv = GoogleCloudStorageToBigQueryOperator(
task_id='gcs_to_bq_test',
bucket='xxxxx',
source_objects=['xxxx.csv'],
destination_project_dataset_table='xxxx.xxxx.xxxx',
google_cloud_storage_conn_id='My_Cloud_Storage',
schema_fields=[
{'name':'AAAA','type':'INTEGER','mode':'NULLABLE'},
{'name':'BBB_NUMBER','type':'INTEGER','mode':'NULLABLE'},
],
write_disposition='WRITE_TRUNCATE',
dag=dag)
Ok , now it's fixed.
Turns out it wasn't working because of the header row in the file, once I removed that it worked fine.
Pretty annoying, completely misleading error messages about invalid locations and authorization.
I had the exact same looking error. What fixed it for me was adding the location of my dataset to my operator. So first, check the dataset information if you are not sure the location. Then add it as a parameter in your operator. For example, my dataset was in us-west1 and I was using an operator that looked like this:
check1 = BigQueryCheckOperator(task_id='check_my_event_data_exists',
sql="""
select count(*) > 0
from my_project.my_dataset.event
""",
use_legacy_sql=False,
location="us-west1") # THIS WAS THE FIX IN MY CASE
GCP error messages don't seem to be very good.

Firebase Realtime Database currently gives TRIGGER_PAYLOAD_TOO_LARGE error

Since this morning, our Firebase application has a problem when writing data to the Realtime Database instance. Even the simplest task, such as adding one key-value pair to an object triggers
Error: TRIGGER_PAYLOAD_TOO_LARGE: This request would cause a function payload exceeding the maximum size allowed.
It is especially strange since nothing in our code or database has changed for more than 24 hours.
Even something as simple as
Database.ref('environments/' + envkey).child('orders/' + orderkey).ref.set({a:1})
triggers the error.
Apperently, the size of the payload is not the problem, but what could be causing this?
Database structure, as requested
environments
+-env1
+-env2
--+orders
---+223344
-----customer: "Peters"
-----country: "NL"
-----+items
------item1
-------code: "a"
-------value: "b"
------item2
-------code: "x"
-------value: "2"
Ok I figured this out. The issue is not related to your write function, but to one of the cloud functions the write action would trigger.
For example, we have a structure like:
/collections/data/abcd/items/a
in JSON:
"collections": {
"data": {
"abc": {
"name": "example Col",
"itemCount": 5,
"items": {
"a": {"name": "a"},
"b": {"name": "b"},
"c": {"name": "c"},
"d": {"name": "d"},
"e": {"name": "e"},
}
}
}
}
Any write into an item was failing at all whatsoever. API, Javascript, even a basic write in the console.
I decided to look at our cloud functions and found this:
const countItems = (collectionId) => {
return firebaseAdmin.database().ref(`/collections/data/${collectionId}/items`).once('value')
.then(snapshot => {
const items = snapshot.val();
const filtered = Object.keys(items).filter(key => {
const item = items[key];
return (item && !item.trash);
});
return firebaseAdmin.database().ref(`/collections/meta/${collectionId}/itemsCount`)
.set(filtered.length);
});
};
export const onCollectionItemAdd = functions.database.ref('/collections/data/{collectionId}/items/{itemId}')
.onCreate((change, context) => {
const { collectionId } = context.params;
return countItems(collectionId);
});
On it's own it's nothing, but that trigger reads for ALL items and by default firebase cloud functions send's the entire snapshot to the CF even if we don't use it. In Fact it sends the previous and after values too, so if you (like us) have a TON of items at that point my guess it the payload that it tries to send to the cloud function is way too big.
I removed the count functions from our CF and boom, back to normal. Not sure the "correct" way to do the count if we can't have the trigger at all, but I'll update this if we do...
The TRIGGER_PAYLOAD_TOO_LARGE error is part of a new feature Firebase is rolling out, where our existing RTDB limits are being strictly enforced. The reason for the change is to make sure that we aren't silently dropping any Cloud Functions triggers, since any event exceeding those limits can't be sent to Functions.
You can turn this feature off yourself by making this REST call:
curl -X PUT -d "false" https://<namespace>.firebaseio.com/.settings/strictTriggerValidation/.json?auth\=<SECRET>
Where <SECRET> is your DB secret
Note that if you disable this, the requests that are currently failing may go through, but any Cloud Functions you have that trigger on the requests exceeding our limits will fail to run. If you are using database triggers for your functions, I would recommend you re-structure your requests so that they stay within the limits.

Resources