Using dag_run.conf in an operator - airflow

I am passing the following as the json
{“name”: “john”}
when triggering, with this operator:
do_stuff = DatabricksSubmitRunOperator(
task_id=“do_stuff”,
spark_python_task={“python_file”: “…”,
“parameters”: f"{uid}", '\'{{ dag_run.conf["name"] if dag_run else "" }}\''},
existing_cluster_id=cluster_id
)
I’m getting a syntax error, but I am thinking it’s related to escaping characters… I have not used templating before.

do_stuff = DatabricksSubmitRunOperator(
task_id=“do_stuff”,
spark_python_task={“python_file”: “…”,
“parameters”: [f"{uid}",'{{ (dag_run.conf["name"] if dag_run else "") | tojson }}']},
existing_cluster_id=cluster_id
)

Related

Airflow Broken DAG error during dynamic task creation with variables

I am trying to create dynamic tasks depending on airflow variable :
My code is :
default_args = {
'start_date': datetime(year=2021, month=6, day=20),
'provide_context': True
}
with DAG(
dag_id='Target_DIF',
default_args=default_args,
schedule_interval='#once',
description='ETL pipeline for processing users'
) as dag:
iterable_list = Variable.get("num_table")
for index, table in enumerate(iterable_list):
read_src1 = PythonOperator(
task_id=f'read_src_{table}'
python_callable=read_src,
)
upload_file_to_directory_bulk1 = PythonOperator(
task_id=f'upload_file_to_directory_bulk_{table}',
python_callable=upload_file_to_directory_bulk
)
write_Snowflake1 = PythonOperator(
task_id=f'write_Snowflake_{table}',
python_callable=write_Snowflake
)
# TaskGroup level dependencies
# DAG level dependencies
start >> read_src1 >> upload_file_to_directory_bulk1 >> write_Snowflake1 >> end
I am facing the below error :
Broken DAG: [/home/dif/airflow/dags/target_dag.py] Traceback (most recent call last):
airflow.exceptions.AirflowException: The key (read_src_[) has to be made of alphanumeric characters, dashes, dots and underscores exclusively
The code works perfect with changes in the code :
#iterable_list = Variable.get("num_table")
iterable_list = ['inventories', 'products']
Start and End are dummy operators.
Airflow variable has data as shown in the image.
My expected dynamic workflow:
I am able to achieve the above flow with a list but not with Airflow variable.
Any leads to find the cause of the error is appreciated. Thanks.
The Variable.get("num_table") returns string.
thus your loop is actually iterating over the chars of ['inventories, 'ptoducts'] which is why in the first iteration of the loop the task_id=f'read_src_{table}' is read_src_[ and [ is not a valid char for task_id.
You should convert the string into list.
Save your var as: "inventories,ptoducts" and then you can do:
iterable_string = Variable.get("num_table")
iterable_list = iterable_string.split(",")
for index, table in enumerate(iterable_list):
You should note that using Variable.get("num_table") as a top level code is a very bad practice!
The problem is that by default, Airflow reads the variables as str. Try using this:
iterable_list = Variable.get("num_table", deserialize_json=True)
I was able to arrive at the solution with the followings modifications :
import ast
...
...
iterable_string = Variable.get("num_table",default_var="[]")
iterable_list = ast.literal_eval(iterable_string)
...
Airflow variables are stored as strings.
So my data was stored as "[tab1,tab2]".
So I have used literal_eval to convert the string back to list.
I have also added an empty list as default so that if no values are present in the variable num_table, I will not process further.

Passing JSON file as string in environment variable for composer airflow from terraform script

I am creating a composer from terraform where I want to pass a json as input variable
Terraform code:
software_config{
env_variables{
AIRFLOW_VAR_MYJSON ="{'__comment1__': 'This the global section', 'project_id':'testproject', 'gce_zone':'us-east1-c', 'gce_region':'us-east1','networkname':'vpc1', 'subnetwork':'https://www.googleapis.com/compute/v1/projects/testproject/regions/us-east1/subnetworks/subnet1'}"
}
}
I am trying to read the value of AIRFLOW_VAR_MYJSON in DAG , but it is not working as the value is not recognized as JSON.
I tried converting it and then deserializing it with following code:
JSONList = Variable.get("MYJSON")
jsonvar = json.dumps(JSONList)
setting_var = Variable.set("settings", jsonvar)
dag_config = Variable.get("settings", deserialize_json=True)
but it is not working.
I have also tried using
dag_config =json.loads(jsonvar)
then reading value as
project_id = dag_config["project_id"]
but I get error : "string indices must be integers"
Please suggest a way to resolve this.
NOTE : I know the gcloud command to set variables from json file but that is not working in my case as the project is in VPC and kubernetes clusters are giving timeout or handshake error, so I have ruled out use of this option
Valid JSON can only be " not '. Try switching the quotes.
A value can be a string in double quotes, or a number, or true or false or null, or an object or an array.
software_config{
env_variables{
AIRFLOW_VAR_MYJSON ="{\"__comment1__\": \"This the global section\", \"project_id\":\"testproject\", \"gce_zone\":\"us-east1-c\", \"gce_region\":\"us-east1\",\"networkname\":\"vpc1\", \"subnetwork\":\"https://www.googleapis.com/compute/v1/projects/testproject/regions/us-east1/subnetworks/subnet1\"}"
}
}
Or a little nicer way:
software_config {
env_variables {
AIRFLOW_VAR_MYJSON = jsonencode({
"__comment1__" = "This the global section",
"project_id" = "testproject",
"gce_zone" = "us-east1-c",
"gce_region" = "us-east1",
"networkname" = "vpc1",
"subnetwork" = "https://www.googleapis.com/compute/v1/projects/testproject/regions/us-east1/subnetworks/subnet1",
})
}
}

Correctly suppress comments

I want to filter out comments starting with a hash # out of a text file, before I run a larger parser over it.
For this I make use of suppress as mentioned here.
pythonStyleComment does not work, because it ignores quotations and removes stuff within it. A hash in a quoted string is not a comment. It is part of the string and therefore should be preserved.
Here is my pytest which I already implemented to test the expected behavior.
def test_filter_comment():
teststrings = [
'# this is comment', 'Option "sadsadlsad#this is not a comment"'
]
expected = ['', 'Option "sadsadlsad#this is not a comment"']
for i, teststring in enumerate(teststrings):
result = filter_comments.transformString(teststring)
assert result == expected[i]
My current implementation breaks somewhere in pyparsing. I probably do something which was not intended:
filter_comments = Regex(r"#.*")
filter_comments = filter_comments.suppress()
filter_comments = filter_comments.ignore(QuotedString)
fails with:
*****/lib/python3.7/site-packages/pyparsing.py:4480: in ignore
super(ParseElementEnhance, self).ignore(other)
*****/lib/python3.7/site-packages/pyparsing.py:2489: in ignore
self.ignoreExprs.append(Suppress(other.copy()))
E TypeError: copy() missing 1 required positional argument: 'self'
Any help how to ignore comments correctly, would be helpful.
Ah I was so close. I have of course to properly instantiate the QuotedString class.The following works as expected:
filter_comments = Regex(r"#.*")
filter_comments = filter_comments.suppress()
qs = QuotedString('"') | QuotedString("'")
filter_comments = filter_comments.ignore(qs)
Here are some more tests.
def test_filter_comment():
teststrings = [
'# this is comment', 'Option "sadsadlsad#this is not a comment"',
"Option 'sadsadlsad#this is not a comment'",
"Option 'sadsadlsad'#this is a comment"
]
expected = [
'', 'Option "sadsadlsad#this is not a comment"',
"Option 'sadsadlsad#this is not a comment'",
"Option 'sadsadlsad'"
]
for i, teststring in enumerate(teststrings):
result = filter_comments.transformString(teststring)
assert result == expected[i]
The regex you're using is not correct.
I think you meant:
^\#.*
or
^(?:.*)\#.*

Airflow: How to push xcom value from BigQueryOperator?

This is my operator:
bigquery_check_op = BigQueryOperator(
task_id='bigquery_check',
bql=SQL_QUERY,
use_legacy_sql = False,
bigquery_conn_id=CONNECTION_ID,
trigger_rule='all_success',
xcom_push=True,
dag=dag
)
When I check the Render page in the UI. Nothing appears there.
When I run the SQL in the console it return value 1400 which is correct.
Why the operator doesn't push the XCOM?
I can't use BigQueryValueCheckOperator. This operator is designed to FAIL against a check of value. I don't want nothing to fail. I simply want to branch the code based on the return value from the query.
Here is how you might be able to accomplish this with the BigQueryHook and the BranchPythonOperator:
from airflow.operators.python_operator import BranchPythonOperator
from airflow.contrib.hooks import BigQueryHook
def big_query_check(**context):
sql = context['templates_dict']['sql']
bq = BigQueryHook(bigquery_conn_id='default_gcp_connection_id',
use_legacy_sql=False)
conn = bq.get_conn()
cursor = conn.cursor()
results = cursor.execute(sql)
# Do something with results, return task_id to branch to
if results == 0:
return "task_a"
else:
return "task_b"
sql = "SELECT COUNT(*) FROM sales"
branching = BranchPythonOperator(
task_id='branching',
python_callable=big_query_check,
provide_context= True,
templates_dict = {"sql": sql}
dag=dag,
)
First we create a python callable that we can use to execute the query and select which task_id to branch too. Second, we create the BranchPythonOperator.
The simplest answer is because xcom_push is not one of the params in BigQueryOperator nor BaseOperator nor LoggingMixin.
The BigQueryGetDataOperator does return (and thus push) some data but it works by table and column name. You could chain this behavior by making the query you run output to a uniquely named table (maybe use {{ds_nodash}} in the name), and then use the table as a source for this operator, and then you can branch with the BranchPythonOperator.
You might instead try to use the BigQueryHook's get_conn().cursor() to run the query and work with some data inside the BranchPythonOperator.
Elsewhere we chatted and came up with something along the lines of this for putting in the callable of a BranchPythonOperator:
cursor = BigQueryHook(bigquery_conn_id='connection_name').get_conn().cursor()
# one of these two:
cursor.execute(SQL_QUERY) # if non-legacy
cursor.job_id = cursor.run_query(bql=SQL_QUERY, use_legacy_sql=False) # if legacy
result=cursor.fetchone()
return "task_one" if result[0] is 1400 else "task_two" # depends on results format

How to handle failed XPATH lookup in MSXML from AutoIT?

I am parsing a piece of XML returned from a Web API. I am looking for a particular node. If that node does not exist, according to the MSXML documentation, it returns null.
The problem is, I don't know how to check for null in AutoIT. I have read the online API doc for Null, but when I run the script using AutoIt3Wrapper v.2.1.2.9, it does not recognize null.
Here is a sample script to show what I mean:
$oXMLDOM = ObjCreate("Msxml2.DOMDocument.3.0")
$xml = '<response><error code="1"><![CDATA[ Incorrect password or username ]]></error></response>'
$oXMLDOM.loadXML($xml)
$node = $oXMLDOM.selectSingleNode("/response/error")
MsgBox(0, "", $node.text) ;; No problems
$node = $oXMLDOM.selectSingleNode("/response/token")
;; $node should be 'null' now; how do I check that in AutoIT?
MsgBox(0, "", $node.text) ;; Fails horribly
You could use IsObj() to test if a valid object was returned:
If Not IsObj($oNode) Then
MsgBox(0, 'ERROR', 'Node is invalid!')
EndIf
I have kind of found a quick workaround for my problem.
By using ObjName(), I can check the name of the COM object returned, which is IXMLDOMElement if it is successful:
If ObjName($node) = "IXMLDOMElement" Then
MsgBox(0, "", "Success")
Else
MsgBox(0, "", "Failure")
EndIf

Resources