Using XCOM_PULL in a FileSenor filepath arguement? - airflow

I am really new to Airflow, so please forgive me if this is a dim question. I did search unsuccessfully on Stackoverlow to find a similar question.
I have a download stream task that waits for a file to download. I'd like to abstract the hardcoded filepath and retrieve the path, stored in an XCOM.
t2 = FileSensor(
task_id = 'waiting_for_file_download',
poke_interval = 60 * 5,
timeout = 60 * 10,
mode = 'reschedule',
filepath = {{ ti.xcom_pull(task_ids = 'downloaded_file', key = 'file_path') }} + 'transformed' + '_new.csv.gz'
)
Is this possible?
Reading the official filesensor documentation did not really help me, as a newcomer.
I see there are two additional fields 1) templates 2) fs_conn_id
UPDATE
Reading the docs I can see a XCOM.get_one() however this is not working either:
filepath = XCom.get_one(
execution_date = date.today(),
dag_id = 'My_DAG',
task_id = 'downloaded_file',
key = 'file_path'
I see that other users use this in conjunction with **context, however, I do not know how you can use the context within the FileSenor()?

Since filepath is declared as a templated field in the FileSensor class, it's possible to use Jinja templating and perform the xcom.pull() during runtime.
I think you were only missing the fact that the Jinja syntax goes withing a string, try this:
filepath = "{{ ti.xcom_pull(task_ids = 'downloaded_file', key = 'file_path') }}" + 'transformed' + '_new.csv.gz'
Let me know if that worked for you.

Related

Creating dynamic tasks in airflow (in composer) based on bigquery response

I am trying to create a airflow DAG which generates task depending on the response from server.
Here is my approach :
getlist of tables from bigquery -> loop through the list and create tasks
This is my latest code and I have tried all possible code found in stack overflow. Nothing seems to work. What am I doing wrong?
with models.DAG(dag_id="xt", default_args=default_args, schedule_interval="0 1 * * *", catchup=True) as dag:
tables = get_tables_from_bq()
bridge = DummyOperator(
task_id='bridge',
dag=dag
)
for t in tables:
sql = ("SELECT * FROM `{project}.{dataset}.{table}` LIMIT 5;".format(
project=project, dataset=dataset, table=t))
materialize_t = BigQueryOperator(bql=sql,
destination_dataset_table=dataset+'.' + table_prefix + t,
task_id = 'x_' + t,
bigquery_conn_id = 'bigquery_default',
use_legacy_sql = False,
write_disposition = 'WRITE_APPEND',
create_disposition = 'CREATE_IF_NEEDED',
query_params = {},
allow_large_results = True,
dag = dag)
bridge >> materialize_t
Even the run option is not showing with this code. I tried multiple codes and finally reached here but still no luck. Any help???
I don't know if it is a typo in the copy and paste of the DAG but tables = get_tables_from_bq() should be before with models.DAG(...) Also, bridge >> materialize_t seems to miss indentation and therefore be outside the with models.DAG(...) scope. On a side note, you do not need the bridge task.

How to create operators from list in Airflow?

I need to copy tables from MySQL to BigQuery daily.
My workflow is:
MySqlToGoogleCloudStorageOperator
GoogleCloudStorageToBigQueryOperator
This works for a single process (say Categories).
Example:
BQ_TABLE_NAME_CATEGORIES = Variable.get("tables_categories")
...
import_categories_op = MySqlToGoogleCloudStorageOperator(
task_id='import_categories',
mysql_conn_id='c_mysql',
google_cloud_storage_conn_id='gcp_a',
approx_max_file_size_bytes = 100000000, #100MB per file
sql = 'import_categories.sql',
bucket=GCS_BUCKET_ID,
filename=file_name_categories,
dag=dag)
gcs_to_bigquery_categories_op = GoogleCloudStorageToBigQueryOperator(
dag=dag,
task_id='load_categories_to_BigQuery',
bucket=GCS_BUCKET_ID,
destination_project_dataset_table=table_name_template_categories,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[uri_template_categories_read_from],
schema_fields=Categories(),
src_fmt_configs={'ignoreUnknownValues': True},
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_TRUNCATE',
skip_leading_rows = 1,
google_cloud_storage_conn_id=CONNECTION_ID,
bigquery_conn_id=CONNECTION_ID)
import_categories_op >> gcs_to_bigquery_categories_op
Now, Say I want to scale it up and have it work with 20 more tables.. Is there a way to do it without writing the same code 20 times?
I'm looking for a way to do something like:
BQ_TABLE_NAME_CATEGORIES = Variable.get("tables_categories")
BQ_TABLE_NAME_PRODUCTS = Variable.get("tables_products")
....
BQ_TABLE_NAME_ORDERS = Variable.get("tables_orders")
list = [BQ_TABLE_NAME_CATEGORIES,BQ_TABLE_NAME_PRODUCTS,BQ_TABLE_NAME_PRODUCTS ]
for item in list:
GENERATE THE OPERATORS PER TABLE
so that will create import_categories_op , import_products_op , import_orders_op etc..
Yes, in fact it's exactly what you described. Simply instantiate your operators in your for loop. Make sure your task ids are unique and you're set:
BQ_TABLE_NAME_CATEGORIES = Variable.get("tables_categories")
BQ_TABLE_NAME_PRODUCTS = Variable.get("tables_products")
list = [BQ_TABLE_NAME_CATEGORIES, BQ_TABLE_NAME_PRODUCTS]
for table in list:
import_op = MySqlToGoogleCloudStorageOperator(
task_id=`import_${table}`,
mysql_conn_id='c_mysql',
google_cloud_storage_conn_id='gcp_a',
approx_max_file_size_bytes = 100000000, #100MB per file
sql = `import_${table}.sql`,
bucket=GCS_BUCKET_ID,
filename=file_name,
dag=dag)
gcs_to_bigquery_op = GoogleCloudStorageToBigQueryOperator(
dag=dag,
task_id=`load_${table}_to_BigQuery`,
bucket=GCS_BUCKET_ID,
destination_project_dataset_table=table_name_template,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[uri_template_read_from],
schema_fields=Categories(),
src_fmt_configs={'ignoreUnknownValues': True},
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_TRUNCATE',
skip_leading_rows = 1,
google_cloud_storage_conn_id=CONNECTION_ID,
bigquery_conn_id=CONNECTION_ID)
import_op >> gcs_to_bigquery_op
You can simplify this if you store all tables in a single variable:
// bq_tables = "table_products,table_orders"
BQ_TABLES = Variable.get("bq_tables").split(',')
for table in BQ_TABLES:
...
Edit: Task references vs IDs
Luis asked about how only the task IDs need to change (and not the references to the tasks). Actually, you don't even need to refer to your tasks for anything but adding some details to them after creation (like upstream and downstream dependencies), because they're stored in the DAG object on creation, and that's all the DAG parser is looking for. Once the DAG parser finds a DAG object in the global scope, it uses it. It doesn't know what names the tasks were referred to as in the global scope, it only knows that those tasks are listed on the DAG object, and that they list each other upstream or downstream.
I would have made this a comment on this answer, but I wanted to show the following code to explain what I mean a bit more obviously (in which I use with DAG to avoid assigning each task to the dag, and the bitwise-shift operator upstream/downstream assignment to avoid needing to even refer to the tasks by a reference, and python3's formatted f-strings):
// bq_tables = "table_products,table_orders"
BQ_TABLES = Variable.get("bq_tables").split(',')
with DAG('…dag_id…', …) as dag:
for table in BQ_TABLES:
MySqlToGoogleCloudStorageOperator(
task_id=f'import_{table}',
sql=f'import_{table}.sql',
… # all params except notably there's no `dag=dag` in here.
) >> GoogleCloudStorageToBigQueryOperator( # Yup, …
task_id=f'load_{table}_to_BigQuery',
… # again all but `dag=dag` in here.
)
Sure, it could have been t1=…; t2=…; t1>>t2; … but why name references?

How to trigger operator inside Python function using Airflow?

I have the following code:
def chunck_import(**kwargs):
...
for i in range(1, num_pages + 1):
start = lower + chunks * i
end = start + chunks
if i>1:
start = start + 1
logging.info(start, end)
if end > max_current:
end = max_current
where = 'where orders_id between {0} and {1}'.format(start,end)
logging.info(where)
import_orders_products_op = MySqlToGoogleCloudStorageOperator(
task_id='import_orders_and_upload_to_storage_orders_products_{}'.format(i),
mysql_conn_id='mysql_con',
google_cloud_storage_conn_id='gcp_con',
provide_context=True,
approx_max_file_size_bytes = 100000000, #100MB per file
sql = 'import_orders.sql',
params={'WHERE': where},
bucket=GCS_BUCKET_ID,
filename=file_name_orders_products,
dag=dag)
start_task_op = DummyOperator(task_id='start_task', dag=dag)
chunck_import_op = PythonOperator(
task_id='chunck_import',
provide_context=True,
python_callable=chunck_import,
dag=dag)
start_task_op >> chunck_import_op
This code uses PythonOperator to calculate how many runs I need from the MySqlToGoogleCloudStorageOperator and create the WHERE cluster of the SQL then it needs to execute it.
The problem is that the MySqlToGoogleCloudStorageOperator isn't being executed.
I can't actually do
chunck_import_op >> import_orders_products_op
How can I make the MySqlToGoogleCloudStorageOperator be executed inside the PythonOperator?
I think at the end of your for loop, you'll want to call import_orders_products_op.execute(context=kwargs) possibly preceded by import_orders_products_op.pre_execute(context=kwargs). This is a bit complicated in that it skips the render_templates() call of the task_instance, and actually if you instead made a task_instance to put each of these tasks in, you could call run or _raw_run_task instead but these both require information from the dagrun (which you can get in the python callable's context like kwargs['dag_run'])
Looking at what you've passed to the operators it looks like as is you'll need the templating step to load the import_orders.sql file and fill in the WHERE parameter. Alternatively it's okay within the callable itself to load the file into a string, replace the {{ params.WHERE }} part (and any others) manually without Jinja2 (or you could spend time to figure out the right jinja2 calls), and then set the import_orders_products_op.sql=the_string_you_loaded before calling import_orders_products_op.pre_execute(context=kwargs) and import_orders_products_op.execute(context=kwargs).

XQuery (saxon) failing with a schema (XPath works)

I switched in saxon from XPath to XQuery and on the selects where I have a schema I'm getting the error message:
A typed input document can only be used with a schema-aware query
My setup is:
InputSource xmlSource = new InputSource(xmlData);
SAXSource saxSource = new SAXSource(reader, xmlSource);
Source schemaSource = new StreamSource(schemaFile);
Configuration config = createEnterpriseConfiguration();
config.addSchemaSource(schemaSource);
Processor processor = new Processor(config);
SchemaValidator validator = new SchemaValidatorImpl(processor);
DocumentBuilder doc_builder = processor.newDocumentBuilder();
if(!preserveWhiteSpace)
doc_builder.setWhitespaceStrippingPolicy(WhitespaceStrippingPolicy.ALL);
doc_builder.setSchemaValidator(validator);
XdmNode root_node = doc_builder.build(saxSource);
XQueryCompiler compiler = processor.newXQueryCompiler();
Is there something additional I need to do on queries where there is a schema?
thanks - dave
Call XQueryCompiler.setSchemaAware(true);
This isn't the default because it's good for the optimizer to know whether the data is likely to be typed or untyped, and it's inefficient to generate schema-aware code if the data is untyped (conversely, when the data is typed, schema-aware code is typically faster -- though the savings can be eaten up by the extra cost of validating the input).

Revit Python Shell - Change Parameter Group

I'm trying to write a quick script to open a family document, change the parameter group of 2 specified parameters, and then close and save the document. I've done multiple tests and I am able to change the parameter groups of the specified parameters, but the changes of the groups don't save back to the family file. When I open the newly saved family, the parameter groups revert back to their original group.
This is with Revit 2017.2.
The same script, when run in RPS in Revit 2018 will do as desired.
import clr
import os
clr.AddReference('RevitAPI')
clr.AddReference('RevitAPIUI')
from Autodesk.Revit.DB import *
from Autodesk.Revit.UI import UIApplication
from System.IO import Directory, SearchOption
searchstring = "*.rfa"
dir = r"C:\Users\dboghean\Desktop\vanity\2017"
docs = []
if Directory.Exists(dir):
files = Directory.GetFiles(dir, searchstring, SearchOption.AllDirectories)
for f in files:
name, extension = os.path.splitext(f)
name2, extension2 = os.path.splitext(name)
if extension2:
os.remove(f)
else:
docs.append(f)
else:
print("Directory does not exist")
doc = __revit__.ActiveUIDocument.Document
app = __revit__.Application
uiapp = UIApplication(app)
currentPath = doc.PathName
pgGroup = BuiltInParameterGroup.PG_GRAPHICS
for i in docs:
doc = app.OpenDocumentFile(i)
paramList = [i for i in doc.FamilyManager.Parameters]
t = Transaction(doc, "test")
t.Start()
for i in paramList:
if i.Definition.Name in ["Right Sidesplash Edge line", "Left Sidesplash Edge line"]:
i.Definition.ParameterGroup = pgGroup
t.Commit()
doc.Close(True)
Any ideas?
Thanks!
I can confirm that this happens in Revit 2017. Strange!
A simple way around it is to arbitrarily rename the parameter using doc.FamilyManager.RenameParameter, then rename it back to the original name.
So in your case this would be three additional lines of code after changing the Parameter group:
originalName = i.Definition.Name
doc.FamilyManager.RenameParameter(i, "temp")
doc.FamilyManager.RenameParameter(i, originalName)
Doesnt get to the root problem, but works around it

Resources