I need to create a DAG that deletes and updates a few different tables. The updates happen by region. The database I work with does a table lock when doing any deletes or updates, so I would need to structure my dag like below, so that I avoid trying to update the same table at the same time.
--> equals dependent on
Florida_table_1 --> Carolina_table_1 --> Texas_table_1
Florida_table_2 --> Carolina_table_2 --> Texas_table_2
Florida_table_3 --> Carolina_table_3 --> Texas_table_3
Worse comes to worse, I can write out all the tasks separately, but I was wondering if there was a smart way to do it dynamically?
I would so something like the following:
list_of_states = ["Alabama", "Alaska", "Arizona" ...] # I forgot the song...
def state_task(which_state):
print(f"Working on {which_state}!")
[...]
with DAG(dag_id="states_process", ...) as dag:
prior_task = the_start = DummyOperator(task_id="the_start")
for which_state in list_of_states:
prior_task = prior_task >> PythonOperator(
task_id=f"{which_state}_task",
python_callable=state_task,
op_args=(which_state,)
)
This is off the top of my head but the concept is basically to leverage Airflow's >> syntax to declare the upstream and also return the task which we save off to use as the upstream of the next: prior_task = prior_task >> PythonOperator
Related
I am trying to create dynamic tasks depending on airflow variable :
My code is :
default_args = {
'start_date': datetime(year=2021, month=6, day=20),
'provide_context': True
}
with DAG(
dag_id='Target_DIF',
default_args=default_args,
schedule_interval='#once',
description='ETL pipeline for processing users'
) as dag:
iterable_list = Variable.get("num_table")
for index, table in enumerate(iterable_list):
read_src1 = PythonOperator(
task_id=f'read_src_{table}'
python_callable=read_src,
)
upload_file_to_directory_bulk1 = PythonOperator(
task_id=f'upload_file_to_directory_bulk_{table}',
python_callable=upload_file_to_directory_bulk
)
write_Snowflake1 = PythonOperator(
task_id=f'write_Snowflake_{table}',
python_callable=write_Snowflake
)
# TaskGroup level dependencies
# DAG level dependencies
start >> read_src1 >> upload_file_to_directory_bulk1 >> write_Snowflake1 >> end
I am facing the below error :
Broken DAG: [/home/dif/airflow/dags/target_dag.py] Traceback (most recent call last):
airflow.exceptions.AirflowException: The key (read_src_[) has to be made of alphanumeric characters, dashes, dots and underscores exclusively
The code works perfect with changes in the code :
#iterable_list = Variable.get("num_table")
iterable_list = ['inventories', 'products']
Start and End are dummy operators.
Airflow variable has data as shown in the image.
My expected dynamic workflow:
I am able to achieve the above flow with a list but not with Airflow variable.
Any leads to find the cause of the error is appreciated. Thanks.
The Variable.get("num_table") returns string.
thus your loop is actually iterating over the chars of ['inventories, 'ptoducts'] which is why in the first iteration of the loop the task_id=f'read_src_{table}' is read_src_[ and [ is not a valid char for task_id.
You should convert the string into list.
Save your var as: "inventories,ptoducts" and then you can do:
iterable_string = Variable.get("num_table")
iterable_list = iterable_string.split(",")
for index, table in enumerate(iterable_list):
You should note that using Variable.get("num_table") as a top level code is a very bad practice!
The problem is that by default, Airflow reads the variables as str. Try using this:
iterable_list = Variable.get("num_table", deserialize_json=True)
I was able to arrive at the solution with the followings modifications :
import ast
...
...
iterable_string = Variable.get("num_table",default_var="[]")
iterable_list = ast.literal_eval(iterable_string)
...
Airflow variables are stored as strings.
So my data was stored as "[tab1,tab2]".
So I have used literal_eval to convert the string back to list.
I have also added an empty list as default so that if no values are present in the variable num_table, I will not process further.
I'm new to Airflow and working on making my ETL pipeline more re-usable. Originally, I had a few lines of top-level code that would determine the job_start based on a few user input parameters, but I found through much searching that this would trigger at every heartbeat which was causing some unwanted behavior in truncating the table.
Now I am investigating wrapping this top level code into a Python Callable so it is secure from the refresh, but I am unsure of the best way to pass the output to my other tasks. The gist of my code is below:
def get_job_dts():
#Do something to determine the appropriate job_start_dt and job_end_dt
#Package up as a list as inputs to other PythonCallables using op_args
job_params = [job_start_dt, job_end_dt]
return job_params
t0 = PythonOperator(
task_id = 'get_dates'
python_callable = get_job_dts
dag=dag
)
t1 = PythonOperator(
task_id = 'task_1'
,python_callable=first_task
,op_args=job_params #<-- How do I send job_params to op_args??
,dag=dag
)
t0 >> t1
I've searched around and hear mentions of jinja templates, variables, or xcoms, but I'm fuzzy on how to implement it. Does anyone have an example I could look at where I can save that list into a variable that can be used by my other tasks?
The best way to do this is to push your value into XCom in get_job_dts, and pull the value back from Xcom in first_task.
def get_job_dts(**kwargs):
#Do something to determine the appropriate job_start_dt and job_end_dt
#Package up as a list as inputs to other PythonCallables using op_args
job_params = [job_start_dt, job_end_dt]
# Push job_params into XCom
kwargs['ti'].xcom_push(key='job_params', value=job_params)
return job_params
def first_task(ti, **kwargs):
# Pull job_params into XCom
job_params = ti.xcom_pull(key='job_params')
# And then do the rest
t0 = PythonOperator(
task_id = 'get_dates'
python_callable = get_job_dts
dag=dag
)
t1 = PythonOperator(
task_id = 'task_1',
provide_context=True,
python_callable=first_task,
op_args=job_params,
dag=dag
)
t0 >> t1
As RyantheCoder mentioned, XCOM is the way to go. My implementation is geared towards the tutorial where I implicitly perform a push automatically from a return value in a PythonCallable.
I am still confused by the difference in passing in (ti, **kwargs) vs. using (**context) to the function that is pulling. Also, where does "ti" come from?
Any clarifications appreciated.
def get_job_dts(**kwargs):
#Do something to determine the appropriate job_start_dt and job_end_dt
#Package up as a list as inputs to other PythonCallables using op_args
job_params = [job_start_dt, job_end_dt]
# Automatically pushes to XCOM, refer to: Airflow XCOM tutorial: https://airflow.apache.org/concepts.html?highlight=xcom#xcoms
return job_params
def first_task(**context):
# Change task_ids to whatever task pushed the XCOM vars you need, rest are standard notation
job_params = job_params = context['task_instance'].xcom_pull(task_ids='get_dates')
# And then do the rest
t0 = PythonOperator(
task_id = 'get_dates'
python_callable = get_job_dts
dag=dag
)
t1 = PythonOperator(
task_id = 'task_1',
provide_context=True,
python_callable=first_task,
dag=dag
)
t0 >> t1
As you mentioned changing task start time and end time dynamically, I supposed what you need is to create dynamic dag rather than just pass the args to dag. Especially, changing start time and interval without changing dag name will cause unexpected result, it will highly suggested not to do so. So you can refer to this link to see if this strategy can help.
I need to copy tables from MySQL to BigQuery daily.
My workflow is:
MySqlToGoogleCloudStorageOperator
GoogleCloudStorageToBigQueryOperator
This works for a single process (say Categories).
Example:
BQ_TABLE_NAME_CATEGORIES = Variable.get("tables_categories")
...
import_categories_op = MySqlToGoogleCloudStorageOperator(
task_id='import_categories',
mysql_conn_id='c_mysql',
google_cloud_storage_conn_id='gcp_a',
approx_max_file_size_bytes = 100000000, #100MB per file
sql = 'import_categories.sql',
bucket=GCS_BUCKET_ID,
filename=file_name_categories,
dag=dag)
gcs_to_bigquery_categories_op = GoogleCloudStorageToBigQueryOperator(
dag=dag,
task_id='load_categories_to_BigQuery',
bucket=GCS_BUCKET_ID,
destination_project_dataset_table=table_name_template_categories,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[uri_template_categories_read_from],
schema_fields=Categories(),
src_fmt_configs={'ignoreUnknownValues': True},
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_TRUNCATE',
skip_leading_rows = 1,
google_cloud_storage_conn_id=CONNECTION_ID,
bigquery_conn_id=CONNECTION_ID)
import_categories_op >> gcs_to_bigquery_categories_op
Now, Say I want to scale it up and have it work with 20 more tables.. Is there a way to do it without writing the same code 20 times?
I'm looking for a way to do something like:
BQ_TABLE_NAME_CATEGORIES = Variable.get("tables_categories")
BQ_TABLE_NAME_PRODUCTS = Variable.get("tables_products")
....
BQ_TABLE_NAME_ORDERS = Variable.get("tables_orders")
list = [BQ_TABLE_NAME_CATEGORIES,BQ_TABLE_NAME_PRODUCTS,BQ_TABLE_NAME_PRODUCTS ]
for item in list:
GENERATE THE OPERATORS PER TABLE
so that will create import_categories_op , import_products_op , import_orders_op etc..
Yes, in fact it's exactly what you described. Simply instantiate your operators in your for loop. Make sure your task ids are unique and you're set:
BQ_TABLE_NAME_CATEGORIES = Variable.get("tables_categories")
BQ_TABLE_NAME_PRODUCTS = Variable.get("tables_products")
list = [BQ_TABLE_NAME_CATEGORIES, BQ_TABLE_NAME_PRODUCTS]
for table in list:
import_op = MySqlToGoogleCloudStorageOperator(
task_id=`import_${table}`,
mysql_conn_id='c_mysql',
google_cloud_storage_conn_id='gcp_a',
approx_max_file_size_bytes = 100000000, #100MB per file
sql = `import_${table}.sql`,
bucket=GCS_BUCKET_ID,
filename=file_name,
dag=dag)
gcs_to_bigquery_op = GoogleCloudStorageToBigQueryOperator(
dag=dag,
task_id=`load_${table}_to_BigQuery`,
bucket=GCS_BUCKET_ID,
destination_project_dataset_table=table_name_template,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[uri_template_read_from],
schema_fields=Categories(),
src_fmt_configs={'ignoreUnknownValues': True},
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_TRUNCATE',
skip_leading_rows = 1,
google_cloud_storage_conn_id=CONNECTION_ID,
bigquery_conn_id=CONNECTION_ID)
import_op >> gcs_to_bigquery_op
You can simplify this if you store all tables in a single variable:
// bq_tables = "table_products,table_orders"
BQ_TABLES = Variable.get("bq_tables").split(',')
for table in BQ_TABLES:
...
Edit: Task references vs IDs
Luis asked about how only the task IDs need to change (and not the references to the tasks). Actually, you don't even need to refer to your tasks for anything but adding some details to them after creation (like upstream and downstream dependencies), because they're stored in the DAG object on creation, and that's all the DAG parser is looking for. Once the DAG parser finds a DAG object in the global scope, it uses it. It doesn't know what names the tasks were referred to as in the global scope, it only knows that those tasks are listed on the DAG object, and that they list each other upstream or downstream.
I would have made this a comment on this answer, but I wanted to show the following code to explain what I mean a bit more obviously (in which I use with DAG to avoid assigning each task to the dag, and the bitwise-shift operator upstream/downstream assignment to avoid needing to even refer to the tasks by a reference, and python3's formatted f-strings):
// bq_tables = "table_products,table_orders"
BQ_TABLES = Variable.get("bq_tables").split(',')
with DAG('…dag_id…', …) as dag:
for table in BQ_TABLES:
MySqlToGoogleCloudStorageOperator(
task_id=f'import_{table}',
sql=f'import_{table}.sql',
… # all params except notably there's no `dag=dag` in here.
) >> GoogleCloudStorageToBigQueryOperator( # Yup, …
task_id=f'load_{table}_to_BigQuery',
… # again all but `dag=dag` in here.
)
Sure, it could have been t1=…; t2=…; t1>>t2; … but why name references?
I have the following code:
def chunck_import(**kwargs):
...
for i in range(1, num_pages + 1):
start = lower + chunks * i
end = start + chunks
if i>1:
start = start + 1
logging.info(start, end)
if end > max_current:
end = max_current
where = 'where orders_id between {0} and {1}'.format(start,end)
logging.info(where)
import_orders_products_op = MySqlToGoogleCloudStorageOperator(
task_id='import_orders_and_upload_to_storage_orders_products_{}'.format(i),
mysql_conn_id='mysql_con',
google_cloud_storage_conn_id='gcp_con',
provide_context=True,
approx_max_file_size_bytes = 100000000, #100MB per file
sql = 'import_orders.sql',
params={'WHERE': where},
bucket=GCS_BUCKET_ID,
filename=file_name_orders_products,
dag=dag)
start_task_op = DummyOperator(task_id='start_task', dag=dag)
chunck_import_op = PythonOperator(
task_id='chunck_import',
provide_context=True,
python_callable=chunck_import,
dag=dag)
start_task_op >> chunck_import_op
This code uses PythonOperator to calculate how many runs I need from the MySqlToGoogleCloudStorageOperator and create the WHERE cluster of the SQL then it needs to execute it.
The problem is that the MySqlToGoogleCloudStorageOperator isn't being executed.
I can't actually do
chunck_import_op >> import_orders_products_op
How can I make the MySqlToGoogleCloudStorageOperator be executed inside the PythonOperator?
I think at the end of your for loop, you'll want to call import_orders_products_op.execute(context=kwargs) possibly preceded by import_orders_products_op.pre_execute(context=kwargs). This is a bit complicated in that it skips the render_templates() call of the task_instance, and actually if you instead made a task_instance to put each of these tasks in, you could call run or _raw_run_task instead but these both require information from the dagrun (which you can get in the python callable's context like kwargs['dag_run'])
Looking at what you've passed to the operators it looks like as is you'll need the templating step to load the import_orders.sql file and fill in the WHERE parameter. Alternatively it's okay within the callable itself to load the file into a string, replace the {{ params.WHERE }} part (and any others) manually without Jinja2 (or you could spend time to figure out the right jinja2 calls), and then set the import_orders_products_op.sql=the_string_you_loaded before calling import_orders_products_op.pre_execute(context=kwargs) and import_orders_products_op.execute(context=kwargs).
I am using Airlfow, since last 6 months. I felt so happy to define the workflows in Airflow.
I have the below scenario where I am not able to get the xcom value (highlighted in yellow color).
Please find the code below sample code:
Work Flow
def push_function(**context):
context['ti'].xcom_push(key='reportid', value='xyz')
dummy_operator = DummyOperator(
task_id='Start',
dag=main_dag
)
push_function_task = PythonOperator(
task_id='push_function',
provide_context=True,
python_callable=push_function,
op_kwargs={},
dag=main_dag)
push_function_task .set_upstream(dummy_operator)
custom_task = CustomOperator(
dag=main_dag,
task_id='import_data',
provide_context=True,
url="http://www.google.com/{}".format("{{task_instance.xcom_pull(task_ids='push_function')}}")
)
custom_task .set_upstream(push_function_task)
Notes:
1. CustomOperator is my own operator wtritten for downloading the data for the given URL
Please help me.
Thanks,
Samanth
I believe you have a mismatch in keys when pushing and pulling the XCom. Each XCom value is tied to a DAG ID, task ID, and key. If you are pushing with report_id key, then you need to pull with it as well.
Note, if a key is not specified to xcom_pull(), it uses the default of return_value. This is because if a task returns a result, Airflow will automatically push it to XCom under the return_value key.
This gives you two options to fix your issue:
1) Continue to push to the report_id key and make sure you pull from it as well
def push_function(**context):
context['ti'].xcom_push(key='reportid', value='xyz')
...
custom_task = CustomOperator(
...
url="http://www.google.com/{}".format("{{ task_instance.xcom_pull(task_ids='push_function', key='reportid') }}")
)
2) Have push_function() return the value you want to push to XCom, then pull from the default key.
def push_function(**context):
return 'xyz'
...
custom_task = CustomOperator(
...
url="http://www.google.com/{}".format("{{ task_instance.xcom_pull(task_ids='push_function') }}")
)