I am triggering the task manually from the UI and it shows the task as success but nothing happens in the database. Basically I am calling a simple procedure (with no params) that copy values from the staging table to the main and delete the contents in the staging table.
from airflow import DAG
from airflow.operators.mssql_operator import MsSqlOperator
from datetime import datetime
dag = DAG("sql_proc_0", "Testing running of SQL procedures",
schedule_interval = None, catchup = False,
start_date = datetime(2019, 1, 1))
# [dbo].[LoadData] is the name of the procedure
sql_command = """
EXECUTE [dbo].[LoadData]
"""
task = MsSqlOperator(task_id = 'run_test_proc', mssql_conn_id = 'mssql_azure_test',
sql = sql_command, dag = dag, database = 'TestDB')
Basically auto_commit=False is set by default. When it is set to True, it works (it took me 2 hours to figure this out and after posting it in SO!)
task = MsSqlOperator(task_id='run_test_proc',mssql_conn_id='mssql_azure_test',
sql=sql_command,dag=dag,
database='TestDB',
auto_commit=True)
Related
I am completely new to AirFlow and I am trying to create 8 tasks which are pretty simillar.
I've read about expand() methond though I am not quite sure how to use it for PostgresOperator?
So I have this task:
t1 = PostgresOperator(
task_id='load_something_1',
postgres_conn_id="postgres_default",
sql = "SELECT somefunction_1()",
dag=dag)
I need to create similar tasks only they gotta have load_something_2, load_something_3 etc. and
SELECT somefucntion_2, SELECT somefucntion_3 etc.
How do I do this using dynamic task mapping ?
Thank you beforehand!
It's hard to say whether you need expand() or not without knowing what your iterator looks like, and how the data is made available to the DAG, but here's how this could be accomplished with a simple iterator in a full-example DAG:
from datetime import datetime
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.decorators import dag, task
#dag(
default_args={
'owner': 'me'
},
dag_id=f'example-dag',
start_date=datetime(2023,1,6),
schedule_interval=None,
)
def workflow():
#task
def load_something(i):
t1 = PostgresOperator(
task_id=f'load_something',
postgres_conn_id="postgres_default",
sql = f"SELECT somefunction_{i}()",
)
my_tasks = [load_something(i) for i in range(1,9)]
# my_tasks = [load_something.override(task_id=f'load_something_{i+1}')(i) for i in range(1,9)]
my_tasks
workflow()
Note: just calling your task like my_tasks = [load_something(i) for i in range(1,9)] with the #task decorator will automatically enumerate your task names for you: if you want to explicitly name the tasks, you can do so using the override() method. Uncomment out my_tasks = [load_something.override(task_id=f'load_something_{i}')(i) for i in range(1,9)]:
I am trying to create dynamic tasks depending on airflow variable :
My code is :
default_args = {
'start_date': datetime(year=2021, month=6, day=20),
'provide_context': True
}
with DAG(
dag_id='Target_DIF',
default_args=default_args,
schedule_interval='#once',
description='ETL pipeline for processing users'
) as dag:
iterable_list = Variable.get("num_table")
for index, table in enumerate(iterable_list):
read_src1 = PythonOperator(
task_id=f'read_src_{table}'
python_callable=read_src,
)
upload_file_to_directory_bulk1 = PythonOperator(
task_id=f'upload_file_to_directory_bulk_{table}',
python_callable=upload_file_to_directory_bulk
)
write_Snowflake1 = PythonOperator(
task_id=f'write_Snowflake_{table}',
python_callable=write_Snowflake
)
# TaskGroup level dependencies
# DAG level dependencies
start >> read_src1 >> upload_file_to_directory_bulk1 >> write_Snowflake1 >> end
I am facing the below error :
Broken DAG: [/home/dif/airflow/dags/target_dag.py] Traceback (most recent call last):
airflow.exceptions.AirflowException: The key (read_src_[) has to be made of alphanumeric characters, dashes, dots and underscores exclusively
The code works perfect with changes in the code :
#iterable_list = Variable.get("num_table")
iterable_list = ['inventories', 'products']
Start and End are dummy operators.
Airflow variable has data as shown in the image.
My expected dynamic workflow:
I am able to achieve the above flow with a list but not with Airflow variable.
Any leads to find the cause of the error is appreciated. Thanks.
The Variable.get("num_table") returns string.
thus your loop is actually iterating over the chars of ['inventories, 'ptoducts'] which is why in the first iteration of the loop the task_id=f'read_src_{table}' is read_src_[ and [ is not a valid char for task_id.
You should convert the string into list.
Save your var as: "inventories,ptoducts" and then you can do:
iterable_string = Variable.get("num_table")
iterable_list = iterable_string.split(",")
for index, table in enumerate(iterable_list):
You should note that using Variable.get("num_table") as a top level code is a very bad practice!
The problem is that by default, Airflow reads the variables as str. Try using this:
iterable_list = Variable.get("num_table", deserialize_json=True)
I was able to arrive at the solution with the followings modifications :
import ast
...
...
iterable_string = Variable.get("num_table",default_var="[]")
iterable_list = ast.literal_eval(iterable_string)
...
Airflow variables are stored as strings.
So my data was stored as "[tab1,tab2]".
So I have used literal_eval to convert the string back to list.
I have also added an empty list as default so that if no values are present in the variable num_table, I will not process further.
I am trying to create a airflow DAG which generates task depending on the response from server.
Here is my approach :
getlist of tables from bigquery -> loop through the list and create tasks
This is my latest code and I have tried all possible code found in stack overflow. Nothing seems to work. What am I doing wrong?
with models.DAG(dag_id="xt", default_args=default_args, schedule_interval="0 1 * * *", catchup=True) as dag:
tables = get_tables_from_bq()
bridge = DummyOperator(
task_id='bridge',
dag=dag
)
for t in tables:
sql = ("SELECT * FROM `{project}.{dataset}.{table}` LIMIT 5;".format(
project=project, dataset=dataset, table=t))
materialize_t = BigQueryOperator(bql=sql,
destination_dataset_table=dataset+'.' + table_prefix + t,
task_id = 'x_' + t,
bigquery_conn_id = 'bigquery_default',
use_legacy_sql = False,
write_disposition = 'WRITE_APPEND',
create_disposition = 'CREATE_IF_NEEDED',
query_params = {},
allow_large_results = True,
dag = dag)
bridge >> materialize_t
Even the run option is not showing with this code. I tried multiple codes and finally reached here but still no luck. Any help???
I don't know if it is a typo in the copy and paste of the DAG but tables = get_tables_from_bq() should be before with models.DAG(...) Also, bridge >> materialize_t seems to miss indentation and therefore be outside the with models.DAG(...) scope. On a side note, you do not need the bridge task.
I need to copy tables from MySQL to BigQuery daily.
My workflow is:
MySqlToGoogleCloudStorageOperator
GoogleCloudStorageToBigQueryOperator
This works for a single process (say Categories).
Example:
BQ_TABLE_NAME_CATEGORIES = Variable.get("tables_categories")
...
import_categories_op = MySqlToGoogleCloudStorageOperator(
task_id='import_categories',
mysql_conn_id='c_mysql',
google_cloud_storage_conn_id='gcp_a',
approx_max_file_size_bytes = 100000000, #100MB per file
sql = 'import_categories.sql',
bucket=GCS_BUCKET_ID,
filename=file_name_categories,
dag=dag)
gcs_to_bigquery_categories_op = GoogleCloudStorageToBigQueryOperator(
dag=dag,
task_id='load_categories_to_BigQuery',
bucket=GCS_BUCKET_ID,
destination_project_dataset_table=table_name_template_categories,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[uri_template_categories_read_from],
schema_fields=Categories(),
src_fmt_configs={'ignoreUnknownValues': True},
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_TRUNCATE',
skip_leading_rows = 1,
google_cloud_storage_conn_id=CONNECTION_ID,
bigquery_conn_id=CONNECTION_ID)
import_categories_op >> gcs_to_bigquery_categories_op
Now, Say I want to scale it up and have it work with 20 more tables.. Is there a way to do it without writing the same code 20 times?
I'm looking for a way to do something like:
BQ_TABLE_NAME_CATEGORIES = Variable.get("tables_categories")
BQ_TABLE_NAME_PRODUCTS = Variable.get("tables_products")
....
BQ_TABLE_NAME_ORDERS = Variable.get("tables_orders")
list = [BQ_TABLE_NAME_CATEGORIES,BQ_TABLE_NAME_PRODUCTS,BQ_TABLE_NAME_PRODUCTS ]
for item in list:
GENERATE THE OPERATORS PER TABLE
so that will create import_categories_op , import_products_op , import_orders_op etc..
Yes, in fact it's exactly what you described. Simply instantiate your operators in your for loop. Make sure your task ids are unique and you're set:
BQ_TABLE_NAME_CATEGORIES = Variable.get("tables_categories")
BQ_TABLE_NAME_PRODUCTS = Variable.get("tables_products")
list = [BQ_TABLE_NAME_CATEGORIES, BQ_TABLE_NAME_PRODUCTS]
for table in list:
import_op = MySqlToGoogleCloudStorageOperator(
task_id=`import_${table}`,
mysql_conn_id='c_mysql',
google_cloud_storage_conn_id='gcp_a',
approx_max_file_size_bytes = 100000000, #100MB per file
sql = `import_${table}.sql`,
bucket=GCS_BUCKET_ID,
filename=file_name,
dag=dag)
gcs_to_bigquery_op = GoogleCloudStorageToBigQueryOperator(
dag=dag,
task_id=`load_${table}_to_BigQuery`,
bucket=GCS_BUCKET_ID,
destination_project_dataset_table=table_name_template,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[uri_template_read_from],
schema_fields=Categories(),
src_fmt_configs={'ignoreUnknownValues': True},
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_TRUNCATE',
skip_leading_rows = 1,
google_cloud_storage_conn_id=CONNECTION_ID,
bigquery_conn_id=CONNECTION_ID)
import_op >> gcs_to_bigquery_op
You can simplify this if you store all tables in a single variable:
// bq_tables = "table_products,table_orders"
BQ_TABLES = Variable.get("bq_tables").split(',')
for table in BQ_TABLES:
...
Edit: Task references vs IDs
Luis asked about how only the task IDs need to change (and not the references to the tasks). Actually, you don't even need to refer to your tasks for anything but adding some details to them after creation (like upstream and downstream dependencies), because they're stored in the DAG object on creation, and that's all the DAG parser is looking for. Once the DAG parser finds a DAG object in the global scope, it uses it. It doesn't know what names the tasks were referred to as in the global scope, it only knows that those tasks are listed on the DAG object, and that they list each other upstream or downstream.
I would have made this a comment on this answer, but I wanted to show the following code to explain what I mean a bit more obviously (in which I use with DAG to avoid assigning each task to the dag, and the bitwise-shift operator upstream/downstream assignment to avoid needing to even refer to the tasks by a reference, and python3's formatted f-strings):
// bq_tables = "table_products,table_orders"
BQ_TABLES = Variable.get("bq_tables").split(',')
with DAG('…dag_id…', …) as dag:
for table in BQ_TABLES:
MySqlToGoogleCloudStorageOperator(
task_id=f'import_{table}',
sql=f'import_{table}.sql',
… # all params except notably there's no `dag=dag` in here.
) >> GoogleCloudStorageToBigQueryOperator( # Yup, …
task_id=f'load_{table}_to_BigQuery',
… # again all but `dag=dag` in here.
)
Sure, it could have been t1=…; t2=…; t1>>t2; … but why name references?
This is my operator:
bigquery_check_op = BigQueryOperator(
task_id='bigquery_check',
bql=SQL_QUERY,
use_legacy_sql = False,
bigquery_conn_id=CONNECTION_ID,
trigger_rule='all_success',
xcom_push=True,
dag=dag
)
When I check the Render page in the UI. Nothing appears there.
When I run the SQL in the console it return value 1400 which is correct.
Why the operator doesn't push the XCOM?
I can't use BigQueryValueCheckOperator. This operator is designed to FAIL against a check of value. I don't want nothing to fail. I simply want to branch the code based on the return value from the query.
Here is how you might be able to accomplish this with the BigQueryHook and the BranchPythonOperator:
from airflow.operators.python_operator import BranchPythonOperator
from airflow.contrib.hooks import BigQueryHook
def big_query_check(**context):
sql = context['templates_dict']['sql']
bq = BigQueryHook(bigquery_conn_id='default_gcp_connection_id',
use_legacy_sql=False)
conn = bq.get_conn()
cursor = conn.cursor()
results = cursor.execute(sql)
# Do something with results, return task_id to branch to
if results == 0:
return "task_a"
else:
return "task_b"
sql = "SELECT COUNT(*) FROM sales"
branching = BranchPythonOperator(
task_id='branching',
python_callable=big_query_check,
provide_context= True,
templates_dict = {"sql": sql}
dag=dag,
)
First we create a python callable that we can use to execute the query and select which task_id to branch too. Second, we create the BranchPythonOperator.
The simplest answer is because xcom_push is not one of the params in BigQueryOperator nor BaseOperator nor LoggingMixin.
The BigQueryGetDataOperator does return (and thus push) some data but it works by table and column name. You could chain this behavior by making the query you run output to a uniquely named table (maybe use {{ds_nodash}} in the name), and then use the table as a source for this operator, and then you can branch with the BranchPythonOperator.
You might instead try to use the BigQueryHook's get_conn().cursor() to run the query and work with some data inside the BranchPythonOperator.
Elsewhere we chatted and came up with something along the lines of this for putting in the callable of a BranchPythonOperator:
cursor = BigQueryHook(bigquery_conn_id='connection_name').get_conn().cursor()
# one of these two:
cursor.execute(SQL_QUERY) # if non-legacy
cursor.job_id = cursor.run_query(bql=SQL_QUERY, use_legacy_sql=False) # if legacy
result=cursor.fetchone()
return "task_one" if result[0] is 1400 else "task_two" # depends on results format