How to run a SQL script via BigQueryInsertJobOperator in DAG? - airflow

I am looking for a solution to run a sql script via the BigQueryInsertJobOperator operator.
There are very few examples to be found online for that and the ones I tried have failed so far.
Mainly I am getting jinja2.exceptions.TemplateNotFound: error.
I have the following folder where I would like to save all my SQL scripts:
my_bucket/dags/my_other_folder/sql_scripts
I have used the template_searchpath attribute in the DAG's configuration:
with DAG(
'DE_test',
schedule_interval=None,
default_args=default_dag_args,
catchup=False,
template_searchpath='/home/airflow/dags'
) as dag:
and I have specified the filename in the BigQueryInsertJobOperator:
Transform = BigQueryInsertJobOperator(
task_id='insert_data',
configuration={
'query': {
'query': "{% include 'my_other_folder/test.sql' %}",
'useLegacySql': False
}
},
location='EU',
)
No matter what I do I keep getting jinja2.exceptions.TemplateNotFound: my_other_folder/test.sql error. What am I doing wrong?

You can try:
with DAG(
'DE_test',
schedule_interval=None,
default_args=default_dag_args,
catchup=False,
template_searchpath='/home/airflow/dags/my_other_folder' # you can provide a list of paths
) as dag:
Transform = BigQueryInsertJobOperator(
task_id='insert_data',
configuration={
'query': {
'query': "{% include 'test.sql' %}", # you should provide the file name in one of the template searchpaths
'useLegacySql': False
}
},
location='EU',
)

I have finally managed to find the answer, the problem was with the value set to the template_searchpath attribute of the DAG.
It should be
template_searchpath='/home/airflow/gcs/dags/my_other_folder'
or just
template_searchpath='/home/airflow/gcs/dags'
instead of
template_searchpath='/home/airflow/dags/my_other_folder'
Basically, it was missing the /gcs/ sub folder in the path of the folder.
Now, I am still not sure as to why that is. Initially I thought that the path to the folder with SQL scripts would have to reflect the path to the folder in the GCP bucket which does not contain the /gcs/ sub folder.
If anybody could educate me as to why it is different and why it needs the /gcs/ sub folder I would appreciate it.

Related

how to pass default values for run time input variable in airflow for scheduled execution

I come across one issue while running DAG in airflow. my code is working in two scenarios where is failing for one.
below are my scenarios,
Manual trigger with input - Running Fine
Manual trigger without input - Running Fine
Scheduled Run - Failing
Below is my code:
def decide_the_flow(**kwargs):
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
print("IP is :",cleanup)
return cleanup
I am getting below error,
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
AttributeError: 'NoneType' object has no attribute 'get'
I tried to define default variables like,
default_dag_args = {
'start_date':days_ago(0),
'params': {
"cleanup": "N"
},
'retries': 0
}
but it wont work.
I am using BranchPythonOperator to call this function.
Scheduling : enter image description here
Can anyone please guide me here. what I am missing ?
For workaround i am using below code,
try:
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
except:
cleanup="N"
You can access the parameters from the context dict params, because airflow defines the default values on this dict after copying the dict dag_run.conf and checking if there is something missing:
from datetime import datetime
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.operators.python import BranchPythonOperator
def decide_the_flow(**kwargs):
cleanup = kwargs['params']["cleanup"]
print(f"IP is : {cleanup}")
return cleanup
with DAG(
dag_id='airflow_params',
start_date=datetime(2022, 8, 25),
schedule_interval="* * * * *",
params={
"cleanup": "N",
},
catchup=False
) as dag:
branch_task = BranchPythonOperator(
task_id='test_param',
python_callable=decide_the_flow
)
task_n = EmptyOperator(task_id="N")
task_m = EmptyOperator(task_id="M")
branch_task >> [task_n, task_m]
I just tested it in scheduled and manual (with and without conf) runs, it works fine.

How to access dag run config inside dag code

I am using a DAG (s3_sensor_dag) to trigger another DAG (advanced_dag) and I pass the tag_names configurations to the triggered DAG (advanced_dag) using the conf argument. It looks something like this:
s3_sensor_dag.py:
trigger_advanced_dag = TriggerDagRunOperator(
task_id="trigger_advanced_dag",
trigger_dag_id="advanced_dag",
wait_for_completion="True",
conf={"tag_names": "{{ task_instance.xcom_pull(key='tag_names', task_ids='get_file_paths') }}"}
)
In the advanced_dag, I am trying to access the dag_conf (tag_names) like this:
advanced_dag.py:
with DAG(
dag_id="advanced_dag",
start_date=datetime(2020, 12, 23),
schedule_interval=None,
is_paused_upon_creation=False,
catchup=False,
dagrun_timeout=timedelta(minutes=60),
) as dag:
dag_parser = DagParser(
home=HOME,
env=env,
global_cli_flags=GLOBAL_CLI_FLAGS,
tag=dag_run.conf["tag_names"]
)
But I get the error stating that dag_run does not exist. I realized that this is a run time variable from Accessing configuration parameters passed to Airflow through CLI.
So, I tried a solution which was mentioned in the comment that uses dag.get_dagrun(execution_date=dag.latest_execution_date).conf which goes something like:
dag_parser = DagParser(
home=HOME,
env=env,
global_cli_flags=GLOBAL_CLI_FLAGS,
tag=dag.get_dagrun(execution_date=dag.latest_execution_date).conf['tag_names']
)
But it looks like it didn't fetch the value either.
I was able to solve this issue by using Airflow Variables but I wanted to know if there is a way to use the dag_conf (which obviously gets data only during runtime) inside the dag() code and get the value.

How do I attach a file on email operator in Airflow

Whenever I add the files argument to the email_task I get a failed run.
email_task = EmailOperator(
task_id='email_sample_data',
to='sample#sample.com',
subject='Forecast for the day',
html_content= "Sample",
files=['/home/airflow/sample.html'],
dag=dag)
I'm getting an error that the file is not found. Where does airflow pick my file, where do I need to upload a file, and what is the correct syntax for the 'files' argument?
Airflow expect path to be relative to where the DAG file is stored.
However since files is templated field you can use template_search_path to provide additional paths that Airflow will look in:
with DAG(
...
template_searchpath = ['/home/airflow/'],
) as dag:
email_task = EmailOperator(
task_id='email_sample_data',
to='sample#sample.com',
subject='Forecast for the day',
html_content="Sample",
files=['/home/airflow/sample.html']
)

Airflow task setup with execution date

I want to customize the task to be weekday dependent in the dag file. It seems the airflow macros like {{ next_execution_date }} are not directly available in the python dag file. This is my dag definition:
RG_TASKS = {
'asia': {
'start_date': pendulum.datetime.(2021,1,1,16,0,tzinfo='Asia/Tokyo'),
'tz': 'Asia/Tokyo',
'files': [
'/path/%Y%m%d/asia_file1.%Y%m%d.csv',
'/path/%Y%m%d/asia_file2.%Y%m%d.csv',
...], },
'euro': {
'start_date': pendulum.datetime.(2021,1,1,16,0,tzinfo='Europe/London'),
'tz': 'Europe/London',
'files': [
'/path/%Y%m%d/euro_file1.%Y%m%d.csv',
'/path/%Y%m%d/euro_file2.%Y%m%d.csv',
...], },
}
dag = DAG(..., start_date=pendulum.datetime.(2021,1,1,16,0,tzinfo='Asia/Tokyo'),
schedule='00 16 * * 0-6')
for rg, t in RG_TASKS.items():
tz = t['tz']
h = t['start_date'].hour
m = t['start_date'].minute
target_time = f'{{{{ next_execution_date.replace(tzinfo="{tz}", hour={h}, minute={m}) }}}}'
time_sensor = DateTimeSensor(dag=dag, task_id=f'wait_for_{rg}', tartget_time=target_time)
bash_task = BashOperator(dag=dag, task_id='load_{rg}', trigger_rule='all_success', depends_on_past=True, bash_command=...)
for fname in t['files']:
fpath = f'{{{{ next_execution_date.strftime("{fname}") }}}}'
task_id = os.path.basename(fname).split('.')[0]
file_sensor = FileSensor(dag=dag, task_id=task_id, filepath=fpath, ...)
file_sensor.set_upstream(time_sensor)
file_sensor.set_downstream(bash_task)
The above works, and the bash_task will be triggered if all files are available, and it is set depend_on_past=True. However, the files have slightly different schedule. {rg}_file1 will be available 6 days/week, except Saturday, while the rest are available 7 days a week.
One option is to create 2 dags, one scheduled to run Sun-Fri, while the other is scheduled to run Sat only. But with this option, the depends_on_past=True is broken on Saturday.
Is there any better way to keep depends_on_past=True 7 days/week? Ideally in the files loop, I could do sth like:
for fname in t['files']:
dt = ...
if dt.weekday()==5 and task_id==f'{rg}_file1':
continue
Generally I think it's better to accomplish things in a single task when it is easy enough to do, and in this case it seem to me you can.
I'm not entirely sure why you are using a datetime sensor, but it does not seem necessary. As far as I can tell, you just want your process to run every day (ideally after the file is there) and skip once per week.
I think we can do away with file sensor too.
Option 1: everything in bash
Check for existence in your bash script and fail (with retries) if missing. Just return non-zero exit code when file missing.
Then in your bash script you could silently do nothing on the skip day.
On skip days, your bash task will be green even though it did nothing.
Option 2: subclass bash operator
Subclass BashOperator and add a skip_day parameter. Then your execute is like this:
def execute(self, context):
next_execution_date = context['next_execution_date']
if next_execution_date.day_of_week == self.skip_day:
raise AirflowSkipException(f'we skip on day {self.skip_day}')
super().execute(context)
With this option your bash script still needs to fail if file missing, but doesn't need to deal with the skip logic. And you'll be able to see that the task skipped in the UI.
Either way, no sensors.
Other note
You can simplify your filename templating.
files=[
'/path/{{ next_ds_nodash }}/euro_file2.{{ next_ds_nodash }}.csv',
...
]
Then you don't need to mess with strftime.

Problem Running MySQL Script with Airflow MySQL Operator

I'm not understanding something about using the MySQL operator to call and a MySQL script with Apache Airflow.
When I run this task...
MySQLTest = MySqlOperator(
task_id='MySQLTest',
sql = '/home/user/DBScripts/MySQLScript/SampleMySQLScript.sql',
mysql_conn_id = 'mysql_db_connect',
autocommit = True,
database = 'segments'
)
I get this error in traceback...
jinja2.exceptions.TemplateNotFound: /home/user/DBScripts/MySQLScript/SampleMySQLScript.sql
The DAG task runs fine if I provide the entire SQL script as a parameter.
I'm not familiar with Jinja templating.
Is it easier to learn to write my scripts as a Jinja template? Should I import the text of the script and assign it to a variable that I pass? Is there a way to write the Airflow task so that it isn't expecting a Jinja template?
This error message means that the .sql file is not found.
Using:
MySQLTest = MySqlOperator(
task_id='MySQLTest',
sql='test.sql',
mysql_conn_id='mysql_db_connect',
autocommit=True,
database='segments',
dag=dag
)
Where test.sql is located on the same folder as the DAG file works fine:
If the path of the .sql file is not relative to the DAG file you can use template_searchpath to define the list of folders (non relative) where jinja will look for the templates.
So Your code could look like:
default_args = { # pylint: disable=invalid-name
'owner': 'airflow',
'start_date': datetime(2020, 12, 03),
}
with DAG(
dag_id='my_sql_dag',
default_args=default_args,
schedule_interval=None,
template_searchpath=['/home/user/DBScripts/MySQLScript']
) as dag:
MySQLTest = MySqlOperator(
task_id='MySQLTest',
sql='SampleMySQLScript.sql',
mysql_conn_id='mysql_db_connect',
autocommit=True,
database='segments'
)

Resources