I'm not understanding something about using the MySQL operator to call and a MySQL script with Apache Airflow.
When I run this task...
MySQLTest = MySqlOperator(
task_id='MySQLTest',
sql = '/home/user/DBScripts/MySQLScript/SampleMySQLScript.sql',
mysql_conn_id = 'mysql_db_connect',
autocommit = True,
database = 'segments'
)
I get this error in traceback...
jinja2.exceptions.TemplateNotFound: /home/user/DBScripts/MySQLScript/SampleMySQLScript.sql
The DAG task runs fine if I provide the entire SQL script as a parameter.
I'm not familiar with Jinja templating.
Is it easier to learn to write my scripts as a Jinja template? Should I import the text of the script and assign it to a variable that I pass? Is there a way to write the Airflow task so that it isn't expecting a Jinja template?
This error message means that the .sql file is not found.
Using:
MySQLTest = MySqlOperator(
task_id='MySQLTest',
sql='test.sql',
mysql_conn_id='mysql_db_connect',
autocommit=True,
database='segments',
dag=dag
)
Where test.sql is located on the same folder as the DAG file works fine:
If the path of the .sql file is not relative to the DAG file you can use template_searchpath to define the list of folders (non relative) where jinja will look for the templates.
So Your code could look like:
default_args = { # pylint: disable=invalid-name
'owner': 'airflow',
'start_date': datetime(2020, 12, 03),
}
with DAG(
dag_id='my_sql_dag',
default_args=default_args,
schedule_interval=None,
template_searchpath=['/home/user/DBScripts/MySQLScript']
) as dag:
MySQLTest = MySqlOperator(
task_id='MySQLTest',
sql='SampleMySQLScript.sql',
mysql_conn_id='mysql_db_connect',
autocommit=True,
database='segments'
)
Related
Whenever I add the files argument to the email_task I get a failed run.
email_task = EmailOperator(
task_id='email_sample_data',
to='sample#sample.com',
subject='Forecast for the day',
html_content= "Sample",
files=['/home/airflow/sample.html'],
dag=dag)
I'm getting an error that the file is not found. Where does airflow pick my file, where do I need to upload a file, and what is the correct syntax for the 'files' argument?
Airflow expect path to be relative to where the DAG file is stored.
However since files is templated field you can use template_search_path to provide additional paths that Airflow will look in:
with DAG(
...
template_searchpath = ['/home/airflow/'],
) as dag:
email_task = EmailOperator(
task_id='email_sample_data',
to='sample#sample.com',
subject='Forecast for the day',
html_content="Sample",
files=['/home/airflow/sample.html']
)
I have an Airflow DAG which I use to submit a spark job and for that I use the SparkSubmitOperator. In the DAG, I have to specify the application JAR that needs to be run. At the moment, it is hardcoded to spark-job-1.0.jar as following:
from airflow import DAG
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from datetime import datetime, timedelta
args = {
'owner': 'joe',
'start_date': datetime(2020, 5, 23)
}
dag = DAG('spark_job', default_args=args)
operator = SparkSubmitOperator(
task_id='spark_sm_job',
conn_id='spark_submit',
java_class='com.mypackage',
application='/home/ubuntu/spark-job-1.0.jar', <---------
total_executor_cores='1',
executor_cores='1',
executor_memory='500M',
num_executors='1',
name='airflow-spark-job',
verbose=False,
driver_memory='500M',
application_args=["yarn", "10.11.21.12:9092"],
dag=dag,
)
The problem is, the release name will increment and I tried using a wildcard spark-job-*.jar but it didn't work! - Is it possible to use a wildcard, or is there any other way to get around this?
Assuming that there will always be only one file matching the glob /home/ubuntu/spark-job-*.jar, you can do the following:
import glob
...
application=glob.glob('/home/ubuntu/spark-job-*.jar')[0],
...
I am trying to use Jinja template variable as against using Variable.get('sql_path'), So as to avoid hitting DB for every scan of the dag file
Original code
import datetime
import os
from functools import partial
from datetime import timedelta
from airflow.models import DAG,Variable
from airflow.contrib.operators.snowflake_operator import SnowflakeOperator
from alerts.email_operator import dag_failure_email
SNOWFLAKE_CONN_ID = 'etl_conn'
tmpl_search_path = []
for subdir in ['business/', 'audit/', 'business/transform/']:
tmpl_search_path.append(os.path.join(Variable.get('sql_path'), subdir))
def get_db_dag(
*,
dag_id,
start_date,
schedule_interval,
max_taskrun,
max_dagrun,
proc_nm,
load_sql
):
default_args = {
'owner': 'airflow',
'start_date': start_date,
'provide_context': True,
'execution_timeout': timedelta(minutes=max_taskrun),
'retries': 0,
'retry_delay': timedelta(minutes=3),
'retry_exponential_backoff': True,
'email_on_retry': False,
}
dag = DAG(
dag_id=dag_id,
schedule_interval=schedule_interval,
dagrun_timeout=timedelta(hours=max_dagrun),
template_searchpath=tmpl_search_path,
default_args=default_args,
max_active_runs=1,
catchup='{{var.value.dag_catchup}}',
on_failure_callback=alert_email_callback,
)
load_table = SnowflakeOperator(
task_id='load_table',
sql=load_sql,
snowflake_conn_id=SNOWFLAKE_CONN_ID,
autocommit=True,
dag=dag,
)
load_vcc_svc_recon
return dag
# ======== DAG DEFINITIONS #
edw_table_A = get_db_dag(
dag_id='edw_table_A',
start_date=datetime.datetime(2020, 5, 21),
schedule_interval='0 5 * * *',
max_taskrun=3, # Minutes
max_dagrun=1, # Hours
load_sql='recon/extract.sql',
)
When I have replaced Variable.get('sql_path') with Jinja Template '{{var.value.sql_path}}' as below and ran the Dag, it threw an error as below, it was not able to get the file from the subdirectory of the SQL folder
tmpl_search_path = []
for subdir in ['bus/', 'audit/', 'business/snflk/']:
tmpl_search_path.append(os.path.join('{{var.value.sql_path}}', subdir))
Got below error as
inja2.exceptions.TemplateNotFound: extract.sql
Templates are not rendered everywhere in a DAG script. Usually they are rendered in the templated parameters of Operators. So, unless you pass the elements of tmpl_search_path to some templated parameter {{var.value.sql_path}} will not be rendered.
The template_searchpath of DAG is not templated. That is why you cannot pass Jinja templates to it.
The options of which I can think are
Hardcode the value of Variable.get('sql_path') in the pipeline script.
Save the value of Variable.get('sql_path') in a configuration file and read it from there in the pipeline script.
Move the Variable.get() call out of the for-loop. This will result in three times fewer requests to the database.
More info about templating in Airflow.
I' am new in BI and I need some help I'am using windows Jobscheduler in order to executer tasks , but sometimes it bugs so i am moving to apache airflow , I have already a bat file who execute but I want to use it in apache airflow dags , this is my file bat code
cd /d D:\EXMOOV\Scripts
call RunDTS.bat EXMOOV Js_002_MOOV_AIR
I want to put it in a dag file code in order to execute it , so I took an example of a Dag code and try it to put it so the file become unreadable and apache airflow didin't read it , this is my try :
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'Brahim',
'depends_on_past': False,
'start_date': days_ago(2),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=2),
}
dag = DAG(
'My_first_test_code',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag,
)
cd /d D:\EXMOOV\Scripts
call RunDTS.bat EXMOOV Js_002_MOOV_AIR
t2 = BashOperator(
task_id='sleep',
depends_on_past=False,
bash_command='sleep 5',
retries=3,
dag=dag,
)
dag.doc_md = __doc__
t1.doc_md = """\
templated_command = """
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
"""
t3 = BashOperator(
task_id='templated',
depends_on_past=False,
bash_command=templated_command,
params={'my_param': 'Parameter I passed in'},
dag=dag,
)
t1 >> [t2, t3]
I just want those a dag file who read those two lines that's all in order to execute the script the file in format .bat used to execute etl jobs in ibm datastage .
You haven't mentioned your architecture. Yet it seems like your Airflow is on Linux machines or it is installed on Windows Subsystem for Linux. Either way I think you can use Python library for Windows Remote Management (pywinrm) which should allow you to access the resources on windows server. Besides, your syntax to run windows command on python script is not correct. Check this out.!
Running windows shell commands with python
I am able to run Spark job using BashOperator but I want to use SparkSubmitOperator for it using Spark standalone mode.
Here's my DAG for SparkSubmitOperator and stack-trace
args = {
'owner': 'airflow',
'start_date': datetime(2018, 5, 24)
}
dag = DAG('spark_job', default_args=args, schedule_interval="*/10 * * * *")
operator = SparkSubmitOperator(
task_id='spark_submit_job',
application='/home/ubuntu/test.py',
total_executor_cores='1',
executor_cores='1',
executor_memory='2g',
num_executors='1',
name='airflow-spark',
verbose=False,
driver_memory='1g',
conf={'master':'spark://xx.xx.xx.xx:7077'},
dag=dag,
)
Looking at source for spark_submit_hook it seems _resolve_connection() always sets master=yarn. How can I change master properties value by Spark standalone master URL? Which properties I can set to run Spark job in standalone mode?
You can either create a new connection using the Airflow Web UI or change the spark-default connection.
Master can be local, yarn, spark://HOST:PORT, mesos://HOST:PORT and k8s://https://<HOST>:<PORT>.
You can also supply the following commands in the extras:
{"queue": "root.default", "deploy_mode": "cluster", "spark_home": "", "spark_binary": "spark-submit", "namespace": "default"}
Either the "spark-submit" binary should be in the PATH or the spark-home is set in the extra on the connection.