Whenever I add the files argument to the email_task I get a failed run.
email_task = EmailOperator(
task_id='email_sample_data',
to='sample#sample.com',
subject='Forecast for the day',
html_content= "Sample",
files=['/home/airflow/sample.html'],
dag=dag)
I'm getting an error that the file is not found. Where does airflow pick my file, where do I need to upload a file, and what is the correct syntax for the 'files' argument?
Airflow expect path to be relative to where the DAG file is stored.
However since files is templated field you can use template_search_path to provide additional paths that Airflow will look in:
with DAG(
...
template_searchpath = ['/home/airflow/'],
) as dag:
email_task = EmailOperator(
task_id='email_sample_data',
to='sample#sample.com',
subject='Forecast for the day',
html_content="Sample",
files=['/home/airflow/sample.html']
)
Related
I am looking for a solution to run a sql script via the BigQueryInsertJobOperator operator.
There are very few examples to be found online for that and the ones I tried have failed so far.
Mainly I am getting jinja2.exceptions.TemplateNotFound: error.
I have the following folder where I would like to save all my SQL scripts:
my_bucket/dags/my_other_folder/sql_scripts
I have used the template_searchpath attribute in the DAG's configuration:
with DAG(
'DE_test',
schedule_interval=None,
default_args=default_dag_args,
catchup=False,
template_searchpath='/home/airflow/dags'
) as dag:
and I have specified the filename in the BigQueryInsertJobOperator:
Transform = BigQueryInsertJobOperator(
task_id='insert_data',
configuration={
'query': {
'query': "{% include 'my_other_folder/test.sql' %}",
'useLegacySql': False
}
},
location='EU',
)
No matter what I do I keep getting jinja2.exceptions.TemplateNotFound: my_other_folder/test.sql error. What am I doing wrong?
You can try:
with DAG(
'DE_test',
schedule_interval=None,
default_args=default_dag_args,
catchup=False,
template_searchpath='/home/airflow/dags/my_other_folder' # you can provide a list of paths
) as dag:
Transform = BigQueryInsertJobOperator(
task_id='insert_data',
configuration={
'query': {
'query': "{% include 'test.sql' %}", # you should provide the file name in one of the template searchpaths
'useLegacySql': False
}
},
location='EU',
)
I have finally managed to find the answer, the problem was with the value set to the template_searchpath attribute of the DAG.
It should be
template_searchpath='/home/airflow/gcs/dags/my_other_folder'
or just
template_searchpath='/home/airflow/gcs/dags'
instead of
template_searchpath='/home/airflow/dags/my_other_folder'
Basically, it was missing the /gcs/ sub folder in the path of the folder.
Now, I am still not sure as to why that is. Initially I thought that the path to the folder with SQL scripts would have to reflect the path to the folder in the GCP bucket which does not contain the /gcs/ sub folder.
If anybody could educate me as to why it is different and why it needs the /gcs/ sub folder I would appreciate it.
I am using a DAG (s3_sensor_dag) to trigger another DAG (advanced_dag) and I pass the tag_names configurations to the triggered DAG (advanced_dag) using the conf argument. It looks something like this:
s3_sensor_dag.py:
trigger_advanced_dag = TriggerDagRunOperator(
task_id="trigger_advanced_dag",
trigger_dag_id="advanced_dag",
wait_for_completion="True",
conf={"tag_names": "{{ task_instance.xcom_pull(key='tag_names', task_ids='get_file_paths') }}"}
)
In the advanced_dag, I am trying to access the dag_conf (tag_names) like this:
advanced_dag.py:
with DAG(
dag_id="advanced_dag",
start_date=datetime(2020, 12, 23),
schedule_interval=None,
is_paused_upon_creation=False,
catchup=False,
dagrun_timeout=timedelta(minutes=60),
) as dag:
dag_parser = DagParser(
home=HOME,
env=env,
global_cli_flags=GLOBAL_CLI_FLAGS,
tag=dag_run.conf["tag_names"]
)
But I get the error stating that dag_run does not exist. I realized that this is a run time variable from Accessing configuration parameters passed to Airflow through CLI.
So, I tried a solution which was mentioned in the comment that uses dag.get_dagrun(execution_date=dag.latest_execution_date).conf which goes something like:
dag_parser = DagParser(
home=HOME,
env=env,
global_cli_flags=GLOBAL_CLI_FLAGS,
tag=dag.get_dagrun(execution_date=dag.latest_execution_date).conf['tag_names']
)
But it looks like it didn't fetch the value either.
I was able to solve this issue by using Airflow Variables but I wanted to know if there is a way to use the dag_conf (which obviously gets data only during runtime) inside the dag() code and get the value.
I am trying to Take data from BigQuery Dataset and pass the result value to bash_command so that it will execute commands to remove files in Cloud storage.
When I execute 'SELECT commandshell_variable_1 FROM sys_process.outbound_flat_file_config where file_number=1' the result is ..... gsutil rm -r gs://A/loop_member_*.csv
I want to use the result of below and Pass it to bash_command in next task ...
Thank you.
DAG Code
with DAG(
dag_id='kodz_Automation',
description='kodz_Automation',
schedule_interval=None,
catchup= False,
default_args=DEFAULT_DAG_ARGS) as dag:
def get_data_from_bq(**kwargs):
hook = BigQueryHook(bigquery_conn_id='bigquery_default', delegate_to=None, use_legacy_sql=False)
conn = hook.get_conn()
cursor = conn.cursor()
cursor.execute('SELECT commandshell_variable_1 FROM sys_process.outbound_flat_file_config where file_number=1')
result = cursor.fetchall()
print('result', result)
return result
fetch_data = PythonOperator(
task_id='fetch_data_public_dataset',
provide_context=True,
python_callable=get_data_from_bq,
dag=dag
)
also_run_this = bash_operator.BashOperator(
task_id='also_run_this',
bash_command=python_callable
)
fetch_data >> also_run_this
To send data from one task to another you can use Airflow XCOM feature.
Using PythonOperator, the returned value will be stored in XCOM by default, so all you need to do is add a xcom_pull in the BashOperator, something like this:
also_run_this = bash_operator.BashOperator(
task_id='also_run_this',
bash_command="<you command> {{ ti.xcom_pull(task_ids=[\'fetch_data_public_dataset\']) }}'"
)
Learn more of XCOM here
But if you will return a lot of data, I recommend saving this in some storage (like S3, GCS, etc) and then sending the link address to the bash command.
I have process where I'm waiting for a file every week, but this file is timestamped in its name for the mesurment's date. So I know I'm going to have something this week and the name can be 2020-05-25*.csv up to 2020-05-31*.csv.
The only way I find out to start my processes with airflow is to run a sensor at start #daily and using the executing date to find is there is a file.
The thing is, since I don't know which day the file will be uploaded I will have 6 fails sensors, so 6 failed DAGs, and 1 succeeded.
SFTP Sensors part exemple :
with DAG(
"geometrie-sftp-to-safe",
default_args=default_args,
schedule_interval="#daily",
catchup=True,
) as dag:
starting_sensor = DummyOperator(
task_id="starting_sensor"
)
sensor_sftp_A = SFTPSensor(
task_id="sensor_sftp_A",
path="/input/geometrie/prod/Track_Geometry-{{ ds_nodash }}_A.csv",
sftp_conn_id="ssh_ftp_landing",
poke_interval=60,
soft_fail=True,
mode="reschedule"
)
Second With GCSSensor
with DAG(
"geometrie-preprocessing",
default_args=default_args,
schedule_interval="#daily",
catchup=True
) as dag:
# File A
sensor_gcs_A = GoogleCloudStorageObjectSensor(
task_id="gcs-sensor_A",
bucket="lisea-mesea-sea-cloud-safe",
object="geometrie/original/track_geometry_{{ ds_nodash }}_A.csv",
google_cloud_conn_id="gcp_conn",
poke_interval=50
)
That's why I would like the DAGs to be set as skipped, if and only if the sensor have fail. If it's something else I would like a real fail.
Airflow has multiple sensors which senses the directory to check for the defined file. The schedule_interval as None will work to your use case as you want the DAG to trigger only when the file is received(considering that the file can be received anytime within the week).
The below example for GCSSensor will sense the bucket for the particular type of file and will print the filename.I am pretty sure that SFTP sensor should work the same way.
dag = DAG(
dag_id='sensing-bucket',
schedule_interval=None,
default_args=args)
def new_file_detection(**context):
value = context['ti'].xcom_pull(task_ids='list_Files')
print('value is : '+str(value))
File_sensor = GoogleCloudStoragePrefixSensor(
task_id='gcs_polling',
bucket='lisea-mesea-sea-cloud-safe',
prefix='geometrie/original/track_geometry_',
dag=dag
)
GCS_File_list = GoogleCloudStorageListOperator(
task_id='list_Files',
bucket='lisea-mesea-sea-cloud-safe',
prefix='geometrie/original/track_geometry_',
delimiter='.csv',
google_cloud_storage_conn_id='google_cloud_default',
dag=dag
)
File_detection = PythonOperator(
task_id='print_detected_filename',
provide_context=True,
python_callable=new_file_detection,
dag=dag
)
File_sensor >> GCS_File_list >> File_detection
I'm not understanding something about using the MySQL operator to call and a MySQL script with Apache Airflow.
When I run this task...
MySQLTest = MySqlOperator(
task_id='MySQLTest',
sql = '/home/user/DBScripts/MySQLScript/SampleMySQLScript.sql',
mysql_conn_id = 'mysql_db_connect',
autocommit = True,
database = 'segments'
)
I get this error in traceback...
jinja2.exceptions.TemplateNotFound: /home/user/DBScripts/MySQLScript/SampleMySQLScript.sql
The DAG task runs fine if I provide the entire SQL script as a parameter.
I'm not familiar with Jinja templating.
Is it easier to learn to write my scripts as a Jinja template? Should I import the text of the script and assign it to a variable that I pass? Is there a way to write the Airflow task so that it isn't expecting a Jinja template?
This error message means that the .sql file is not found.
Using:
MySQLTest = MySqlOperator(
task_id='MySQLTest',
sql='test.sql',
mysql_conn_id='mysql_db_connect',
autocommit=True,
database='segments',
dag=dag
)
Where test.sql is located on the same folder as the DAG file works fine:
If the path of the .sql file is not relative to the DAG file you can use template_searchpath to define the list of folders (non relative) where jinja will look for the templates.
So Your code could look like:
default_args = { # pylint: disable=invalid-name
'owner': 'airflow',
'start_date': datetime(2020, 12, 03),
}
with DAG(
dag_id='my_sql_dag',
default_args=default_args,
schedule_interval=None,
template_searchpath=['/home/user/DBScripts/MySQLScript']
) as dag:
MySQLTest = MySqlOperator(
task_id='MySQLTest',
sql='SampleMySQLScript.sql',
mysql_conn_id='mysql_db_connect',
autocommit=True,
database='segments'
)