I would like to get the execution hour inside a DAG context. I checked and found that {{ds}} provides only the execution date and not time. Is there any way to get the hour at which the DAG gets executed on any given day ?
with DAG(dag_id="dag_name", schedule_interval="30 * * * *", max_active_runs=1) as dag:
features_hourly = KubernetesPodOperator(
task_id="task-name",
name="task-name",
cmds=[
"python", "-m", "sql_library.scripts.sql_executor",
"--template", "format",
"--env-names", "'" + json.dumps(["SCHEMA"]) + "'",
"--vars", "'" + json.dumps({
"EXECUTION_DATE": "{{ ds }}",
"PREDICTION_HOUR": ??,
}) + "'",
"sql_filename.sql",
],
**default_task_params,
)
execution_date is a Pendulum.DateTime object which holds an attribute hour (docs):
{{ execution_date.hour }}
You can find examples and more details about the template variables in the docs.
Note that execution_date is deprecated since Airflow 2.2. The equivalent is now {{ dag_run.logical_date }}.
Related
I have a Dag with schedule interval None. I want to trigger this Dag by TriggerDagRunOperator multiple times in a day.
I crated a PreDag with schedule_interval "* 1/12 * * *"
Inside PreDag a task of TriggerDagRunOperator runs that Trigger the main Dag.
As scheduled PreDag runs twice a day 1st time when PreDag runs it trigger the Dag but 2nd time when PreDag runs then task of triggerDagRunOperator show error :
" A Dag Run already exists for dag id {{ dag_id}} at {{ execution_date}} with run id {{ trigger_run_id}}" `
trigger_run = TriggerDagRunOperator(
task_id="main_dag_trigger",
trigger_dag_id=str('DW_Test_TriggerDag'),
pool='branch_pool_limit',
wait_for_completion=True,
poke_interval=20,
trigger_run_id = 'trig__' + str(datetime.now()),
execution_date = '{{ ds }}',
# reset_dag_run = True ,
dag = predag
)`
Is it possible to Trigger a dag multiple times in a day using TriggerDagRunOperator.
Airflow uses execution_date and dag_id as ID for dag run table, so when the dag is triggered for the second time, there is a run with the same execution_date created in the first run.
Why do you have this problem? that's because you are using {{ ds }} as execution_date for the run:
The DAG run’s logical date as YYYY-MM-DD. Same as {{ dag_run.logical_date | ds }}.
which is the date of your run and not the datetime, and the date of two runs triggered in the same day is the same.
You can fix it by replacing {{ ds }} by {{ ts }}
I have a python operator and BigQueryInsertJobOperator in my DAG. The result returned by the python operator should be passed to BigQueryInsertJobOperator in the params field.
Below is the script I am running.
def get_columns():
field = "name"
return field
with models.DAG(
"xcom_test",
default_args=default_args,
schedule_interval="0 0 * * *",
tags=["xcom"],
)as dag:
t1 = PythonOperator(task_id="get_columns", python_callable=get_columns, do_xcom_push=True)
t2 = BigQueryInsertJobOperator(
task_id="bigquery_insert",
project_id=project_id,
configuration={
"query": {
"query": "{% include 'xcom_query.sql' %}",
"useLegacySql": False,
}
},
force_rerun=True,
provide_context=True,
params={
"columns": "{{ ti.xcom_pull(task_ids='get_columns') }}",
"project_id": project_id
},
)
The xcom_query.sql looks below
INSERT INTO `{{ params.project_id }}.test.xcom_test`
{{ params.columns }}
select 'Andrew' from `{{ params.project_id }}.test.xcom_test`
While running this, the columns params are converted to a string and hence resulting in an error. Below is how the query was converted.
INSERT INTO `project.test.xcom_test`
{{ ti.xcom_pull(task_ids='get_columns') }}
select 'Andrew' from `project.test.xcom_test`
Any idea what am I missing ?
I found the reason why my dag is failing.
The "params" field for BigQueryInsertJobOperator is not a templatized field and hence calling "task_instance.xcom_pull" will not work this way.
But instead, you can directly access the 'task_instance' variable from jinja template.
INSERT INTO `{{ params.project_id }}.test.xcom_test`
({{ task_instance.xcom_pull(task_ids='get_columns') }})
select
'Andrew' from `{{ params.project_id }}.test.xcom_test`
https://marclamberti.com/blog/templates-macros-apache-airflow/ - This article explains how to identifies template parameters in airflow
How do I reuse a value that is calculated on the DAG run between tasks?
I'm trying to generate a timestamp in my DAG and use it in several tasks. So far I tried setting a Variable and a params value - nothing works, it's unique per each task run.
Here is my code:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import Variable
from airflow.utils.dates import days_ago
from airflow.providers.amazon.aws.operators.athena import AWSAthenaOperator
from airflow.providers.amazon.aws.operators.glue import AwsGlueJobOperator
default_args = {
"sla": timedelta(hours=1),
}
config = Variable.get("config", deserialize_json=True)
athena_output_bucket = config["athena_output_bucket"]
glue_db = config["glue_db"]
bucket = config["bucket"]
region = config["region"]
def get_snapshot_timestamp(time_of_run=None):
if not time_of_run:
time_of_run = datetime.now()
timestamp = time_of_run.timestamp() * 1000
return str(int(timestamp))
class TemplatedArgsGlueOperator(AwsGlueJobOperator):
template_fields = ("script_args",)
table = "my_table"
with DAG(
"my-table-export",
default_args=default_args,
description="Export my table from DynamoDB to S3",
schedule_interval=timedelta(days=1),
start_date=days_ago(1),
params={
"snapshot_ts": get_snapshot_timestamp(),
"athena_output_location": f"s3://{athena_output_bucket}/{table}",
"table": table,
},
) as dag:
my_table_export_to_s3 = TemplatedArgsGlueOperator(
task_id="my_table_export_to_s3",
job_name="my-table-export-to-s3",
num_of_dpus=2,
region_name=region,
script_args={"--snapshot_ts": "{{ params.snapshot_ts }}"},
)
add_new_partition = AWSAthenaOperator(
task_id="add_new_partition",
query="""
ALTER TABLE {{ params.table }} ADD PARTITION (snapshot_ts = '{{ params.snapshot_ts }}')
LOCATION 's3://{{ var.json.config.bucket }}/{{ params.table }}/snapshot_ts={{ params.snapshot_ts }}'
""",
database=glue_db,
output_location="{{ params.athena_output_location }}",
)
update_latest_view = AWSAthenaOperator(
task_id="update_latest_view",
query="""
CREATE OR REPLACE VIEW {{ params.table }}_latest AS
SELECT * from {{ params.table }}
WHERE snapshot_ts = '{{ params.snapshot_ts }}'
""",
database=glue_db,
output_location="{{ params.athena_output_location }}",
)
my_table_export_to_s3 >> add_new_partition >> update_latest_view
I want snapshot_ts to be the same across all three tasks, but it's different. What am I doing wrong?
This should be possible via xcom. xCom is used precisely for exchanging information between various tasks. To quote
XComs let tasks exchange messages, allowing more nuanced forms of
control and shared state. The name is an abbreviation of
“cross-communication”. XComs are principally defined by a key, value,
and timestamp, but also track attributes like the task/DAG that
created the XCom and when it should become visible. Any object that
can be pickled can be used as an XCom value, so users should make sure
to use objects of appropriate size.
In xCom a pythonoperator is used to call a function. That function pushes some values into a table called xcom in inside airflow metadata db. The same is then access via other DAGs or Tasks.
An example of how to do it all is here - https://www.cloudwalker.io/2020/01/28/airflow-xcom/
I am working on some simple Apache Airflow DAG. My goal is to:
1. calculate the data parameter based on the DAG run date - I try achieve that by the Python operator.
2. pass the parameter calculated above as a bq query parameter.
Any ideas are welcom.
My code below - I have marked the two points with I am struggling with by the 'TODO' label.
...
def set_date_param(dag_run_time):
# a business logic applied here
....
return "2020-05-28" # example result
# --------------------------------------------------------
# DAG definition below
# --------------------------------------------------------
# Python operator
set_data_param = PythonOperator(
task_id='set_data_param',
python_callable=set_data_param,
provide_cotext=True,
op_kwargs={
"dag_run_date": #TODO - how to pass the DAG running date as a function input parameter
},
dag=dag
)
# bq operator
load_data_to_bq_table = BigQueryOperator(
task_id='load_data_to_bq_table',
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {date_key_param}
""".format(
date_key_param =
), #TODO - how to get the python operator results from the previous step
use_legacy_sql=False,
destination_dataset_table="my_project.dataset2.table2}",
trigger_rule='all_success',
dag=dag
)
set_data_param >> load_data_to_bq_table
For PythonOperator to pass the execution date to the python_callable, you only need to set provide_cotext=True (as it has been already done in your example). This way, Airflow automatically passes a collection of keyword arguments to the python callable, such that the names and values of these arguments are equivalent to the template variables described here. That is, if you define the python callable as set_data_param(ds, **kwargs): ..., the ds parameter will automatically get the execution date as a string value in the format YYYY-MM-DD.
XCOM allows task instances to exchange messages. To use the date returned by set_date_param() inside the sql query string of BigQueryOperator, you can combine XCOM with Jinja templating:
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {{ task_instance.xcom_pull(task_ids='set_data_param') }}
"""
The following complete example puts all pieces together. In the example, the get_date task creates a date string based on the execution date. After that, the use_date task uses XCOM and Jinja templating to retrieve the date string and writes it to a log.
import logging
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
default_args = {'start_date': days_ago(1)}
def calculate_date(ds, execution_date, **kwargs):
return f'{ds} ({execution_date.strftime("%m/%d/%Y")})'
def log_date(date_string):
logging.info(date_string)
with DAG(
'a_dag',
schedule_interval='*/5 * * * *',
default_args=default_args,
catchup=False,
) as dag:
get_date = PythonOperator(
task_id='get_date',
python_callable=calculate_date,
provide_context=True,
)
use_date = PythonOperator(
task_id='use_date',
python_callable=log_date,
op_args=['Date: {{ task_instance.xcom_pull(task_ids="get_date") }}'],
)
get_date >> use_date
I thought the macro prev_execution_date listed here would get me the execution date of the last DAG run, but looking at the source code it seems to only get the last date based on the DAG schedule.
prev_execution_date = task.dag.previous_schedule(self.execution_date)
Is there any way via macros to get the execution date of the DAG when it doesn't run on a schedule?
Yes, you can define your own custom macro for this as follows:
# custom macro function
def get_last_dag_run(dag):
last_dag_run = dag.get_last_dagrun()
if last_dag_run is None:
return "no prev run"
else:
return last_dag_run.execution_date.strftime("%Y-%m-%d")
# add macro in user_defined_macros in dag definition
dag = DAG(dag_id="my_test_dag",
schedule_interval='#daily',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run
}
)
# example of using it in practice
print_vals = BashOperator(
task_id='print_vals',
bash_command='echo {{ last_dag_run_execution_date(dag) }}',
dag=dag
)
Note that the dag.get_last_run() is just one of the many functions available on the Dag object. Here's where I found it: https://github.com/apache/incubator-airflow/blob/v1-10-stable/airflow/models.py#L3396
You can also tweak the formatting of the string for the date format, and what you want output if there is no previous run.
You can make your own user custom macro function, use airflow model to search meta-database.
def get_last_dag_run(dag_id):
//TODO search DB
return xxx
dag = DAG(
'example',
schedule_interval='0 1 * * *',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run,
}
)
Then use the KEY in your template.