In one of my airflow task, I want to pass the current Month current hour and last hour(in UTC) as a variable.
I know macros are there, but the airflow is running with IST timestamp, how can I get the variable data in UTC? any sample code?
execution_date is Pendulum object so you can use in_tz()
{{ execution_date.in_tz('UTC') }}
you can then format the pattern and extract the str as you need.
For example to get the month:
op = BashOperator(
task_id="example",
bash_command="echo the month extracted from {{ execution_date.in_tz('UTC') }}" \
" is {{ execution_date.in_tz('UTC').strftime('%m') }}"
)
Related
I have a Dag with schedule interval None. I want to trigger this Dag by TriggerDagRunOperator multiple times in a day.
I crated a PreDag with schedule_interval "* 1/12 * * *"
Inside PreDag a task of TriggerDagRunOperator runs that Trigger the main Dag.
As scheduled PreDag runs twice a day 1st time when PreDag runs it trigger the Dag but 2nd time when PreDag runs then task of triggerDagRunOperator show error :
" A Dag Run already exists for dag id {{ dag_id}} at {{ execution_date}} with run id {{ trigger_run_id}}" `
trigger_run = TriggerDagRunOperator(
task_id="main_dag_trigger",
trigger_dag_id=str('DW_Test_TriggerDag'),
pool='branch_pool_limit',
wait_for_completion=True,
poke_interval=20,
trigger_run_id = 'trig__' + str(datetime.now()),
execution_date = '{{ ds }}',
# reset_dag_run = True ,
dag = predag
)`
Is it possible to Trigger a dag multiple times in a day using TriggerDagRunOperator.
Airflow uses execution_date and dag_id as ID for dag run table, so when the dag is triggered for the second time, there is a run with the same execution_date created in the first run.
Why do you have this problem? that's because you are using {{ ds }} as execution_date for the run:
The DAG run’s logical date as YYYY-MM-DD. Same as {{ dag_run.logical_date | ds }}.
which is the date of your run and not the datetime, and the date of two runs triggered in the same day is the same.
You can fix it by replacing {{ ds }} by {{ ts }}
I would like to get the execution hour inside a DAG context. I checked and found that {{ds}} provides only the execution date and not time. Is there any way to get the hour at which the DAG gets executed on any given day ?
with DAG(dag_id="dag_name", schedule_interval="30 * * * *", max_active_runs=1) as dag:
features_hourly = KubernetesPodOperator(
task_id="task-name",
name="task-name",
cmds=[
"python", "-m", "sql_library.scripts.sql_executor",
"--template", "format",
"--env-names", "'" + json.dumps(["SCHEMA"]) + "'",
"--vars", "'" + json.dumps({
"EXECUTION_DATE": "{{ ds }}",
"PREDICTION_HOUR": ??,
}) + "'",
"sql_filename.sql",
],
**default_task_params,
)
execution_date is a Pendulum.DateTime object which holds an attribute hour (docs):
{{ execution_date.hour }}
You can find examples and more details about the template variables in the docs.
Note that execution_date is deprecated since Airflow 2.2. The equivalent is now {{ dag_run.logical_date }}.
When I do something like:
some_value = "{{ dag_run.get_task_instance('start').start_date }}"
print(f"some interpolated value: {some_value}")
I see this in the airflow logs:
some interpolated value: {{ dag_run.get_task_instance('start').start_date }}
but not the actual value itself. How can I easily see what the value is?
Everything in the DAG task run comes through as kwargs (before 1.10.12 you needed to add provide_context, but all context is provided after version 2).
To get something out of kwargs, do something like this in your python callable:
run_id = kwargs['run_id']
print(f'run_id = {run_id}')
Additional info:
To get the kwargs out, add them to your callable, so:
def my_func(**kwargs):
run_id = kwargs['run_id']
print(f'run_id = {run_id}')
And just call this from your DAG task like:
my_task = PythonOperator(
task_id='my_task'
, dag=dag
, python_callable=my_func)
I'm not sure what your current code structure is because you haven't provided more info I'm afraid.
I thought the macro prev_execution_date listed here would get me the execution date of the last DAG run, but looking at the source code it seems to only get the last date based on the DAG schedule.
prev_execution_date = task.dag.previous_schedule(self.execution_date)
Is there any way via macros to get the execution date of the DAG when it doesn't run on a schedule?
Yes, you can define your own custom macro for this as follows:
# custom macro function
def get_last_dag_run(dag):
last_dag_run = dag.get_last_dagrun()
if last_dag_run is None:
return "no prev run"
else:
return last_dag_run.execution_date.strftime("%Y-%m-%d")
# add macro in user_defined_macros in dag definition
dag = DAG(dag_id="my_test_dag",
schedule_interval='#daily',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run
}
)
# example of using it in practice
print_vals = BashOperator(
task_id='print_vals',
bash_command='echo {{ last_dag_run_execution_date(dag) }}',
dag=dag
)
Note that the dag.get_last_run() is just one of the many functions available on the Dag object. Here's where I found it: https://github.com/apache/incubator-airflow/blob/v1-10-stable/airflow/models.py#L3396
You can also tweak the formatting of the string for the date format, and what you want output if there is no previous run.
You can make your own user custom macro function, use airflow model to search meta-database.
def get_last_dag_run(dag_id):
//TODO search DB
return xxx
dag = DAG(
'example',
schedule_interval='0 1 * * *',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run,
}
)
Then use the KEY in your template.
I have a DAG with three bash tasks which is scheduled to run every day.
I would like to access unique ID of dag instance(may be PID) in all bash scripts.
Is there any way to do this?
I am looking for similar functionality as Oozie where we can access WORKFLOW_ID in workflow xml or java code.
Can somebody point me to documentation of AirFlow on "How to use in-build and custom variables in AirFlow DAG"
Many Thanks
Pari
Object's attributes can be accessed with dot notation in jinja2 (see https://airflow.apache.org/code.html#macros). In this case, it would simply be:
{{ dag.dag_id }}
i made use of the fact that the python object for dag prints out the name of the current dag. so i just use jinja2 to change the dag name:
{{ dag | replace( '<DAG: ', '' ) | replace( '>', '' ) }}
bit of a hack, but it works.
therefore,
clear_upstream = BashOperator( task_id='clear_upstream',
trigger_rule='all_failed',
bash_command="""
echo airflow clear -t upstream_task -c -d -s {{ ts }} -e {{ ts }} {{ dag | replace( '<DAG: ', '' ) | replace( '>', '' ) }}
"""
)