How to get end time of previous job - airflow

I have a task which is scheduled every few minutes.
I want to implement the logic where new task starts where previous successfully executed task left off.
More concretely I use this time intervals to than query the database and so I don't to miss some data between executions.
How this can be achieved ?

Have a look at the documentation around macros. You can see that you have two variables that you can use in your sql files: {{ execution_date }} and {{ next_execution_date }} - you should use these to query the database for the time interval like this for example:
select *
from
table
where
timestamp_column >= {{ execution_date }}
and timestamp_column < {{ next_execution_date }}

Related

Passing Trigger DAG Id value xcom

I am building a parent dag that basically does a conditional trigger of another dag depending on a xcom value.
Say, I have the following dags in my system
Load Contracts
Load Budget
Load Sales
Trigger Data Loads (Master)
In my master DAG, I have two tasks - (1) Check File Name (2) Trigger appropriate dag based on file name. If File Name is Sales, then trigger 'Load Sales' and like wise for Budget / Contracts.
Here is how my TriggerDAGRunOperator is configure
def _check_file_version(**context):
context["task_instance"].xcom_push(key="file_name",value="Load Sales")
with dag:
completion_file_check = PythonOperator(
task_id="completion_file_check"
python_callable=_check_file_version,
provide_context=True
)
trigger_dependent_dag = TriggerDagRunOperator(
task_id = "trigger_dependent_dag",
provide_context=True,
trigger_dag_id={{ task_instance.xcom_pull(task_ids='completion_file_check', key='file_name') }},
wait_for_completion=True
)
I want to modify the trigger_dag_id value based on the filename. Is there a way to pull in xcom variable value into it? Looking at this link -DAG Dependencies, I see that this value is jinja templated i.e. can be modified using the variables. However, my use case is to have it configured via xcom pull. Can it be done?
When I put in the xcom_pull as suggested by the replies, I get syntax error ass shown below
You can use ti.xcom_pull to get xcom value that you pushed in previous task.
{{ ti.xcom_pull(task_ids='sales_taskid', key='filename_key') }}
XCom is identified by key, task_id and dag_id. For your requirement, you can use xcom_pull where you can provide task id and keys. XCom is used for small amounts of data and larger values are not allowed. If the task auto-pushes the results into the XCom key called return_value then, xcom_pull will use that as default value for keys. For more information, you can check this link.
If you are auto-pushing the results into key, you can use below code :
value = task_instance.xcom_pull(task_ids='pushing_task')
For trigger_dag_id value based on the filename, you can use it in template as below :
{{ task_instance.xcom_pull(task_ids='your_task_id', key='file_name') }}

DAG backfilling for many dates

I have a DAG that i need to backfill for many dates. Due to resources i think the best option is to create a list of dates between a start and end date and in a for loop, run the task.
Should i have a function that returns all the dates (after formatting) to a variable and run the DAG task in a for loop or should the list of dates be part of a function that runs as a task and then somehow uses Xcom to send the list of dates? How can this be assigned to a variable with xcom pull but not requiring a task?
Let me focus on the first part of the question, and not on the way you are suggesting to solve it:
I have a DAG that I need to backfill for many dates.
But you don't want to run all the dates at the same time because of limited resources!
I would look into utilizing the max_active_runs_per_dag configuration variable that airflow provides out of the box. You don't need to create any additional logic.
This way you can limit the backfill process to a certain dag runs in parallel. For example if I want to only have 2 dag runs at a time:
export AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG=2; airflow backfill -s {start_date} -e {end_date} --reset-dagruns {dag_id}
Run with --dry-run if you want to get a feel of what's happening before you execute the actual backfill command.
Hope that helps.
1) You don't want your DAG to exhaust limited resources :
Limit your dag with max_active_runs or max_active_tasks
DAG(
"MyDAG",
start_date=datetime(2021, 1, 1),
schedule_interval="#hourly",
catchup=False,
tags=["benchmark"],
max_active_runs=1,
max_active_tasks=1
)
Limit your resource utilisation using pools
MyOperator(
task_id="MyTask",
...
pool="MyPool"
)
In admin/pool set the capacity to 1 or more
2) You wan't to programatically control your DAG :
Create a second DAG.
In this new DAG add a TriggerDagRunOperator
When using the TriggerDagRunOperator use the conf argument to carry information to your main DAG (in your case, you would carry dates)

Simple Examples to Use Insert_Rows DB hook in Airflow

Can someone help me with simple examples to use Insert_Rows DB hook in Airflow?
I have a requirement to make an insert into a table.
How do I do that and make commit to the database.
Starting with airflow, so simple examples will help understand in a better way.
There are many ways. it depends on what is you preferered mode.
Based on you description, I think the most simple is use dboperator + SQL. It need strong Databases admin experience + a piece of airflow experience. For example:
process_order_fact = PostgresOperatorWithTemplatedParams(
task_id='process_order_fact',
postgres_conn_id='postgres_dwh',
sql='process_order_fact.sql',
parameters={"window_start_date": "{{ ds }}", "window_end_date": "{{ tomorrow_ds }}"},
dag=dag,
pool='postgres_dwh'
Above code was copied from https://gtoonstra.github.io/etl-with-airflow/etlexample.html
Good Luck.
Here is another method. For example, if you have database A that you read a row from it and want to insert it to a similar database B.
Here is an example of INSERT:
cursor.execute("SELECT * FROM A WHERE ID > 5")
connecetion_to_B.insert_rows(table="B", rows=cursor)
UPSERT:
cursor.execute("SELECT * FROM A WHERE Id > 5")
connecetion_to_B.insert_rows(table="B",
rows=cursor,
replace=True,
replace_index="id",
target_fields=['Id','memberId'])

How to write/read time stamp from a variable in airflow?

I'm working with: EXEC_DATE = '{{ macros.ds_add(ds, 1) }}' This gives me the execution date but not the hour.
I want to be able to save this value as YYYY-MM-DD HH:MM into variable called process_last_run.
Basically read the variable in the beginning of the run and write to it in the end of the dag. This variable indicates what was the running time of the last dag.
How can I do that?
You can do this with the macro execution_date. However be advised that this is a poorly named concept in airflow. It represents the beginning of a scheduled interval period. It will not change within the same dag-run, even if the task is re-run manually. It's there to support idempotent data updates. Which frankly is the optimal way to approach data pipelines. In your case though you said elsewhere that your data fetching api takes a start date and provides all the data up to current, which isn't conducive to being processed idempotently, though you could throw away data after a cut off specified.
So instead you might just take the date after your processing of data has completed, and store that for later. You can store into an Airflow Variable. You might note though that the time you get out of a date command shown below is going to be later then the last time of the data you might have gotten from within your process_data api call for all data from a start date. So it might be better if your processing step outputs the actual last date and time of the data processed as the last line of stdout (which is captured by BashOperator for xcom).
E.G.
from airflow.models import Variable, DAG
from datetime import datetime
def pyop_fun(**context):
# You could have used execution_date here and in the next operator
# to make the operator rerun safe.
# date_string = context['execution_date'].strftime('%Y-%m-%d %H:%M')
# But elsewhere you said your api is always giving you the up-to-the-minute data.
# So maybe getting the date from the prior task would work better for you.
Variable.set(
'process_last_run',
context['task_instance'].xcom_pull(task_ids='process_data')
with dag as DAG(…):
pyop = PythonOperator(
task_id='set_process_last_run',
callable=pyop_fun,
provide_context=True, …)
shop = BashOperator(
task_id='process_data',
bash_command='''
process_data "{{var.value.process_last_run}}";
date -u +%Y-%m-%d\ %H:%M''',
xcom_push=True, …)
shop >> pyop
# Because the last output line of a BashOperator is pushed into xcom for that
# task id with the default key, it can be pulled by the PythonOperator and
# stored in a variable.
There's an {{ execution_date }} variable in Jinja you can use to get the execution date for the current DAG run.
More info: Airflow - Macros
If you're looking to track something like start time or end time of execution or duration (in seconds) of a particular task instance, that info is stored in the TaskInstance model.
class TaskInstance(Base, LoggingMixin):
...
start_date = Column(UtcDateTime)
end_date = Column(UtcDateTime)
duration = Column(Float)
https://github.com/apache/incubator-airflow/blob/4c30d402c4cd57dc56a5d9dcfe642eadc77ec3ba/airflow/models.py#L877-L879
Also, if you wanted to compute the running time of the entire DAG, you could get that from querying the Airflow metadata database around these fields for a particular DAG run.
If you're doing this in your Python code already, you can access the execution_date field on the task instance itself as well instead of using the template layer.
Variables
You can write to and read from Airflow Variables like so:
Variable.set('my_key', 'some value')
my_val = Variable.get('my_key')
You can also perform CRUD operations on variables with the CLI.
Stats
Another thing you might keep in mind if you find yourself working with stats like task duration a lot is Airflow's StatsD integration which gathers metrics on Airflow itself at execution time. You can have these metrics go into a push-based system like StatsD itself, or into a pull-based system like Prometheus / Grafana by using statsd_exporter.

Does airflow have something like `yesterday_ds` / `tomorrow_ds` but for `#monthly` jobs?

I have a job that's using the ds variable to coordinate the amount of work that it processes, and it is scheduled to run daily with #daily.
select * from events
where date = '{{ ds }}';
However, I'd like to write a new version of it to be #monthly. I don't have to change much, but I need access to the datestamp of the next run in order to cleanly port it over.
select * from events
where
date >= '{{ ds }}'
and
date < '{{ macros.ds_add(ds, 32) }}';
I can sort of get by through making a DAGrun's end-date be {{ ds_add(ds, 32) }} because my job is able to handle overlaps between runs, but I was hoping there was a way to have a datestamp that would be the first of the next month or the first of the previous month.
select * from events
where
date >= '{{ ds }}'
and
date < '{{ next_month }}';
How could I implement this?
If you're running a recent version of Airflow and you set your schedule's interval to be #monthly then I think the {{ data_interval_start }} and {{ data_interval_end }} is what you're looking for. You can see all the macros here
You can use {{ prev_execution_date }} as well as {{ next_execution_date }} IF you are running an #monthyl schedule interval.
In the case that you are not doing so, you may want to develop a custom macro via the plugin system. These macros will enable you to develop a function to be passed into a template that meets your exact needs regardless of schedule_interval. You can use the existing ds_add() and ds_format() macros as guidance.

Resources