Looking at official documentation here, it seems that the prev_ds and other variables are being deprecated and we need to replace them with other variables. Is there a variable that provides this directly?
We do get ds and ds_nodash that we can subtract from, but was wondering if there was another variable that just gives me the previous date.
You can get dag_run with Jinja template and from it to get the prev dag.
for example, for prev_ds you can use :
"{{ dag_run.get_previous_dagrun().execution_date | ds }}"
for prev_ds_nodash
"{{ dag_run.get_previous_dagrun().execution_date | ds_nodash }}"
Related
I am new to Airflow and have been reading around to try and code my DAGs to fit the standards for the tool. Thanks to plenty of warnings I got the gist around execution_date being at the start of a time slice. Where I have been less sure is in how to handle the end of the time slice.
If I'm running a daily task to process records based on a timestamp, and especially if I want this to be idempotent, then I will need to bound the time slice at both ends. The clearest way to do this is to use execution_date and next_execution_date variables, as in the example below:
from datetime import datetime
from airflow import DAG
from airflow.providers.postgres.operators.postgres import PostgresOperator
dag = DAG(
dag_id='time_slice_example',
start_date=datetime(year=2021, month=2, day=1),
schedule_interval='0 0 * * *'
)
copy_data = PostgresOperator(
owner='airflow',
task_id='copy_time_slice_data',
sql='''
INSERT INTO pipeline_tbl (id, text, other)
SELECT id, text, other FROM daily_tbl
WHERE data_ts >= {{ execution_date }}
AND data_ts < {{ next_execution_date }}
''',
postgres_conn_id='my_db_conn',
dag=dag
)
(I've used a postgres query to illustrate the example but the same variables and principle would apply to any time slice operation)
So my question is whether this is normal? For all of the references to Airflow time slices, I have seen almost no examples of this approach. I can appreciate that it is arguably outside of the scope of Airflow itself, but I wanted to check that this is a standard approach and indeed that I'm not missing something more appropriate.
Yes, using the interval bounded by [execution_date,next_execution_date) is exactly the right behaviour.
In Airflow 2.1 or 2.2 we are investigating making this clearer, possibly by making these parameters be something like data_interval_start and data_interval_end
A bit more detail is happening on https://lists.apache.org/thread.html/rb4e004e68574e5fb77ee5b51f4fd5bfb4b3392d884c178bc767681bf%40%3Cdev.airflow.apache.org%3E
(Source: I am an Airflow core developer.)
How pull xcom variable from previous run in airflow? Is it possible?
I want to use value from same task_id in previous run_id as jinja variable for data argument in SimpeHttpOperator.
I'm looking in macros docs https://airflow.apache.org/docs/stable/macros.html
and cant't find any documented way to do this.
UPD
Example:
select_expired = SimpleHttpOperator(
task_id='select_expired',
http_conn_id='clickhouse_http',
endpoint='/',
method='POST',
data=REQUESTED_EXPIRED_FLIGHTS,
xcom_push=True,
pool='clickhouse_select',
dag=dag
)
where REQUESTED_EXPIRED_FLIGHTS is:
insert into table where column = '{{ ??????? (value returned in previous task) }}'
You should be able to access the previous task_instance using previous_ti()
Then you can use get_state() to get its state, and perform actions based on that.
I am new to the Airflow's xcom feature. i tried it out with PythonOperator and it was working fine(i.e., i can push and pull the value out of the context), but when i tried it out on BashOperator, it didn't work. However i can pull only the final stdout statement by adding the xcom_push=True attribute during the task creation. that's one thing. 2) But i also wish to push and pull the values based on their keys (to and from the BashOp) like the way we do it in PythonOp.. It would be really helpful since i need to pass tons of variables from one script to another.
Is this what you want?
from datetime import datetime
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
dag = DAG(
dag_id="example_bash_operator_1",
schedule_interval=None,
start_date=datetime(2018, 12, 31),
)
t1 = BashOperator(
task_id="t1",
bash_command='echo "{{ ti.xcom_push(key="k1", value="v1") }}" "{{ti.xcom_push(key="k2", value="v2") }}"',
dag=dag,
)
t2 = BashOperator(
task_id="t2",
bash_command='echo "{{ ti.xcom_pull(key="k1") }}" "{{ ti.xcom_pull(key="k2") }}"',
dag=dag,
)
t1 >> t2
#SpaceyBot & Lucas answered your first question.
Regarding second question raised
Blockquote
2) But i also wish to push and pull the values based on their keys (to and from the BashOp) like the way we do it in PythonOp.. It would be really helpful since i need to pass tons of variables from one script to another.
Blockquote
this is not advisable. All XCom pull/push actions are translated to Insert/Select statements in airflow DB.
This will degrade the scheduler performance in time and slow down the whole processing either because of high number of pull(queries) run or the large amounts of rows retrieved which will be retrieved through Full Table scans instead of Index based scans.
So it's better to consider a different mechanism here - storing info in external json/csv/txt files/.. etc.
Bottom line - XCom is designed for transferring small amounts of data only, mostly counters and status variables.
In addition to #Ryan Yuan answer you can use the parameter env of the BashOperator to set environmental variables for your bash script/command.
my_task = BashOperator(
task_id='my_task',
bash_command='echo $VAR1 $VAR2',
env={
"VAR1": '{{ ti.xcom_pull(key="var1")}}',
"VAR2": '{{ ti.xcom_pull(key="var2")}}'
},
dag=dag
)
The problem: Airflow's execution_date is defined as the beginning of the period between runs. For example, a DAG run on a weekly schedule would run on 2018-01-08 T11:00:00, but the execution_date would be 2018-01-01 T11:01:00.
The objective: I receive a file once a week, with the file date in the file's name. To identify the file, I'd like to use Airflow's execution_date. But I cannot seem to find a way to use the date of the run, versus using the earliest possible execution_date for a period.
Possible solutions:
Modify the execution_date on the fly. Something like: context['execution_date'] + timedelta(days=7). This seems hacky.
Run the DAG daily, insert a ShortCircuitOperator at the beginning of the DAG execution graph, exit if execution_date is not the expected date.
All suggestions or recommendations are welcomed. It's a nuanced problem, but causing some issues with my ETL pipeline.
Another possible solution?
Have the DAG run once a week just after you "think" the file will arrive. Parse the names of the files in the landing area which will give you a bunch of dates. Check and see which of these dates is between the execution_date + schedule_interval (or next_execution_date if you're using airflow version >= 1.8). Then ingest file/s which match.
I think using execution_date + timedelta(days=7) is a bit hacky, intead use the execution_date + schedule_interval, that way if the interval changes there shouldn't be any issues (I do this for one of my DAGS). If you're using a newer airflow version then you can use the next_execution_date which is better.
I'm using macro for this issue.
This function (for macro) can handle manual trigger, too.
def weekly_today(execution_date, run_id, years=0, months=0, days=0, fmt="%Y%m%d"):
d = pendulum.instance(execution_date)
if run_id.startswith('scheduled_'):
d = d.add(days=7)
return d.add(years=years, months=months, days=days).strftime(fmt)
This function should be added to DAG as user_defined_macros
dag = DAG(
dag_id='test',
start_date=timezone.datetime(2019, 6, 24, 6),
schedule_interval=timedelta(days=7),
user_defined_macros={
'weekly_today': weekly_today
},
)
And I needed to set data range from 1 year ago to today.
Here is sample macro usage.
from_macro = '{{ weekly_today(execution_date, run_id, years=-1) }}'
to_macro = '{{ weekly_today(execution_date, run_id) }}'
bad naming.. but works.
in airflow, I would like to run a dag each monday at 8am (the execution_date should be of course "current day monday 8 am"). The relevant parameters to set up for this workflow are :
start_date : "2018-03-19"
schedule_interval : "0 8 * * MON"
I expect to see a dag run every monday at 8am . The first one being run the 19-03-2018 at 8 am with execution_date = 2018-03-19-08-00-00 and so on each monday.
However it's not what happens : the dag is not started on 19/03/18 at 8 am. The real behaviour is explained here for exemple : https://stackoverflow.com/a/39620901/1510109 or https://stackoverflow.com/a/48213964/1510109
The behaviour is : at each end of the interval ( weekly in my case) the dag is run with execution_date = beginning of the interval (i.e the previous week). This behavour is apparently motivated by an "ETL way of thinking" (see the link above). But it's absolutely not what I want.
How what can I achieve to run my dag each monday at 08:00am with execution_date = trigger_date = now ( = current monday 8am) ?
Thanks
Take a quick look at my answer with start times and execution_date examples.
You want to run every Monday at 8am.
So this part is going to stay the same:
schedule_interval: '0 8 * * MON',
You want it to run it's first run on 2018-03-19, since the first run occurs at the end of the first full schedule period after the start date, you should change your start date to:
start_date: datetime(2018,03,12),
You will have to live with the fact that Airflow will name your DagRuns with the start of each period and pass in macros based on the execution_date set to the start of the interval period. Adjust your logic accordingly.
Your first run will start after 2018-03-19T08:00:00.0Z and the execution_date, every other macro that depends on it, and name of the DagRun will be 2018-03-12T08:00:00.0Z
So long as you understand what to expect from the execution_date and you don't try to base your time off of datetime.now() your DAGs will be able to be idempotent in operation. Feel free to make a new variable like my_execution_date = execution_date + datetime.timedelta(7) within any PythonOperator or custom operator (you get execution_date from the context of the task), use template statements like {{ (execution_date + macros.timedelta(7)).strftime('%Y%m%d') }} or {{ macros.ds_add(ds, 7) }}, or use the next_execution_date.
You can even add a dag level user_defined_macros like {'dt':lambda d: d+datetime.timedelta(days=7)} to enable {{ dt(execution_date) }}. And recently user_defined_filters were added like {'dt':lambda d: d+datetime.timedelta(days=7)} enabling {{ execution_date | dt }}. The next_ds and next_execution_date would be easier for your purposes.
While thinking about templating, you may as well read up on the built-in stuff out there: http://jinja.pocoo.org/docs/2.10/templates/#builtin-filters
That is how airflow behaves, it always runs when the duration is completed. Detailed behavior here and airflow faq.
But in order to somehow make it run for current week, what we can do is manipulate execution_date of DAG. That may be in form of adding 7 days to a datetime object (if weekly schedule) or may use {{ next_execution_date }} macro.
Agreed that this is only possible if somehow dates are used in your DAG or dependencies are triggered by it.
Just to be clear again, DAG is still running as per its normal behavior. Only thing what we trying to do is manipulate date in program/DAG.
args = { ....
'start_date': datetime.datetime(2018,3,18)
}
dag = DAG(...
schedule_interval = "#weekly"
)
# DAG would run on 3/25/2018 for week of 18th March
# but lets say we manipulate here
# {{ next_execution_date }} macro
# or add 7 days
# So basically we are running with date 3/25/2018 instead of 3/18/2018 for the week of 18th March
For me I solved it in this way:
{{ ds if dag_run.external_trigger or dag_run.is_backfill else macros.ds_add(ds, 1) }}
If DAG was run by external trigger we shouldn't change ds.
If DAG was run by backfilling we shouldn't change ds.
If DAG was scheduled we use macros to increment it by one day.