i create a Dag that will be scheduled to run every day once a day - the dag wont get any parameters . each day the dag ran i need to calculate yesterdays date (current Date -1) and pass it for all the operators in the dag (same date).
i saw that i can use Airflow-macros to compute the date , but the problem is ,the operator i use in t2 (must use this operator and cant change it) pass the data to the dictionary (default_param_dict) as a string and don't compute the macro.
is there any other way to compute the date without macros? using X-com is not relevant because i can only use the operators and cant change their code.
need you help:)
adding my dag example:
t1 = SimpleHttpOperator(
task_id='check_if_daily_report_ready',
method='GET',
endpoint="/bla/bla?date={date}".format(
date='{{ (execution_date - macros.timedelta(days=1)).strftime("%Y-%m-%d") }}'),
http_conn_id="conn1",
headers={"Content-Type": "application/json"},
response_check=lambda response: True if response.status_code == 200 else False,
dag=dag,
)
t2 = Queryperator(
task_id='cal',
query_file='ca.sql',
query_folder='include/sql_files/bla',
token='Token',
default_param_dict={"date": '{{ (execution_date - macros.timedelta(days=1)).strftime("%Y-%m-%d") }}'},
dag=dag
)
If I understood the question, you wanted to add yesterday's date the default_param_dict, if so, I would recommend using the datetime, something like below,
import datetime
t2 = Queryperator(
task_id='cal',
query_file='ca.sql',
query_folder='include/sql_files/bla',
token='Token',
default_param_dict={"date": (datetime.date.today() - datetime.timedelta(days=1)).strftime('%Y-%m-%d')},
dag=dag
)
Related
I try to pass execution date to httpsensor operator.
is_api_available = HttpSensor(
task_id='is_api_available',
http_conn_id='data_available',
endpoint='api/3/action/date= {{I want to set date in here}}'
)
I can get execution date parameter in python operator like this:
print("my start date : ",kwargs['execution_date'] )
it works but how can I get it in other operators?
thanks in advance
You can use Jinja template with the variable {{ ds }}, it format the datetime as YYYY-MM-DD
for more macros you can see at https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html
is_api_available = HttpSensor(
task_id='is_api_available',
http_conn_id='data_available',
endpoint='api/3/action/date={{ ds }}')
api/3/action/date=2022-06-25
I am working through a problem where I cannot get DAGs to run on the correct data interval.
The DAGs are of the form:
CYCLE_START_SECONDS = 240
CYCLE_END_SECONDS = 120
#dag(schedule_interval=timedelta(seconds=CYCLE_START_SECONDS - CYCLE_END_SECONDS), start_date=datetime(2021, 11, 16),
catchup=True, default_args=DEFAULT_ARGS,
on_failure_callback=emit_on_task_failure, max_active_runs=1, on_success_callback=emit_on_dag_success,
render_template_as_native_obj=True)
def ETL_Workflow():
"""
Workflow to post process raw data into metrics for ingest into ES
:return:
"""
#task()
def get_start_date(start, end):
print(start, end)
end = int(end.timestamp())
if isinstance(start, pendulum.DateTime):
start = int(start.timestamp())
else:
start = end - CYCLE_START_SECONDS
return start, end
#task(execution_timeout=timedelta(seconds=CYCLE_START_SECONDS - CYCLE_END_SECONDS))
def run_query(start_end: tuple, query_template, conn_str, redis_key, transforms):
start, end = start_end
query = query_template.format(start=start, end=end)
return run_pipeline(query, conn_str, redis_key, transforms)
#task()
def store_period_end(start_end: tuple):
_ = Variable.set(DAG_NAME + "_period_end", start_end[1])
return
start = '{{ prev_data_interval_end_success }}'
end = '{{ dag_run.execution_date }}'
conn_str = get_source_url(SECRET)
start_end = get_start_date(start, end)
t1 = run_query(start_end, QUERY, conn_str, REDIS_KEY, TRANSFORMS)
t3 = store_period_end(start_end)
start_end >> t1 >> t3
dag = ETL_Workflow()
Specifically, I get the desired data intervals using these templates:
start = '{{ prev_data_interval_end_success }}'
end = '{{ dag_run.execution_date }}'
But then for some reason those values resolve to the same datetime
[2021-12-04, 18:42:20 UTC] {logging_mixin.py:109} INFO - start: 2021-12-04T18:40:18.451905+00:00 end: 2021-12-04 18:40:18.451905+00:00
You can see, however, that the data intervals are correct in the run metadata:
I am stumped. The DAG execution date should be CYCLE_START_SECONDS after the previous run's data interval end. I have unit tested the logic in get_start_date and it is fine. Moreover, some workflows don't experience this problem. For those workflows, the execution datetime correctly works out to CYCLE_START_SECONDS after the previous data interval end. Am I using the templates incorrectly? Am I specifying the schedule incorrectly? Any pointers as to what might be the problem would be appreciated. Thanks.
I think you’re misunderstanding execution_date (which is normal because it’s a freaking confusing concept). A DAG run’s execution_date is NOT when the run happens, but when the data it should process started to come in. For an interval-scheduled DAG, execution_date almost always equals to data_interval_start, which in turn almost always equals to its previous DAG run’s data_interval_end. This means that execution_date is the previous run’s data_interval_end, not an interval after. Therefore, if the previous run succeeds, you see prev_data_interval_end_success equal to execution_date. It’s entirely normal.
Given you know the existence of prev_data_interval_end_success, you likely also know that execution_date is deprecated, and it’s exactly because the concept is way too confusing. Do not use it when writing new DAGs; you are probably looking for data_interval_end instead.
I'm new to Airflow and working on making my ETL pipeline more re-usable. Originally, I had a few lines of top-level code that would determine the job_start based on a few user input parameters, but I found through much searching that this would trigger at every heartbeat which was causing some unwanted behavior in truncating the table.
Now I am investigating wrapping this top level code into a Python Callable so it is secure from the refresh, but I am unsure of the best way to pass the output to my other tasks. The gist of my code is below:
def get_job_dts():
#Do something to determine the appropriate job_start_dt and job_end_dt
#Package up as a list as inputs to other PythonCallables using op_args
job_params = [job_start_dt, job_end_dt]
return job_params
t0 = PythonOperator(
task_id = 'get_dates'
python_callable = get_job_dts
dag=dag
)
t1 = PythonOperator(
task_id = 'task_1'
,python_callable=first_task
,op_args=job_params #<-- How do I send job_params to op_args??
,dag=dag
)
t0 >> t1
I've searched around and hear mentions of jinja templates, variables, or xcoms, but I'm fuzzy on how to implement it. Does anyone have an example I could look at where I can save that list into a variable that can be used by my other tasks?
The best way to do this is to push your value into XCom in get_job_dts, and pull the value back from Xcom in first_task.
def get_job_dts(**kwargs):
#Do something to determine the appropriate job_start_dt and job_end_dt
#Package up as a list as inputs to other PythonCallables using op_args
job_params = [job_start_dt, job_end_dt]
# Push job_params into XCom
kwargs['ti'].xcom_push(key='job_params', value=job_params)
return job_params
def first_task(ti, **kwargs):
# Pull job_params into XCom
job_params = ti.xcom_pull(key='job_params')
# And then do the rest
t0 = PythonOperator(
task_id = 'get_dates'
python_callable = get_job_dts
dag=dag
)
t1 = PythonOperator(
task_id = 'task_1',
provide_context=True,
python_callable=first_task,
op_args=job_params,
dag=dag
)
t0 >> t1
As RyantheCoder mentioned, XCOM is the way to go. My implementation is geared towards the tutorial where I implicitly perform a push automatically from a return value in a PythonCallable.
I am still confused by the difference in passing in (ti, **kwargs) vs. using (**context) to the function that is pulling. Also, where does "ti" come from?
Any clarifications appreciated.
def get_job_dts(**kwargs):
#Do something to determine the appropriate job_start_dt and job_end_dt
#Package up as a list as inputs to other PythonCallables using op_args
job_params = [job_start_dt, job_end_dt]
# Automatically pushes to XCOM, refer to: Airflow XCOM tutorial: https://airflow.apache.org/concepts.html?highlight=xcom#xcoms
return job_params
def first_task(**context):
# Change task_ids to whatever task pushed the XCOM vars you need, rest are standard notation
job_params = job_params = context['task_instance'].xcom_pull(task_ids='get_dates')
# And then do the rest
t0 = PythonOperator(
task_id = 'get_dates'
python_callable = get_job_dts
dag=dag
)
t1 = PythonOperator(
task_id = 'task_1',
provide_context=True,
python_callable=first_task,
dag=dag
)
t0 >> t1
As you mentioned changing task start time and end time dynamically, I supposed what you need is to create dynamic dag rather than just pass the args to dag. Especially, changing start time and interval without changing dag name will cause unexpected result, it will highly suggested not to do so. So you can refer to this link to see if this strategy can help.
This is my operator:
bigquery_check_op = BigQueryOperator(
task_id='bigquery_check',
bql=SQL_QUERY,
use_legacy_sql = False,
bigquery_conn_id=CONNECTION_ID,
trigger_rule='all_success',
xcom_push=True,
dag=dag
)
When I check the Render page in the UI. Nothing appears there.
When I run the SQL in the console it return value 1400 which is correct.
Why the operator doesn't push the XCOM?
I can't use BigQueryValueCheckOperator. This operator is designed to FAIL against a check of value. I don't want nothing to fail. I simply want to branch the code based on the return value from the query.
Here is how you might be able to accomplish this with the BigQueryHook and the BranchPythonOperator:
from airflow.operators.python_operator import BranchPythonOperator
from airflow.contrib.hooks import BigQueryHook
def big_query_check(**context):
sql = context['templates_dict']['sql']
bq = BigQueryHook(bigquery_conn_id='default_gcp_connection_id',
use_legacy_sql=False)
conn = bq.get_conn()
cursor = conn.cursor()
results = cursor.execute(sql)
# Do something with results, return task_id to branch to
if results == 0:
return "task_a"
else:
return "task_b"
sql = "SELECT COUNT(*) FROM sales"
branching = BranchPythonOperator(
task_id='branching',
python_callable=big_query_check,
provide_context= True,
templates_dict = {"sql": sql}
dag=dag,
)
First we create a python callable that we can use to execute the query and select which task_id to branch too. Second, we create the BranchPythonOperator.
The simplest answer is because xcom_push is not one of the params in BigQueryOperator nor BaseOperator nor LoggingMixin.
The BigQueryGetDataOperator does return (and thus push) some data but it works by table and column name. You could chain this behavior by making the query you run output to a uniquely named table (maybe use {{ds_nodash}} in the name), and then use the table as a source for this operator, and then you can branch with the BranchPythonOperator.
You might instead try to use the BigQueryHook's get_conn().cursor() to run the query and work with some data inside the BranchPythonOperator.
Elsewhere we chatted and came up with something along the lines of this for putting in the callable of a BranchPythonOperator:
cursor = BigQueryHook(bigquery_conn_id='connection_name').get_conn().cursor()
# one of these two:
cursor.execute(SQL_QUERY) # if non-legacy
cursor.job_id = cursor.run_query(bql=SQL_QUERY, use_legacy_sql=False) # if legacy
result=cursor.fetchone()
return "task_one" if result[0] is 1400 else "task_two" # depends on results format
I am using Airlfow, since last 6 months. I felt so happy to define the workflows in Airflow.
I have the below scenario where I am not able to get the xcom value (highlighted in yellow color).
Please find the code below sample code:
Work Flow
def push_function(**context):
context['ti'].xcom_push(key='reportid', value='xyz')
dummy_operator = DummyOperator(
task_id='Start',
dag=main_dag
)
push_function_task = PythonOperator(
task_id='push_function',
provide_context=True,
python_callable=push_function,
op_kwargs={},
dag=main_dag)
push_function_task .set_upstream(dummy_operator)
custom_task = CustomOperator(
dag=main_dag,
task_id='import_data',
provide_context=True,
url="http://www.google.com/{}".format("{{task_instance.xcom_pull(task_ids='push_function')}}")
)
custom_task .set_upstream(push_function_task)
Notes:
1. CustomOperator is my own operator wtritten for downloading the data for the given URL
Please help me.
Thanks,
Samanth
I believe you have a mismatch in keys when pushing and pulling the XCom. Each XCom value is tied to a DAG ID, task ID, and key. If you are pushing with report_id key, then you need to pull with it as well.
Note, if a key is not specified to xcom_pull(), it uses the default of return_value. This is because if a task returns a result, Airflow will automatically push it to XCom under the return_value key.
This gives you two options to fix your issue:
1) Continue to push to the report_id key and make sure you pull from it as well
def push_function(**context):
context['ti'].xcom_push(key='reportid', value='xyz')
...
custom_task = CustomOperator(
...
url="http://www.google.com/{}".format("{{ task_instance.xcom_pull(task_ids='push_function', key='reportid') }}")
)
2) Have push_function() return the value you want to push to XCom, then pull from the default key.
def push_function(**context):
return 'xyz'
...
custom_task = CustomOperator(
...
url="http://www.google.com/{}".format("{{ task_instance.xcom_pull(task_ids='push_function') }}")
)