TypeError: PostgresOperator.partial() got an unexpected keyword argument 'schedule_interval' - airflow

So I am trying to use partial() method in PostgresOperator, but I am getting this error apparently because I unintentionally pass schedule_interval to it. I looked up in airflow git there's no such parameter for partial() method of BasicOperator which I assume is parent for all the operators classes.
So I am confused I have to pass this parameter to DAG params yet there's no such parameter for .partial() so how am I supposed to create this dag and tasks? I haven't found any information on how to pull it off.
from airflow import DAG
from airflow.decorators import task
from airflow.providers.postgres.operators.postgres import PostgresOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2022, 9, 7),
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval' : '#daily'
}
with DAG(
'name',
default_args=default_args
) as dag:
#task
def generate_sql_queries(src_list: list) -> list:
queries = []
for i in src_list:
query = f'SELECT sql_epic_function()'
queries.append(query)
return queries
queries = generate_sql_queries([4,8])
task = PostgresOperator.partial(
task_id='name',
postgres_conn_id='postgres_default_id_connection' # don't forget to change
).expand(sql=queries)
task

The values in default_args dict are passed to each Operator, and you have 'schedule_interval' : '#daily' on it. Also, schedule_interval is not an operator argument but a DAG object argument. So, besides removing it from the default_args dict, you have to add it to the DAG definition like:
with DAG(
'name',
schedule_interval="#daily"
default_args=default_args
) as dag:

Related

Airflow DAG status is Success, but task states Dag has yet to run

I am using Airflow 2.3.4 ;
I am Triggering with Config. When I hardcode the config values, this DAG runs successfully.
But on Triggering after passing config
my tasks never start, but the status turn green(Success).
Please help me understand what's going wrong !
from datetime import datetime, timedelta
from airflow import DAG
from pprint import pprint
from airflow.operators.python import PythonOperator
from operators.jvm import JVMOperator
args = {
'owner': 'satyam',
'depends_on_past': False,
'start_date': datetime.utcnow(),
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag_params = {
'dag_id': 'synthea_etl_end_to_end_with_config',
'start_date': datetime.utcnow(),
'end_date': datetime(2025, 2, 5),
'default_args': args,
'schedule_interval': timedelta(hours=4)
}
dag = DAG(**dag_params)
# [START howto_operator_python]
def print_context(ds, **kwargs):
"""Print the Airflow context and ds variable from the context."""
pprint(kwargs)
pprint(ds)
return 'Whatever you return gets printed in the logs'
jvm_task = JVMOperator(
task_id='jvm_task',
correlation_id='123456',
jar='/home/i1136/Synthea/synthea-with-dependencies.jar',
options={
'java_args': [''],
'jar_args': ["-p {{ dag_run.conf['population_count'] }} --exporter.fhir.export {{ dag_run.conf['fhir'] }} --exporter.ccda.export {{ dag_run.conf['ccda'] }} --exporter.csv.export {{ dag_run.conf['csv'] }} --exporter.csv.append_mode {{ dag_run.conf['csv'] }} --exporter.baseDirectory /home/i1136/Synthea/output_dag_config" ]
})
print_context_task = PythonOperator(task_id='print_context_task', provide_context=True, python_callable=print_context, dag=dag)
jvm_task.set_downstream(print_context_task)
The problem is with 'start_date': datetime.utcnow(), which is always >= the dag_run start_date, in this case Airflow will mark the run as succeeded without running it. For this variable, it's better to choose the minimum date of your runs, if you don't have one, you can use the yesterday date, but the next day you will not be able to re-run the tasks failed on the previous day:
import pendulum
dag_params = {
...,
'start_date': pendulum.yesterday(),
...,
}
In my case it was a small bug in python script - not detected by Airflow after refreshing

Apache-Airflow - Task is in the none state when running DAG

Just started with airflow and wanted to run simple dag with BashOperator that outputs 'Hello' to console
I noticed that my status is indefinitely stuck in 'Running'
When I go on task details, I get this:
Task is in the 'None' state which is not a valid state for execution. The task must be cleared in order to be run.
Any suggestions or hints are much appreciated.
Dag:
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'dude_whose_doors_open_like_this_-W-',
'depends_on_past': False,
'start_date': days_ago(2),
'email': ['yessure#gmail.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'Test',
default_args=default_args,
description='Test',
schedule_interval=timedelta(days=1)
)
t1 = BashOperator(
task_id='ECHO',
bash_command='echo "Hello"',
dag=dag
)
t1
I've managed to solve it by adding 'start_date': dt(1970, 1, 1),
to default args object
and adding schedule_interval=None to my dag object
Could you remove the last line of t1- this isn't necessary. Also start_dateshouldn't be set dynamically - this can lead to problems with the scheduling.

Airflow sql_path not able to read the sql files when passed as Jinja Template Variable

I am trying to use Jinja template variable as against using Variable.get('sql_path'), So as to avoid hitting DB for every scan of the dag file
Original code
import datetime
import os
from functools import partial
from datetime import timedelta
from airflow.models import DAG,Variable
from airflow.contrib.operators.snowflake_operator import SnowflakeOperator
from alerts.email_operator import dag_failure_email
SNOWFLAKE_CONN_ID = 'etl_conn'
tmpl_search_path = []
for subdir in ['business/', 'audit/', 'business/transform/']:
tmpl_search_path.append(os.path.join(Variable.get('sql_path'), subdir))
def get_db_dag(
*,
dag_id,
start_date,
schedule_interval,
max_taskrun,
max_dagrun,
proc_nm,
load_sql
):
default_args = {
'owner': 'airflow',
'start_date': start_date,
'provide_context': True,
'execution_timeout': timedelta(minutes=max_taskrun),
'retries': 0,
'retry_delay': timedelta(minutes=3),
'retry_exponential_backoff': True,
'email_on_retry': False,
}
dag = DAG(
dag_id=dag_id,
schedule_interval=schedule_interval,
dagrun_timeout=timedelta(hours=max_dagrun),
template_searchpath=tmpl_search_path,
default_args=default_args,
max_active_runs=1,
catchup='{{var.value.dag_catchup}}',
on_failure_callback=alert_email_callback,
)
load_table = SnowflakeOperator(
task_id='load_table',
sql=load_sql,
snowflake_conn_id=SNOWFLAKE_CONN_ID,
autocommit=True,
dag=dag,
)
load_vcc_svc_recon
return dag
# ======== DAG DEFINITIONS #
edw_table_A = get_db_dag(
dag_id='edw_table_A',
start_date=datetime.datetime(2020, 5, 21),
schedule_interval='0 5 * * *',
max_taskrun=3, # Minutes
max_dagrun=1, # Hours
load_sql='recon/extract.sql',
)
When I have replaced Variable.get('sql_path') with Jinja Template '{{var.value.sql_path}}' as below and ran the Dag, it threw an error as below, it was not able to get the file from the subdirectory of the SQL folder
tmpl_search_path = []
for subdir in ['bus/', 'audit/', 'business/snflk/']:
tmpl_search_path.append(os.path.join('{{var.value.sql_path}}', subdir))
Got below error as
inja2.exceptions.TemplateNotFound: extract.sql
Templates are not rendered everywhere in a DAG script. Usually they are rendered in the templated parameters of Operators. So, unless you pass the elements of tmpl_search_path to some templated parameter {{var.value.sql_path}} will not be rendered.
The template_searchpath of DAG is not templated. That is why you cannot pass Jinja templates to it.
The options of which I can think are
Hardcode the value of Variable.get('sql_path') in the pipeline script.
Save the value of Variable.get('sql_path') in a configuration file and read it from there in the pipeline script.
Move the Variable.get() call out of the for-loop. This will result in three times fewer requests to the database.
More info about templating in Airflow.

Is it possible to have a pipeline in Airflow that does not tie to any schedule?

I need to have pipeline that will be executed either manually or programmatically, is possible with Airflow? Looks like right now each workflow MUST be tied to a schedule.
Just set the schedule_interval to None when you create the DAG:
dag = DAG('workflow_name',
template_searchpath='path',
schedule_interval=None,
default_args=default_args)
From the Airflow Manual:
Each DAG may or may not have a schedule, which informs how DAG Runs
are created. schedule_interval is defined as a DAG arguments, and
receives preferably a cron expression as a str, or a
datetime.timedelta object.
The manual then goes on to list some cron 'presets' one of which is None.
Yes, this can be achieved by passing None to schedule_interval in default_args.
Check this documation on DAG run.
For example:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 12, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval': None, # Check this line
}
In Airflow, every DAG is required to have a start date and schedule interval*, for example hourly:
import datetime
dag = DAG(
dag_id='my_dag',
schedule_interval=datetime.timedelta(hours=1),
start_date=datetime(2018, 5, 23),
)
(Without a schedule how would it know when to run?)
Alternatively to a cron schedule, you can set the schedule to #once to only run once.
*One exception: You can omit the schedule for externally triggered DAGs because Airflow will not schedule them itself.
However, that said, if you omit the schedule, then you need to trigger the DAG externally somehow. If you want to be able to call a DAG programmatically, for instance, as a result of a separate condition occurring in another DAG, you can do that with the TriggerDagRunOperator. You might also hear this idea called externally triggered DAGs.
Here's a usage example from the Airflow Example DAGs:
File 1 - example_trigger_controller_dag.py:
"""This example illustrates the use of the TriggerDagRunOperator. There are 2
entities at work in this scenario:
1. The Controller DAG - the DAG that conditionally executes the trigger
2. The Target DAG - DAG being triggered (in example_trigger_target_dag.py)
This example illustrates the following features :
1. A TriggerDagRunOperator that takes:
a. A python callable that decides whether or not to trigger the Target DAG
b. An optional params dict passed to the python callable to help in
evaluating whether or not to trigger the Target DAG
c. The id (name) of the Target DAG
d. The python callable can add contextual info to the DagRun created by
way of adding a Pickleable payload (e.g. dictionary of primitives). This
state is then made available to the TargetDag
2. A Target DAG : c.f. example_trigger_target_dag.py
"""
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
from datetime import datetime
import pprint
pp = pprint.PrettyPrinter(indent=4)
def conditionally_trigger(context, dag_run_obj):
"""This function decides whether or not to Trigger the remote DAG"""
c_p = context['params']['condition_param']
print("Controller DAG : conditionally_trigger = {}".format(c_p))
if context['params']['condition_param']:
dag_run_obj.payload = {'message': context['params']['message']}
pp.pprint(dag_run_obj.payload)
return dag_run_obj
# Define the DAG
dag = DAG(dag_id='example_trigger_controller_dag',
default_args={"owner": "airflow",
"start_date": datetime.utcnow()},
schedule_interval='#once')
# Define the single task in this controller example DAG
trigger = TriggerDagRunOperator(task_id='test_trigger_dagrun',
trigger_dag_id="example_trigger_target_dag",
python_callable=conditionally_trigger,
params={'condition_param': True,
'message': 'Hello World'},
dag=dag)
File 2 - example_trigger_target_dag.py:
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from datetime import datetime
import pprint
pp = pprint.PrettyPrinter(indent=4)
# This example illustrates the use of the TriggerDagRunOperator. There are 2
# entities at work in this scenario:
# 1. The Controller DAG - the DAG that conditionally executes the trigger
# (in example_trigger_controller.py)
# 2. The Target DAG - DAG being triggered
#
# This example illustrates the following features :
# 1. A TriggerDagRunOperator that takes:
# a. A python callable that decides whether or not to trigger the Target DAG
# b. An optional params dict passed to the python callable to help in
# evaluating whether or not to trigger the Target DAG
# c. The id (name) of the Target DAG
# d. The python callable can add contextual info to the DagRun created by
# way of adding a Pickleable payload (e.g. dictionary of primitives). This
# state is then made available to the TargetDag
# 2. A Target DAG : c.f. example_trigger_target_dag.py
args = {
'start_date': datetime.utcnow(),
'owner': 'airflow',
}
dag = DAG(
dag_id='example_trigger_target_dag',
default_args=args,
schedule_interval=None)
def run_this_func(ds, **kwargs):
print("Remotely received value of {} for key=message".
format(kwargs['dag_run'].conf['message']))
run_this = PythonOperator(
task_id='run_this',
provide_context=True,
python_callable=run_this_func,
dag=dag)
# You can also access the DagRun object in templates
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: '
'{{ dag_run.conf["message"] if dag_run else "" }}" ',
dag=dag)

Airflow - Run each python function separately

I have the airflow below script that runs all python scripts as one function. I would like to have each the python functions to run individually so that I could keep track of each function and their status.
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
#from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def db_log():
global db_con
try:
db_con = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = 'port' sslmode = 'require' ")
except:
print("Connection Failed.")
print('Connected successfully')
return (db_con)
def insert_data():
cur = db_con.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
def job_run():
db_log()
insert_data()
##########################################
t1 = PythonOperator(
task_id='DB_Connect',
python_callable=job_run,
# bash_command='python3 ~/airflow/dags/sample.py',
dag=dag)
t1
The above script works just fine but would like to split this by function to keep better track. Could anyone assist on this. Tnx..
Updated Code (version 2):
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
#from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def db_log(**kwargs):
global db_con
try:
db_con = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = 'port' sslmode = 'require' ")
except:
print("Connection Failed.")
print('Connected successfully')
task_instance = kwargs['task_instance']
task_instance.xcom_push(value="db_con", key="db_log")
return (db_con)
def insert_data(**kwargs):
v1 = task_instance.xcom_pull(key="db_con", task_ids='db_log')
return (v1)
cur = db_con.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
#def job_run():
# db_log()
# insert_data()
##########################################
t1 = PythonOperator(
task_id='Connect',
python_callable=db_log,provide_context=True,
dag=dag)
t2 = PythonOperator(
task_id='Query',
python_callable=insert_data,provide_context=True,
dag=dag)
t1 >> t2
There are two possible solutions for this:
A) Create several tasks per function
The tasks in Airflow are being called in separate processes. Variables which get defined as global won't work since the second task can usually not see into the variables of the first task.
Introducing: XCOM. This is a feature of Airflow and we answered a few questions for this already, for example here (with examples): Python Airflow - Return result from PythonOperator
EDIT
You have to provide context and pass the context along as written in the examples. For your example, this would mean:
add provide_context=True, to your PythonOperator
change the signature of job_run to def job_run(**kwargs):
pass the kwargs to data_warehouse_login with data_warehouse_login(kwargs) inside the function
B) Create one complete function
In this very scenario I'd still remove the global (just call insert_data, call data_warehouse_login from within and return the connection) and use just one task.
If an error occurs, throw an exception. Airflow will handle these just fine. Just make sure to put appropriate messages in the exception and use the best exception type.

Resources