Airflow - Run each python function separately - airflow

I have the airflow below script that runs all python scripts as one function. I would like to have each the python functions to run individually so that I could keep track of each function and their status.
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
#from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def db_log():
global db_con
try:
db_con = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = 'port' sslmode = 'require' ")
except:
print("Connection Failed.")
print('Connected successfully')
return (db_con)
def insert_data():
cur = db_con.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
def job_run():
db_log()
insert_data()
##########################################
t1 = PythonOperator(
task_id='DB_Connect',
python_callable=job_run,
# bash_command='python3 ~/airflow/dags/sample.py',
dag=dag)
t1
The above script works just fine but would like to split this by function to keep better track. Could anyone assist on this. Tnx..
Updated Code (version 2):
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
#from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def db_log(**kwargs):
global db_con
try:
db_con = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = 'port' sslmode = 'require' ")
except:
print("Connection Failed.")
print('Connected successfully')
task_instance = kwargs['task_instance']
task_instance.xcom_push(value="db_con", key="db_log")
return (db_con)
def insert_data(**kwargs):
v1 = task_instance.xcom_pull(key="db_con", task_ids='db_log')
return (v1)
cur = db_con.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
#def job_run():
# db_log()
# insert_data()
##########################################
t1 = PythonOperator(
task_id='Connect',
python_callable=db_log,provide_context=True,
dag=dag)
t2 = PythonOperator(
task_id='Query',
python_callable=insert_data,provide_context=True,
dag=dag)
t1 >> t2

There are two possible solutions for this:
A) Create several tasks per function
The tasks in Airflow are being called in separate processes. Variables which get defined as global won't work since the second task can usually not see into the variables of the first task.
Introducing: XCOM. This is a feature of Airflow and we answered a few questions for this already, for example here (with examples): Python Airflow - Return result from PythonOperator
EDIT
You have to provide context and pass the context along as written in the examples. For your example, this would mean:
add provide_context=True, to your PythonOperator
change the signature of job_run to def job_run(**kwargs):
pass the kwargs to data_warehouse_login with data_warehouse_login(kwargs) inside the function
B) Create one complete function
In this very scenario I'd still remove the global (just call insert_data, call data_warehouse_login from within and return the connection) and use just one task.
If an error occurs, throw an exception. Airflow will handle these just fine. Just make sure to put appropriate messages in the exception and use the best exception type.

Related

Airflow + Sentry - no information from dags/tasks

I am trying to start using sentry to grab information from airflow. I am using the newest version of airflow (from v1.10.6 sentry is integrated with airflow). However i am not able to get any information about the dag or task status.
I prepared some simple dag which should fail, but on sentry i don't receive anything. The connection is established becouse when i make some typo for example in imports, the error infomation is catched at sentry. For this example i used the SequentialExecutor
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.trigger_rule import TriggerRule
from airflow.utils.dates import days_ago
from airflow import AirflowException
################################################################################
# dag
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(2),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(seconds=3),
}
dag = DAG(
'debug_sentry',
default_args=default_args,
schedule_interval=None,
)
################################################################################
# first_task
def _first_task_callable(*args, **kwargs):
pass
first_task = PythonOperator(
task_id='first_task',
python_callable=_first_task_callable,
provide_context=True,
trigger_rule=TriggerRule.ONE_SUCCESS,
dag=dag
)
################################################################################
# second_task_which_fails
def _second_task_which_fails_callable(*args, **kwargs):
a = 1
b = 0
c = a / b
return c
second_task_which_fails = PythonOperator(
task_id='second_task_which_fails',
python_callable=_second_task_which_fails_callable,
provide_context=True,
trigger_rule=TriggerRule.ONE_SUCCESS,
dag=dag
)
################################################################################
# third_task
def _third_task_callable(*args, **kwargs):
pass
third_task = PythonOperator(
task_id='third_task',
python_callable=_third_task_callable,
provide_context=True,
trigger_rule=TriggerRule.ONE_SUCCESS,
dag=dag
)
################################################################################
first_task >> second_task_which_fails >> third_task
What i did wrong or i missed something in configuration at airflow.cfg?
[sentry]
sentry_dsn = https://<my_dsn>
There was a recent fix to the Sentry integration in Airflow as per: https://github.com/apache/airflow/pull/7232. Try updating airflow to this commit?

Airflow, Connecting to MsSql error "Adaptive Server is unavailable or does not exist"

I'm getting this error when trying to use Airflow to get_records().
pymssql.OperationalError: (20009, b'DB-Lib error message 20009, severity 9:\nUnable to connect: Adaptive Server is unavailable or does not exist (localhost:None)\n')
I used this guide on how to setup.
https://tech.marksblogg.com/mssql-sql-server-linux-install-tutorial-and-guide.html
Using Python REPL, I can connect and return a result.
with pymssql.connect(server="localhost",
user="SA",
password="password",
database="database_name") as conn:
df = pd.read_sql("SELECT TOP 1 * FROM currency", conn)
print(df)
with pymssql.connect(server="127.0.0.1",
user="SA",
password="password",
database="database_name") as conn:
df = pd.read_sql("SELECT TOP 1 * FROM currency", conn)
print(df)
I update my Airflow Connections with either of these setups and the run a test.
airflow test run_test_db test_database 2015-06-01
The error is produced....
Any ideas please? The whole setup is contained within the one linux (vagrant) machine, no virtual environments.... So its using the same version of pymssql to try an connect....
EDIT UPDATE
Whats really annoying is if i use the same connection string in a DAG there is no error and it runs perfectly fine...
So the connection string that is retrieved from the database must change.
Is there a way to debug/print the string/connection properties?
Example working DAG
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.hooks.mssql_hook import MsSqlHook
from datetime import datetime, timedelta
import pymssql
import pandas as pd
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2019, 2, 13),
'email': ['example#email.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('run_test_db', default_args=default_args, schedule_interval="0 01 * * 2-6")
def test_db(**context):
with pymssql.connect(server="localhost",
user="SA",
password="Password123",
database="database") as conn:
df = pd.read_sql("SELECT TOP 1 * FROM champ_dw_dim_currency", conn)
print(df)
test_database = PythonOperator(
task_id='test_database',
python_callable=test_db,
execution_timeout=timedelta(minutes=3),
dag=dag,
provide_context=True,
op_kwargs={
'extra_detail': 'nothing'
})

Airflow - Trying to execute a set of Python functions

I am trying to execute a Airflow script that consists of a couple of Python functions. These functions basically query a database and perform few tasks. I am trying to execute this is Airflow so that I would be able to monitor each of these functions seperately. Given below is the code I am trying to execute and get the below error
Subtask: NameError: name 'task_instance' is not defined
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def db_log(**kwargs):
global db_con
try:
db_con = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = 'port' sslmode = 'require' ")
except:
print("Connection Failed.")
print('Connected successfully')
task_instance = kwargs['task_instance']
task_instance.xcom_push(value="db_con", key="db_log")
return (db_con)
def insert_data(**kwargs):
v1 = task_instance.xcom_pull(key="db_con", task_ids='db_log')
return (v1)
cur = db_con.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
#def job_run():
# db_log()
# insert_data()
##########################################
t1 = PythonOperator(
task_id='Connect',
python_callable=db_log,provide_context=True,
dag=dag)
t2 = PythonOperator(
task_id='Query',
python_callable=insert_data,provide_context=True,
dag=dag)
t1 >> t2
Could anyone assist on this. Thanks..
Update 1 :
Encountered an error
AttributeError: 'NoneType' object has no attribute 'execute'
pointing to the last line on the above piece of code
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
Complete code
Complete code:
## Third party Library Imports
import pandas as pd
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 5, 29, 12),
'email': ['airflow#airflow.com']
}
dag = DAG('sample1', default_args=default_args)
## Login to DB
def db_log(**kwargs):
global db_con
try:
db_con = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = '5439'")
except:
print("I am unable to connect")
print('Connection Task Complete')
task_instance = kwargs['task_instance']
task_instance.xcom_push(key="dwh_connection" , value = "dwh_connection")
return (dwh_connection)
t1 = PythonOperator(
task_id='DWH_Connect',
python_callable=data_warehouse_login,provide_context=True,
dag=dag)
#######################
def insert_data(**kwargs):
task_instance = kwargs['task_instance']
db_con_xcom = task_instance.xcom_pull(key="dwh_connection", task_ids='DWH_Connect')
cur = db_con_xcom
cur.execute("""insert into tbl_1 select limit 2 """)
##########################################
t2 = PythonOperator(
task_id='DWH_Connect1',
python_callable=insert_data,provide_context=True,dag=dag)
t1 >> t2
This is a basic error message from Python.
NameError: name 'task_instance' is not defined
tells that task_instance is nowhere to be found when you want to use it.
The task instance is provided in the context which is already being passed to the function.
Airflow sends the context with the setting
provide_context=True,
within the task. Also the definition accepts kwargs:
def insert_data(**kwargs):
which is also correct.
Correction
You first need to take the task instance out of the context like so:
task_instance = kwargs['task_instance']
Then you can use the task instance to use xcom_pull. So it should look like this (put in a few comments as well):
def insert_data(**kwargs):
task_instance = kwargs['task_instance']
db_con_xcom = task_instance.xcom_pull(key="db_con", task_ids='db_log')
#return (v1) # wrong, why return here?
#cur = db_con.cursor() # wrong, db_con might not be available
cur = db_con_xcom
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
Since the question is becoming bigger I think it is appropriate to add a second answer.
Even after the edit from the comment "I removed the indentation portion of the code" I am still not sure about this bit of code:
def db_log(**kwargs):
global db_con
try:
db_con = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = '5439'")
except:
print("I am unable to connect")
print('Connection Task Complete')
task_instance = kwargs['task_instance']
task_instance.xcom_push(key="dwh_connection" , value = "dwh_connection")
return (dwh_connection)
It should look like this:
def db_log(**kwargs):
global db_con
try:
db_con = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = '5439'")
except:
print("I am unable to connect")
print('Connection Task Complete')
task_instance = kwargs['task_instance']
task_instance.xcom_push(key="dwh_connection" , value = "dwh_connection")
#return (dwh_connection) # don't need a return here
Besides that the idea in your other question (Python - AttributeError: 'NoneType' object has no attribute 'execute') to use a PostgresHook seems interesting to me. You might want to pursue that thought in the other question.

Airflow - Script does not execute when triggered

I have a airflow script that tries to insert data from one table to another, I am using a Amazon Redshift DB. The given below script when triggered does not execute. Task_id status remains as 'no status' in the Graph view and no other error is shown.
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def db_login():
global db_conn
try:
db_conn = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = '5439' sslmode = 'require' ")
except:
print("I am unable to connect to the database.")
print('Connection Task Complete: Connected to DB')
return (db_conn)
#######################
def insert_data():
cur = db_conn.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2 ;""")
db_conn.commit()
print('ETL Task Complete')
def job_run():
db_login()
insert_data()
##########################################
t1 = PythonOperator(
task_id='DBConnect',
python_callable=job_run,
bash_command='python3 ~/airflow/dags/sample.py',
dag=dag)
t1
Could anyone assist to find where the problem could be. Thanks
Updated Code (05/28)
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
#from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def data_warehouse_login():
global dwh_connection
try:
dwh_connection = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = 'port' sslmode = 'require' ")
except:
print("Connection Failed.")
print('Connected successfully')
return (dwh_connection)
def insert_data():
cur = dwh_connection.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
dwh_connection.commit()
print('Task Complete: Insert success')
def job_run():
data_warehouse_login()
insert_data()
##########################################
t1 = PythonOperator(
task_id='DWH_Connect',
python_callable=job_run(),
# bash_command='python3 ~/airflow/dags/sample.py',
dag=dag)
t1
Log message when running the script
[2018-05-28 11:36:45,300] {jobs.py:343} DagFileProcessor26 INFO - Started process (PID=26489) to work on /Users/user/airflow/dags/sample.py
[2018-05-28 11:36:45,306] {jobs.py:534} DagFileProcessor26 ERROR - Cannot use more than 1 thread when using sqlite. Setting max_threads to 1
[2018-05-28 11:36:45,310] {jobs.py:1521} DagFileProcessor26 INFO - Processing file /Users/user/airflow/dags/sample.py for tasks to queue
[2018-05-28 11:36:45,310] {models.py:167} DagFileProcessor26 INFO - Filling up the DagBag from /Users/user/airflow/dags/sample.py
/Users/user/anaconda3/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
Task Complete: Insert success
[2018-05-28 11:36:50,964] {jobs.py:1535} DagFileProcessor26 INFO - DAG(s) dict_keys(['latest_only', 'example_python_operator', 'test_utils', 'example_bash_operator', 'example_short_circuit_operator', 'example_branch_operator', 'tutorial', 'example_passing_params_via_test_command', 'latest_only_with_trigger', 'example_xcom', 'example_http_operator', 'example_skip_dag', 'example_trigger_target_dag', 'example_branch_dop_operator_v3', 'example_subdag_operator', 'example_subdag_operator.section-1', 'example_subdag_operator.section-2', 'example_trigger_controller_dag', 'insert_data2']) retrieved from /Users/user/airflow/dags/sample.py
[2018-05-28 11:36:51,159] {jobs.py:1169} DagFileProcessor26 INFO - Processing example_subdag_operator
[2018-05-28 11:36:51,167] {jobs.py:566} DagFileProcessor26 INFO - Skipping SLA check for <DAG: example_subdag_operator> because no tasks in DAG have SLAs
[2018-05-28 11:36:51,170] {jobs.py:1169} DagFileProcessor26 INFO - Processing sample_dag
[2018-05-28 11:36:51,174] {jobs.py:354} DagFileProcessor26 ERROR - Got an exception! Propagating...
Traceback (most recent call last):
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 346, in helper
pickle_dags)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/utils/db.py", line 53, in wrapper
result = func(*args, **kwargs)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 1581, in process_file
self._process_dags(dagbag, dags, ti_keys_to_schedule)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 1171, in _process_dags
dag_run = self.create_dag_run(dag)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/utils/db.py", line 53, in wrapper
result = func(*args, **kwargs)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 776, in create_dag_run
if next_start <= now:
TypeError: '<=' not supported between instances of 'NoneType' and 'datetime.datetime'
Log from the Graph View
* Log file isn't local.
* Fetching here: http://:8793/log/sample_dag/DWH_Connect/2018-05-28T12:23:57.595234
*** Failed to fetch log file from worker.
* Reading remote logs...
* Unsupported remote log location.
Instead of having a PythonOperator you need to have a BashOperator and a PythonOperator.
You are getting the error because PythonOperator doesn't have a bash_command argument
t1 = PythonOperator(
task_id='DBConnect',
python_callable=db_login,
dag=dag
)
t2 = BashOperator(
task_id='Run Python File',
bash_command='python3 ~/airflow/dags/sample.py',
dag=dag
)
t1 >> t2
To the answer kaxil provided I would like to extend that you should be using an IDE to develop for Airflow. PyCharm works fine for me.
That being said, please make sure to look up the available fields in the docs next time. For PythonOperator, see the docs here:
https://airflow.apache.org/code.html#airflow.operators.PythonOperator
Signature looks like:
class airflow.operators.PythonOperator(python_callable, op_args=None, op_kwargs=None, provide_context=False, templates_dict=None, templates_exts=None, *args, **kwargs)
and for BashOperator, see the docs here:
https://airflow.apache.org/code.html#airflow.operators.BashOperator
Signature is:
class airflow.operators.BashOperator(bash_command, xcom_push=False, env=None, output_encoding='utf-8', *args, **kwargs)
Highlights are from me to show the parameters you have been using.
Make sure to dig through the documentation a bit before using an Operator is my recommendation.
EDIT
After seeing the code update there is one thing left:
Make sure when defining python_callable in a task that you do so without brackets, otherwise the code will be called (which is very unintuitive if you don't know about it). So your code should look like this:
t1 = PythonOperator(
task_id='DWH_Connect',
python_callable=job_run,
dag=dag)

Airflow - Broken DAG - Timeout

I have a DAG that executes a function that connects to a Postgres DB, deletes the contents in the table and then inserts a new data set.
I am trying this in my local and I see when I try to run this, the web server takes a long time to connect and in most cases doesn't succeed. However as part of the connecting process it seems to be executing the queries from the back-end. Since I have a delete function I see the data getting deleted from the table(basically one of the functions gets executed) even though I have not scheduled the script or manually started. Could someone advice as to what I am doing wrong in this.
One error that pops out in the UI is
Broken DAG: [/Users/user/airflow/dags/dwh_sample23.py] Timeout
Also see an i next to the dag id in the UI that says This is DAG isn't available in the web server's DAG object.
Given below is the code I am using:
## Third party Library Imports
import pandas as pd
import psycopg2
import airflow
from airflow import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'admin',
'depends_on_past': False,
'start_date': datetime(2018, 5, 21),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('dwh_sample23', default_args=default_args)
#######################
## Login to DB
def db_login():
''' This function connects to the Data Warehouse and returns the cursor to execute queries '''
global dwh_connection
try:
dwh_connection = psycopg2.connect(" dbname = 'dbname' user = 'user' password = 'password' host = 'hostname' port = '5439' sslmode = 'require' ")
except:
print("I am unable to connect to the database.")
print('Success')
return(dwh_connection)
def tbl1_del():
''' This function takes clears all rows from tbl1 '''
cur = dwh_connection.cursor()
cur.execute("""DELETE FROM tbl1;""")
dwh_connection.commit()
def pop_tbl1():
''' This function populates all rows in tbl1 '''
cur = dwh_connection.cursor()
cur.execute(""" INSERT INTO tbl1
select id,name,price from tbl2;""")
dwh_connection.commit()
db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()
##########################################
t1 = BashOperator(
task_id='DB_Connect',
python_callable=db_login(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)
t2 = BashOperator(
task_id='del',
python_callable=tbl1_del(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)
t3 = BashOperator(
task_id='populate',
python_callable=pop_tbl1(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)
t1.set_downstream(t2)
t2.set_downstream(t3)
Could anyone assist? Thanks.
Instead of using BashOperator you can use PythonOperator and call db_login(), tbl1_del(), pop_tbl1() in PythonOperator
## Third party Library Imports
import pandas as pd
import psycopg2
import airflow
from airflow import DAG
from airflow.operators import PythonOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'admin',
'depends_on_past': False,
'start_date': datetime(2018, 5, 21),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('dwh_sample23', default_args=default_args)
#######################
## Login to DB
def db_login():
''' This function connects to the Data Warehouse and returns the cursor to execute queries '''
global dwh_connection
try:
dwh_connection = psycopg2.connect(" dbname = 'dbname' user = 'user' password = 'password' host = 'hostname' port = '5439' sslmode = 'require' ")
except:
print("I am unable to connect to the database.")
print('Success')
return(dwh_connection)
def tbl1_del():
''' This function takes clears all rows from tbl1 '''
cur = dwh_connection.cursor()
cur.execute("""DELETE FROM tbl1;""")
dwh_connection.commit()
def pop_tbl1():
''' This function populates all rows in tbl1 '''
cur = dwh_connection.cursor()
cur.execute(""" INSERT INTO tbl1
select id,name,price from tbl2;""")
dwh_connection.commit()
db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()
##########################################
t1 = PythonOperator(
task_id='DB_Connect',
python_callable=db_login(),
dag=dag)
t2 = PythonOperator(
task_id='del',
python_callable=tbl1_del(),
dag=dag)
t3 = PythonOperator(
task_id='populate',
python_callable=pop_tbl1(),
dag=dag)
t1.set_downstream(t2)
t2.set_downstream(t3)
This is really old by now, but we got this error in prod and I found this question, and think its nice that it would have an answer.
Some of the code is getting executed during DAG load, i.e. you actually run
db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()
##########################################
inside webserver and scheduler loop, when they load dag definition from the file.
I believe you didn't intend that to happen.
Everything should work just fine if you just remove these 4 lines.
Generally don't place function you want executors to execute on file/module level, because when interpreter of the scheduler/webserver loads the file to get dag definition, it would invoke them.
Just try putting this in your dag file and see check webserver logs to see what happens.
from time import sleep
def do_some_printing():
print(1111111)
sleep(60)
do_some_printing()

Resources