Airflow, Connecting to MsSql error "Adaptive Server is unavailable or does not exist" - airflow

I'm getting this error when trying to use Airflow to get_records().
pymssql.OperationalError: (20009, b'DB-Lib error message 20009, severity 9:\nUnable to connect: Adaptive Server is unavailable or does not exist (localhost:None)\n')
I used this guide on how to setup.
https://tech.marksblogg.com/mssql-sql-server-linux-install-tutorial-and-guide.html
Using Python REPL, I can connect and return a result.
with pymssql.connect(server="localhost",
user="SA",
password="password",
database="database_name") as conn:
df = pd.read_sql("SELECT TOP 1 * FROM currency", conn)
print(df)
with pymssql.connect(server="127.0.0.1",
user="SA",
password="password",
database="database_name") as conn:
df = pd.read_sql("SELECT TOP 1 * FROM currency", conn)
print(df)
I update my Airflow Connections with either of these setups and the run a test.
airflow test run_test_db test_database 2015-06-01
The error is produced....
Any ideas please? The whole setup is contained within the one linux (vagrant) machine, no virtual environments.... So its using the same version of pymssql to try an connect....
EDIT UPDATE
Whats really annoying is if i use the same connection string in a DAG there is no error and it runs perfectly fine...
So the connection string that is retrieved from the database must change.
Is there a way to debug/print the string/connection properties?
Example working DAG
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.hooks.mssql_hook import MsSqlHook
from datetime import datetime, timedelta
import pymssql
import pandas as pd
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2019, 2, 13),
'email': ['example#email.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('run_test_db', default_args=default_args, schedule_interval="0 01 * * 2-6")
def test_db(**context):
with pymssql.connect(server="localhost",
user="SA",
password="Password123",
database="database") as conn:
df = pd.read_sql("SELECT TOP 1 * FROM champ_dw_dim_currency", conn)
print(df)
test_database = PythonOperator(
task_id='test_database',
python_callable=test_db,
execution_timeout=timedelta(minutes=3),
dag=dag,
provide_context=True,
op_kwargs={
'extra_detail': 'nothing'
})

Related

getting error while using livybatch operator in Airflow , DAG getting crashed

Can someone help me on this while using livybatchoperator in Airflow , below is my code ...
apart from that what's other way to run spark job in airflow except spark operator, spark is installed on different machine in my case.
I'm getting this error : Getting Error in Airflow UI - "No module named 'airflow_livy'" .
```
from datetime import datetime, timedelta
from airflow_livy.batch import LivyBatchOperator
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from airflow.models import DAG
default_args = {
'owner': 'airflow',
'start-date': datetime(2020, 8, 4),
'retires': 0,
'catchup': False,
'retry-delay': timedelta(minutes=5),
}
dag_config: DAG = DAG(
'Airflow7', description='Hello world example', schedule_interval='0 12 * * *',
start_date=datetime(2020, 8, 4), catchup=False)
livy_Operator_SubmitTask = LivyBatchOperator(
task_id='spark-submit_job_livy',
class_name='Class name ',
file='File path of my jar',
arguments=['Test'],
verify_in='spark',
dag=dag_config
)
livy_Operator_SubmitTask```
Try importing this namespace instead:
from airflow.providers.apache.livy.operators.livy import LivyOperator
Found from:
https://github.com/apache/airflow/blob/master/airflow/providers/apache/livy/example_dags/example_livy.py

Querying hive using Airflow

I am trying to execute a query in hive using Airflow HiveOperator. My code is below :
import datetime as dt
from airflow.models import DAG
from airflow.operators.hive_operator import HiveOperator
default_args = {
'owner': 'dime',
'start_date': dt.datetime(2020, 3, 24),
'retries': 1,
'retry_delay': dt.timedelta(minutes=5),
}
hql_query = """USE testdb;
CREATE TABLE airflow-test-table LIKE testtable;"""
load_hive = DAG(dag_id='load_hive',default_args=default_args,schedule_interval='0 * * * *')
hive_copy = HiveOperator(task_id="hive_copy",hql=hql_query,hive_cli_conn_id="hive_cli_default",dag=load_hive,)
hive_copy
While executing I am getting error:
No such file or directory: 'hive': 'hive'
P.S. Airflow installation is in other machine than where HIVE is.

Airflow does not run dags

Context: I successfully installed Airflow on EC2, changed things like executor to LocalExecutor; sql_alchemy_conn to postgresql+psycopg2://postgres#localhost:5432/airflow; max_threads to 10.
My problem is when I create a dag which I indicate to be run everyday everything is fine, but when I create a dag to be run like at 10am on Monday and Wednesday Airflow doesn't does not run it. Does anybody know what could I do wrong and should I do in order to fix this issue?
Dag for script which runs fine and properly:
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
args = {
'owner': 'arseniyy123',
'start_date': airflow.utils.dates.days_ago(1),
'depends_on_past': False,
'email': ['exam#exam.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG(
'daily_script',
default_args=args,
description = 'daily_script',
schedule_interval = "0 10 * * *",
)
t1 = BashOperator(
task_id='daily',
bash_command='cd /root/ && python3 DAILY_WORK.py',
dag=dag)
t1
Dag for script which should run on Monday and Wednesday, but it does not run at all:
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
args = {
'owner': 'arseniyy123',
'start_date': airflow.utils.dates.days_ago(1),
'depends_on_past': False,
'email': ['exam#exam.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG(
'monday_wednesday',
default_args=args,
description = 'monday_wednesday',
schedule_interval = "0 10 * * 1,3",
)
t1 = BashOperator(
task_id='monday_wednesday',
bash_command='cd /root/ && python3 not_daily_work.py',
dag=dag)
t1
I also have some problems with scheduler, it uses to die after being working more than 10 hours, anybody know why does it happen?
Thank you in advance!
Can you try changing the start_date to a static datetime e.g. datetime.date(2020, 3, 20) instead of using airflow.utils.dates.days_ago(1)
Maybe read through the scheduling examples here, to understand why your code didn't work. From that documentation:
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period

PythonOperator with python_callable set gets executed constantly

import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from workflow.task import some_task
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['jimin.park1#aig.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=1),
'start_date': airflow.utils.dates.days_ago(0)
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG('JiminTest', default_args=default_args, schedule_interval='*/1 * * * *', catchup=False)
t1 = PythonOperator(
task_id='Task1',
provide_context=True,
python_callable=some_task,
dag=dag
)
The actual some_task itself simply appends timestamp to some file. As you can see in the dag config file, the task itself is configured to run every 1 min.
def some_task(ds, **kwargs):
current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
with open("test.txt", "a") as myfile:
myfile.write(current_time + '\n')
I simply tail -f the output file and started up the webserver without the scheduler running. This function was being called and things were being appended to the file when webserver starts up. When I start up the scheduler, on each execution loop, the file gets appended.
What I want is for the function to be executed on every minute as intended, not every execution loop.
The scheduler will run each DAG file every scheduler loop, including all import statements.
Is there anything running code in the file from where you are importing the function?
Try to check the scheduler_heartbeat_sec config parameter in your config file. For your case it should be smaller than 60 seconds.
If you want the scheduler not to cahtchup previous runs set catchup_by_defaultto False (I am not sure if this relevant to your question though).
Please indicate which Apache Airflow version are you using

Airflow - Run each python function separately

I have the airflow below script that runs all python scripts as one function. I would like to have each the python functions to run individually so that I could keep track of each function and their status.
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
#from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def db_log():
global db_con
try:
db_con = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = 'port' sslmode = 'require' ")
except:
print("Connection Failed.")
print('Connected successfully')
return (db_con)
def insert_data():
cur = db_con.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
def job_run():
db_log()
insert_data()
##########################################
t1 = PythonOperator(
task_id='DB_Connect',
python_callable=job_run,
# bash_command='python3 ~/airflow/dags/sample.py',
dag=dag)
t1
The above script works just fine but would like to split this by function to keep better track. Could anyone assist on this. Tnx..
Updated Code (version 2):
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
#from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def db_log(**kwargs):
global db_con
try:
db_con = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = 'port' sslmode = 'require' ")
except:
print("Connection Failed.")
print('Connected successfully')
task_instance = kwargs['task_instance']
task_instance.xcom_push(value="db_con", key="db_log")
return (db_con)
def insert_data(**kwargs):
v1 = task_instance.xcom_pull(key="db_con", task_ids='db_log')
return (v1)
cur = db_con.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
#def job_run():
# db_log()
# insert_data()
##########################################
t1 = PythonOperator(
task_id='Connect',
python_callable=db_log,provide_context=True,
dag=dag)
t2 = PythonOperator(
task_id='Query',
python_callable=insert_data,provide_context=True,
dag=dag)
t1 >> t2
There are two possible solutions for this:
A) Create several tasks per function
The tasks in Airflow are being called in separate processes. Variables which get defined as global won't work since the second task can usually not see into the variables of the first task.
Introducing: XCOM. This is a feature of Airflow and we answered a few questions for this already, for example here (with examples): Python Airflow - Return result from PythonOperator
EDIT
You have to provide context and pass the context along as written in the examples. For your example, this would mean:
add provide_context=True, to your PythonOperator
change the signature of job_run to def job_run(**kwargs):
pass the kwargs to data_warehouse_login with data_warehouse_login(kwargs) inside the function
B) Create one complete function
In this very scenario I'd still remove the global (just call insert_data, call data_warehouse_login from within and return the connection) and use just one task.
If an error occurs, throw an exception. Airflow will handle these just fine. Just make sure to put appropriate messages in the exception and use the best exception type.

Resources