getting error while using livybatch operator in Airflow , DAG getting crashed - livy

Can someone help me on this while using livybatchoperator in Airflow , below is my code ...
apart from that what's other way to run spark job in airflow except spark operator, spark is installed on different machine in my case.
I'm getting this error : Getting Error in Airflow UI - "No module named 'airflow_livy'" .
```
from datetime import datetime, timedelta
from airflow_livy.batch import LivyBatchOperator
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from airflow.models import DAG
default_args = {
'owner': 'airflow',
'start-date': datetime(2020, 8, 4),
'retires': 0,
'catchup': False,
'retry-delay': timedelta(minutes=5),
}
dag_config: DAG = DAG(
'Airflow7', description='Hello world example', schedule_interval='0 12 * * *',
start_date=datetime(2020, 8, 4), catchup=False)
livy_Operator_SubmitTask = LivyBatchOperator(
task_id='spark-submit_job_livy',
class_name='Class name ',
file='File path of my jar',
arguments=['Test'],
verify_in='spark',
dag=dag_config
)
livy_Operator_SubmitTask```

Try importing this namespace instead:
from airflow.providers.apache.livy.operators.livy import LivyOperator
Found from:
https://github.com/apache/airflow/blob/master/airflow/providers/apache/livy/example_dags/example_livy.py

Related

TypeError: PostgresOperator.partial() got an unexpected keyword argument 'schedule_interval'

So I am trying to use partial() method in PostgresOperator, but I am getting this error apparently because I unintentionally pass schedule_interval to it. I looked up in airflow git there's no such parameter for partial() method of BasicOperator which I assume is parent for all the operators classes.
So I am confused I have to pass this parameter to DAG params yet there's no such parameter for .partial() so how am I supposed to create this dag and tasks? I haven't found any information on how to pull it off.
from airflow import DAG
from airflow.decorators import task
from airflow.providers.postgres.operators.postgres import PostgresOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2022, 9, 7),
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval' : '#daily'
}
with DAG(
'name',
default_args=default_args
) as dag:
#task
def generate_sql_queries(src_list: list) -> list:
queries = []
for i in src_list:
query = f'SELECT sql_epic_function()'
queries.append(query)
return queries
queries = generate_sql_queries([4,8])
task = PostgresOperator.partial(
task_id='name',
postgres_conn_id='postgres_default_id_connection' # don't forget to change
).expand(sql=queries)
task
The values in default_args dict are passed to each Operator, and you have 'schedule_interval' : '#daily' on it. Also, schedule_interval is not an operator argument but a DAG object argument. So, besides removing it from the default_args dict, you have to add it to the DAG definition like:
with DAG(
'name',
schedule_interval="#daily"
default_args=default_args
) as dag:

Airflow DAG status is Success, but task states Dag has yet to run

I am using Airflow 2.3.4 ;
I am Triggering with Config. When I hardcode the config values, this DAG runs successfully.
But on Triggering after passing config
my tasks never start, but the status turn green(Success).
Please help me understand what's going wrong !
from datetime import datetime, timedelta
from airflow import DAG
from pprint import pprint
from airflow.operators.python import PythonOperator
from operators.jvm import JVMOperator
args = {
'owner': 'satyam',
'depends_on_past': False,
'start_date': datetime.utcnow(),
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag_params = {
'dag_id': 'synthea_etl_end_to_end_with_config',
'start_date': datetime.utcnow(),
'end_date': datetime(2025, 2, 5),
'default_args': args,
'schedule_interval': timedelta(hours=4)
}
dag = DAG(**dag_params)
# [START howto_operator_python]
def print_context(ds, **kwargs):
"""Print the Airflow context and ds variable from the context."""
pprint(kwargs)
pprint(ds)
return 'Whatever you return gets printed in the logs'
jvm_task = JVMOperator(
task_id='jvm_task',
correlation_id='123456',
jar='/home/i1136/Synthea/synthea-with-dependencies.jar',
options={
'java_args': [''],
'jar_args': ["-p {{ dag_run.conf['population_count'] }} --exporter.fhir.export {{ dag_run.conf['fhir'] }} --exporter.ccda.export {{ dag_run.conf['ccda'] }} --exporter.csv.export {{ dag_run.conf['csv'] }} --exporter.csv.append_mode {{ dag_run.conf['csv'] }} --exporter.baseDirectory /home/i1136/Synthea/output_dag_config" ]
})
print_context_task = PythonOperator(task_id='print_context_task', provide_context=True, python_callable=print_context, dag=dag)
jvm_task.set_downstream(print_context_task)
The problem is with 'start_date': datetime.utcnow(), which is always >= the dag_run start_date, in this case Airflow will mark the run as succeeded without running it. For this variable, it's better to choose the minimum date of your runs, if you don't have one, you can use the yesterday date, but the next day you will not be able to re-run the tasks failed on the previous day:
import pendulum
dag_params = {
...,
'start_date': pendulum.yesterday(),
...,
}
In my case it was a small bug in python script - not detected by Airflow after refreshing

Airflow sql_path not able to read the sql files when passed as Jinja Template Variable

I am trying to use Jinja template variable as against using Variable.get('sql_path'), So as to avoid hitting DB for every scan of the dag file
Original code
import datetime
import os
from functools import partial
from datetime import timedelta
from airflow.models import DAG,Variable
from airflow.contrib.operators.snowflake_operator import SnowflakeOperator
from alerts.email_operator import dag_failure_email
SNOWFLAKE_CONN_ID = 'etl_conn'
tmpl_search_path = []
for subdir in ['business/', 'audit/', 'business/transform/']:
tmpl_search_path.append(os.path.join(Variable.get('sql_path'), subdir))
def get_db_dag(
*,
dag_id,
start_date,
schedule_interval,
max_taskrun,
max_dagrun,
proc_nm,
load_sql
):
default_args = {
'owner': 'airflow',
'start_date': start_date,
'provide_context': True,
'execution_timeout': timedelta(minutes=max_taskrun),
'retries': 0,
'retry_delay': timedelta(minutes=3),
'retry_exponential_backoff': True,
'email_on_retry': False,
}
dag = DAG(
dag_id=dag_id,
schedule_interval=schedule_interval,
dagrun_timeout=timedelta(hours=max_dagrun),
template_searchpath=tmpl_search_path,
default_args=default_args,
max_active_runs=1,
catchup='{{var.value.dag_catchup}}',
on_failure_callback=alert_email_callback,
)
load_table = SnowflakeOperator(
task_id='load_table',
sql=load_sql,
snowflake_conn_id=SNOWFLAKE_CONN_ID,
autocommit=True,
dag=dag,
)
load_vcc_svc_recon
return dag
# ======== DAG DEFINITIONS #
edw_table_A = get_db_dag(
dag_id='edw_table_A',
start_date=datetime.datetime(2020, 5, 21),
schedule_interval='0 5 * * *',
max_taskrun=3, # Minutes
max_dagrun=1, # Hours
load_sql='recon/extract.sql',
)
When I have replaced Variable.get('sql_path') with Jinja Template '{{var.value.sql_path}}' as below and ran the Dag, it threw an error as below, it was not able to get the file from the subdirectory of the SQL folder
tmpl_search_path = []
for subdir in ['bus/', 'audit/', 'business/snflk/']:
tmpl_search_path.append(os.path.join('{{var.value.sql_path}}', subdir))
Got below error as
inja2.exceptions.TemplateNotFound: extract.sql
Templates are not rendered everywhere in a DAG script. Usually they are rendered in the templated parameters of Operators. So, unless you pass the elements of tmpl_search_path to some templated parameter {{var.value.sql_path}} will not be rendered.
The template_searchpath of DAG is not templated. That is why you cannot pass Jinja templates to it.
The options of which I can think are
Hardcode the value of Variable.get('sql_path') in the pipeline script.
Save the value of Variable.get('sql_path') in a configuration file and read it from there in the pipeline script.
Move the Variable.get() call out of the for-loop. This will result in three times fewer requests to the database.
More info about templating in Airflow.

Querying hive using Airflow

I am trying to execute a query in hive using Airflow HiveOperator. My code is below :
import datetime as dt
from airflow.models import DAG
from airflow.operators.hive_operator import HiveOperator
default_args = {
'owner': 'dime',
'start_date': dt.datetime(2020, 3, 24),
'retries': 1,
'retry_delay': dt.timedelta(minutes=5),
}
hql_query = """USE testdb;
CREATE TABLE airflow-test-table LIKE testtable;"""
load_hive = DAG(dag_id='load_hive',default_args=default_args,schedule_interval='0 * * * *')
hive_copy = HiveOperator(task_id="hive_copy",hql=hql_query,hive_cli_conn_id="hive_cli_default",dag=load_hive,)
hive_copy
While executing I am getting error:
No such file or directory: 'hive': 'hive'
P.S. Airflow installation is in other machine than where HIVE is.

Airflow, Connecting to MsSql error "Adaptive Server is unavailable or does not exist"

I'm getting this error when trying to use Airflow to get_records().
pymssql.OperationalError: (20009, b'DB-Lib error message 20009, severity 9:\nUnable to connect: Adaptive Server is unavailable or does not exist (localhost:None)\n')
I used this guide on how to setup.
https://tech.marksblogg.com/mssql-sql-server-linux-install-tutorial-and-guide.html
Using Python REPL, I can connect and return a result.
with pymssql.connect(server="localhost",
user="SA",
password="password",
database="database_name") as conn:
df = pd.read_sql("SELECT TOP 1 * FROM currency", conn)
print(df)
with pymssql.connect(server="127.0.0.1",
user="SA",
password="password",
database="database_name") as conn:
df = pd.read_sql("SELECT TOP 1 * FROM currency", conn)
print(df)
I update my Airflow Connections with either of these setups and the run a test.
airflow test run_test_db test_database 2015-06-01
The error is produced....
Any ideas please? The whole setup is contained within the one linux (vagrant) machine, no virtual environments.... So its using the same version of pymssql to try an connect....
EDIT UPDATE
Whats really annoying is if i use the same connection string in a DAG there is no error and it runs perfectly fine...
So the connection string that is retrieved from the database must change.
Is there a way to debug/print the string/connection properties?
Example working DAG
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.hooks.mssql_hook import MsSqlHook
from datetime import datetime, timedelta
import pymssql
import pandas as pd
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2019, 2, 13),
'email': ['example#email.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('run_test_db', default_args=default_args, schedule_interval="0 01 * * 2-6")
def test_db(**context):
with pymssql.connect(server="localhost",
user="SA",
password="Password123",
database="database") as conn:
df = pd.read_sql("SELECT TOP 1 * FROM champ_dw_dim_currency", conn)
print(df)
test_database = PythonOperator(
task_id='test_database',
python_callable=test_db,
execution_timeout=timedelta(minutes=3),
dag=dag,
provide_context=True,
op_kwargs={
'extra_detail': 'nothing'
})

Resources