Not able to use JdbcOperator Airflow - airflow

I am trying to connect to hive table using JdbcOperator. My code is below :
import datetime as dt
from datetime import timedelta
import airflow
from airflow.models import DAG
from airflow.operators.jdbc_operator.JdbcOperator import JdbcOperator
args = {
'owner': 'Airflow',
'start_date': dt.datetime(2020, 3, 24),
'retries': 1,
'retry_delay': dt.timedelta(minutes=5),
}
dag_hive = DAG(dag_id="import_hive",default_args=args, schedule_interval= " 0 * * * *",dagrun_timeout=timedelta(minutes=60))
hql_query = """USE testdb;
CREATE TABLE airflow-test-table LIKE testtable;"""
hive_task = JdbcOperator(sql = hql_query, task_id="hive_script_task", jdbc_conn_id="hive_conn_default",dag=dag_hive)
hive_task
I am getting error
ModuleNotFoundError: No module named
'airflow.operators.jdbc_operator.JdbcOperator';
'airflow.operators.jdbc_operator' is not a package
I have cross checked the package in sitepackages folder, its available. Not able to figure out why I am getting this error.

Install dependencies for using JDBC operator by running the following command:
pip install 'apache-airflow[jdbc]'
and then import JdbcOperator in your DAG file like #mk_sta mentioned and as follows:
from airflow.operators.jdbc_operator import JdbcOperator

The correct way to import JdbcOperator() module will be the following:
from airflow.operators.jdbc_operator import JdbcOperator
Keep in mind that JDBCOperator also requires dependent jaydebeapi Python package that needs to be supplied to the current Airflow environment.

Related

How to integrate great expectations into airflow project

I m trying to integrate great expectations into a airflow project but without success.
My question is there a configuration to do ?
Here are the steps I followed:
1- I generate the great expectaions project by following this tutorial https://docs.greatexpectations.io/docs/tutorials/getting_started/tutorial_setup
2- I copy the great_expectations folder into /include
The airflow project looks like:
3- Create a DAG
import os
import pathlib
from pathlib import Path
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from great_expectations_provider.operators.great_expectations import GreatExpectationsOperator
base_path = Path(__file__).parents[1]
ge_root_dir = os.path.join(base_path, "include", "great_expectations")
data_file = os.path.join(base_path, "include", "data/yellow_tripdata_sample_2019-01.csv")
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019, 1, 1),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG('example_great_expectations_dag',
schedule_interval='#once',
default_args=default_args)
with dag:
ge_task = GreatExpectationsOperator(
task_id="ge_task",
data_context_root_dir=ge_root_dir,
checkpoint_name="getting_started_checkpoint")
ge_task
Error:
[2022-04-17, 02:52:54 EDT] {great_expectations.py:122} INFO - Running validation with Great Expectations...
[2022-04-17, 02:52:54 EDT] {great_expectations.py:125} INFO - Ensuring data context is valid...
[2022-04-17, 02:52:54 EDT] {util.py:153} CRITICAL - Error The module: `great_expectations.data_context.store` does not contain the class: `ProfilerStore`.
- Please verify that the class named `ProfilerStore` exists. occurred while attempting to instantiate a store.
[2022-04-17, 02:52:54 EDT] {taskinstance.py:1718} ERROR - Task failed with exception
this might be a package dependency Problem. Please make sure:
Notes on compatibility
=> This operator currently works with the Great Expectations V3 Batch Request API only. If you would like to use the operator in conjunction with the V2 Batch Kwargs API, you must use a version below 0.1.0
=> make sure, that you use the same packages in both environments
I had the same problem

getting error while using livybatch operator in Airflow , DAG getting crashed

Can someone help me on this while using livybatchoperator in Airflow , below is my code ...
apart from that what's other way to run spark job in airflow except spark operator, spark is installed on different machine in my case.
I'm getting this error : Getting Error in Airflow UI - "No module named 'airflow_livy'" .
```
from datetime import datetime, timedelta
from airflow_livy.batch import LivyBatchOperator
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from airflow.models import DAG
default_args = {
'owner': 'airflow',
'start-date': datetime(2020, 8, 4),
'retires': 0,
'catchup': False,
'retry-delay': timedelta(minutes=5),
}
dag_config: DAG = DAG(
'Airflow7', description='Hello world example', schedule_interval='0 12 * * *',
start_date=datetime(2020, 8, 4), catchup=False)
livy_Operator_SubmitTask = LivyBatchOperator(
task_id='spark-submit_job_livy',
class_name='Class name ',
file='File path of my jar',
arguments=['Test'],
verify_in='spark',
dag=dag_config
)
livy_Operator_SubmitTask```
Try importing this namespace instead:
from airflow.providers.apache.livy.operators.livy import LivyOperator
Found from:
https://github.com/apache/airflow/blob/master/airflow/providers/apache/livy/example_dags/example_livy.py

Querying hive using Airflow

I am trying to execute a query in hive using Airflow HiveOperator. My code is below :
import datetime as dt
from airflow.models import DAG
from airflow.operators.hive_operator import HiveOperator
default_args = {
'owner': 'dime',
'start_date': dt.datetime(2020, 3, 24),
'retries': 1,
'retry_delay': dt.timedelta(minutes=5),
}
hql_query = """USE testdb;
CREATE TABLE airflow-test-table LIKE testtable;"""
load_hive = DAG(dag_id='load_hive',default_args=default_args,schedule_interval='0 * * * *')
hive_copy = HiveOperator(task_id="hive_copy",hql=hql_query,hive_cli_conn_id="hive_cli_default",dag=load_hive,)
hive_copy
While executing I am getting error:
No such file or directory: 'hive': 'hive'
P.S. Airflow installation is in other machine than where HIVE is.

Cannot Create Extra Operator Link on DatabricksRunNowOperator in Airflow

I'm currently trying to build an extra link on the DatabricksRunNowOperator in airflow so I can quickly access the databricks run without having to rummage through the logs. As a starting point I'm simply trying to add a link to google in the task instance menu. I've followed the procedure shown in this tutorial creating the following code placed within my airflow home plugins folder:
from airflow.plugins_manager import AirflowPlugin
from airflow.models.baseoperator import BaseOperatorLink
from airflow.contrib.operators.databricks_operator import DatabricksRunNowOperator
class DBLogLink(BaseOperatorLink):
name = 'run_link'
operators = [DatabricksRunNowOperator]
def get_link(self, operator, dttm):
return "https://www.google.com"
class AirflowExtraLinkPlugin(AirflowPlugin):
name = "extra_link_plugin"
operator_extra_links = [DBLogLink(), ]
However the extra link does not show up, even after restarting the webserver etc:
Here's the code I'm using to create the DAG:
from airflow import DAG
from airflow.contrib.operators.databricks_operator import DatabricksRunNowOperator
from datetime import datetime, timedelta
DATABRICKS_CONN_ID = '____'
args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2020, 2, 13),
'retries': 0
}
dag = DAG(
dag_id = 'testing_notebook',
default_args = args,
schedule_interval = timedelta(days=1)
)
DatabricksRunNowOperator(
task_id = 'mail_reader',
dag = dag,
databricks_conn_id = DATABRICKS_CONN_ID,
polling_period_seconds=1,
job_id = ____,
notebook_params = {____}
)
I feel like I'm missing something really basic, but I just can't figure it out.
Additional info
Airflow version 1.10.9
Running on ubuntu 18.04.3
I've worked it out. You need to have your webserver running as RBAC. This means setting up airflow with authentication and adding users. RBAC can be turned on by setting rbac = True in your airflow.cfg file.

Getting *** Task instance did not exist in the DB as error when running gcs_to_bq in composer

While executing the following python script using cloud-composer, I get *** Task instance did not exist in the DB under the gcs2bq task Log in Airflow
Code:
import datetime
import os
import csv
import pandas as pd
import pip
from airflow import models
#from airflow.contrib.operators import dataproc_operator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils import trigger_rule
from airflow.contrib.operators import gcs_to_bq
from airflow.contrib.operators import bigquery_operator
print('''/-------/--------/------/
-------/--------/------/''')
yesterday = datetime.datetime.combine(
datetime.datetime.today() - datetime.timedelta(1),
datetime.datetime.min.time())
default_dag_args = {
# Setting start date as yesterday starts the DAG immediately when it is
# detected in the Cloud Storage bucket.
'start_date': yesterday,
# To email on failure or retry set 'email' arg to your email and enable
# emailing here.
'email_on_failure': False,
'email_on_retry': False,
# If a task fails, retry it once after waiting at least 5 minutes
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'project_id': 'data-rubrics'
#models.Variable.get('gcp_project')
}
try:
# [START composer_quickstart_schedule]
with models.DAG(
'composer_agg_quickstart',
# Continue to run DAG once per day
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
# [END composer_quickstart_schedule]
op_start = BashOperator(task_id='Initializing', bash_command='echo Initialized')
#op_readwrite = PythonOperator(task_id = 'ReadAggWriteFile', python_callable=read_data)
op_load = gcs_to_bq.GoogleCloudStorageToBigQueryOperator( \
task_id='gcs2bq',\
bucket='dr-mockup-data',\
source_objects=['sample.csv'],\
destination_project_dataset_table='data-rubrics.sample_bqtable',\
schema_fields = [{'name':'a', 'type':'STRING', 'mode':'NULLABLE'},{'name':'b', 'type':'FLOAT', 'mode':'NULLABLE'}],\
write_disposition='WRITE_TRUNCATE',\
dag=dag)
#op_write = PythonOperator(task_id = 'AggregateAndWriteFile', python_callable=write_data)
op_start >> op_load
UPDATE:
Can you remove dag=dag from gcs2bq task as you are already using with models.DAG and run your dag again?
It might be because you have a dynamic start date. Your start_date should never be dynamic. Read this FAQ: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an #hourly DAG would never get to an hour after now as now() moves along.
Make your start_date static or use Airflow utils/macros:
import airflow
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
}
Okay, this was a stupid question on my part and apologies for everyone who wasted time here. I had a Dag running due to which the one I was shooting off was always in the que. Also, I did not write the correct value in destination_project_dataset_table. Thanks and apologies to all who spent time.

Resources