How to integrate great expectations into airflow project - airflow

I m trying to integrate great expectations into a airflow project but without success.
My question is there a configuration to do ?
Here are the steps I followed:
1- I generate the great expectaions project by following this tutorial https://docs.greatexpectations.io/docs/tutorials/getting_started/tutorial_setup
2- I copy the great_expectations folder into /include
The airflow project looks like:
3- Create a DAG
import os
import pathlib
from pathlib import Path
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from great_expectations_provider.operators.great_expectations import GreatExpectationsOperator
base_path = Path(__file__).parents[1]
ge_root_dir = os.path.join(base_path, "include", "great_expectations")
data_file = os.path.join(base_path, "include", "data/yellow_tripdata_sample_2019-01.csv")
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019, 1, 1),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG('example_great_expectations_dag',
schedule_interval='#once',
default_args=default_args)
with dag:
ge_task = GreatExpectationsOperator(
task_id="ge_task",
data_context_root_dir=ge_root_dir,
checkpoint_name="getting_started_checkpoint")
ge_task
Error:
[2022-04-17, 02:52:54 EDT] {great_expectations.py:122} INFO - Running validation with Great Expectations...
[2022-04-17, 02:52:54 EDT] {great_expectations.py:125} INFO - Ensuring data context is valid...
[2022-04-17, 02:52:54 EDT] {util.py:153} CRITICAL - Error The module: `great_expectations.data_context.store` does not contain the class: `ProfilerStore`.
- Please verify that the class named `ProfilerStore` exists. occurred while attempting to instantiate a store.
[2022-04-17, 02:52:54 EDT] {taskinstance.py:1718} ERROR - Task failed with exception

this might be a package dependency Problem. Please make sure:
Notes on compatibility
=> This operator currently works with the Great Expectations V3 Batch Request API only. If you would like to use the operator in conjunction with the V2 Batch Kwargs API, you must use a version below 0.1.0
=> make sure, that you use the same packages in both environments
I had the same problem

Related

MWAA not finding aws_default connection

I just set up AWS MWAA (managed airflow) and I'm playing around with running a simple bash script in a dag. I was reading the logs for the task and noticed that by default, the task looks for the aws_default connection and tries to use it but doesn't find it.
I went to the connections pane and set the aws_default connection but it still is showing the same message in the logs.
Airflow Connection: aws_conn_id=aws_default
No credentials retrieved from Connection
*** Reading remote log from Cloudwatch log_group: airflow-mwaa-Task log_stream: dms-
postgres-dialog-label-pg/start-replication-task/2021-11-22T13_00_00+00_00/1.log.
[2021-11-23 13:01:02,487] {{logging_mixin.py:104}} INFO - [2021-11-23 13:01:02,486] {{base_aws.py:368}} INFO - Airflow Connection: aws_conn_id=aws_default
[2021-11-23 13:01:02,657] {{logging_mixin.py:104}} INFO - [2021-11-23 13:01:02,656] {{base_aws.py:179}} INFO - No credentials retrieved from Connection
[2021-11-23 13:01:02,678] {{logging_mixin.py:104}} INFO - [2021-11-23 13:01:02,678] {{base_aws.py:87}} INFO - Creating session with aws_access_key_id=None region_name=us-east-1
[2021-11-23 13:01:02,772] {{logging_mixin.py:104}} INFO - [2021-11-23 13:01:02,772] {{base_aws.py:157}} INFO - role_arn is None
How can I get MWAA to recognize this connection?
My dag:
from datetime import datetime, timedelta, tzinfo
import pendulum
# The DAG object; we'll need this to instantiate a DAG
from airflow import DAG
# Operators; we need this to operate!
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
local_tz = pendulum.timezone("America/New_York")
start_date = datetime(2021, 11, 9, 8, tzinfo=local_tz)
# These args will get passed on to each operator
# You can override them on a per-task basis during operator initialization
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
# 'wait_for_downstream': False,
# 'dag': dag,
# 'sla': timedelta(hours=2),
# 'execution_timeout': timedelta(seconds=300),
# 'on_failure_callback': some_function,
# 'on_success_callback': some_other_function,
# 'on_retry_callback': another_function,
# 'sla_miss_callback': yet_another_function,
# 'trigger_rule': 'all_success'
}
with DAG(
'dms-postgres-dialog-label-pg-test',
default_args=default_args,
description='',
schedule_interval=timedelta(days=1),
start_date=start_date,
tags=['example'],
) as dag:
t1 = BashOperator(
task_id='start-replication-task',
bash_command="""
aws dms start-replication-task --replication-task-arn arn:aws:dms:us-east-1:blah --start-replication-task-type reload-target
""",
)
t1
Edit:
For now, I'm just importing an in-built function and using that to get the credentials. Example:
from airflow.hooks.base import BaseHook
conn = BaseHook.get_connection('aws_service_account')
...
print(conn.host)
print(conn.login)
print(conn.password)
Updating this as I just got off with AWS support.
The execution role MWAA creates is used instead of an access key id and secret in aws_default. To use a custom access key id and secret do as #Jonathan Porter recommends with his question's answer:
from airflow.hooks.base import BaseHook
conn = BaseHook.get_connection('aws_service_account')
...
print(conn.host)
print(conn.login)
print(conn.password)
However, if one wants to use the execution role specifically that mwaa provides, this is the default within mwaa. Confusingly, the info messages state that no credentials were retrieved from the connection, however the execution role will be used in something akin to the kubernetes pod operator.
[2021-11-23 13:01:02,487] {{logging_mixin.py:104}} INFO - [2021-11-23 13:01:02,486] {{base_aws.py:368}} INFO - Airflow Connection: aws_conn_id=aws_default
[2021-11-23 13:01:02,657] {{logging_mixin.py:104}} INFO - [2021-11-23 13:01:02,656] {{base_aws.py:179}} INFO - No credentials retrieved from Connection
[2021-11-23 13:01:02,678] {{logging_mixin.py:104}} INFO - [2021-11-23 13:01:02,678] {{base_aws.py:87}} INFO - Creating session with aws_access_key_id=None region_name=us-east-1
[2021-11-23 13:01:02,772] {{logging_mixin.py:104}} INFO - [2021-11-23 13:01:02,772] {{base_aws.py:157}} INFO - role_arn is None
For example, the following uses the .aws/credentials set by the execution role in the mwaa env automatically with this:
from datetime import timedelta
from airflow import DAG
from datetime import datetime
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
default_args = {
'owner': 'aws',
'depends_on_past': False,
'start_date': datetime(2019, 2, 20),
'provide_context': True
}
dag = DAG(
'kubernetes_pod_example', default_args=default_args, schedule_interval=None
)
#use a kube_config stored in s3 dags folder for now
kube_config_path = '/usr/local/airflow/dags/kube_config.yaml'
podRun = KubernetesPodOperator(
namespace="mwaa",
image="ubuntu:18.04",
cmds=["bash"],
arguments=["-c", "ls"],
labels={"foo": "bar"},
name="mwaa-pod-test",
task_id="pod-task",
get_logs=True,
dag=dag,
is_delete_operator_pod=False,
config_file=kube_config_path,
in_cluster=False,
cluster_context='aws',
execution_timeout=timedelta(seconds=60)
)
Hope this helps for anyone else stumbling around.

Not able to use JdbcOperator Airflow

I am trying to connect to hive table using JdbcOperator. My code is below :
import datetime as dt
from datetime import timedelta
import airflow
from airflow.models import DAG
from airflow.operators.jdbc_operator.JdbcOperator import JdbcOperator
args = {
'owner': 'Airflow',
'start_date': dt.datetime(2020, 3, 24),
'retries': 1,
'retry_delay': dt.timedelta(minutes=5),
}
dag_hive = DAG(dag_id="import_hive",default_args=args, schedule_interval= " 0 * * * *",dagrun_timeout=timedelta(minutes=60))
hql_query = """USE testdb;
CREATE TABLE airflow-test-table LIKE testtable;"""
hive_task = JdbcOperator(sql = hql_query, task_id="hive_script_task", jdbc_conn_id="hive_conn_default",dag=dag_hive)
hive_task
I am getting error
ModuleNotFoundError: No module named
'airflow.operators.jdbc_operator.JdbcOperator';
'airflow.operators.jdbc_operator' is not a package
I have cross checked the package in sitepackages folder, its available. Not able to figure out why I am getting this error.
Install dependencies for using JDBC operator by running the following command:
pip install 'apache-airflow[jdbc]'
and then import JdbcOperator in your DAG file like #mk_sta mentioned and as follows:
from airflow.operators.jdbc_operator import JdbcOperator
The correct way to import JdbcOperator() module will be the following:
from airflow.operators.jdbc_operator import JdbcOperator
Keep in mind that JDBCOperator also requires dependent jaydebeapi Python package that needs to be supplied to the current Airflow environment.

Cannot Create Extra Operator Link on DatabricksRunNowOperator in Airflow

I'm currently trying to build an extra link on the DatabricksRunNowOperator in airflow so I can quickly access the databricks run without having to rummage through the logs. As a starting point I'm simply trying to add a link to google in the task instance menu. I've followed the procedure shown in this tutorial creating the following code placed within my airflow home plugins folder:
from airflow.plugins_manager import AirflowPlugin
from airflow.models.baseoperator import BaseOperatorLink
from airflow.contrib.operators.databricks_operator import DatabricksRunNowOperator
class DBLogLink(BaseOperatorLink):
name = 'run_link'
operators = [DatabricksRunNowOperator]
def get_link(self, operator, dttm):
return "https://www.google.com"
class AirflowExtraLinkPlugin(AirflowPlugin):
name = "extra_link_plugin"
operator_extra_links = [DBLogLink(), ]
However the extra link does not show up, even after restarting the webserver etc:
Here's the code I'm using to create the DAG:
from airflow import DAG
from airflow.contrib.operators.databricks_operator import DatabricksRunNowOperator
from datetime import datetime, timedelta
DATABRICKS_CONN_ID = '____'
args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2020, 2, 13),
'retries': 0
}
dag = DAG(
dag_id = 'testing_notebook',
default_args = args,
schedule_interval = timedelta(days=1)
)
DatabricksRunNowOperator(
task_id = 'mail_reader',
dag = dag,
databricks_conn_id = DATABRICKS_CONN_ID,
polling_period_seconds=1,
job_id = ____,
notebook_params = {____}
)
I feel like I'm missing something really basic, but I just can't figure it out.
Additional info
Airflow version 1.10.9
Running on ubuntu 18.04.3
I've worked it out. You need to have your webserver running as RBAC. This means setting up airflow with authentication and adding users. RBAC can be turned on by setting rbac = True in your airflow.cfg file.

Getting *** Task instance did not exist in the DB as error when running gcs_to_bq in composer

While executing the following python script using cloud-composer, I get *** Task instance did not exist in the DB under the gcs2bq task Log in Airflow
Code:
import datetime
import os
import csv
import pandas as pd
import pip
from airflow import models
#from airflow.contrib.operators import dataproc_operator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils import trigger_rule
from airflow.contrib.operators import gcs_to_bq
from airflow.contrib.operators import bigquery_operator
print('''/-------/--------/------/
-------/--------/------/''')
yesterday = datetime.datetime.combine(
datetime.datetime.today() - datetime.timedelta(1),
datetime.datetime.min.time())
default_dag_args = {
# Setting start date as yesterday starts the DAG immediately when it is
# detected in the Cloud Storage bucket.
'start_date': yesterday,
# To email on failure or retry set 'email' arg to your email and enable
# emailing here.
'email_on_failure': False,
'email_on_retry': False,
# If a task fails, retry it once after waiting at least 5 minutes
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'project_id': 'data-rubrics'
#models.Variable.get('gcp_project')
}
try:
# [START composer_quickstart_schedule]
with models.DAG(
'composer_agg_quickstart',
# Continue to run DAG once per day
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
# [END composer_quickstart_schedule]
op_start = BashOperator(task_id='Initializing', bash_command='echo Initialized')
#op_readwrite = PythonOperator(task_id = 'ReadAggWriteFile', python_callable=read_data)
op_load = gcs_to_bq.GoogleCloudStorageToBigQueryOperator( \
task_id='gcs2bq',\
bucket='dr-mockup-data',\
source_objects=['sample.csv'],\
destination_project_dataset_table='data-rubrics.sample_bqtable',\
schema_fields = [{'name':'a', 'type':'STRING', 'mode':'NULLABLE'},{'name':'b', 'type':'FLOAT', 'mode':'NULLABLE'}],\
write_disposition='WRITE_TRUNCATE',\
dag=dag)
#op_write = PythonOperator(task_id = 'AggregateAndWriteFile', python_callable=write_data)
op_start >> op_load
UPDATE:
Can you remove dag=dag from gcs2bq task as you are already using with models.DAG and run your dag again?
It might be because you have a dynamic start date. Your start_date should never be dynamic. Read this FAQ: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an #hourly DAG would never get to an hour after now as now() moves along.
Make your start_date static or use Airflow utils/macros:
import airflow
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
}
Okay, this was a stupid question on my part and apologies for everyone who wasted time here. I had a Dag running due to which the one I was shooting off was always in the que. Also, I did not write the correct value in destination_project_dataset_table. Thanks and apologies to all who spent time.

How to perform S3 to BigQuery using Airflow?

Currently there is no S3ToBigQuery operator.
My choices are:
Use the S3ToGoogleCloudStorageOperator and then use the GoogleCloudStorageToBigQueryOperator
This is not something i'm eager to do. This means paying double for storage. Even if removing the file from either one of the storage that still involves payment.
Download the file from S3 to local file system and load it to BigQuery from file system - However there is no S3DownloadOperator This means writing the whole process from scratch without Airflow involvement. This misses the point of using Airflow.
Is there another option? What would you suggest to do?
This is what I ended up with.
This should be converted to a S3toLocalFile Operator.
def download_from_s3(**kwargs):
hook = S3Hook(aws_conn_id='project-s3')
result = hook.read_key(bucket_name='stage-project-metrics',
key='{}.csv'.format(kwargs['ds']))
if not result:
logging.info('no data found')
else:
outfile = '{}project{}.csv'.format(Variable.get("data_directory"),kwargs['ds'])
f=open(outfile,'w+')
f.write(result)
f.close()
return result
If the first option is cost restrictive, you could just use the S3Hook to download the file through the PythonOperator:
from airflow.hooks.S3_hook import S3Hook
from datetime import timedelta, datetime
from airflow import DAG
from airflow.hooks.S3_hook import S3Hook
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 1),
'email_on_failure': False,
'email_on_retry': False,
'retries': 0
}
def download_from_s3(**kwargs):
hook = S3Hook(aws_conn_id='s3_conn')
hook.read_key(bucket_name='workflows-dev',
key='test_data.csv')
dag = DAG('s3_download',
schedule_interval='#daily',
default_args=default_args,
catchup=False)
with dag:
download_data = PythonOperator(
task_id='download_data',
python_callable=download_from_s3,
provide_context=True
)
What you can do instead is use S3ToGoogleCloudStorageOperator and then use GoogleCloudStorageToBigQueryOperator with an external_table table flag i.e pass external_table =True.
This will create an external data that points to GCS location and doesn't store your data in BigQuery but you can still query it.

Resources