I'm using the MySqlHook to show results of my SQL query in Airflow logs. All is working except that I want to limit the results to a particular schema. Even though I'm passing in that schema, MySqlHook doesn't seem to be recognizing it.
Here's my function that uses the hook:
def func(mysql_conn_id, sql, schema):
"""Print results of sql query """
print("schema:", schema)
hook = MySqlHook(mysql_conn_id=mysql_conn_id, schema=schema)
df = hook.get_pandas_df(sql=sql)
print("\n" + df.to_string())
the schema that I pass in does come up in the logs from my print statement.
But when I see the results of the second print statement here (the df.to_string()), the connection that shows up in Airflow is without the schema.
INFO - Using connection to: id: dbjobs_mysql. Host: kb-qa.local, Port: None, Schema: , Login: my_user, Password: ***, extra: {}
And the query is running for more than just the given schema.
Looking in the source code it seems like what I did is what it's expecting:
class MySqlHook(DbApiHook):
conn_name_attr = 'mysql_conn_id'
default_conn_name = 'mysql_default'
supports_autocommit = True
def __init__(self, *args, **kwargs):
super(MySqlHook, self).__init__(*args, **kwargs)
self.schema = kwargs.pop("schema", None)
Related
TLDR
In the python callable for a simpleHttpOperator response function, I am trying to push an xcom that has combined information from two sources to a specificied key (a hash of the filename/path and an object lookup from a DB)
Longer Tale
I have a filesensor written which grabs all new files and passes them to MultiDagRun to parallel process the information (scientific) in the files as xcom. Works great. The simpleHttpOperator POSTs filepath info to a submission api and receives back a task_id which it must then read as a response from another (slow running) api to get the result. This I all have working fine. Files get scanned, it launches multiple dags to process, and returns objects.
But... I cannot puzzle out how to push the result to an xcom inside the python response function for the simpleHttpOperator.
My google- and SO and Reddit-fu has failed me here (and it seems overkill to use the pythonOperator tho that's my next stop.). I notice a lot of people asking similar questions though.
How do you use context or ti or task_instance or context['task_instance'] with the response function? (I cannot use "Returned Value" xcom as I need to distinguish the xcom keys as parallel processing afaik). As the default I have context set to true in the default_args.
Sure I am missing something simple here, but stumped as to what it is (note, I did try the **kwargs and ti = kwargs['ti'] below as well before hitting SO...
def _handler_object_result(response, file):
# Note: related to api I am calling not Airflow internal task ids
header_result = response.json()
task_id = header_result["task"]["id"]
api = "https://redacted.com/api/task/result/{task_id}".format(task_id=task_id)
resp = requests.get(api, verify=False).json()
data = json.loads(resp["data"])
file_object = json.dumps(data["OBJECT"])
file_hash = hash(file)
# This is the part that is not working as I am unsure how
# to access the task instance to do the xcom_push
ti.xcom_push(key=file_hash, value=file_object)
if ti.xcom_pull(key=file_hash):
return True
else:
return False
and the Operator:
object_result = SimpleHttpOperator(
task_id="object_result",
method='POST',
data=json.dumps({"file": "{{ dag_run.conf['file'] }}", "keyword": "object"}),
http_conn_id="coma_api",
endpoint="/api/v1/file/describe",
headers={"Content-Type": "application/json"},
extra_options={"verify":False},
response_check=lambda response: _handler_object_result(response, "{{ dag_run.conf['file'] }}"),
do_xcom_push=False,
dag=dag,
)
I was really expecting the task_instance object to be available in some fashion, either be default or configuration but each variation that has worked elsewhere (filesensor, pythonOperator, etc) hasn't worked, and been unable to google a solution for the magic words to make it accessible.
You can try using the get_current_context() function in your response_check function:
from airflow.operators.python import get_current_context
def _handler_object_result(response, file):
# Note: related to api I am calling not Airflow internal task ids
header_result = response.json()
task_id = header_result["task"]["id"]
api = "https://redacted.com/api/task/result/{task_id}".format(task_id=task_id)
resp = requests.get(api, verify=False).json()
data = json.loads(resp["data"])
file_object = json.dumps(data["OBJECT"])
file_hash = hash(file)
ti = get_current_context()["ti"] # <- Try this
ti.xcom_push(key=file_hash, value=file_object)
if ti.xcom_pull(key=file_hash):
return True
else:
return False
That function is a nice way of still accessing the task's execution context when context isn't explicitly handy or you don't want to pass context attrs around to access it deep in your logic stack.
While I was writing the Async test code for FASTAPI there is a problem that cannot be solved. this code is for test db. I'm using postgres and in order to user db as a test, I created is_testing function. It drop and create test db.
if is_testing:
db_url = self._engine.url
if db_url.host != "localhost":
raise Exception("db host must be 'localhost' in test environment")
except_schema_db_url = f"{db_url.drivername}://{db_url.username}#{db_url.host}"
schema_name = db_url.database # test
temp_engine = create_engine(except_schema_db_url, echo=echo, pool_recycle=pool_recycle, pool_pre_ping=True)
conn = temp_engine.connect()
try:
conn = conn.execution_options(autocommit=False)
conn.execute("ROLLBACK")
conn.execute(f"DROP DATABASE {schema_name}")
except ProgrammingError:
print(f"could not drop the database, probably does not exist.")
conn.execute("ROLLBACK")
except OperationalError:
print("could not drop database because it's being accessed by other users(psql prompt open?)")
conn.execute("ROLLBACK")
print(f"test db dropped! about to create {schema_name}")
conn.execute(f"CREATE DATABASE {schema_name}")
try:
conn.execute(f"create user test with encrypted password test")
except:
print("User already exist")
temp_engine.dispose()
this is conftest.py
#pytest.fixture(scope="session")
def app():
os.environ["API_ENV"] = "test"
return create_app()
#pytest.fixture(scope="session")
def client(app):
Base.metadata.create_all(db.engine)
# Create tables
client = AsyncClient(app=app, base_url="http://test")
return client
#pytest.fixture(scope="function", autouse=True)
def session():
sess = next(db.session())
yield sess
clear_all_table_data(
session=sess,
metadata=Base.metadata,
except_tables=[]
)
sess.rollback()
def clear_all_table_data(session: Session, metadata, except_tables: List[str] = None):
session.execute("SET session_replication_role = 'replica';")
for table in metadata.sorted_tables:
if table.name not in except_tables:
session.execute(table.delete())
session.execute("SET session_replication_role = 'origin';")
session.commit()
I got error sqlalchemy.exc.OperationalError: (psycopg2.errors.ObjectInUse) database "test" is being accessed by other users DETAIL: There is 1 other session using the database. in elb check test.
and I got error TypeError: 'AsyncClient' object is not callable in another api test.
I modified client function in conftest.py
#pytest.fixture(scope="session")
def client(app):
Base.metadata.create_all(db.engine)
return AsyncClient(app=app, base_url="http://test")
I passed one test, but I received the following error from the second test.
ClientState.OPENED: "Cannot open a client instance more than once.",
ClientState.CLOSED: "Cannot reopen a client instance, once it has been closed.",
how can I fix it?
thank you for reading long question!
I have a DAG that shall check if a file has been uploaded to Azure DataLake in a specific directory. If so, it allow other DAGs to run.
I thought about using a FileSensor, but I assume a fsconnid parameter is not enough to authenticate against a DataLake
There is no AzureDataLakeSensor in the Azure provider but you can easily implement one since the AzureDataLakeHook has check_for_file function so all needed is to wrap this function with Sensor class implementing poke() function of BaseSensorOperator. By doing so you can use Microsoft Azure Data Lake Connection directly.
I didn't test it but this should work:
from airflow.providers.microsoft.azure.hooks.data_lake import AzureDataLakeHook
from airflow.sensors.base import BaseSensorOperator
class MyAzureDataLakeSensor(BaseSensorOperator):
"""
Sense for files in Azure Data Lake
:param path: The Azure Data Lake path to find the objects. Supports glob
strings (templated)
:param azure_data_lake_conn_id: The Azure Data Lake conn
"""
template_fields: Sequence[str] = ('path',)
ui_color = '#901dd2'
def __init__(
self, *, path: str, azure_data_lake_conn_id: str = 'azure_data_lake_default', **kwargs
) -> None:
super().__init__(**kwargs)
self.path = path
self.azure_data_lake_conn_id = azure_data_lake_conn_id
def poke(self, context: "Context") -> bool:
hook = AzureDataLakeHook(azure_data_lake_conn_id=self.azure_data_lake_conn_id)
self.log.info('Poking for file in path: %s', self.path)
try:
hook.check_for_file(file_path=self.path)
return True
except FileNotFoundError:
pass
return False
Usage example:
MyAzureDataLakeSensor(
task_id='adls_sense',
path='folder/file.csv',
azure_data_lake_conn_id='azure_data_lake_default',
mode='reschedule'
)
First of all, have a look at official Microsoft Operators for Airflow.
We can see that there are dedicated Operators to Azure DataLake Storage, unfortunately, only the ADLSDeleteOperator seems available at the moment.
This ADLSDeleteOperator uses a AzureDataLakeHook which you should reuse in your own custom operator to check for file presence.
My advice for you is to create a Child class of CheckOperator using the ADLS hook check if the file provided in input exists with check_for_file function of the hook.
UPDATE: as pointed in comments, CheckOperator seems to by tied to SQL queries and is deprecated. Using your own custom Sensor or custom Operator is the way to go.
I had severe issues using the proposed API. So I embedded the Microsoft API into Airflow. This was working fine. All you need to do then is to use this operator and pass account_url and access_token.
from azure.storage.filedatalake import DataLakeServiceClient
from airflow.sensors.base import BaseSensorOperator
class AzureDataLakeSensor(BaseSensorOperator):
def __init__(self, path, filename, account_url, access_token, **kwargs):
super().__init__(**kwargs)
self._client = DataLakeServiceClient(
account_url=account_url,
credential=access_token
)
self.path = path
self.filename = filename
def poke(self, context):
container = self._client.get_file_system_client(file_system="raw")
dir_client = container.get_directory_client(self.path)
file = dir_client.get_file_client(self.filename)
return file.exists()
I need to implement the html_content dynamic for custom email operator, as we have html_content different for different jobs.
Also, I need the values, for example, rows and filename be dynamic
The example below is one of the email body:
The `filename` has been delivered. `0 rows` for contact from 2020-06-14. If you have any questions or concerns regarding this feed please reply to this email
NOTE: The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use, or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.
Code:
def execute(self, context):
if self.source_task_ids:
ti = context['task_instance']
self.s3_key = ti.xcom_pull(task_ids=self.source_task_ids, key='s3_key')[0]
self.s3_key = self.get_s3_key(self.s3_key)
s3_hook = S3Hook(self.s3_conn_id)
try:
if not s3_hook.check_for_key(self.s3_key, bucket_name=self.s3_bucket):
logger.info(f'The source key {self.s3_key} does not exist in the {self.s3_bucket}')
rowcount = 0
self.subject = self.subject
self.html_content = self.html_content
else:
filedata = s3_hook.read_key(self.s3_key, bucket_name=self.s3_bucket)
rowcount = filedata.count('\n') - 1
logger.info(f'rowcount: {rowcount}')
self.subject = self.subject
self.html_content = self.html_content
self.snd_mail(self.send_from,self.send_to,self.subject, self.html_content, self.eml_server, files=self.files)
except Exception as e:
raise AirflowException(f'Error in sending the Email - {e}')
Airflow support Jinja templating in operators. It is build into the BaseOperator and controlled by the template_fields and template_ext fields of the base operator, e.g.:
class CustomEmailOperator(BaseOperator):
template_fields = ("html_content")
template_ext = (".html",)
#apply_defaults
def __init__(self, html_content, ...):
super().__init__(*args, **kwargs)
self.html_content = html_content
def execute(self, context):
# Rest of operator code, nothing special needs to happen to render the templates
Now the html_content field can either be a path to a jinja templated file with the .html extension or a html string directly. Parameters can be passed to the Jinja template using the params field of the operator:
task1 = CustomEmailOperator(
task_id = "task1",
html_content = "Hello, {{ params.name }}",
params = {
"name": "John",
},
...
)
That is how you could pass the filename and # of rows parameters. If you do not want to rely on the BaseOperator mechanism to template your email content, e.g. because you need a bit more control you can also use a helper function available in Airflow:
from airflow.utils.helpers import parse_template_string
html_content = "Hello, {{ params.name }}"
_, template = parse_template_string(html_content)
body = template.render({"name": "John"})
I am new to Airflow. I have written a code to submit HTTP Post using SimpleHttpOperator. In this case post request return a token, i need a help on how reading the response body.
get_templates = SimpleHttpOperator(
task_id='get_templates',
method='POST',
endpoint='myendpoint',
http_conn_id = 'myconnection',
trigger_rule="all_done",
headers={"Content-Type": "application/json"},
xcom_push=True,
dag=dag
)
Looks like POST was successful. Now my question is how to read the response body.
This is the output of code, there is no errors
[2019-05-06 20:08:40,518] {http_hook.py:128} INFO - Sending 'POST' to url: https://auth.reltio.com/oauth//token?username=perf_api_user&password=perf_api_user!&grant_type=password
/usr/lib/python2.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
The execute function of the SimpleHttpOperator returns the response.text (source). By looking at the Airflow documentation for XCom, you can see that:
... if a task returns a value (either from its Operator’s execute() method, or from a PythonOperator’s python_callable function), then an XCom containing that value is automatically pushed.
meaning the response body is pushed to the XCom and is available for downstream tasks to access.
For example, you could have a PythonOperator fetching it via:
response_body = context['task_instance'].xcom_pull(task_ids='get_templates')
Additionally, if you just want to log the response instead of process it, you can just set the log_response of the SimpleHttpOperator constructor to True.
If you use Airflow 2 xcom_push argument is not available in SimpleHttpOperator. In this case let's say you call /my-url in task call_api; to get the response and pass it to another task you need to read from the xcom return_value that is automatically defined by the SimpleHttpOperator as:
call_api = SimpleHttpOperator(
task_id='call_api',
http_conn_id=api_connection,
method='GET',
endpoint='/my-url',
response_filter=lambda response: json.loads(response.text),
log_response=True, # Shows response in the task log
dag=dag
)
def _read_response(ti):
val = ti.xcom_pull(
task_ids='call_api',
key='return_value'
)
print(val)
read_response = PythonOperator(
task_id='read_response',
python_callable=_read_response,
dag=dag
)
You can also specify dag_id in ti.xcom_pull to select the running dag.