Airflow - Custom XCom backend on Ubuntu - airflow

I'm trying to implement custom XCOM backend.
Those are the steps I did:
Created "include" directory at the main Airflow dir (AIRFLOW_HOME).
Created these "custom_xcom_backend.py" file inside:
from typing import Any
from airflow.models.xcom import BaseXCom
import pandas as pd
class CustomXComBackend(BaseXCom):
#staticmethod
def serialize_value(value: Any):
if isinstance(value, pd.DataFrame):
value = value.to_json(orient='records')
return BaseXCom.serialize_value(value)
#staticmethod
def deserialize_value(result) -> Any:
result = BaseXCom.deserialize_value(result)
result = df = pd.read_json(result)
return result
Set at config file:
xcom_backend = include.custom_xcom_backend.CustomXComBackend
When I restarted webserver I got:
airflow.exceptions.AirflowConfigException: The object could not be loaded. Please check "xcom_backend" key in "core" section. Current value: "include.cust...
My guess is that it not recognizing the "include" folder
But how can I fix it?
*Note: There is no docker. It is installed on a Ubuntu machine.
Thanks!

So I solved it:
Put custom_xcom_backend.py into the plugins directory
set at config file:
xcom_backend = custom_xcom_backend.CustomXComBackend
Restart all airflow related services
*Note: Do not store DataFrames that way (bad practice).
Sources I used:
https://www.youtube.com/watch?v=iI0ymwOij88

Related

db.create_all() doesn't create a database in a desired directory

I am trying to create a database for my Flask application in the main directory of my project. This is my code for initializing a database:
app.config["SQLALCHEMY_DATABASE_URI"] = 'sqlite:///users.db'
app.config["SQLALCHEMY_TRACK_MODIFICATIONS"] = False
db = SQLAlchemy(app)
Flask requires application context, so this is how I create the database:
$ flask shell
>>> db.create_all()
I also tried doing it with:
$ python
>>> from app import app, db
>>> app.app_context().push()
>>> db.create_all()
Both of these options create the database in the /instance directory. Is there any way to get around this and create it in the main directory of the project?
The instance path is the preferred and default location for the database. I recommend you to use this one for security reasons. However, it is also possible to choose an alternative solution in which the full length of the path is specified in the configuration.
The following configuration corresponds to an outdated variant, where the database is created in the current working directory. Please don't use this anymore.
app.config['SQLALCHEMY_DATABASE_URI'] ='sqlite:///' + os.path.join(os.getcwd(), 'users.db')
This corresponds to the current solution.
app.config['SQLALCHEMY_DATABASE_URI'] ='sqlite:///' + os.path.join(app.instance_path, 'users.db')

Google Dataflow: Import custom Python module

I try to run a Apache Beam pipeline (Python) within Google Cloud Dataflow, triggered by a DAG in Google Cloud Coomposer.
The structure of my dags folder in the respective GCS bucket is as follows:
/dags/
dataflow.py <- DAG
dataflow/
pipeline.py <- pipeline
setup.py
my_modules/
__init__.py
commons.py <- the module I want to import in the pipeline
The setup.py is very basic, but according to the Apache Beam docs and answers on SO:
import setuptools
setuptools.setup(setuptools.find_packages())
In the DAG file (dataflow.py) I set the setup_file option and pass it to Dataflow:
default_dag_args = {
... ,
'dataflow_default_options': {
... ,
'runner': 'DataflowRunner',
'setup_file': os.path.join(configuration.get('core', 'dags_folder'), 'dataflow', 'setup.py')
}
}
Within the pipeline file (pipeline.py) I try to use
from my_modules import commons
but this fails. The log in Google Cloud Composer (Apache Airflow) says:
gcp_dataflow_hook.py:132} WARNING - b' File "/home/airflow/gcs/dags/dataflow/dataflow.py", line 11\n from my_modules import commons\n ^\nSyntaxError: invalid syntax'
The basic idea behind the setup.py file is documented here
Also, there are similar questions on SO which helped me:
Google Dataflow - Failed to import custom python modules
Dataflow/apache beam: manage custom module dependencies
I'm actually wondering why my pipelines fails with a Syntax Error and not a module not found kind of error...
I tried to reproduce your issue and then try to solve it, so I created the same folder structure you already have:
/dags/
dataflow.py
dataflow/
pipeline.py -> pipeline
setup.py
my_modules/
__init__.py
common.py
Therefore, to make it work, the change I made is to copy these folders to a place where the instance is running the code is able to find it, for example in the /tmp/ folder of the instance.
So, my DAG would be something like this:
1 - Fist of all I declare my arguments:
default_args = {
'start_date': datetime(xxxx, x, x),
'retries': 1,
'retry_delay': timedelta(minutes=5),
'dataflow_default_options': {
'project': '<project>',
'region': '<region>',
'stagingLocation': 'gs://<bucket>/stage',
'tempLocation': 'gs://<bucket>/temp',
'setup_file': <setup.py>,
'runner': 'DataflowRunner'
}
}
2- After this, I created the DAG and before running the Dataflow task, I copied the whole folder directory, above created, into the /tmp/ folder of the instance Task t1, and after this, I run the pipeline from the /tmp/ directory Task t2:
with DAG(
'composer_df',
default_args=default_args,
description='datflow dag',
schedule_interval="xxxx") as dag:
def copy_dependencies():
process = subprocess.Popen(['gsutil','cp', '-r' ,'gs://<bucket>/dags/*',
'/tmp/'])
process.communicate()
t1 = python_operator.PythonOperator(
task_id='copy_dependencies',
python_callable=copy_dependencies,
provide_context=False
)
t2 = DataFlowPythonOperator(task_id="composer_dataflow",
py_file='/tmp/dataflow/pipeline.py', job_name='job_composer')
t1 >> t2
That's how I created the DAG file dataflow.py, and then, in the pipeline.py the package to import would be like:
from my_modules import commons
It should work fine, since the folder directory is understandable for the VM.

How can I read a config file from airflow packaged DAG?

Airflow packaged DAGs seem like a great building block for a sane production airflow deployment.
I have a DAG with dynamic subDAGs, driven by a config file, something like:
config.yaml:
imports:
- project_foo
- project_bar`
which yields subdag tasks like imports.project_{foo|bar}.step{1|2|3}.
I've normally read in the config file using python's open function, a la config = open(os.path.join(os.path.split(__file__)[0], 'config.yaml')
Unfortunately, when using packaged DAGs, this results in an error:
Broken DAG: [/home/airflow/dags/workflows.zip] [Errno 20] Not a directory: '/home/airflow/dags/workflows.zip/config.yaml'
Any thoughts / best practices to recommend here?
It's a bit of a kludge, but I eventually just fell back on reading zip file contents via ZipFile.
import yaml
from zipfile import ZipFile
import logging
import re
def get_config(yaml_filename):
"""Parses and returns the given YAML config file.
For packaged DAGs, gracefully handles unzipping.
"""
zip, post_zip = re.search(r'(.*\.zip)?(.*)', yaml_filename).groups()
if zip:
contents = ZipFile(zip).read(post_zip.lstrip('/'))
else:
contents = open(post_zip).read()
result = yaml.safe_load(contents)
logging.info('Parsed config: %s', result)
return result
which works as you'd expect from the main dag.py:
get_config(os.path.join(path.split(__file__)[0], 'config.yaml'))

airflow plugins not getting picked up correctly

We are using Apache 1.9.0. I have written a snowflake hook plugin. I have placed the hook in the $AIRFLOW_HOME/plugins directory.
$AIRFLOW_HOME
+--plugins
+--snowflake_hook2.py
snowflake_hook2.py
# This is the base class for a plugin
from airflow.plugins_manager import AirflowPlugin
# This is necessary to expose the plugin in the Web interface
from flask import Blueprint
from flask_admin import BaseView, expose
from flask_admin.base import MenuLink
# This is the base hook for connecting to a database
from airflow.hooks.dbapi_hook import DbApiHook
# This is the snowflake provided Connector
import snowflake.connector
# This is the default python logging package
import logging
class SnowflakeHook2(DbApiHook):
"""
Airflow Hook to communicate with Snowflake
This is implemented as a Plugin
"""
def __init__(self, connname_in='snowflake_default', db_in='default', wh_in='default', schema_in='default'):
logging.info('# Connecting to {0}'.format(connname_in))
self.conn_name_attr = 'snowflake_conn_id'
self.connname = connname_in
self.superconn = super().get_connection(self.connname) #gets the values from Airflow
{SNIP - Connection stuff that works}
self.cur = self.conn.cursor()
def query(self,q,params=None):
"""From jmoney's db_wrapper allows return of a full list of rows(tuples)"""
if params == None: #no Params, so no insertion
self.cur.execute(q)
else: #make the parameter substitution
self.cur.execute(q,params)
self.results = self.cur.fetchall()
self.rowcount = self.cur.rowcount
self.columnnames = [colspec[0] for colspec in self.cur.description]
return self.results
{SNIP - Other class functions}
class SnowflakePluginClass(AirflowPlugin):
name = "SnowflakePluginModule"
hooks = [SnowflakeHook2]
operators = []
So I went ahead and put some print statements in Airflows plugin_manager to try and get a better handle on what is happening. After restarting the webserver and running airflow list_dags, these lines were showing the "new module name" (and no errors
SnowflakePluginModule [<class '__home__ubuntu__airflow__plugins_snowflake_hook2.SnowflakeHook2'>]
hook_module - airflow.hooks.snowflakepluginmodule
INTEGRATING airflow.hooks.snowflakepluginmodule
snowflakepluginmodule <module 'airflow.hooks.snowflakepluginmodule'>
As this is consistent with what the documentation says, I should be fine using this in my DAG:
from airflow import DAG
from airflow.hooks.snowflakepluginmodule import SnowflakeHook2
from airflow.operators.python_operator import PythonOperator
But the web throws this error
Broken DAG: [/home/ubuntu/airflow/dags/test_sf2.py] No module named 'airflow.hooks.snowflakepluginmodule'
So the question is, What am I doing wrong? Or have I uncovered a bug?
You need to import as below:
from airflow import DAG
from airflow.hooks import SnowflakeHook2
from airflow.operators.python_operator import PythonOperator
OR
from airflow import DAG
from airflow.hooks.SnowflakePluginModule import SnowflakeHook2
from airflow.operators.python_operator import PythonOperator
I don't think that airflow automatically goes through the folders in your plugins directory and runs everything underneath it. The way that I've set it up successfully is to have an __init__.py under the plugins directory which contains each plugin class. Have a look at the Astronomer plugins in Github, it provides some really good examples for how to set up your plugins.
In particular have a look at how they've set up the mysql plugin
https://github.com/airflow-plugins/mysql_plugin
Also someone has incorporated a snowflake hook in one of the later versions of airflow too which you might want to leverage:
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/snowflake_hook.py

Django Settings Module

I installed Django and was able to double check that the module was in fact in Python, but when attempting to implement basic commands such as runserver or utilize manage.py; I get DJANGO_SETTEINGS_MODULE error. I already used "set DJANGO_SETTINGS_MODULE = mysite.settings" as advised and inserted mysite.settings into the PATH for Python as some documentation online directed me to.
Now instead of undefined it says no such module exists. I can't find anything else in the documentation and I used my test site name instead of "mysite" without any change. Does anyone know what am I missing? All I can find in the module library for Django in my Python is this code.
from future import unicode_literals
from django.utils.version import get_version
VERSION = (1, 11, 5, 'final', 0)
__version__ = get_version(VERSION)
def setup(set_prefix=True):
"""
Configure the settings (this happens as a side effect of accessing the first setting), configure logging and populate the app registry.
Set the thread-local urlresolvers script prefix if `set_prefix` is True.
"""
from django.apps import apps
from django.conf import settings
from django.urls import set_script_prefix
from django.utils.encoding import force_text
from django.utils.log import configure_logging
configure_logging(settings.LOGGING_CONFIG, settings.LOGGING)
if set_prefix:
set_script_prefix(
'/' if settings.FORCE_SCRIPT_NAME is None else force_text(settings.FORCE_SCRIPT_NAME)
)
apps.populate(settings.INSTALLED_APPS)
Are you sure you wrote properly the environment variable? I'm asking cause I see you get the error DJANGO_SETTEINGS_MODULE (settings has a misspelling)...

Resources