configuring Airflow to work with CeleryExecutor - airflow

I try to configure Airbnb AirFlow to use the CeleryExecutor like this:
I changed the executer in the airflow.cfg from SequentialExecutor to CeleryExecutor:
# The executor class that airflow should use. Choices include
# SequentialExecutor, LocalExecutor, CeleryExecutor
executor = CeleryExecutor
But I get the following error:
airflow.configuration.AirflowConfigException: error: cannot use sqlite with the CeleryExecutor
Note that the sql_alchemy_conn is configured like this:
sql_alchemy_conn = sqlite:////root/airflow/airflow.db
I looked at Airflow's GIT (https://github.com/airbnb/airflow/blob/master/airflow/configuration.py)
and found that the following code throws this exception:
def _validate(self):
if (
self.get("core", "executor") != 'SequentialExecutor' and
"sqlite" in self.get('core', 'sql_alchemy_conn')):
raise AirflowConfigException("error: cannot use sqlite with the {}".
format(self.get('core', 'executor')))
It seems from this validate method that the sql_alchemy_conn cannot contain sqlite.
Do you have any idea how to configure the CeleryExecutor without sqllite? please note that I downloaded rabitMQ for working with the CeleryExecuter as required.

It is said by AirFlow that the CeleryExecutor requires other backend than default database SQLite. You have to use MySQL or PostgreSQL, for example.
The sql_alchemy_conn in airflow.cfg must be changed to follow the SqlAlchemy connection string structure (see SqlAlchemy document)
For example,
sql_alchemy_conn = postgresql+psycopg2://airflow:airflow#127.0.0.1:5432/airflow

To configure Airflow for mysql
firstly install mysql this might help or just google it
goto airflow installation director usually /home//airflow
edit airflow.cfg
locate
sql_alchemy_conn = sqlite:////home/vipul/airflow/airflow.db
and add # in front of it so it looks like
#sql_alchemy_conn = sqlite:////home/vipul/airflow/airflow.db
if you have default sqlite
add this line below
sql_alchemy_conn = mysql://:#localhost:3306/
save the file
run command
airflow initdb
and done !

As other answers have stated you need to use a different database besides SQLite. Additionally you need to install rabbitmq, configure it appropriately, and change each of your airflow.cfg's to have the correct rabbitmq information. For an excellent tutorial on this see A Guide On How To Build An Airflow Server/Cluster.

If you run it on a kubernetes cluster. Use the following config:
airflow:
config:
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://postgres:airflow#airflow-postgresql:5432/airflow

Related

Airflow DeprecationWarning

I'm running a distributed Airflow 2.4.0 setup using the official Docker image. All the containers use the same .env file and same version of Airflow image. When I log into one of the Airflow containers I get this warning:
/home/airflow/.local/lib/python3.7/site-packages/airflow/configuration.py:545: DeprecationWarning: The sql_alchemy_conn option in [core] has been moved to the sql_alchemy_conn option in [database] - the old setting has been used, but please update your config.
option = self._get_environment_variables(deprecated_key, deprecated_section, key, section)
/home/airflow/.local/lib/python3.7/site-packages/airflow/configuration.py:545: DeprecationWarning: The auth_backend option in [api] has been renamed to auth_backends - the old setting has been used, but please update your config.
option = self._get_environment_variables(deprecated_key, deprecated_section, key, section)
/home/airflow/.local/lib/python3.7/site-packages/airflow/configuration.py:367: FutureWarning: The auth_backends setting in [api] has had airflow.api.auth.backend.session added in the running config, which is needed by the UI. Please update your config before Apache Airflow 3.0.
FutureWarning,
I checked the airflow.cfg inside the container and it has the up to date variables. Why do I still get the warning messages?
You are seeing these variables because of the section. airflow.cfg is configuration file with section. settings are expected to be in the proper section.
In your case your airflow.cfg has sql_alchemy_conn where you override the default value. Prior to 2.3.0 this setting in core section and in 2.3.0 it was moved to database section. (see PR)
What you need to do is simply open airflow.cfg and move the setting to the proper section. For example:
[core]
sql_alchemy_conn = sqlite:///{AIRFLOW_HOME}/airflow.db
to:
[database]
sql_alchemy_conn = sqlite:///{AIRFLOW_HOME}/airflow.db
The reason why it's like that is also explained in the docs. Airflow reference settings by environment variables of the format AIRFLOW__{SECTION}__{KEY} so in this case it will be:
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN so the section is important to find the variable.

MWAA Airflow 2.0 in AWS Snowflake connection not showing

Snowflake is not showing in the connections dropdown.
I am using MWAA 2.0 and the providers are already in the requirements.txt
MWAA uses python 3.7 dont know if this can be a thing
Requirements.txt:
--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.0.2/constraints-3.7.txt"
asn1crypto
azure-common
azure-core
azure-storage-blob
boto3
botocore
certifi
cffi
chardet
cryptography
greenlet
idna
isodate
jmespath
msrest
numpy
oauthlib
oscrypto
pandas
pyarrow
pycparser
pycryptodomex
PyJWT
pyOpenSSL
python-dateutil
pytz
requests
requests-oauthlib
s3transfer
six
urllib3
apache-airflow-providers-http
apache-airflow-providers-snowflake
#apache-airflow-providers-snowflake[slack]
#apache-airflow-providers-slack
snowflake-connector-python >=2.4.1
snowflake-sqlalchemy >=1.1.0
\
If anyone is in this trouble, instead of choosing Snowflake in the dropdown, you can choose AWS as the connection and will work fine.
It took me a while to finally figure this one out after trying many different parameter combinations.
My full Snowflake URL is:
https://xx12345.us-east-2.aws.snowflakecomputing.com
The correct format for the Host field is:
xx12345.us-east-2.snowflakecomputing.com
For the Extra field, this is what worked for me:
{
"account": "xx12345.us-east-2.aws",
"warehouse": "my_warehouse_name",
"database": "my_database_name"
}
Make sure you put Amazon Web Services for the Conn Type, like #AXI said.
Also, I have these modules defined in my requirements.txt file:
apache-airflow-providers-snowflake==1.3.0
snowflake-connector-python==2.4.5
snowflake-sqlalchemy==1.2.4
My Airflow version is 2.0.2.
According to MWAA docs, it should be enough to add apache-airflow-providers-snowflake==1.3.0 to the requirements file. When I added it to the existing MWAA env, where I had already tried many different combinations of packages, it helped partially. It was possible to create a connection using CLI, but not with UI.
But, when I created a new clean MWAA env with the requirements file as stated in mentioned AWS doc, it worked well. The connection was available in UI.

How to output Airflow's scheduler log to stdout or S3 / GCS

We're running Airflow cluster using puckel/airflow docker image with docker-compose. Airflow's scheduler container outputs its logs to /usr/local/airflow/logs/scheduler.
The problem is that the log files are not rotated and disk usage increases until the disk gets full. Dag for cleaning up the log directory is available but the DAG run on worker node and log directory on scheduler container is not cleaned up.
I'm looking for the way to output scheduler log to stdout or S3/GCS bucket but unable to find out. Is there any to output the scheduler log to stdout or S3/GCS bucket?
Finally I managed to output scheduler's log to stdout.
Here you can find how to use custom logger of Airflow. The default logging config is available at github.
What you have to do is.
(1) Create custom logger class to ${AIRFLOW_HOME}/config/log_config.py.
# Setting processor (scheduler, etc..) logs output to stdout
# Referring https://www.astronomer.io/guides/logging
# This file is created following https://airflow.apache.org/docs/apache-airflow/2.0.0/logging-monitoring/logging-tasks.html#advanced-configuration
from copy import deepcopy
from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG
import sys
LOGGING_CONFIG = deepcopy(DEFAULT_LOGGING_CONFIG)
LOGGING_CONFIG["handlers"]["processor"] = {
"class": "logging.StreamHandler",
"formatter": "airflow",
"stream": sys.stdout,
}
(2) Set logging_config_class property to config.log_config.LOGGING_CONFIG in airflow.cfg
logging_config_class = config.log_config.LOGGING_CONFIG
(3) [Optional] Add $AIRFLOW_HOME to PYTHONPATH environment.
export "${PYTHONPATH}:~"
Actually, you can set the path of logging_config_class to anything as long as the python is able to load the package.
Setting handler.processor to airflow.utils.log.logging_mixin.RedirectStdHandler didn't work for me. It used too much memory.
remote_logging=True in airflow.cfg is the key.
Please check the thread here for detailed steps.
You can extend the image with the following or do so in airflow.cfg
ENV AIRFLOW__LOGGING__REMOTE_LOGGING=True
ENV AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID=gcp_conn_id
ENV AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER=gs://bucket_name/AIRFLOW_LOGS
the gcp_conn_id should have the correct permission to create/delete objects in GCS

Google Cloud Composer DataflowJavaOperator: 403 Forbidden When Creating Job in Another Project

I am trying to use DataflowJavaOperator on our testing composer environment, but I am running into a 403 forbidden error. My intention is to kick off a Dataflow Java job on a different project using the test composer environment.
t2 = DataFlowJavaOperator(
task_id = "run-java-dataflow-job",
jar="gs://path/to/dataflow-jar.jar",
dataflow_default_options=config_params["dataflow_default_options"],
gcp_conn_id=config_params["gcloud_config"]["conn_id"],
dag=dag
)
My default options look like
'dataflow_default_options': {
'project': 'other-project',
'input': 'other-project:dataset.table',
'output': 'other-project:dataset.table'
...
}
I have tried creating a temporary composer test environment in the same project as the Dataflow, and this allows me to use DataflowJavaOperator as expected. Only when the composer environment resides in a different project as the Dataflow, does DataflowJavaOperator not work as expected.
My current workaround is to use BashOperator, use "env" to set the GOOGLE_APPLICATION_CREDENTIALS as gcp_conn_id path, store the jar file in our test composer bucket, and just run this bash command:
java -jar /path/to/dataflow-jar.jar \
[... all Dataflow job options]
Is it possible to use DataflowJavaOperator to kick off Dataflow jobs on another project?
You need a different GCP connection created for Composer to interact with your 2nd GCP project and you need to pass that connection id to gcp_conn_id in DataFlowJavaOperator

Possible to set different executor for each Airflow DAG?

I am looking to add another DAG to an existing Airflow server. The server is currently using LocalExecutor but I might want my DAG to use CeleryExecutor. It seems like the configuration file airflow.cfg only allows one executor:
# The executor class that airflow should use. Choices include
# SequentialExecutor, LocalExecutor, CeleryExecutor
executor = LocalExecutor
Is it possible to configure Airflow such that the existing DAGs can continue to use LocalExecutor and my new DAG can use CeleryExecutor or a custom executor class? I haven't found any examples of people doing this nor come across anything in the Airflow documentation.
If you have a SubDAG within your DAG, you can pass in a specific executor to that SubDagOperator. For instance, to use a SequentialExecutor:
bar_subdag = SubDagOperator(
task_id='bar',
subdag=my_subdag('foo', 'bar', default_args),
default_args=default_args,
dag=foo_dag,
executor=SequentialExecutor()
)
This is on 1.8, not sure if 1.9 is different.
Seems the scheduler will only start one instance of the executor.

Resources