Need to create custom s3KeySensor - airflow

I'm using airflow_docker and it seems it does not come with an s3KeySensor. It does come with a Sensor Class. How can I create my own custom s3KeySensor? Do I just have to inherit from Sensor and overwrite the poke method? Can I literally just copy the source code from the s3KeySensor? Here's the source code for the s3KeySensor
The reason I am using airflow docker is that it can run in a container and I can pass in aws role creds to the task container so that it has the proper and exact permissions to do an action other than using the worker container's role permissions.

I would recommend upgrading to the latest version Airflow or at least start at Airflow 2.0.
In Airflow 2.0, a majority of the operators, sensors, and hooks that are not considered core functionality are categorized and separated into providers. Amazon has a provider, apache-airflow-providers-amazon, that includes the S3KeySensor.
You can also install backport providers so the one you would need is apache-airflow-backport-providers-amazon if you want to stay on Airflow 1.10.5 (based on the links you sent).

Related

Airflow : How to disable User from editing Variables/Connections in GUI in Production Environment

We want to avoid users to manually editing or adding new variables/connections from Airflow GUI in Production. Currently, we are using JSON files which loads all the connections and variables for Airflow.
Can experts guide me on how to achieve this?
Sergiy's right, you will need to start looking at Airflow's Role Based Access (RBAC) documentation and implement from there. Here's a place to start: https://airflow.readthedocs.io/en/latest/howto/add-new-role.html

how to organize your projects and dags with airflow

I am considering using Apache-Airflow. I had a look at the documentation and now I am trying to implement an already existing pipeline (home made framework) using Airflow.
All given examples are simple one module DAGs. But in real life you can have a versionned application that provides (complex) pipeline blocks. And DAGs use those blocks as tasks. Basically the application package is installed in a dedicated virtual environment with its dependencies.
Ok so no now how do you plug that with Airflow ? Should airflow be installed in the application virtualenv ? Then there is a dedicated Airflow instance for this application pipelines. But in this case if you have 100 applications you have to have 100 Airflow instances... On the other side if you have one unique instance it means you have installed all your applications packages on the same environement and potentially you can have dependency conflicts...
Is there something I am missing ? Are there best practices ? Do you know internet resources that may help ? Or GitHub repos using one pattern or the other ?
Thanks
One instance with 100 pipelines. Each pipelines can easily be versioned and python dependancies can be packaged.
We have 200+ very different pipelines and use one central airflow instance. Folders are organized as follow:
DAGs/
DAGs/pipeline_1/v1/pipeline_1_dag_1.0.py
DAGs/pipeline_1/v1/dependancies/
DAGs/pipeline_1/v2/pipeline_1_dag_2.0.py
DAGs/pipeline_1/v2/dependancies/
DAGs/pipeline_2/v5/pipeline_2_dag_5.0.py
DAGs/pipeline_2/v5/dependancies/
DAGs/pipeline_2/v6/pipeline_2_dag_6.0.py
DAGs/pipeline_2/v6/dependancies/
etc.

Flyway - Multiple DBs with Central Migration

Our company has about 30+ applications written in different languages (java, c#, visual basic, nodejs etc)
Our aim is to have development teams keep the database change sqls in their repositories, and do the migration from Jenkins with them starting pipelines with version number. Development teams don't have access to Jenkins configuration, they can only run jobs that we created and configured.
How should we go about this? Do we have too keep different flyway instances for each application? And how about pre-production and production stages?
Basically, how should we, as devops team, maintain flyway to do migration of different applications with different stages, without the development teams doing the migration part.
This should be possible with the Flyway CLI. You can tell Flywyay where to look for migrations and how to connect to the database. See the docs about configuring the CLI.
You can configure Flyway in various ways - environment variables, command line arguments, and config files.
What you could do is allow each development team to specify a migrations directory and connection details for the Jenkins task. The task can then call the Flyway CLI, overriding the relevant config items via command line arguments. For example, the command line call to migrate a database:
flyway url=jdbc:oracle:thin:#//<host>:<port>/<service> -locations=some-location migrate
Or you could allow your devs to specify environment variables, or provide a custom config file.
You can reuse a single Flyway instance since the commands are essentially stateless. The only bit of environmental state they need comes from the config file, which you have complete control over.
Hope that helps

IBM BPM 8.6 upgrade to IBM Business Automation Workflow is not working?

I have updated IBM BPM version 8.6.0 to IBM Business Automation Workflow V18.0.0.2 by following below documentation.
IBM BPM upgarde to IBM Business Automation Workflow V18.0.0.2
In the above documentation I have executed all the commands, only one command createProcedure_ProcessServer.sql was not successful and the optional commands i have not executed.
Now after doing all these things IBM BPM was upgraded as i can see the process portal/admin/center login page name is chnaged and also the additional rest api for sharing "saved searches" and RPA task is available. but when I am trying to access case builder it is giving me below error.
You mentioned skipping the optional steps in the upgrade instructions when performing your upgrade.
However, several of the optional steps specifically mention that they are needed in order enable the new case management functionality:
Optional: To use case management, follow these instructions to enable it.
Note: Steps 22-24 are about configuring case management. If you have a Db2 for z/OS database or an AdvancedOnly deployment environment, or you want to configure case management later, or you do not intend to use case management, skip these steps.
Thus, if you follow the optional steps 22-24 that would most likely solve your issue.
As Jan said,
BAW uses an additional Database/Schema the CPEDB, so if you have upgraded, you must export the current Dmgr profile, check if there you have the option to set the CPEDB database and credentials, fill that and then update the Dmgr profile.
If after the export you don't have the CPEDB options, open the samples config files and look for the differences and add them to the exported file.

Airflow DataprocClusterCreateOperator

In Airflow DataprocClusterCreateOperator settings:
Do we have a chance to set the Primary disk type for master and worker to pd-ssd?
The default setting is standard.
I was looking into the documentation - I don't find any parameters.
Unfortunately, there is no option to change the Disk Type in DataprocClusterCreateOperator.
In Google API it is available if you pass a parameter to https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#diskconfig
I will try and add this feature and should be available in Airflow 1.10.1 or Airflow 2.0.
For now, you can create an Airflow plugin that modifies the current DataprocClusterCreateOperator.
There seem to be two fields in regard to this:
master_machine_type: Compute engine machine type to use for the master node
worker_machine_type: Compute engine machine type to use for the worker nodes
I have found this simply looking into the source code here (this is for latest, but no version was provided so I assumed the latest version):
https://airflow.readthedocs.io/en/latest/_modules/airflow/contrib/operators/dataproc_operator.html

Resources