To trigger a Databricks job in one region by an airflow job running in another region

To trigger a Databricks job in one region by an airflow job running in another region - airflow

We plan to have one common AWS MWAA cluster in us west region which triggers Databricks jobs in different regions.
Is there a way to trigger a Databricks job in one region by an airflow job running in another region? I checked the databricks connection document as here but it does not list any region parameters. How to achieve this?

Airflow operator doesn't know anything about regions - it just needs URL of the workspace, and personal access token. That workspace could be in the same region, or in another - it doesn't matter to a Databricks operator.

Related

MultiTennant Airflow - access control and secrets management

Any recommendations on how to approach secrets management on a multi-tenant Airflow 2.0 instance?
We are aware of an alternative backend which would be used before envt variables and multistore, configured via airflow.cfg.
But how do we ensure security around this in a multi-tenant envt e.g. how can we restrict users and their DAG's access to secrets/connections. Is it possible to use access control to restrict this?
We're envisaging putting Connection Ids inside DAGs and, from what we understand, anybody who has knowledge of them will be able to access the Connection and extract secrets as long as he is able to create DAGs of his own. How can we prevent this?

There is no way currently (Airflow 2.1) to prevent anyone who can write DAGs to be able to access anything in the instance. Airflow does not (yet) have true multi-tenant setup that provides this kind of isolation. This is in the works but it will likely not come (fully) until Airflow 3 but elements of it will appear in Airlfow 2 in the coming months so you will be able to configure more and more isolation if you want likely.
For now Airflow 2 introduced partial isolation comparing to 1.10:
Parsing the DAGs is separated from Scheduler, so erroneous/malicious DAGs cannot impact scheduling process directly.
Webserver does not execute DAG code any more at all.
Currently, whoever writes DAGs can:
access the DB of Airflow directly and do anything in the database (including dropping the whole database)
read any configuration variables and connections and secrets
dynamically change definition of any DAGS/Tasks runnning in Airflow via manipulating the DB
And there is no way to prevent it (by design).
All those, are in plans to address in the coming months.
This basically means that you have to have certain level of trust for the users who are writing the DAGs. Full isolation cannot be achieved, you should rely on code reviews of the submitted DAG in production to be able to prevent any kind of abuse (very similar as in case of any code submitted by developers to your code-base).
The only "true" isolation currently you can achieve by deploying several Airlfow instances - each with own database, scheduler, webserver. This is actually not as bad as it seems - if you have Kubernetes Cluster and use the official Helm Chart of Airflow https://airflow.apache.org/docs/helm-chart/stable/index.html. You can easily create several Airflow instances - each in a different namespace, and each using their own database schema (so you can still use single Database Server, but each instance will have to have their own separate schema). Each airflow instance will then have their own workers which can have different authentication (either via connections or via other mechanisms).
You can even provide common authentication mechanisms - for example you could put KeyCloak in front of Airflow and integrate Oauth/LDAP authentication with your common auth aproach - for all such instances (and for example have different groups of employees authorized for different instances).
This provides nice multi-tenant manageability, some level of resource re-use (database, K8S cluster nodes), and if you have - for example - Terraform scripts to manage your infrastructure, this can be actually nicely made easily manageable so that you can add/remove tenants easily. And the isolation between tenants is even better - because you can separately manage resources used (number of workers, schedulers etc.) for each tenant.
If you are serious about isolation and multi-tenant management, I heartily recommend that approach. Even when in Airflow 3 we will achieve full isolation, you will still have to make sure to manage the "resource" isolation between tenants and having multiple Airflow Instances is one way that makes it very easy (so it will also remain as valid and recommended way of implementing multi-tenancy in some scenarios).

From the cluster, cannot download a Python Wheel from the storage account

1) We upload a python wheel to the storage account associated with the workspace successfully.
2) In the second step we submit an experiment which runs in the cluster and needs to download and run the package from step 1.
The experiment is able to download the package and run when the storage account is not associated with any VNet. However, when we associate the storage account in Vnets the experiment hangs and eventually fails. This storage account is in two Vnet’s, a, and b. The cluster is also in the Vnet, a.
I don’t know why the cluster cannot download the wheel package when the storage account is in a Vnet. It is our policy to have storage accounts in Vnet’s.
It there something else we are missing? I also checked the container registry setting and its set to ‘allow from everywhere’ (we are using std SKU).
Let me know if any further information is required. Thanks.

Is there an API call that lists all currently running airflow jobs?

I would like to know if there is a simple API call I can make to list all currently running airflow jobs.
In the airflow flower dashboard, there is a column that lists all currently active jobs. I'd like to know if I can obtain this information via an API call.

In short: no. Airflow does have an experimental REST API, but there is no endpoint for the call you're after. See https://github.com/apache/airflow/blob/98e852219fc73c7ec049feeab7305bc7c0e89698/airflow/api/client/json_client.py#L26 for a list of the endpoints supported (Some of them are mentioned in the official documentation as well: https://airflow.apache.org/api.html#endpoints)
As far as I understand a proper non-experimental REST API is on the roadmap of Airflow 2.

Cloud Firestore - selecting region to store data?

I'm working on a product which for legal reasons needs to store user data in a specific region.
I'm using Firebase so I created a project selecting the region it needs to be in however looking at firestore where the user data is kept I can't find anything pinpointing the region the data actually is. The thing which makes me worry the most is the Cloud Functions endpoints start with us-central1 but obviously that could just be that cloud functions don't exist in the specified region.
Given this is an important matter is there a way to confirm the location of data and even force it to be in a specific region?

Update
Cloud Firestore supports the following regional GCP resource locations, in addition to the 2 multi-region (nam5, eur3) locations:
See the documentation for an up-to-date list of locations.
Original Answer
Cloud Firestore is currently only available in our US multi-region (Iowa, Oklahoma, South Carolina). As we approach GA we plan to roll it out to multiple locations across the globe and you'll be able to select which one at creation time. Not something you can do today though.

How to choose a zone for Google Cloud Shell Session?

Since echo backs of typing character is very slow for me in Japan, it seems that instances of Google Cloud Shell sessions are in some us region.
How do I change the zone of an instance?

Cloud Shell is globally distributed across multiple GCP regions. When a user connects for the first time, the system assigns them to the geographically closest region that can accommodate new users. While the users cannot manually chose their region, the system does its best to pick the closest region Cloud Shell operates in. If Cloud Shell does not initially pick the closest region or if the user later connects from a location that's geographically closer to a different region, Cloud Shell will migrate the user to a closer region on session end.
Since Cloud Shell runs as a GCP VM, you can view your current region by running
curl -H "Metadata-Flavor: Google" metadata/computeMetadata/v1/instance/zone
inside the Cloud Shell session. This is documented in https://cloud.google.com/compute/docs/storing-retrieving-metadata.