I cannot find any information on what the owner variable is used for in the docs. The only mention of owner I found was in this Stack Overflow question and in the Security section of the docs, both of which don't help in understanding the general concept.
As far as I see in the Airflow repository, it's just information. It's shown
as a column in the main DAG view in Airflow, and if you click on that name it will show you all the DAGs from that owner (at least in Airflow 2).
The origins of owner lie in legacy airflow (v1).
Basically it was used to denote the user in the linux machine. Then in the pre-RBAC days it could be used to restrict access to dags running only for those owners.
Now that only RBAC is there through user signups, there is no need for owner fields. It only serves a cosmetic purpose (albeit an important one) of identifying the DAG owner
Related
Any recommendations on how to approach secrets management on a multi-tenant Airflow 2.0 instance?
We are aware of an alternative backend which would be used before envt variables and multistore, configured via airflow.cfg.
But how do we ensure security around this in a multi-tenant envt e.g. how can we restrict users and their DAG's access to secrets/connections. Is it possible to use access control to restrict this?
We're envisaging putting Connection Ids inside DAGs and, from what we understand, anybody who has knowledge of them will be able to access the Connection and extract secrets as long as he is able to create DAGs of his own. How can we prevent this?
There is no way currently (Airflow 2.1) to prevent anyone who can write DAGs to be able to access anything in the instance. Airflow does not (yet) have true multi-tenant setup that provides this kind of isolation. This is in the works but it will likely not come (fully) until Airflow 3 but elements of it will appear in Airlfow 2 in the coming months so you will be able to configure more and more isolation if you want likely.
For now Airflow 2 introduced partial isolation comparing to 1.10:
Parsing the DAGs is separated from Scheduler, so erroneous/malicious DAGs cannot impact scheduling process directly.
Webserver does not execute DAG code any more at all.
Currently, whoever writes DAGs can:
access the DB of Airflow directly and do anything in the database (including dropping the whole database)
read any configuration variables and connections and secrets
dynamically change definition of any DAGS/Tasks runnning in Airflow via manipulating the DB
And there is no way to prevent it (by design).
All those, are in plans to address in the coming months.
This basically means that you have to have certain level of trust for the users who are writing the DAGs. Full isolation cannot be achieved, you should rely on code reviews of the submitted DAG in production to be able to prevent any kind of abuse (very similar as in case of any code submitted by developers to your code-base).
The only "true" isolation currently you can achieve by deploying several Airlfow instances - each with own database, scheduler, webserver. This is actually not as bad as it seems - if you have Kubernetes Cluster and use the official Helm Chart of Airflow https://airflow.apache.org/docs/helm-chart/stable/index.html. You can easily create several Airflow instances - each in a different namespace, and each using their own database schema (so you can still use single Database Server, but each instance will have to have their own separate schema). Each airflow instance will then have their own workers which can have different authentication (either via connections or via other mechanisms).
You can even provide common authentication mechanisms - for example you could put KeyCloak in front of Airflow and integrate Oauth/LDAP authentication with your common auth aproach - for all such instances (and for example have different groups of employees authorized for different instances).
This provides nice multi-tenant manageability, some level of resource re-use (database, K8S cluster nodes), and if you have - for example - Terraform scripts to manage your infrastructure, this can be actually nicely made easily manageable so that you can add/remove tenants easily. And the isolation between tenants is even better - because you can separately manage resources used (number of workers, schedulers etc.) for each tenant.
If you are serious about isolation and multi-tenant management, I heartily recommend that approach. Even when in Airflow 3 we will achieve full isolation, you will still have to make sure to manage the "resource" isolation between tenants and having multiple Airflow Instances is one way that makes it very easy (so it will also remain as valid and recommended way of implementing multi-tenancy in some scenarios).
How can I restrict the DAGs that are related to a particular owner in airflow 2.0.1?
I have enabled the
enter image description here
That configuration item was removed and it's not available in Airflow 2.
As mentioned here:
As part of this change, a few configuration items in [webserver] section are removed and no longer applicable, including authenticate, filter_by_owner, owner_mode, and rbac.
You can implement this using LDAP authentication, using groups and permissions. You can find all the details in this video (I included the specific time when it configures the role).
Airflow version: 2.1.0
I set FERNET_KEY and checked login/password fields are encrypted when I add connections via Web UI.
However, when I add a connection via CLI:
airflow connections add 'site_account' \
--conn-type 'login' \
--conn-login 'my_user_id' \
--conn-password 'wowpassword'
And run airflow connections list, it shows everything in raw value(not encrypted at all).
I think this could be dangerous enough if I manage all connections using CLI commands (I want to make my airflow infra restorable. That's why I tried to use CLI command to manage connections)
How to solve it?
Airflow decrypts the connections passwords during the processing of your cli commands.
You can use airflow connections list --o yaml to see whether your record was actually encrypted in the database or not.
Furthermore, if you are able to access the cli, you are also able to access the config, meaning you can always extract the database connection and fernet_key and get the full password on your own.
Jorrick answer is correct however I want to elaborate on the background as I feel it will bridge between the question and the answer.
It's very understandable that Airflow needs to be able to decrypt the connection when DAG/user asks to. This is needed for normal operation of the app so Airflow must assume that if a user can author DAGs he is permitted to utilize the system resources (Connections, Variables).
The security measurements are on a different level. If utilizing them (using Fernet) then Airflow will encrypt the sensitive information (like connection passwords) this means that in the database itself the value is encrypted. The security concern here is where the ferent_key is stored? is it rotating? etc...
There are many other security layers that handle different aspects like: access control, hiding sensitive information in the Ui but that's a different topic.
I think the important thing to understand that security handles two types of users:
A user that is permitted but you just want to limit what actions he can do or what he can see. (This is more what Airflow itself handles see security docs)
A user that is malicious and wants to do damage. While Airflow does provide some features in that area this is more of an issue of where you setup Airflow and how well you protect it (IP allow-list etc...)
keep in mind that if a malicious user gained access to Airflow server then there is little you can do about it. This user can simply use his admin privileges to do anything. This is no different than a user that hacked into any other server that you own.
I am trying to integrate Accounts Functionality in my CordApp, and was going through the supply chain demo https://github.com/corda/accounts-demo-supplychain
Here is a list of my queries:
What is the purpose of the Account Broadcast flow?, not mentioned in the readme file https://github.com/corda/accounts-demo-supplychain/blob/master/workflows/src/main/kotlin/com/accounts_SupplyChain/flows/AccountBroadcast.kt
Is the purpose of Share Account Flow, only that the counter party node's know's the account identity, what if I don't want to use it?
Since account is a sub-vault of the node's vault, that data is visible to that node right?
To share account infos with other nodes on in the zone. You need to do this so that they know which node and account belongs to.
Same as above. You don't have to use it.
Yes.
The Conduit API in Phabricator does not support setting of authorPHID parameter while calling maniphest.createtask. I can imagine this is because of security or some logical reason.
But I am developping my own frontend for Maniphest where the users (logged through Phabricator, so they are phab users and have phid) will add and edit tasks. What I need is that if a user creates task, he is also the author of the task.
But the problem is, that I can't connect to Conduit as any other user than "apibot" because I don't have others certificates in my front-end to do it. But if I log in as "apibot", then "apibot" is set as an author of the task.
Three possible solutions came to my mind:
1. retrieve certificate directly from phab's database
2. keep a list of certificates in some file in my front-end and update it manually everytime somebody will register
I guess none of them are really smart...
The third solution would be nice, but I didn't find a way, how to do it:
3. log in as "apibot", get certificate of userXY and then log in as the userXY
What would you suggest?