Any recommendations on how to approach secrets management on a multi-tenant Airflow 2.0 instance?
We are aware of an alternative backend which would be used before envt variables and multistore, configured via airflow.cfg.
But how do we ensure security around this in a multi-tenant envt e.g. how can we restrict users and their DAG's access to secrets/connections. Is it possible to use access control to restrict this?
We're envisaging putting Connection Ids inside DAGs and, from what we understand, anybody who has knowledge of them will be able to access the Connection and extract secrets as long as he is able to create DAGs of his own. How can we prevent this?
There is no way currently (Airflow 2.1) to prevent anyone who can write DAGs to be able to access anything in the instance. Airflow does not (yet) have true multi-tenant setup that provides this kind of isolation. This is in the works but it will likely not come (fully) until Airflow 3 but elements of it will appear in Airlfow 2 in the coming months so you will be able to configure more and more isolation if you want likely.
For now Airflow 2 introduced partial isolation comparing to 1.10:
Parsing the DAGs is separated from Scheduler, so erroneous/malicious DAGs cannot impact scheduling process directly.
Webserver does not execute DAG code any more at all.
Currently, whoever writes DAGs can:
access the DB of Airflow directly and do anything in the database (including dropping the whole database)
read any configuration variables and connections and secrets
dynamically change definition of any DAGS/Tasks runnning in Airflow via manipulating the DB
And there is no way to prevent it (by design).
All those, are in plans to address in the coming months.
This basically means that you have to have certain level of trust for the users who are writing the DAGs. Full isolation cannot be achieved, you should rely on code reviews of the submitted DAG in production to be able to prevent any kind of abuse (very similar as in case of any code submitted by developers to your code-base).
The only "true" isolation currently you can achieve by deploying several Airlfow instances - each with own database, scheduler, webserver. This is actually not as bad as it seems - if you have Kubernetes Cluster and use the official Helm Chart of Airflow https://airflow.apache.org/docs/helm-chart/stable/index.html. You can easily create several Airflow instances - each in a different namespace, and each using their own database schema (so you can still use single Database Server, but each instance will have to have their own separate schema). Each airflow instance will then have their own workers which can have different authentication (either via connections or via other mechanisms).
You can even provide common authentication mechanisms - for example you could put KeyCloak in front of Airflow and integrate Oauth/LDAP authentication with your common auth aproach - for all such instances (and for example have different groups of employees authorized for different instances).
This provides nice multi-tenant manageability, some level of resource re-use (database, K8S cluster nodes), and if you have - for example - Terraform scripts to manage your infrastructure, this can be actually nicely made easily manageable so that you can add/remove tenants easily. And the isolation between tenants is even better - because you can separately manage resources used (number of workers, schedulers etc.) for each tenant.
If you are serious about isolation and multi-tenant management, I heartily recommend that approach. Even when in Airflow 3 we will achieve full isolation, you will still have to make sure to manage the "resource" isolation between tenants and having multiple Airflow Instances is one way that makes it very easy (so it will also remain as valid and recommended way of implementing multi-tenancy in some scenarios).
Related
In a production environment, is it better to have dedicated RDS instances for Artifactory and Xray? Or is it okay to have a single RDS instance for both Artifactory and Xray?
It seems like Xray uses a lot of resources during the initial DB sync, indexing of artifacts, and when generating reports. I'm not sure how big of a performance impact this has on Artifactory, so would love to hear from other user experience.
In general, this depends on your organization’s needs. It is essential to consider that there is continuous communication between Artifactory and the database which, in most cases, will result in high database utilization. Hence, it is recommended to have dedicated database instances for Artifactory and Xray applications. We can also avoid a single point of failure for both the applications(in case the database instance has some issues)
If there is a requirement to use only a shared database instance, please make sure that it can handle heavy load and tune it accordingly for both applications.
I’m setting up a service to allow my clients to create and manage their cloud resources on my OpenStack setup. The network requirements of some are reasonably complex.
The trouble I’m having is deciding how to manage the resources. OpenStack provide an API to allow me to CRUD all individual components as I need to. Yet there’s also the stack create/update methods which allow me to create the network/VMs/routers/rules all at once. However I feel the barrier to the 2nd method is that I will have to maintain an increasingly complex template file as the client’s network/# of VMs grow. Yet it has the benefit of only being 1 web call for every change compared to potentially 50 or so for a large environment.
Is there a preferred method for handling this scenario?
I am working with an e-commerce platform, and I have a task to synchronize with some remote accounting software. The task requires syncing orders, products, inventory...etc. With large amounts of data being synced,the process can take awhile. So, I don't think asp.net application would be the best place to handle this. So, the requirements are:
To be able to schedule this process to run overnight
To be able to manually fire off this process and pass into it some variables like order numbers to export.
Possibly get back status info when fired off manually.
Has to work on .net 3.5
Issues: Can't use a windows service because the site is hosted remotely on a shared service, and the host won't allow a service.
Ideas: I'm having a really hard time finding the best way to handle this outside asp.net that fits all requirements, but I do have access to their FTP and thought possibly a console app that hosts a web-service may work, and I can put Quartz scheduler in global file to fire off service from the site.
Anyway, please offer some thoughts and experiences if you have them on which methods have worked for you.
Can't use a windows service because the site is hosted remotely on a shared service, and the host won't allow a service.
That might be a problem. Does this hosting service provide any other kind of scheduling functionality? If not then you may need to consider changing your hosting services.
You're correct in that ASP.NET is not the tool you'd use for scheduling tasks. A web application is a request/response system (and is very much at the mercy of the hosting process, IIS usually for ASP.NET). So you need some way to schedule the task to execute at regular intervals. Windows Services, Windows Task Scheduler, or some other task scheduling tool.
As for the requirement to be able to invoke the process manually, that's a simple matter of separating the invocation of the logic from the logic itself. Picture the following components:
A module which performs the logic, not bound to any UI or any way of invoking it. Basically a Class Library project (or part of one).
A Windows Service or Console Application which references the Class Library and invokes the logic.
A Web Application which references the Class Library and invokes the logic.
Once you've sorted out how to schedule the Console Application, just schedule it and it's all set. If the process returns some information then the Console Application can also perform any notifications necessary to inform people of that information.
The Web Application can then also have an interface somewhere to invoke the process manually. Since the process "can take a while" then of course you won't want the interface to wait for it to complete. This can result in timeouts and leave the system in an unknown state. Instead you'd want to return the UI to the user indicating that the process has started (or been queued) and that they will be notified with the results when it completes. There are a couple of options for this...
You can use a BackgroundWorker to actually invoke the process. When the process completes, send a notification to the user who invoked it.
You can write a record to a database table to "queue" the process and have something like a Windows Service or scheduled Console Application (same scenario as above) which regularly polls that table for queued tasks, performs the task, and sends the notification. (Of course updating the status in the table along the way so it doesn't perform it twice.)
There are pros and cons either way, it's really up to you how you'd like to proceed. Ultimately you're looking at two main things here:
Separate the logic itself from the scheduling/invocation of the logic.
Utilize a scheduling system to schedule tasks. (If your hosting provider doesn't have one, find one that does.)
I am planning to develop a fairly small SaaS service. Every business client will have an associated database (same schema among clients' databases, different data). In addition, they will have a unique domain pointing to the web app, and here I see these 2 options:
The domains will point to a unique web app, which will change the
connection string to the proper client's database depending on the
domain. (That is, I will need to deploy one web app only.)
The domains will point to their own web app, which is really the
same web app replicated for every client but with the proper
connection string to the client's database. (That is, I will need to
deploy many web apps.)
This is for an ASP.NET 4.0 MVC 3.0 web app that will run on IIS 7.0. It will be fairly small, but I do require to be scalable. Should I go with 1 or 2?
This MSDN article is a great resource that goes into detail about the advantages of three patterns:
Separated DB. Each app instance has its own DB instance. Easier, but can be difficult to administer from a database infrastructure standpoint.
Separated schema. Each app instance shares a DB but is partitioned via schemas. Requires a little more coding work, and mitigates some of the challenges of a totally separate schema, but still has difficulties if you need individual site backup/restore and things like that.
Shared schema. Your app is responsible for partitioning the data based on the app instance. This requires the most work, but is most flexible in terms of management of the data.
In terms of how your app handles it, the DB design will probably determine that. I have in the past done both shared DB and shared schema. In the separated DB approach, I usually separate the app instances as well. In the shared schema approach, it's the same app with logic to modify what data is available based on login and/or hostname.
I'm not sure this is the answer you're looking for, but there is a third option:
Using a multi-tenant database design. A single database which supports all clients. Your tables would contain composite primary keys.
Scale out when you need. If your service is small, I wouldn't see any benefit to multiple databases except for assured data security - meaning, you'll only bring back query results for the correct client. The costs will be much higher running multiple databases if you're planning on hosting with a cloud service.
If SalesForce can host their SaaS using a multitenant design, I would at least consider this as a viable option for a small service.
I am designing a multi-tenant system and am considering sharding by tenant at the application layer level instead of database.
Hypothetically, the way this should work is that for incoming request a router process has a global collection of tenants containing primary attributes to determine the tenant for this request as well as the virtual shard id. This virtual shard id is further mapped to an actual shard.
The actual shard contains both the code for application as well as whole data for this tenant. These shards would be LNMP (Linux, Nginx, MySQL/MongoDB, PHP) servers.
The router process should act as proxy. It should be able to run some code to determine the target shard for incoming request based on the collection stored in some local db or files. To be able to scale this better, i am considering making the shards themselves act as routers also so that they can run a reverse proxy that will forward the request to appropriate shard. Maybe the nginx instance running on shard can also act as that reverse proxy. But how will it execute the application logic needed to match up the request with the appropriate shard.
I will appreciate any ideas and suggestions for this router implementation.
Thanks
Another option would be to use a product such as dbShards. dbShards is the only sharding product that shards at the application level. This way you can use any RDMS (Postgres, MySQL, etc.) and still be able to shard your database without having to put some kind of proxy in-between. A lot of the other sharding products rely on a proxy to point the transactions to the correct shard, but dbShards knows where to go without having to "ask" anyone else.
Great product. dbshards
Unless you expect your tenants to generate approximately equal data volume, sharding by tenant will not be very efficient.
As to application level sharding in general, let me share my own experience:
Version 1 of our high-volume SaaS product sharded at the application level. You will find that resharding as you grow will be a major headache if you shard against a SQL type solution at the application level, or you will have to write significant tooling to automate the process.
We switched to MongoDB (after considering multiple alternatives including Cassandra) in no small part because of all of the built-in support for resharding / rebalancing as data grows.
If your application does not need the relational capabilities of MySQL, I would suggest concentrating your efforts on MongoDB (since you have already identified that as a possible data platform) if you expect more than modest data growth. Allow MongoDB to handle the data sharding.