Difference between run_as_user vs default_impersonation in airflow - airflow

I need some clarification on the configuration of run_as_user vs default_impersonation in airflow. When is one needed vs the other? If I don't specify both, what will happen?

Related

Apache Airflow - Multiple deployment environments

When handling multiple environments (such as Dev/Staging/Prod etc) having separate (preferably identical) Airflow instances for each of these environments would be the best case scenario.
I'm using the GCP managed Airflow ( GCP Cloud Composer), which is not very cheap to run, and having multiple instances would increase our monthly bill significantly.
So, I'd like to know if anyone has recommendations on using a single Airflow instance to handle multiple environments?
One approach I was considering of was to have separate top-level folders within my dags folder corresponding to each of the environment (i.e. dags/dev, dags/prod etc)
and copy my DAG scripts to the relevant folder through the CI/CD pipeline.
So, within my source code repository if my dag looks like:
airflow_dags/my_dag_A.py
During the CI stage, I could have a build step that creates 2 separate versions of this file:
airflow_dags/DEV/my_dag_A.py
airflow_dags/PROD/my_dag_A.py
I would follow a strict naming convention for naming my DAGs, Airflow Variables etc to reflect the environment name, so that the above build step can automatically rename those accordingly.
I wanted check if this is an approach others may have used? Or are there any better/alternative suggestions?
Please let me know if any additional clarifications are needed.
Thank you in advance for your support. Highly appreciated.
I think it can be a good approach to have a shared environement because it's cost effective.
However if you have a Composer cluster per environment, it's simpler to manage, and it's allows having a better separation.
If you stay on a shared environment, I think you are on the good direction with a separation on the Composer bucket DAG and a folder per environment.
If you use Airflow variables, you also have to deal with environment in addition to the DAGs part.
You can then manage the access to each folder in the bucket.
In my team, we chose another approach.
Cloud Composer uses GKE with autopilot mode and it's more cost effective than the previous version.
It's also easier to manage the environement size of the cluster and play with differents parameters (workers, cpu, webserver...).
In our case, we created a cluster per environment but we have a different configuration per environment (managed by Terraform):
For dev and uat envs, we have a little sizing and an environment size as small
For prod env, we have a higher sizing and an environment size as Medium
It's not perfect but this allows us to have a compromise between cost and separation.

Airflow DAG Versioning

Is DAG versioning a thing ? I can't find much on the subject with a few Google searches. I would like to look at the DAGs screen in Airflow and be sure of what DAG code is in the wild.
The simplest solution would be to include a version number as part of the dag_id, but I would appreciate knowing if anyone has better, alternative solution. Tags would work too and migjht look good in the UI - they are designed for for filtering though, I'm not sure if there would be undesirable side-effects.
As the author of the DAG Versioning AIP, I can say that this work has been deferred post 2.0 mainly to support end-to-end DAG Versioning.
Originally, we (Airflow Core Committers) were planning to have a Webserver-only DAG Versioning i.e. to improve the visibility behaviour but not execution:
The scope of this AIP to make sure that the visibility behavior of
Airflow is correct, without changing the execution behaviour which
will continue to be based on the most recent version of the DAG.
This means it overcomes the issues where you can go back to an old version of the DAG, to view the shape of the DAG few months back and you can see the correct representation instead of "always-latest".
Currently, Airflow suffers from the issue where if you add/remove a task, it gets added/removed in all the previous DagRuns in the Webserver.
However, what we have decided is that we will accomplish Remote DAG Fetcher + DAG Versioning and enable versioning of DAG on the worker side, so a user will be able to run a DAG with the previous version too.
Currently, we don't have a date but mostly planning to do it around the end of 2021.
The Airflow project has a draft feature open to support DAG versions. The answer currently is that Airflow does not support versions.
The first use case in the link describes a key limitation, log files from previous runs can only surface nodes from the current DAG.
As mentioned above, as of yet, Airflow doesn't has its own functionality of versioning workflows. However you can manage that on your own by managing DAGs on their own git repository and fetching its state into airflow reposiroty as submodules. More on that;
https://www.youtube.com/watch?v=a-4yRne3ba4&lc=UgwiIO-ECVFSZPz1hOt4AaABAg

Apache Airflow environment setup

Can a single installation of Apache Airflow be used to handle multiple environments? eg. Dev, QA1, QA2, and Production (if so please guide) or do I need to have a separate install for each? What would be the best design considering maintenance of all environments.
You can do whatever you want. If you want to keep a single Airflow installation to handle different environments, you could switch connections or Airflow variables according to the environment.
At the end of the day DAGs are written in Python so you're really flexible.

How run multiple meteor servers on different ports

How can meteor run on multiple ports.For example if the meteor run on 3000 i need another meteor app run on the same terminal.Please help me.
You can use the --port parameter:
`meteor run --port 3030`
To learn more about command line parameters, run meteor help <command>, e.g. meteor help run.
I see you've tagged your question meteor-up. If you're actually using mup, check out the env parameter in the config file.
I think the OP was referring to the exceptions caused because of locks on the mongo db. I am only on this platform for last week - and am learning as quick as I can. But when I tried running my application from the same project directory as two different users on two different ports - I got an exception about MongoDB :
Error: EBUSY, unlink 'D:\test\.meteor\local\db\mongod.lock'
The root of the issue isn't running on different ports - it is the shared files between the two instances - Specifically the database.
I don't think any of your answers actually helped him out. And .. neither can I yet.
I see two options -
First -
I am going to experiment with links to see if I can get the two users to use a different folder for the .meteor\local tree ... so both of us can work on the same code at the same time - but not impact each other when testing.
But I doubt if that is what the OP was referring to either (different users same app)...
Second - is trying to identify if I can inject into the run-mongo.js some concept of the URL / port number I am running on, so the mongodb.lock (and db of course) ... are named something like mongodb.lock-3000
I don't like the 2nd option because then I am on my own version of standard scripts.
B
No, it is mainly used the default port of 3000 or any state at the start, and the following (+1) to Mongo.
That is, the following application can be run through a 2-port, already in 3002, hence the previous 2-port as before - it is 2998.
Check can be very simple (Mac, Linux):
ps|grep meteor

Integration Testing best practices

Our team has hundreds of integration tests that hit a database and verify results. I've got two base classes for all the integration tests, one for retrieve-only tests and one for create/update/delete tests. The retrieve-only base class regenerates the database during the TestFixtureSetup so it only executes once per test class. The CUD base class regenerates the database before each test. Each repository class has its own corresponding test class.
As you can imagine, this whole thing takes quite some time (approaching 7-8 minutes to run and growing quickly). Having this run as part of our CI (CruiseControl.Net) is not a problem, but running locally takes a long time and really prohibits running them before committing code.
My question is are there any best practices to help speed up the execution of these types of integration tests?
I'm unable to execute them in-memory (a la sqlite) because we use some database specific functionality (computed columns, etc.) that aren't supported in sqlite.
Also, the whole team has to be able to execute them, so running them on a local instance of SQL Server Express or something could be error prone unless the connection strings are all the same for those instances.
How are you accomplishing this in your shop and what works well?
Thanks!
Keep your fast (unit) and slow (integration) tests separate, so that you can run them separately. Use whatever method for grouping/categorizing the tests is provided by your testing framework. If the testing framework does not support grouping the tests, move the integration tests into a separate module that has only integration tests.
The fast tests should take only some seconds to run all of them and should have high code coverage. These kind of tests allow the developers to refactor ruthlessly, because they can do a small change and run all the tests and be very confident that the change did not break anything.
The slow tests can take many minutes to run and they will make sure that the individual components work together right. When the developers do changes that might possibly break something which is tested by the integration tests but not the unit tests, they should run those integration tests before committing. Otherwise, the slow tests are run by the CI server.
in NUnit you can decorate your test classes (or methods) with an attribute eg:
[Category("Integration")]
public class SomeTestFixture{
...
}
[Category("Unit")]
public class SomeOtherTestFixture{
...
}
You can then stipulate in the build process on the server that all categories get run and just require that your developers run a subset of the available test categories. What categories they are required to run would depend on things you will understand better than I will. But the gist is that they are able to test at the unit level and the server handles the integration tests.
I'm a java developer but have dealt with a similar problem. I found that running a local database instance works well because of the speed (no data to send over the network) and because this way you don't have contention on your integration test database.
The general approach we use to solving this problem is to set up the build scripts to read the database connection strings from a configuration file, and then set up one file per environment. For example, one file for WORKSTATION, another for CI. Then you set up the build scripts to read the config file based on the specified environment. So builds running on a developer workstation run using the WORKSTATION configuration, and builds running in the CI environment use the CI settings.
It also helps tremendously if the entire database schema can be created from a single script, so each developer can quickly set up a local database for testing. You can even extend this concept to the next level and add the database setup script to the build process, so the entire database setup can be scripted to keep up with changes in the database schema.
We have an SQL Server Express instance with the same DB definition running for every dev machine as part of the dev environment. With Windows authentication the connection strings are stable - no username/password in the string.
What we would really like to do, but haven't yet, is see if we can get our system to run on SQL Server Compact Edition, which is like SQLite with SQL Server's engine. Then we could run them in-memory, and possibly in parallel as well (with multiple processes).
Have you done any measurements (using timers or similar) to determine where the tests spend most of their time?
If you already know that the database recreation is why they're time consuming a different approach would be to regenerate the database once and use transactions to preserve the state between tests. Each CUD-type test starts a transaction in setup and performs a rollback in teardown. This can significantly reduce the time spent on database setup for each test since a transaction rollback is cheaper than a full database recreation.

Resources