If flyway upon run will end up performing any migrations, our deployment system would like to know, as it then needs to do a different set of operations than if there are no database migrations (If there are migrations, then all servers in a cluster will be taken down, and then one server will start up and perform migrations, and then the rest of the servers will be taken up. If there are no migrations, a rolling upgrade can be performed, which is desirable.)
So, can you set up flyway as normal, and then ask it whether it will in the current state perform any migrations? I guess the concept of "dry-run" would apply here: If the dry-run shows any changes, then "yes", else "no".
Got it, so answering: Evidently Flyway.info().pending() will give the set of operations that will be run. If it is non-empty, then "yes, there are changes!".
Related
I need to persist my domain objects into two different databases. This use case is purely write-only. I don't need to read back from the databases.
Following Domain Driven Design, I typically create a repository for each aggregate root.
I see two alternatives. I can create one single repository for my AG, and implement it so that it persists the domain object into the two databases.
The second alternative is to create two repositories, one each for each database.
From a domain driven design perspective, which alternative is correct?
My requirement is that it must persist the AR in both databases - all or nothing. So if the first one goes through and the second fails, I would need to remove the AG from the first one.
If you had a transaction manager that were to span across those two databases, you would use that manager to automatically roll back all of the transactions if one of them fails. A transaction manager like that would necessarily add overhead to your writes, as it would have to ensure that all transactions succeeded, and while doing so, maintain a lock on the tables being written to.
If you consider what the transaction manager is doing, it is effectively writing to one database and ensuring that write is successful, writing to the next, and then committing those transactions. You could implement the same type of process using a two-phase commit process. Unfortunately, this can be complicated because the process of keeping two databases in sync is inherently complex.
You would use a process manager or saga to manage the process of ensuring that the databases are consistent:
Write to the first database and leave the record in a PENDING status (not visible to user reads).
Make a request to second database to write the record in a PENDING status.
Make a request to the first database to leave the record in a VALID status (visible to user reads).
Make a request to the second database to leave the record in a VALID status.
The issue with this approach is that the process can fail at any point. In this case, you would need to account for those failures. For example,
You can have a process that comes through and finds records in PENDING status that are older than X minutes and continues pushing them through the workflow.
You can can have a process that cleans up any PENDING records after X minutes and purges them from the database.
Ideally, you are using something like a queue based workflow that allows you to fire and forget these commands and a saga or process manager to identify and react to failures.
The second alternative is to create two repositories, one each for each database.
Based on the above, hopefully you can understand why this is the correct option.
If you don't need to write why don't build some sort of commands log?
The log acts as a queue, you write the operation in it, and two processes pulls new command from it and each one update a database, if you can accept that in worst case scenario the two dbs can have different version of the data, with the guarantees that eventually they will be consistent it seems to me much easier than does transactions spanning two different dbs.
I'm not sure how much DDD is your use case, as if you don't need to read back you don't have any state to manage, so no need for entities/aggregates
I had a question related to Snowflake. Actually in my current role, I am planning to migrate data from ADLS (Azure data lake) to Snowflake.
I am right now looking for 2 options
Creating Snowpipe to load updated data
Create Airflow job for same.
I am still trying to understand which will be the best way and what is the pro and cons of choosing each.
It depends on what you are trying to as part of this migration. If it is a plain vanilla(no transformation, no complex validations) as-is migration of data from ADLS to Snowflake, then you may be good with SnowPipe(but please also check if your scenario is good for Snowpipe or Bulk Copy- https://docs.snowflake.com/en/user-guide/data-load-snowpipe-intro.html#recommended-load-file-size).
If you have many steps before you move the data to snowflake and there are chances that you may need to change your workflow in future, it is better to use Airflow which will give you more flexibility. In one of my migrations, I used Airflow and in the other one CONTROL-M
You'll be able to load higher volumes of data with lower latency if you use Snowpipe instead of Airflow. It'll also be easier to manage Snowpipe in my opinion.
Airflow is a batch scheduler and using it to schedule anything that runs more frequently than 5 minutes becomes painful to manage. Also, you'll have to manage the scaling yourself with Airflow. Snowpipe is a serverless option that can scale up and down based on the volumes sees and you're going to see your data land within 2 minutes.
The only thing that should restrict your usage of Snowpipe is cost. Although, you may find that Snowpipe ends up being cheaper in the long run if you consider that you'll need someone to manage your Airflow pipelines too.
There are a few considerations. Snowpipe can only run a single copy command, which has some limitations itself, and snowpipe imposes further limitations as per Usage Notes. The main pain is that it does not support PURGE = TRUE | FALSE (i.e. automatic purging while loading) saying:
Note that you can manually remove files from an internal (i.e.
Snowflake) stage (after they’ve been loaded) using the REMOVE command.
Regrettably the snowflake docs are famously vague as they use an ambiguous colloquial writing style. While it said you 'can' remove the files manually yourself in reality any user using snowpipe as advertised for "continuous fast ingestion" must remove the files to not suffer performance/cost impacts of the copy command having to ignore a very large number of files that have been previously loaded. The docs around the cost and performance of "table directories" which are implicit to stages talk about 1m files being a lot of files. By way of an official example the default pipe flush time on snowflake kafka connector snowpipe is 120s so assuming data ingests continually, and you make one file per flush, you will hit 1m files in 2 years. Yet using snowpipe is supposed to imply low latency. If you were to lower the flush to 30s you may hit the 1m file mark in about half a year.
If you want a fully automated process with no manual intervention this could mean that after you have pushed files into a stage and invoked the pipe you need logic have to poll the API to learn which files were eventually loaded. Your logic can then remove the loaded files. The official snowpipe Java example code has some logic that pushes files then polls the API to check when the files are eventually loaded. The snowflake kafka connector also polls to check which files the pipe has eventually asynchronously completed. Alternatively, you might write an airflow job to ls #the_stage and look for files last_modified that is in the past greater than some safe threshold to then rm #the_stage/path/file.gz the older files.
The next limitation is that a copy command is a "copy into your_table" command that can only target a single table. You can however do advanced transformations using SQL in the copy command.
Another thing to consider is that neither latency nor throughput is guaranteed with snowpipe. The documentation very clearly says you should measure the latency yourself. It would be a completely "free lunch" if snowpipe that is running on shared infrastructure to reduce your costs were to run instantly and as fast if you were paying for hot warehouses. It is reasonable to assume a higher tail latency when using shared "on-demand" infrastructure (i.e. a low percentage of invocations that have a high delay).
You have no control over the size of the warehouse used by snowpipe. This will affect the performance of any sql transforms used in the copy command. In contrast if you run on Airflow you have to assign a warehouse to run the copy command and you can assign as big a warehouse as you need to run your transforms.
A final consideration is that to use snowpipe you need to make a Snowflake API call. That is significantly more complex code to write than making a regular database connection to load data into a stage. For example, the regular Snowflake JDBC database connection has advanced methods to make it efficient to stream data into stages without having to write oAuth code to call the snowflake API.
Be very clear that if you carefully read the snowpipe documentation you will see that snowpipe is simply a restricted copy into table command running on shared infrastructure that is eventually run at some point; yet you yourself can run a full copy command as part of a more complex SQL script on a warehouse that you can size and suspend. If you can live with the restrictions of snowpipe, can figure out how to remove the files in the stage yourself, and you can live with the fact that tail latency and throughput is likely to be higher than paying for a dedicated warehouse, then it could be a good fit.
I am using NHibernate and ASP.NET/MVC. I use one session per request to handle database operations. For integration testing I am looking for a way to have each test run in an isolated mode that will not change the database and interfere with other tests running in parallel. Something like a transaction that can be rolled back at the end of the test. The main challenge is that each test can make multiple requests. If one request changes data the next request must be able to see these changes etc.
I tried binding the session to the auth cookie to create child sessions for the following requests of a test. But that does not work well, as neither sessions nor transactions are threadsafe in NHibernate. (it results in trying to open multiple DataReaders on the same connection)
I also checked if TransactionScope could be a way, but could not figure out how to use it from multple threads/requests.
What could be a good way to make this happen?
I typically do this by operating on different data.
for example, say I have an integration test which checks a basket total for an e-commerce website.
I would go and create a new user, activate it, add some items to a basket, calculate the total, delete all created data and assert on whatever I need.
so the flow is : create the data you need, operate on it, check it, delete it. This way all the tests can run in parallel and don't interfere with each other, plus the data is always cleaned up.
I'd like to use pglogical to replicate a set of tables, but I want to make all of my changes downstream from the master - to avoid risk, I don't want to make any modifications to the master database. I'd also like to start using pglogical now so we get familiar with the technology and can include it across all of our databases on our next release.
I don't need constant updates, so I came up with a plan, a cron job that:
Turns off streaming replication to a standby
Makes this standby a logically-replicating master (just for logical replication, no writes)
Stop postgresql
Copy off data dir
Make config changes
Start postgresql
Create pglogical extension
Catches up logical replication
Makes this database a streaming standby without logical replication again
Stop postgresql
Replace data dir with previous copy
Config changes
Start postgresql
My question - does this approach even make sense? Is there some easy way to accomplish this that I'm totally missing?
We have multiple products using the same database, each product has its own deployment system which will ultimately trigger off Flyway as part of the procedure to bring the Database up to date.
What if two projects deploy at the same time and run Flyway at exactly the same time? Will Flyway attempt to apply version 1,2,3 twice or will it automatically handle this situation.
This could cause headaches in some scenarios (eg, add 3 rows to a table twice).
Given, this could be rare, but it could happen and I'd like to know if Flyway considers this out of the box?
A solution we discussed in the office would be to acquire a lock on the versioning table. The second instance would have to wait for the lock to be freed before it applied its version(s), forcing it to wait for the previous instance to finish and therefore not applying versions twice.
Thanks,
Chris
Flyway
The answer is in Flyway's videos...
See the talk by Axel Fontaine at about 30 minutes.
Essentially, Flyway acquires a lock on its versioning table so all other instances have to wait.