How to avoid code duplication for non-data structures (views, stored procedures etc) - flyway

My project contains a lot of objects like views and stored procedures which are being changed quite frequently. Now I have to create new SQL script on every update which contains complete source code of changed objects despite I've actually changed only few rows. It leads to massive code duplication and I also found it difficult to review these changes.
I'd like to have only one actual version of SQL script for every object like view or procedure and recreate these objects every time I redeploy the database. As result I could change existing source file (like in Java or C programming) instead of creating a new update every time I need to alter view or procedure.
Is there a possibility to execute some scripts every time I migrate the database with Flyway?

I'm not sure why that got so many downvotes, it's a perfectly understandable and valid question. Perhaps it's because it closely resembles this open question:
Migrating Stored Procedures with Flyway
We are actually starting to push against this issue now. We've been using flyway for development and testing (and love it). But we've come to a point where we're starting to have to use procs/triggers/views (p/t/v's) and the fundamental disconnect between how we did it before, and how we must use flyway, is starting to be a strain.
Before, for a given database object (let's say it's a procedure), there'd be one source file. And if you needed to change the proc 'n' times, there would be 'n' versions of the same file in your VCS. Diff tools work great, IDE's all understand this, merges detect when two developers working in separate branches make changes to the proc, etc, etc. You know, old school.
But with flyway, any one proc with 'n' changes is now scattered across 'n' files. Instead of "one object in one file with 'n' versions", you have "one objecct in 'n' files with one change each". I now need to do a text search in my IDE for any instance of "proc_name" if I want to know the history of changes to the proc. The VCS knows nothing about it. Devs can each make a migration in their own branches that succeed when each is deployed, but leave the proc with a missing update.
I'm not saying any of this to complain about flyway, and I fully realize it's not a simple area. I'd almost say it's unsolveable (by flyway).
We're scheming how to handle this problem, and I'd be very interested to know how others have handled it.

Repeatable migrations are supported by Flyway 4.0, now.
Just add sql files starting with "R" without any version information to your migration folder:
R__Blue_cars.sql
You have to ensure, that the script could be repeatable migrated.
This is usually done by "CREATE OR REPLACE" clauses in your DDL statements.
https://flywaydb.org/documentation/migration/repeatable

Related

Version control of big data tables (iceberg)

I'm building a Iceberg tables on the top of a data lake. These tables are used for reporting tools. I'm trying to figure out what is the best way to control a version/deploy changes to these tables in CI/CD process. E.g. I could like to add a column to the Iceberg table. To do that I have to write a ALTER TABLE statement, save it to the git repository and deploy via CI/CD pipeline. Tables are accessible via AWS Glue Catalog.
I couldn't find to much info about this in google so if anyone could share some knowledge, it would be much appreciated.
Cheers.
Version control of Iceberg tables.
Agree with #Fokko Driesprong. This is a supplement only.
Sometimes, table changes are considered as part of task version changes. That is, table change statements, ALTER TABLE, are bound to task upgrades.
Tasks are sometimes automatically deployed. So it often executes a table change statement first, and then deploys a new task. If the change is disruptive, then we need to stop the old task first and then deploy the new one.
Corresponding to the upgrade, we also have a rollback script, of course, the corresponding table change statement.
thanks for asking this question. I don't think there is a definitive way of doing this. In practice I see most people bundling this as part of the job that writing to the Iceberg table. This way you can make sure that new columns are populated right away with the new version of the job. If you don't do any breaking changes (such as deletion of column), then the downstream jobs won't break. Hope this helps!

Duplicate column issue in migration scripts in .net microservice module. I have to resolve duplicate column in my migration which already executed

In .net core microservices, another team works on open source modules, and I extend their modules in my project. I already added one column in a Entity and then same column is added by open source team. Now duplicate column error is showing.
I can not alter open source migration files and my column is already in production.
How to resolve this issue please suggest.
i think we are dealing with multiple smells here.
each microservice should have its own Datarealm
by the sound of it you extended a Table of an object which was generated by the opensource service
But this does not help you, the question is can you merge the Data which is in this column.
If so you might get away with a simple copy/update.
What you need todo is:
Create a new Column With a different Name
Copy your existing Data into the new column
Drop your column
Execute your Migration
Copy your Data back into the column
Test your application very carefully, if the logic which you have implemented for the from you generated Data works as intended
Drop your backup column
Depending on the amount of Data this will lead to a downtime, so plan ahead and have a rollback strategy ready if something goes wrong.
Personal Opinion to Prevent those Smells
Every time i needed an opensource project in my projects in the past i wrote a wrapper around it, this has multiple benefits.
For one if the api of the project changes you only have to update it in exactly one place, which improves the maintainability.
Because it has a wrapper it automatically gets a own Database and if i need to extend an entity which i get from one of those opensource projects, i usually do it via Foreign Keys with a different Table. Which then gets linked via a view.
Yes this costs some performance, but in the end it was worth it every time.

Is there any way to execute repeatable flyway scripts first?

We use flyway since years to maintain our DB scripts, and it does a wonderful job.
However there is one single situation where I am not really happy - possibly someone out there has a solution:
In order to reduce the number of scripts required (and also in order to keep overview about "where" our procedures are defined) I'd like to implement our functions/procedures in one script. Every time a procedure changes (or a new one is developed) this script shall be updated - repeatable scripts sound perfect for this purpose, but unfortunately they are not.
The drawback is, that a new procedure cannot be accessed by non-repeatable scripts, as repeatable scripts are executed last, so the procedure does not exist when the non-repeatable script executes.
I hoped I can control this by specifying different locations (e.g. loc_first containing the repeatables I want to be executed first, loc_normal for the standard scripts and the repeatables to be executed last).
Unfortunately the order of locations has no impact on execution order ;-(
What's the proper way to deal with this situation? Right now I need to specify the corresponding procedures in non-repeatable scripts, but that's exactly what I'd like to avoid ....
I found a workaround on my own: I'm using flyway directly with maven (the same would work in case you use the API of course). Each stage of my maven script has its own profile (specifiying URL etc.)
Now I create two profiles for every stage - so I have e.g. dev and devProcs.
The difference between these two maven profiles is, that the "[stage]Procs" profile operates on a different location (where only the repeatable scripts maintaining procedures are kept). Then I need to execute flyway twice - first with [stage]Procs then with [stage].
To me this looks a bit messy, but at least I can maintain my procedures in a repeatable script this way.
According to flyway docs, Repeatable migrations ALWAYS execute after versioned migration.
But, I guess, you can use Flyway callbacks. Looks like, beforeMigrate.sql callback is exactly what you need.

Doctrine schema update or Doctrine migrations

What are the practical advantages of Doctrine Migrations over just running a schema update?
Safety?
The orm:schema-tool:update command (doctrine:schema:update in Symfony) warns
This operation should not be executed in a production environment.
but why is this? Sure, it can delete data but so can a migration.
Flexibility?
I thought I could tailor my migrations to add stuff like column defaults but this often doesn't work as Doctrine will notice the discrepancy between the schema and the code on the next diff and stomp over your changes.
When you are using the schema-tool, no history of database modification is kept, and in a production/staging environment this is a big downside.
Let's assume you have a complicated database structure in a live project. And in the next changeset you have to alter the database somehow. For example, your users' contact phones need to be stored in a different format, not a VARCHAR, but three SMALLINT columns for country code, area code and the phone number.
Well, that's not so hard to figure out a query that would fetch the current data, separate it into three values and insert them back. That's when migrations come into play: you can create your new fields, then do the transforms and finally drop the field that was holding the data before.
And even more! You can even describe the backwards process (the down migration), when you need to undo the changes introduced in your migration. Let's assume that someone somewhere relied heavily on the format of the VARCHAR field, and now that you've changed the structure, his piece of code is not working as expected. So, you run migration:down, and everything gets reverted. In this specific case you'd just bring back the old VARCHAR column and concatenate the values back, and then drop the fields.
Doctrine's migration tool basically does most of the work for you. When you diff your schema, it generates all the necessary up's and down's, so only thing you'll have to do is handle the data that could be damaged when the migration is applied.
Also, migrations are something that gives other developers on your team knowledge on when it's time to update their schemas. With just the schema-tool, your teammates would have to run doctrine:schema:update each and every time they pull, `cause they wouldn't know if the schema has really changed.
When using migrations, you always see that there are some updates in the migrations folder, which means you need to update your schema.
I think that you indeed nailed it on Safety. With Migrations you can go back to another state of the table (just like you can do in Git version control). With the schema update command you can only UPDATE the tables. There is no log kept for going back in case of a failure with already saved data in those tables. I don't know exactly, but doesn't a migration also saves the data of the corresponding table that's being updated? That would be essential in my opinion, otherwise there is no big reason to use them.
So yes, I personally think that the main reason for using migrations in a production environment is safety and maybe a bit of flexibility. Safety would be the winner here I think :)
Hope this helps.
edit: Here is another answer with references to the Symfony docs: Is it safe to use doctrine2 migrations in production environment with symfony2 and php
You also cant perform large updates with plain doctrine migration. Like try to update index on 30 mln users database. As it will a lot of time while you app will not be accessible.

I'm working with Drupal and I want to diff between two databases, to see which tables have been updated in a given session. how do I do this?

I'm working with Drupal on a project, trying to find a way to speed up tests (we're using Cucumber and Selenium), and I'm trying to see which tables have been changed in a given series of steps, so I can just revert dump and out reset those tables between each test case.
Right now, Simpletest, the Drupal testing framework works by installing and setting up the tables for every module needed for a test, which makes for slow tests, and I'm emulating a similar approach by loading a db dump for each test.
Given that a site, if you're doing integration testing has a 'known good' state to be starting from, I think it would be faster to be able to just revert back to that point each time, instead of waiting twenty seconds or so to drop the database then pipe the dumpfile back in between each test runs.
However, when I try diffing between two dumpfiles (ie before.I.create.a.node.sql, and after.I.create.a.node.sql) the output is an unreadable load of serialised php, that I can't make sense of.
Ae there any tools I can use to help work out which tables I need to drop and rebuild between test cases, so I don't incur the 20 second hit on each test, short of reading the schema and code of every module I'm working with?
I'm following the ideas outlined here with getting cucumber to work with PHP, and yes, I have seen [this question here on a similar subject
Thanks!
Drupal does store a lot of serialized PHP in the database. But the main part of it is kept in the cache tables; like cache, cache_field, cache_menu, etc and you can safely truncate these before dumping the database.
If you have any simpletest tables you could drop those. They are all temporary and is used only for running your Simpletest test suite.
That should reduce the dump size a lot. If it's not enough I can recommend reading up on the tables in the book Pro Drupal Development, or you could skim through the .install files to read the module's schema definitions. Though most will probably be real data you'd want to revert between tests.
Because of the relational nature of the database, be sure to either know exactly what you're doing or dump/revert all the remaining tables together.

Resources