Version control of big data tables (iceberg) - bigdata

I'm building a Iceberg tables on the top of a data lake. These tables are used for reporting tools. I'm trying to figure out what is the best way to control a version/deploy changes to these tables in CI/CD process. E.g. I could like to add a column to the Iceberg table. To do that I have to write a ALTER TABLE statement, save it to the git repository and deploy via CI/CD pipeline. Tables are accessible via AWS Glue Catalog.
I couldn't find to much info about this in google so if anyone could share some knowledge, it would be much appreciated.
Cheers.
Version control of Iceberg tables.

Agree with #Fokko Driesprong. This is a supplement only.
Sometimes, table changes are considered as part of task version changes. That is, table change statements, ALTER TABLE, are bound to task upgrades.
Tasks are sometimes automatically deployed. So it often executes a table change statement first, and then deploys a new task. If the change is disruptive, then we need to stop the old task first and then deploy the new one.
Corresponding to the upgrade, we also have a rollback script, of course, the corresponding table change statement.

thanks for asking this question. I don't think there is a definitive way of doing this. In practice I see most people bundling this as part of the job that writing to the Iceberg table. This way you can make sure that new columns are populated right away with the new version of the job. If you don't do any breaking changes (such as deletion of column), then the downstream jobs won't break. Hope this helps!

Related

Duplicate column issue in migration scripts in .net microservice module. I have to resolve duplicate column in my migration which already executed

In .net core microservices, another team works on open source modules, and I extend their modules in my project. I already added one column in a Entity and then same column is added by open source team. Now duplicate column error is showing.
I can not alter open source migration files and my column is already in production.
How to resolve this issue please suggest.
i think we are dealing with multiple smells here.
each microservice should have its own Datarealm
by the sound of it you extended a Table of an object which was generated by the opensource service
But this does not help you, the question is can you merge the Data which is in this column.
If so you might get away with a simple copy/update.
What you need todo is:
Create a new Column With a different Name
Copy your existing Data into the new column
Drop your column
Execute your Migration
Copy your Data back into the column
Test your application very carefully, if the logic which you have implemented for the from you generated Data works as intended
Drop your backup column
Depending on the amount of Data this will lead to a downtime, so plan ahead and have a rollback strategy ready if something goes wrong.
Personal Opinion to Prevent those Smells
Every time i needed an opensource project in my projects in the past i wrote a wrapper around it, this has multiple benefits.
For one if the api of the project changes you only have to update it in exactly one place, which improves the maintainability.
Because it has a wrapper it automatically gets a own Database and if i need to extend an entity which i get from one of those opensource projects, i usually do it via Foreign Keys with a different Table. Which then gets linked via a view.
Yes this costs some performance, but in the end it was worth it every time.

Is there any way to execute repeatable flyway scripts first?

We use flyway since years to maintain our DB scripts, and it does a wonderful job.
However there is one single situation where I am not really happy - possibly someone out there has a solution:
In order to reduce the number of scripts required (and also in order to keep overview about "where" our procedures are defined) I'd like to implement our functions/procedures in one script. Every time a procedure changes (or a new one is developed) this script shall be updated - repeatable scripts sound perfect for this purpose, but unfortunately they are not.
The drawback is, that a new procedure cannot be accessed by non-repeatable scripts, as repeatable scripts are executed last, so the procedure does not exist when the non-repeatable script executes.
I hoped I can control this by specifying different locations (e.g. loc_first containing the repeatables I want to be executed first, loc_normal for the standard scripts and the repeatables to be executed last).
Unfortunately the order of locations has no impact on execution order ;-(
What's the proper way to deal with this situation? Right now I need to specify the corresponding procedures in non-repeatable scripts, but that's exactly what I'd like to avoid ....
I found a workaround on my own: I'm using flyway directly with maven (the same would work in case you use the API of course). Each stage of my maven script has its own profile (specifiying URL etc.)
Now I create two profiles for every stage - so I have e.g. dev and devProcs.
The difference between these two maven profiles is, that the "[stage]Procs" profile operates on a different location (where only the repeatable scripts maintaining procedures are kept). Then I need to execute flyway twice - first with [stage]Procs then with [stage].
To me this looks a bit messy, but at least I can maintain my procedures in a repeatable script this way.
According to flyway docs, Repeatable migrations ALWAYS execute after versioned migration.
But, I guess, you can use Flyway callbacks. Looks like, beforeMigrate.sql callback is exactly what you need.

Can we compare the contents of two folders in spotfire?

I have two environments. One is development and another is production. Lets say I have folder in production which has all my metadata like ILs, joins, DS, Analysis, scripts etc. Now in development I have the same folder but with new enhancements done.
Now, I want to compare that what are the changes that have been done and as per the result I will be able to understand the impact.
So, could you please tell me that what is way to compare that two folders of development and production environment?
For the requirements posted here, you can create information link on top of LIB_ITEMS table to fetch details of library item details from Spotfire database. An another set of activity is performed at this link, but approach can be used for your requirements as well.

Doctrine schema update or Doctrine migrations

What are the practical advantages of Doctrine Migrations over just running a schema update?
Safety?
The orm:schema-tool:update command (doctrine:schema:update in Symfony) warns
This operation should not be executed in a production environment.
but why is this? Sure, it can delete data but so can a migration.
Flexibility?
I thought I could tailor my migrations to add stuff like column defaults but this often doesn't work as Doctrine will notice the discrepancy between the schema and the code on the next diff and stomp over your changes.
When you are using the schema-tool, no history of database modification is kept, and in a production/staging environment this is a big downside.
Let's assume you have a complicated database structure in a live project. And in the next changeset you have to alter the database somehow. For example, your users' contact phones need to be stored in a different format, not a VARCHAR, but three SMALLINT columns for country code, area code and the phone number.
Well, that's not so hard to figure out a query that would fetch the current data, separate it into three values and insert them back. That's when migrations come into play: you can create your new fields, then do the transforms and finally drop the field that was holding the data before.
And even more! You can even describe the backwards process (the down migration), when you need to undo the changes introduced in your migration. Let's assume that someone somewhere relied heavily on the format of the VARCHAR field, and now that you've changed the structure, his piece of code is not working as expected. So, you run migration:down, and everything gets reverted. In this specific case you'd just bring back the old VARCHAR column and concatenate the values back, and then drop the fields.
Doctrine's migration tool basically does most of the work for you. When you diff your schema, it generates all the necessary up's and down's, so only thing you'll have to do is handle the data that could be damaged when the migration is applied.
Also, migrations are something that gives other developers on your team knowledge on when it's time to update their schemas. With just the schema-tool, your teammates would have to run doctrine:schema:update each and every time they pull, `cause they wouldn't know if the schema has really changed.
When using migrations, you always see that there are some updates in the migrations folder, which means you need to update your schema.
I think that you indeed nailed it on Safety. With Migrations you can go back to another state of the table (just like you can do in Git version control). With the schema update command you can only UPDATE the tables. There is no log kept for going back in case of a failure with already saved data in those tables. I don't know exactly, but doesn't a migration also saves the data of the corresponding table that's being updated? That would be essential in my opinion, otherwise there is no big reason to use them.
So yes, I personally think that the main reason for using migrations in a production environment is safety and maybe a bit of flexibility. Safety would be the winner here I think :)
Hope this helps.
edit: Here is another answer with references to the Symfony docs: Is it safe to use doctrine2 migrations in production environment with symfony2 and php
You also cant perform large updates with plain doctrine migration. Like try to update index on 30 mln users database. As it will a lot of time while you app will not be accessible.

How to avoid code duplication for non-data structures (views, stored procedures etc)

My project contains a lot of objects like views and stored procedures which are being changed quite frequently. Now I have to create new SQL script on every update which contains complete source code of changed objects despite I've actually changed only few rows. It leads to massive code duplication and I also found it difficult to review these changes.
I'd like to have only one actual version of SQL script for every object like view or procedure and recreate these objects every time I redeploy the database. As result I could change existing source file (like in Java or C programming) instead of creating a new update every time I need to alter view or procedure.
Is there a possibility to execute some scripts every time I migrate the database with Flyway?
I'm not sure why that got so many downvotes, it's a perfectly understandable and valid question. Perhaps it's because it closely resembles this open question:
Migrating Stored Procedures with Flyway
We are actually starting to push against this issue now. We've been using flyway for development and testing (and love it). But we've come to a point where we're starting to have to use procs/triggers/views (p/t/v's) and the fundamental disconnect between how we did it before, and how we must use flyway, is starting to be a strain.
Before, for a given database object (let's say it's a procedure), there'd be one source file. And if you needed to change the proc 'n' times, there would be 'n' versions of the same file in your VCS. Diff tools work great, IDE's all understand this, merges detect when two developers working in separate branches make changes to the proc, etc, etc. You know, old school.
But with flyway, any one proc with 'n' changes is now scattered across 'n' files. Instead of "one object in one file with 'n' versions", you have "one objecct in 'n' files with one change each". I now need to do a text search in my IDE for any instance of "proc_name" if I want to know the history of changes to the proc. The VCS knows nothing about it. Devs can each make a migration in their own branches that succeed when each is deployed, but leave the proc with a missing update.
I'm not saying any of this to complain about flyway, and I fully realize it's not a simple area. I'd almost say it's unsolveable (by flyway).
We're scheming how to handle this problem, and I'd be very interested to know how others have handled it.
Repeatable migrations are supported by Flyway 4.0, now.
Just add sql files starting with "R" without any version information to your migration folder:
R__Blue_cars.sql
You have to ensure, that the script could be repeatable migrated.
This is usually done by "CREATE OR REPLACE" clauses in your DDL statements.
https://flywaydb.org/documentation/migration/repeatable

Resources