Change a large dynamodb table to use LSIs instead of GSIs - amazon-dynamodb

I have a live table in dynamo with about 28 million records in it.
The table has a number of GSI that I'd like to change to be LSIs however LSIs can only be created when the table is created.
I need to create a new table and migrate the data with minimum downtime. I was thinking I'd do the following:
Create the new table with the correct indexes.
Update the code to write records to the old and new table. When this starts, take a note of the timestamp for the first record.
Write a simple process to sync existing data for anything with a create date prior to my first date.
I'd have to add a lock field to the new table to prevent race conditions when an existing record is updated.
When it's all synced we'd swap to using the new table.
I think that will work, but it's fairly complicated and feels prone to error. Has anyone found a better way to do this?

Here is an approach:
(Let's refer to the table with GSIs as oldTable and the new table with LSIs as newTable).
Create newTable with the required LSIs.
Create a DynamoDB tirgger for the oldTable such that for every new record coming to the oldTable insert the same record to the newTable. (This logic needs to be in the AWS Lambda).
Make your application point to the newTable.
Migrate all the records from oldTable to newTable.

Related

Copy DynamoDB table while modifying key attribute

I have a DynamoDB table with hundreds of thousands of data, which I need it duplicated, with one catch that the key needs to be modified. The current key is a combination of 2 fields, e.g. attr1:attr2. I need the new table to have the key consisted only from attr1.
I know copying the table with Data pipelines is pretty straight forward, but how do I do the new key creation according to the use case I have?
Note: the data size is between 500K and 1M items.
Use Elastic Map Reduce in order to manipulate the data. This article explains how to handle DynamoDB data with EMR. Create a UDF which will parse and manipulate the key and use that in a comprehensive
SELECT UDF(id), all, other, columns FROM your_table
Which will be saved in another DynamoDB table.

Table listener using spring

Is it possible from spring to listen to table insert or update even if the table is inserted/updated outside the spring application process?
One suggestion: you need to do some extra works base on your RDBMS
- Create triggers for each of your listening tables, this trigger
should base on a new change on these table to update information to
somewhere else
- Create a table (named: datalogger)to store the data come from the trigger above, for example, when table1 has new record inserted, the
trigger (named: trigger1) on table1 will be execute by RDBMS, then
trigger1 will insert a new record into datalogger which describes
the action event is "inserted", table name is "table1", and the
inserted record id is "new_recordId" as example. The same process
need to be applied for modifying action.
- On your application, create a job, which will try to query the datalogger table periodically, (1ms, 10ms,... as your requirement)
to find is there a new record in that table, if any, read the new
record and find out what happened with your database based on the
new record (determine by time or auto-increment id,..)

Use ConditionExpression to limit insert when ID doesn't exist in other table

Simple thing. While inserting data to table A I have a HashKey id and additional hash index for column ex_id, which is kind of a foreign key in table B.
When inserting a new data into table A I would like to create an exception whenever data is inserted with value in column ex_id that doesn't have a correspondent entry in table B.
I thought that ConditionExpression is the way to go, but can't make it work - probably missing something obvious. Tried to use contains()...
Any ideas?
As per my knowledge this would not be possible at DynamoDB end because there are no relationship between the tables.
What you can do is that you can have a condition at the application level, which checks on its own and throw an exception before inserting the value in table A. (You can query table B for that "Id" if found then insert else throw exception)
DynamoDB does not natively support any kind of foreign key support, everything works on a per table basis, per key basis. DynamoDB's approach is to handle such logic at the client level. For example see the dynamodb transactions client. This library allows you to perform transactions across tables which either all succeed or all rollback.
For your case, I would first make a getItem request to table B (use consistent read) if it exists then write to table A.
Then I would enable streams on table A and write a lambda function to check if any data violations get written to the table.

Change the schema of a DynamoDB table: what is the best/recommended way?

What is the Amazon-recommended way of changing the schema of a large table in a production DynamoDB?
Imagine a hypothetical case where we have a table Person, with primary hash key SSN. This table may contain 10 million items.
Now the news comes that due to the critical volume of identity thefts, the government of this hypothetical country has introduced another personal identification: Unique Personal Identifier, or UPI.
We have to add an UPI column and change the schema of the Person table, so that now the primary hash key is UPI. We want to support for some time both the current system, which uses SSN and the new system, which uses UPI, thus we need both these two columns to co-exist in the Person table.
What is the Amazon-recommended way to do this schema change?
There are a couple of approaches, but first you must understand that you cannot change the schema of an existing table. To get a different schema, you have to create a new table. You may be able to reuse your existing table, but the result would be the same as if you created a different table.
Lazy migration to the same table, without Streams. Every time you modify an entry in the Person table, create a new item in the Person table using UPI and not SSN as the value for the hash key, and delete the old item keyed at SSN. This assumes that UPI draws from a different range of values than SSN. If SSN looks like XXX-XX-XXXX, then as long as UPI has a different number of digits than SSN, then you will never have an overlap.
Lazy migration to the same table, using Streams. When streams becomes generally available, you will be able to turn on a Stream for your Person table. Create a stream with the NEW_AND_OLD_IMAGES stream view type, and whenever you detect a change to an item that adds a UPI to an existing person in the Person table, create a Lambda function that removes the person keyed at SSN and add a person with the same attributes keyed at UPI. This approach has race conditions that can be mitigated by adding an atomic counter-version attribute to the item and conditioning the DeleteItem call on the version attribute.
Preemptive (scripted) migration to a different table, using Streams. Run a script that scans your table and adds a unique UPI to each Person-item in the Person table. Create a stream on Person table with the NEW_AND_OLD_IMAGES stream view type and subscribe a lambda function to that stream that writes all the new Persons in a new Person_UPI table when the lambda function detects that a Person with a UPI was changed or when a Person had a UPI added. Mutations on the base table usually take hundreds of milliseconds to appear in a stream as stream records, so you can do a hot failover to the new Person_UPI table in your application. Reject requests for a few seconds, point your application to the Person_UPI table during that time, and re-enable requests.
DynamoDB streams enable us to migrate tables without any downtime. I've done this to great effective, and the steps I've followed are:
Create a new table (let us call this NewTable), with the desired key structure, LSIs, GSIs.
Enable DynamoDB Streams on the original table
Associate a Lambda to the Stream, which pushes the record into NewTable. (This Lambda should trim off the migration flag in Step 5)
[Optional] Create a GSI on the original table to speed up scanning items. Ensure this GSI only has attributes: Primary Key, and Migrated (See Step 5).
Scan the GSI created in the previous step (or entire table) and use the following Filter:
FilterExpression = "attribute_not_exists(Migrated)"
Update each item in the table with a migrate flag (ie: “Migrated”: { “S”: “0” }, which sends it to the DynamoDB Streams (using UpdateItem API, to ensure no data loss occurs).
NOTE: You may want to increase write capacity units on the table during the updates.
The Lambda will pick up all items, trim off the Migrated flag and push it into NewTable.
Once all items have been migrated, repoint the code to the new table
Remove original table, and Lambda function once happy all is good.
Following these steps should ensure you have no data loss and no downtime.
I've documented this on my blog, with code to assist:
https://www.abhayachauhan.com/2018/01/dynamodb-changing-table-schema/
I'm using a variant of Alexander's third approach. Again, you create a new table that will be updated as the old table is updated. The difference is that you use code in the existing service to write to both tables while you're transitioning instead of using a lambda function. You may have custom persistence code that you don't want to reproduce in a temporary lambda function and it's likely that you'll have to write the service code for this new table anyway. Depending on your architecture, you may even be able to switch to the new table without downtime.
However, the nice part about using a lambda function is that any load introduced by additional writes to the new table would be on the lambda, not the service.
If the changes involve changing the partition key, you can add a new GSI (global secondary index). Moreover, you can always add new columns/attributes to DynamoDB without needing to migrate tables.

Teradata: Is there a way to generate DDL from a view or select statement?

I am using a global application user account to access database A. This user account does not have permissions to modify database A's schema (ie, create tables, modify tables, etc). This user also has access to database B, but only views. I need to run SQL to feed data from a view in database B into a table in database A.
In a perfect world, I would be able to use this SQL:
create database_a.mytable as (select * from database_b) with no data
However, the user can't create tables in database A. If I could get the DDL of the select statement then I could log in under my personal account (which doesn't have any access to database B) and run the DDL in database A to create the table.
The only other option is to manually write the SQL, but I don't want to do that, especially since this view I am wanting to copy has many columns of varying data types and sizes.
Edit: I may be getting closer. I just experimented with this:
show (select * from database_b.myview)
However, it generated the DLL of every single table that is used in the view itself, as well as the definition for the view. This doesn't really help me since I just want the schema of the select statement itself. In other words, I need what would be generated if I were to use the create table as statement mentioned above.
Edit for Rob: Perhaps "DDL" was the wrong term to use. Using show view db.myview just shows the definition of the view, not the schema it represents. In my above example of create table as, I show how you can create a table that mimics the schema of a result set returned in a select. It generates a DDL on the back end for creating a table and then executes that DDL to actually create the table. You can then say show table db.newtable and see the new table's DDL. I want to get that DDL directly from a select statement so that I can copy it, log out of the app account, into my personal account, and then execute the DDL to create the table.
This is only to save me the headache of having to type out the DDL manually by hand to save time and reduce typing errors, especially since the source view has so many columns. That said, I think hitting up the DBA or writing some snazzy stored procedure to do dynamic stuff would be a bit over the top for my needs. I think there has to be a way to get the DDL for creating a table schema directly from a select statement.
Generate DDL Statements for objects:
SHOW TABLE {DatabaseB}.{Table1};
SHOW VIEW {DatabaseB}.{View1};
Breakdown of columns in a view:
HELP VIEW {DatabaseB}.{View1};
However, without the ability to create the object in the target database DatabaseA your don't have much leverage. Obviously, if the object already existed INSERT INTO SELECT ... FROM DatabaseB.Table1 or MERGE INTO would be options that you already explored.
Alternative Solution
Would it be possible to have a stored procedure created that dynamically created the table based on the view name that is provided? The global application account would simply need privilege to execute the procedure. Generally the user creating the stored procedure would need the permissions to perform the actions contained within the stored procedure. (You have some additional flexibility with this in Teradata 13.10.)
There are some caveats with this approach. You are attempting to materialize views that could reference anywhere from hundreds to billions of records. These aren't simple 1:1 views that are put on top of the target tables. Trying to determine the required space in the target database to materialize the view will be difficult. Performance can and will vary depending on the complexity of the view and the data volumes. This will not be a fast-path or data block optimized operation.
As a DBA, I would be concerned with this approach being taken on by a global application account without fully understanding the intent. I trust you have an open line of communication with the DBA(s) involved for supporting this system. I'm sure there are reasons for your madness that can't be disclosed here.
Possible Solution - VOLATILE TABLE
Unless the implicit privilege for CREATE TABLE has been revoked from the global application account this solution should work.
Volatile tables do not require perm space. There table definitions persist for the duration of the session and any data inserted into them relies on the spool space of the user who instantiated it.
CREATE VOLATILE TABLE {Global Application UserID}.{TableA_Copy} AS
(
SELECT *
FROM {DatabaseB}.{TableA}
)
WITH NO DATA
NO PRIMARY INDEX
ON COMMIT PRESERVE ROWS;
SHOW TABLE {Global Application UserID}.{TableA_Copy};
I opted to use a Teradata 13.10 feature called NO PRIMARY INDEX. By default, CREATE TABLE AS will take the first column of the SELECT statement and make it the PRIMARY INDEX of the table. This could lead to skewing and perm space issues in your testing depending on the data demographics. You can specify an explicit PRIMARY INDEX on your own as you understand the underlying data. (See the DDL manuals for details on the syntax if you're uncertain.)
The use of ON COMMIT PRESERVE ROWS for the intent of this example is probably extraneous. But in reality if you popped any data into that table for testing this clause would be beneficial in Teradata mode as the data would otherwise be lost immediately after the CREATE TABLE or any other data manipulation was performed against the volatile table.

Resources