DynamoDB Single Table Schema Design with Adjacency Lists - amazon-dynamodb

I am trying to understand how to properly design a DynamoDB schema. I've read a few articles, watched some YouTube videos but, to be honest, I don't yet feel quite comfortable.
This is what I am trying to design properly:
two entities, "location" (id & name) and "vehicle" (id & name)
a location can have 0-n vehicles
a vehicle can be in 0-1 locations
Access patterns:
get a list of all available locations (id & name)
get a list of all available vehicles and their current location (id, name, location-id, location-name)
get a list of all vehicles in a given location (id, name)
I've read about adjacency lists and because there will be n-m relations I've decided to give it a try.
This is what I've came up with:
# | PK (GSI1-SK) | SK (GSI1-PK) | DATA
==|======================|====================|==============
1 | LOCATION#locationId1 | A | locationName1
2 | LOCATION#locationId2 | A | locationName2
3 | LOCATION#locationId1 | VEHICLE#vehicleId1 |
4 | LOCATION#locationId1 | VEHICLE#vehicleId2 |
5 | LOCATION#locationId2 | VEHICLE#vehicleId3 |
6 | VEHICLE#vehicleId1 | A | vehicleName1
7 | VEHICLE#vehicleId2 | A | vehicleName2
8 | VEHICLE#vehicleId3 | A | vehicleName3
#1-2 & #6-8 are my entity records, those with additional data for the entity itself (e.g. its name).
#3-5 is an example of how I would design a relationship. I've added an inverted GSI in order to be able to search in both ways.
Back to my access patterns:
get a list of all available locations (id & name)
query GSI1 for SK=A and PK begins with LOCATION#
get a list of all available vehicles and their current location (id, name, location-id, location-name)
query GSI1 for SK=A and PK begins with VEHICLE#
for each result item, query GSI1 for SK=VEHICLE#vehicleId and PK begins with LOCATION#
for each result item, query table for PK=LOCATION#locationId and SK=A
... this doesn't seem right
get a list of all vehicles in a given location (id, name)
query table for PK=LOCATION#locationId and SK begins with VEHICLE#
for each result item, query table for PK=VEHICLE#vehicleId and SK=A
... this doesn't seem right
Adjacency lists look like a nice and clean way to design complex relationships but either I am doing something wrong (probably) or they come with alot of querys that are necessary to look things up.
Any advice is appreciated.

I modelled this in DynamoDB Workbench:
Main Index (PK -> SK)
GSI1 (PK1 -> SK)
In order to:
"get a list of all available locations (id & name)"
select * from GS1 where PK1="ALL#LOCATION"
get a list of all available vehicles and their current location (id, name, location-id, location-name)
select * from MAIN-INDEX where PK="ALL#VEHICLE"
get a list of all vehicles in a given location (id, name)
select * from GSI1 where PK1="LOC#ID"
Several things to here:
It's important to distribute the traffic across all partition keys. I'm using "ALL#" partition keys in this design. Ideally you shard that somehow, there are several tricks like using dates or timestamp to the beginning of the day. You can randomly spread them across a fixed number of "ALL#" records and then randomly query 1 if your use case allows it. If you have millions of locations this is probably ok. That's how you take these decisions: think of the traffic and the behaviour of the data.
In order to use both indexes I put the "ALL#LOCATION" and the "ALL#VEHICLE" partition keys in different indexes.
Notice that vehicle 4 doesn't have a PK1. See what happens to GSI1. This is what's called a sparse index.
I denormalized the vehicle-location relationship. Assuming that the location ID and the location name are immutable it's ok to do this, the problem is when the attributes you denormalize are mutable, avoid that if possible.

Related

DynamoDB Global Secondary Index "Batch" Retrieval

I've see older posts around this but hoping to bring this topic up again. I have a table in DynamoDB that has a UUID for the primary key and I created a secondary global index (SGI) for a more business-friendly key. For example:
| account_id | email | first_name | last_name |
|------------ |---------------- |----------- |---------- |
| 4f9cb231... | linda#gmail.com | Linda | James |
| a0302e59... | bruce#gmail.com | Bruce | Thomas |
| 3e0c1dde... | harry#gmail.com | Harry | Styles |
If account_id is my primary key and email is my SGI, how do I query the table to get accounts with email in ('linda#gmail.com', 'harry#gmail.com')? I looked at the IN conditional expression but it doesn't appear to work with SGI. I'm using the go SDK v2 library but will take any guidance. Thanks.
Short answer, you can't.
DDB is designed to return a single item, via GetItem(), or a set of related items, via Query(). Related meaning that you're using a composite primary key (hash key & sort key) and the related items all have the same hash key (aka partition key).
Another way to think of it, you can't Query() a DDB Table/index. You can only Query() a specific partition in a table or index.
Scan() is the only operation that works across partitions in one shot. But scanning is very inefficient and costly since it reads the entire table every time.
You'll need to issue a GetItem() for every email you want returned.
Luckily, DDB now offers BatchGetItem() with will allow you to send multiple, up to 100, GetItem() requests in a single call. Saves a little bit of network time and automatically runs the requests in parallel; but otherwise is the little different from what your application could do itself directly with GetItem(). Make no mistake, BatchGetItem() is making individual GetItem() requests behind the scenes. In fact, the requests in a BatchGetItem() don't even have to be against the same tables/indexes. The cost for each request in a batch will be the same as if you'd used GetItem() directly.
One difference to make note of, BatchGetItem() can only return 16MB of data. So if your DDB items are large, you may not get as many returned as your requested.
For example, if you ask to retrieve 100 items, but each individual
item is 300 KB in size, the system returns 52 items (so as not to
exceed the 16 MB limit). It also returns an appropriate
UnprocessedKeys value so you can get the next page of results. If
desired, your application can include its own logic to assemble the
pages of results into one dataset.
Because you have a GSI with PK of email (from what I understand) you can use PartiQL command to get your batch of emails back. The API is called ExecuteStatment and you use a SQL like syntax:
SELECT * FROM mytable.myindex WHERE email IN ['email#email.com','email1#email.com']

Slow query on table | WHERE x | ORDER by timestamp | DISTINCT a,b,c,d | TAKE 20 when table large

We are experiencing a sudden performance drop with a query structured like this:
table(tablename)
| where MeasurementName in ('ActiveJobId')
and MachineId == machineId
and SourceTimestamp <= from
and isnotnull( Value)
| order by SourceTimestamp desc
| distinct SourceTimestamp, MeasurementName, tostring(Value), SourceTimestampUtc
| take rows
tablename, machineId, from, rows are all query parameters. rows is typically "20". Value column is of type "dynamic"
The table contains 240 Million entries, with about 64,000 matching the WHERE criteria. The goal of the query is to get the last 20 UNIQUE, non-empty entries for a given machine and data point, starting after a specific date.
The query runs smooth in the Staging database system, but started to degrade in performance on the Dev system. Possibly because of increased data amount.
If we remove the distinct clause, or move it behind the TAKE clause, the query completes very fast. (<1s). The data contains about 5-10% duplicate entries.
To our understanding the query should be performed like this:
Prepare a filter for the source table, start at a specific datetime range
Order desc: walk backwards
Walk down the table and stop when you got 20 distinct rows
From the time it sometimes takes it looks almost as if ADX walks down the whole table, performs a distinct, and then only takes the topmost 20 rows.
The problem persists if we swap | order and | distinct around.
The problem disappears if we move | distinct to the end of the query, but then we often receive 1-2 items less than required.
Is there a logical error we make, can this query be rewritten, or are there better options at hand?
The goal of the query is to get the last 20 UNIQUE, non-empty entries for a given machine and data point, starting after a specific date.
This part of the description doesn't match the filter in your query: and SourceTimestamp <= from - did you mean to use >= instead of <= ?
Is there a logical error we make, can this query be rewritten, or are there better options at hand?
If you can't eliminate the duplicates upstream, you can consider setting a materialized view that performs the deduplication, then query the view directly instead of the raw data. Also see Handle duplicate data

Determining a partition key in Dynamo DB for GSI

I am new to DynamoDB and I am finding it hard to think of how I should decide my partition key. I am using a condensed version of my use case:
I have an attribute which is a boolean value => B
For a given ID, I need to return all the data for it. The ID is either X or Y attribute. For the given ID, if B is true, I need to read attribute X, else Y.
While inserting into the table I know the the value of B and hence I can fill it in either X or Y depending on the value of it.
However while fetching, I just am given an ID, and I need to figure out whether it exists in column X or column Y ( I won't be getting the value of B in the input).
In a RDBMS I could run a query like select * from tab where (B == true && X == ID) || (B==false && Y == ID).
I think creating a GSI in DynamoDB will be the way to go about solving this in Dynamo. However I am not able to figure out the best way to approach this. Could I get suggestions?
Not sure if I got your use case correctly, but why not just swapping target columns based on value B while inserting a row.
Consider the following input:
+-----+------+--------+
| X | Y | B |
+-----+------+--------+
| ID1 | ID2 | true |
+-----+------+--------+
| ID3 | ID4 | true |
+-----+------+--------+
| ID5 | ID6 | false |
+-----+------+--------+
| ID7 | ID8 | false |
+-----+------+--------+
What if you store the values like this:
+-----------+-------------------------+
| id | opposite id |
|(hash key) | or whatever you call it |
+-----------+-------------------------+
| ID1 | ID2 |
+-----------+-------------------------+
| ID3 | ID4 |
+-----------+-------------------------+
| ID6 | ID5 |
+-----------+-------------------------+
| ID8 | ID7 |
+-----------+-------------------------+
This way, while fetching an item by an IDXXX value you would need to perform a query on the single column id.
UPD: Notice, if your use case allows having multiple records with a same id, you would need an another field to serve as a range key. This holds true no matter whether you swap columns like shown above or not.
As Per AWS DynamoDB Blog Post : Choosing the Right DynamoDB Partition Key
Choosing the Right DynamoDB Partition Key is an important step in the
design and building of scalable and reliable applications on top of
DynamoDB.
What is a partition key?
DynamoDB supports two types of primary keys:
Partition key: Also known as a hash key, the partition key is composed of a single attribute. Attributes in DynamoDB are similar in
many ways to fields or columns in other database systems.
Partition key and sort key: Referred to as a composite primary key or hash-range key, this type of key is composed of two attributes. The
first attribute is the partition key, and the second attribute is the
sort key. Here is an example:
Why do I need a partition key?
DynamoDB stores data as groups of attributes, known as items. Items
are similar to rows or records in other database systems. DynamoDB
stores and retrieves each item based on the primary key value which
must be unique. Items are distributed across 10 GB storage units,
called partitions (physical storage internal to DynamoDB). Each table
has one or more partitions, as shown in Figure 2. For more
information, see the Understand Partition Behavior in the DynamoDB
Developer Guide.
DynamoDB uses the partition key’s value as an input to an internal
hash function. The output from the hash function determines the
partition in which the item will be stored. Each item’s location is
determined by the hash value of its partition key.
All items with the same partition key are stored together, and for
composite partition keys, are ordered by the sort key value. DynamoDB
will split partitions by sort key if the collection size grows bigger
than 10 GB.
Recommendations for partition keys
Use high-cardinality attributes. These are attributes that have
distinct values for each item like e-mail id, employee_no,
customerid, sessionid, ordered, and so on.
Use composite attributes. Try to combine more than one attribute to
form a unique key, if that meets your access pattern. For example,
consider an orders table with customerid+productid+countrycode as the
partition key and order_date as the sort key.
Cache the popular items when there is a high volume of read traffic.
The cache acts as a low-pass filter, preventing reads of unusually
popular items from swamping partitions. For example, consider a table
that has deals information for products. Some deals are expected to be
more popular than others during major sale events like Black Friday or
Cyber Monday.
Add random numbers/digits from a predetermined range for write-heavy
use cases. If you expect a large volume of writes for a partition key,
use an additional prefix or suffix (a fixed number from predeternmined
range, say 1-10) and add it to the partition key. For example,
consider a table of invoice transactions. A single invoice can contain
thousands of transactions per client.
Read More # Choosing the Right DynamoDB Partition Key

Indicating a "canonical" record in a one-to-many table

Imagine we have a table of countries, and a table of cities. A country can of course have many cities, but a city can only be in one country, thus a one-to-many relationship makes intuitive sense:
countries
| id | name |
| 1 | Lorwick |
| 2 | Belmead |
cities
| id | country | name |
| 1 | 1 | Marblecrest |
| 2 | 1 | Westacre |
| 3 | 2 | Belcoast |
| 4 | 1 | Rosemarsh |
| 5 | 2 | Vertston |
But in addition to our one-to-many relationship, we want to describe the one-to-one relationship of national capitals. If it matters, assume that capitals may change fairly regularly, and for that matter cities appear and vanish at will, and that cities may switch countries. Point is, this data is unstable.
I see a couple of options:
Add an int column capital to countries which cannot be null. Pro: always exactly one city; Con: not associated with the city, nothing enforcing the city is in the country, or that it even exists.
Add a boolean column capital to cities, which if true indicates the city is the capital of the associated country. Pro: directly associated with the city in question, no duplicate columns indicating hierarchy; Con: pretty sure this is poor normalization as there's nothing stopping there being zero, or more than one, "capital" for a given country.
Create an additional table capitals with columns country and city and a unique constraint on both columns (or at least on city). Pro: feels cleaner, easy joining on either countries or cities; Con: still doesn't ensure city is in country, or that either exist.
What is the most normalized and/or best way to represent this relationship? Is there any way to ensure each country has exactly one capital which does in fact exist and resides inside that country? I imagine it's not possible, in which case, how can I best minimize issues for my client code?
I'm currently using SQLite, but I'm interested in generalized answers, regardless of the underlying database.
I did a little digging and found Indicating primary/default record in database but I don't think this really answers my question.
PS: It's not that bad if there's no capital (there may be no cities!), but it would be bad if there were multiple.
I think the requirement "each country has exactly one capital" conflicts with the requirement "cities appear and vanish at will". If a city can vanish, it follows that a capital city can vanish, too.
You can enforce the constraint "each country has [zero or] one capital which does in fact exist and resides inside that country" with a foreign key constraint on a table of capitals.
create table capitals (
country_id integer primary key,
city_id integer not null,
foreign key (country_id, city_id) references cities (country_id, city_id)
);
In that table, the primary key constraint guarantees that there can be no more than one capital per country. The foreign key constraint guarantees that that the capital you choose exists in the country you choose. In the referenced table (the "cities" table), you also need a unique constraint on {city_id, country_id}; since {city_id} is unique in the "cities" table, {city_id, country_id} will necessarily be unique in that table, too, so that's not a problem.
The declarative "way" to guarantee a one-to-one relationship between countries and capitals (not a one-to-zero-or-one relationship) is to use an assertion. But I don't know of any current SQL dbms that supports CREATE ASSERTION. That forces us rely on one or more of these:
triggers and possibly deferred constraints,
application code, or
administrative procedures.
(Initially, you'd have to enter a row in the three tables "countries", "cities", and "captials" in a single transaction in order to satisfy all the constraints. I think you'll need deferred constraints for that, but I haven't had coffee yet today.)
For clarity and simplicity, I'd add the boolean IsCapital column to the cities table. Then add a trigger that sets all other cities (that share the updated record's country) IsCapital = false when IsCapital is set to true on a record. This will handle most of your concerns. The one case to ensure there is exactly one capital per country isn't really possible, you can ensure there is 0 or 1, but since the cities table has a FK constraint to countries, there is always going to be a point in time where inserted countries will have no cities that can be set as the capitol.
FWIW, I think logic should be left to the app, referential integrity to the database.
To make sure there is exactly one capital per country and the capital is not a city from a different country, do this:
Note how we use the identifying relationship to migrate the COUNTRY_ID to CITY's PK, so it can be migrated back to the CONTRY - this is what guarantees a capital must actually belong to the country it is the capital of.
The circular reference here prevents the insertion of new data, which is resolved using deferred foreign keys if the DBMS supports them. Otherwise, you can just leave COUNTRY.CAPITAL_NO NULL-able (and enforce its eventual non-NULL-ness at the application level).1
1 This assumes the DBMS has MATCH SIMPLE foreign keys (i.e. FK is ignored if any of its components are NULL). If the DBMS supports only MATCH PARTIAL or FULL (such as MS Access), you are out of luck, and would have to emulate the FK through non-declarative means (triggers or application code).

What's the best way to retrieve this data?

The architecture for this scenario is as follows:
I have a table of items and several tables of forms. Rather than having the forms own the items, the items own the forms. This is because one item can be on several forms (although only one of each type, but not necessarily on any). The forms and items are all tied together by a common OrderId. This can be represented like so:
OrderItems | Form A | Form B etc....
---------- |--------- |
ItemId |FormAId |
OrderId |OrderId |
FormAId |SomeField |
FormBId |OtherVar |
FormCId |etc...
This works just fine for these forms. However, there is another form, (say, FormX) which cannot have an OrderId because it consists of items from multiple orders. OrderItems does contain a column for FormXId as well, but I'm confused about the best way to get a list of the "FormX"s related to a single OrderId. I'm using MySQL and was thinking maybe a stored proc was the best way to go on this, but I've never used a stored proc on MySQL and don't really know the best way to go about it. My other (kludgy) option was to hit the DB twice, first to get all the items that are for the given OrderId that also have a FormXId, and then get all their FormXIds and do a dynamic SELECT statement where I do something like (pseudocode)
SELECT whatever FROM sometable WHERE FormXId=x OR FormXId=y....
Obviously this is less than ideal, but I can't really think of any other way... anything better I could do either programmatically or architecturally? My back-end code is ASP.NET.
Thanks so much!
UPDATE
In response to the request for more info:
Sample input:
OrderId = 1000
Sample output
FormXs:
-----------------
FormXId | FieldA | FieldB | etc
-------------------------------
1003 | value | value | ...
1020 | ... .. ..
1234 | .. . .. . . ...
You see the problem is that FormX doesn't have one single OrderId but is rather a collection of OrderIds. Sometimes multiple items from the same order are on FormX, sometimes it's just one, most orders don't have any items on FormX. But when someone pulls up their order, I need for all the FormXs their items belong on to show up so they can be modified/viewed.
I was thinking of maybe creating a stored proc that does what I said above, run one query to pull down all the related OrderIds and then another to return the appropriate FormXs. But there has to be a better way...
I understand you need to get a list of the "FormX"s related to a single OrderId. You say, that OrderItems does contain a column for FormXId.
You can issue the following query:
select
FormX.*
From
OrderItems
join
Formx
on
OrderItems.FormXId = FormX.FormXId
where
OrderItems.OrderId = #orderId
You need to pass #orderId to your query and you will get a record set with FormX records related to this order.
You can either package this query up as a stored procedure using #orderId paramter, or you can use dynamic sql and substitute #orderId with real order number you executing your query for.

Resources