Is there a better way to get only the latest entry for each partition key + sort key prefix? - amazon-dynamodb

We currently have a table that has both a partition key and sort key that make up the primary key.
They're both strings.
Example:
p_id: A#2021-04-21 (+)
s_id: XYZ#2#1634925978 (, , )
A use case of ours is to get all items for a given partition (regioncode+date), but ONLY the latest for a given id and code.
So for example if we had:
A#2021-04-21 , XYZ#2.0#10000 , <other attributes>
A#2021-04-21 , XYZ#2.0#20000 , ...
A#2021-04-21 , QRS#2.0#10000 , ...
We'd only want to get
A#2021-04-21 , XYZ#2.0#20000 , ...
A#2021-04-21 , QRS#2.0#10000 , ...
To do this currently, I'm just doing:
response = self.table.query(
KeyConditionExpression=Key(self.table_key_name).eq(f"{region_id}#{date_key}")
)
And then getting out the items, and having to manually make a map for each sort key prefix up until the epoch milliseconds / timestamp. Then for each key, set the value only if the timestamp is newer than whatever was previously there.
Is there a way to do this faster and utilize the query itself more? I've debated adding the pieces in the ID as attributes and maybe being able to use some kind of filtering but I don't think I see anything that would let me do the equivalent of a "group by" like I want here. Do I have no choice but to create some kind of Index?
Any ideas? Help would be much appreciated!

DDB doesn't support aggregations, MIN/MAX/COUNT/SUM/, like an RDBMS does...
One solution, is to use a "trigger", DDB Streams + Lamdba, to aggregate the needed data for you. See Using Global Secondary Indexes for Materialized Aggregation Queries
You might also want to consider looking at various ways to implement versioning of your DDB data.

If you want to get the latest item, then your Sort Key should end in an ISO8601 standard format date that is determined when the item is added. You can then do a Query and because your sort key is ending in an iso8601 standard date, the first item returned is automatically the last item added. (ISO8601 date format being 'alphabetical' and Sort Keys being ... well automatically sorted'. (and if you tell it to order the response in the opposite direction, then the first item returned is automatically... the first item!)
You will need to do something like SK: SOME_QUALIFIER#YYYY-mm-ddTHH:MM:SSZ00:00 - and then do your query with your SK begins with "SOME_QUALIFIER#". - so you will have to think about how you want to organize this, but it is entirely possible to do taking advantage of the fact that the sort key is automatically sorted.
Alternatively, if you are only going to be doing this once in a while (ie for a generated report or someting) Its OK to put your last updated date (or last created, which ever is more important) in its own attribute (And with composoite type keys you often should anyways!!!) and then create an index with that as your sort key, and something else (either report Type or something) for your PK. Then you can query that PK and get the latest item there
MIN/MAX and many other sql style calls can be tricked by making clever use of the sort key.

Related

Querying on Global Secondary indexes with a usage of contains operator

I've been reading a DynamoDB docs and was unable to understand if it does make sense to query on Global Secondary Index with a usage of 'contains' operator.
My problem is as follows: my dynamoDB document has a list of embedded objects, every object has a 'code' field which is unique:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
]
}
I want to be able to get all documents that contain entities with entity.code = X.
For this purpose I'm considering adding a Global Secondary Index that would contain all entity.codes that are present in current db document separated by a comma. So the example above would look like:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
],
"entitiesGlobalSecondaryIndex":"entityCode1,entityCode2"
}
And then I would like to apply filter expression on entitiesGlobalSecondaryIndex something like: entitiesGlobalSecondaryIndex contains entityCode1.
Would this be efficient or using global secondary index does not make sense in this way and DynamoDB will simply check the condition against every document which is similar so scan?
Any help is very appreciated,
Thanks
The contains operator of a query cannot be run on a partition Key. In order for a query to use any sort of operators (contains, begins with, > < ect...) you must have a range attributes- aka your Sort Key.
You can very well set up a GSI with some value as your PK and this code as your SK. However, GSIs are replication of the table - there is a slight potential for the data ina GSI to lag behind that of the master copy. If the query you're doing against this GSI isn't very often, then you're probably safe from that.
However. If you are trying to do this to the entire table at once then it's no better than a scan.
If what you need is a specific Code to return all its documents at once, then you could do a GSI with that as the PK. If you add a date field as the SK of this GSI it would even be time sorted. If you query against that code in that index, you'll get every single one of them.
Since you may have multiple codes, if they aren't too many per document, you maybe could use a Sparse Index - if you have an entity with code "AAAA" then you also have an attribute named AAAA (or AAAAflag or something.) It is always null/does not exist Unless the entities contains that code. If you do a GSI on this AAAflag attribute, it will only contain documents that contain that entity code, and ignore all where this attribute does not exist on a given document. This may work for you if you can also provide a good PK on this to keep the numbers well partitioned and if you don't have too many codes.
Filter expressions by the way are different than all of the above. Filter expressions are run on tbe data that would be returned, after it is already read out of the table. This is useful I'd you have a multi access pattern setup, but don't want a particular call to get all the documents associated with a particular PK - in the interests of keeping the data your code is working with concise. The query with a filter expression still retrieves everything from that query, but only presents what makes it past the filter.
If are only querying against a particular PK at any given time and you want to know if it contains any entities of x, then a Filter expressions would work perfectly. Of course, this is only per PK and not for your entire table.
If all you need is numbers, then you could do a count attribute on the document, or a meta document on that partition that contains these values and could be queried directly.
Lastly, and I have no idea if this would work or not, if your entities attribute is a map type you might very well be able to filter against entities code - and maybe even with entities.code.contains(value) if it was an SK - but I do not know if this is possible or not

DynamoDB query 1 field greate than

I have games table.
To keep it simple, I will add only two fields for the question.
gameId:
deadlineToPlay:
I want to query for all games with deadlineToPlay > than today.
How would I set up the index for this? I thought I could create an index with just deadlineToPlay, but if I understand correctly when querying on hashkey, it has to be exact value. Can't use >.
I would also not like to use a scan, due to costs.
A way to workaround this would be to create or use an existing field which will have constant value (for example, field hasDeadline with value true).
Now you can create the table key like this: hasDeadline as HASH key and deadlineToPlay as SORT key (if the table is already created, you can define this key in a new GSI).
This way you will be able to query by hasDeadline = true and deadlineToPlay > today.

How to implement a certain query in dynamodb?

I wanted to run a query, "Find me the item with smallest 'id' which is larger than some number" ?
Is it possible in dynamodb ?
And how to do it ?
Thanks in advance.
As you probably know, a DynamoDB table can have 2 types of keys: hash keys, or hash+range keys
When you run a query, you need to specify the hash key for the item that you are looking for. If your table has a key of type hash+range, you will automatically get the results back with the range attribute sorted. Your Query request can also optionally add a KeyCondition on the range attribute so that you can require that it be larger than some number. So, yes, what you are looking for is possible, assuming that you design your table appropriately.
For more info, check out the following links:
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html

SQLite - Get a specific row index for a Sorted/Filtered Query

I'm creating a caching system to take data from an SQLite database table using a sorted/filtered query and display it. The tables I'm pulling from can be potentially very large and, of course, I need to minimize impact on memory by only retaining a maximum number of rows in memory at any given time. This is easily done by using LIMIT and OFFSET to load only the records I need and update the cache as needed. Implementing this is trivial. The problem I'm having is determining where the insertion index is for a new record inserted into a particular query so I can update my UI appropriately. Is there an easy way to do this? So far the ideas I've had are:
Dump the entire cache, re-count the Query results (there's no guarantee the new row will be included), refresh the cache and refresh the entire UI. I hope it's obvious why that's not really desirable.
Use my own algorithm to determine whether the new row is included in the current query, if it is included in the current cached results and at what index it should be inserted into if it's within the current cached scope. The biggest downfall of this approach is it's complexity and the risk that my own sorting/filtering algorithm won't match SQLite's.
Of course, what I want is to be able to ask SQLite: Given 'Query A' what is the index of 'Row B', without loading the entire query results. However, so far I haven't been able to find a way to do this.
I don't think it matters but this is all occurring on an iOS device, using the objective-c programming language.
More Info
The Query and subsequent cache is based off of user input. Essentially the user can re-sort and filter (or search) to alter the results they're seeing. My reticence in simply recreating the cache on insertions (and edits, actually) is to provide a 'smoother' UI experience.
I should point out that I'm leaning toward option "2" at the moment. I played around with creating my own caching/indexing system by loading all the records in a table and performing the sort/filter in memory using my own algorithms. So much of the code needed to determine whether and/or where a particular record is in the cache is already there, so I'm slightly predisposed to use it. The danger lies in having a cache that doesn't match the underlying query. If I include a record in the cache that the query wouldn't return, I'll be in trouble and probably crash.
You don't need record numbers.
Save the values of the ordered field in the first and last records of the LIMITed query result.
Then you can use these to check whether the new record falls into this range.
In other words, assuming that you order by the Name field, and that the original query was this:
SELECT Name, ...
FROM mytab
WHERE some_conditions
ORDER BY Name
LIMIT x OFFSET y
then try to get at the new record with a similar query:
SELECT 1
FROM mytab
WHERE some_conditions
AND PrimaryKey = LastInsertedValue
AND Name BETWEEN CachedMin AND CachedMax
Similarly, to find out before (or after) which record the new record was inserted, start directly after the inserted record and use a limit of one, like this:
SELECT Name
FROM mytab
WHERE some_conditions
AND Name > MyInsertedName
AND Name BETWEEN CachedMin AND CachedMax
ORDER BY Name
LIMIT 1
This doesn't give you a number; you still have to check where the returned Name is in your cache.
Typically you'd expect a cache to be invalidated if there were underlying data changes. I think dropping it and starting over will be your simplest, maintainable solution. I would recommend it unless you have a very good reason.
You could write another query that just returned the row count (example below) to see if your cache should be invalidated. That would save recreating the cache when it did not change.
SELECT name,address FROM people WHERE area_code=970;
SELECT COUNT(rowid) FROM people WHERE area_code=970;
The information you'd need from sqlite to know when your cache was invalidated would require some rather intimate knowledge of how the query and/or index was working. I would say that is fairly high coupling.
Otherwise, you'd want to know where it was inserted with regards to the sorting. You would probably key each page on the sorted field. Delete anything greater than the insert/delete field. Any time you change the sorting you'd drop everything.
Something like the below would be a start if you were using C++. I realize you aren't doing C++, but hopefully it is evident as to what I'm trying to do.
struct Person {
std::string name;
std::string addr;
};
struct Page {
std::string key;
std::vector<Person> persons;
struct Less {
bool operator()(const Page &lhs, const Page &rhs) const {
return lhs.key.compare(rhs.key) < 0;
}
};
};
typedef std::set<Page, Page::Less> pages_t;
pages_t pages;
void insert(const Person &person) {
if (sql_insert(person)) {
pages_t::iterator drop_cache_start = pages.lower_bound(person);
//... drop this page and everything after it
}
}
You'd have to do some wrangling to get different datatypes of key to work nicely, but its possible.
Theoretically you could just leave the pages out of it and only use the objects themselves. The database would no longer "own" the data though. If you only fill pages from the database, then you'll have less data consistency worries.
This may be a bit off topic, you aren't re-implementing views are you? It doesn't cache per se, but it isn't clear if that is a requirement of your project.
The solution I came up with is not exactly simple, but it's currently working well. I realized that the index of a record in a Query Statement is also the Count of all it's previous records. What I needed to do was 'convert' all the ORDER statements in the query to a series of WHERE statements that would return only the preceding records and take a count of those records. It's trickier than it sounds (or maybe not...it sounds tricky). The biggest issue I had was making sure the query was, in fact, sorted in a way I could predict. This meant I needed to have an order column in the Order Parameters that was based off of a column with unique values. So, whenever a user sorts on a column, I append to the statement another order parameter on a unique column (I used a "Modified Date Stamp") to break ties.
Creating the WHERE portion of the statement requires more than just tacking on a bunch of ANDs. It's easier to demonstrate. Say you have 3 Order columns: "LastName" ASC, "FirstName" DESC, and "Modified Stamp" ASC (the tie breaker). The WHERE statement would have to look something like this ('?' = record value):
WHERE
"LastName" < ? OR
("LastName" = ? AND "FirstName" > ?) OR
("LastName" = ? AND "FirstName" = ? AND "Modified Stamp" < ?)
Each set of WHERE parameters grouped together by parenthesis are tie breakers. If, in fact, the record values of "LastName" are equal, we must then look at "FirstName", and finally "Modified Stamp". Obviously, this statement can get really long if you're sorting by a bunch of order parameters.
There's still one problem with the above solution. Mathematical operations on NULL values always return false, and yet when you sort SQLite sorts NULL values first. Therefore, in order to deal with NULL values appropriately you've gotta add another layer of complication. First, all mathematical equality operations, =, must be replace by IS. Second, all < operations must be nested with an OR IS NULL to include NULL values appropriately on the < operator. This turns the above operation into:
WHERE
("LastName" < ? OR "LastName" IS NULL) OR
("LastName" IS ? AND "FirstName" > ?) OR
("LastName" IS ? AND "FirstName" IS ? AND ("Modified Stamp" < ? OR "Modified Stamp" IS NULL))
I then take a count of the RowID using the above WHERE parameter.
It turned out easy enough for me to do mostly because I had already constructed a set of objects to represent various aspects of my SQL Statement which could be assembled to generate the statement. I can't even imagine trying to manipulate a SQL statement like this any other way.
So far, I've tested using this on several iOS devices with up to 10,000 records in a table and I've had no noticeable performance issues. Of course, it's designed for single record edits/insertions so I don't really need it to be super fast/efficient.

Should I use an auto-generated Primary Key if I'm just doing a lookup table?

I have a table which has two varchar(Max) columns
Column 1 Column 2
-----------------------
URLRewitten OriginalURL
its part of my url re-writing for an asp.net webforms site.
when a url comes in I do a check to see if its in the table if it is i use the OriginalURL.
My question is, if all I'm doing is querying the table for urls and no other table in the database will ever link to this table does it need a dedicated primary key field? like an auto-number? will this make queries faster?
and also how can I make the query's run as faster?
Edit: I do have a unique constraint on URLRewitten.
Edit: ways i'm using this table..
Query when a new Request comes in.. search on URLRewitten to find OriginalURL
When needing to display a link on the site, i query on the OriginalURL to find the URLRewitten url i should use.
When adding a new url to the table i make sure that it doesn't already exist.
thats all the querys i do.. at the moment.
Both columns together would be unique.
Do you need a primary key? Yes. Always. However, it looks like in your case OriginalURL could be your primary key (I'm assuming that there wouldn't be more than one value for URLRewritten for a given value in OriginalURL).
This is what's known as a "natural key" (where a component of the data itself is, by its nature, unique). These can be convenient, though I have found that they're generally more trouble than they're worth under most circumstances, so yes, I would recommend some sort of opaque key (meaning a key that has no relation to the data in the row, other than to identify a single row). Whether or not you want an autonumber is up to you. It's certainly convenient, though identity columns come with their own set of advantages and disadvantages.
For now I suppose I would advise creating two things:
A primary key on your table of an identity column
A unique constraint on OriginalURL to enforce data integrity.
I'd put one in there anyway... it'll make updating alot easier or duplicating an existing rule...
i.e. this is easier
UPDATE Rules SET OriginalURL = 'http://www.domain.com' WHERE ID = 1
--OR
INSERT INTO Rules SELECT OriginalUrl, NewUrl FROM Rules WHERE ID = 1
Than this
this is easier
UPDATE Rules SET OriginalURL = "http://www.domain.com" WHERE OriginalURL = 'http://old.domain.com'
--OR
INSERT INTO Rules SELECT OriginalUrl, NewUrl FROM Rules WHERE OriginalURL = 'http://old.domain.com'
In terms of performance, if your going to be searching by OriginalURL,
you should add an index to that column,
I would use the OriginalURL as your primary key as I would assume this is unique. Assuming your are using SQL-Server you could create an index on RewrittenURL with OrigionalURL as an "Included column" to speed up the performance of the query.
An identity column can help when you search for recent events:
select top 100 * from table order by idcolumn desc
We'd have to know what kind of queries you are running, before we can search for a way to make them faster.
As you are doing your query on the URLRewritten column I don't think adding an auto-generated primary key would help you.
Have you got an index on your URLRewritten column? If not, create one: that should see a big increase in the speed of your queries (perhaps just make URLRewritten your primay key?).
Yes there should be a Primary Key Because you can set INDEX on that Primary Key for Fast Access
I don't think adding auto generated primary key will make your query faster.
However there are are a few things to consider:
I would not be so sure, that never
ever nothing will link to this table
:(.
I've seen a lot of people asking about
how to i.e. remove duplicates from
table like that -- with primary key
it is much easier.
To make this query
faster we need to
know more about this table and ways
of using it...
In my opinion, every table, must have auto generated primary key (i.e. identity in MSSQL).
I don't believe in unique natural keys.

Resources