Given two DynamoDB tables: Books and Words, how can I create an index that associates the two? Specifically, I'd like to query to get all Books that contain a certain Word, and query to get all Words that appear in a specific Book.
The objective is to avoid scanning an entire table for these queries.
Based on your question I can't tell if you only care about unique words or if you want every word including duplicates. I'll assume unique words.
This can be done with a single table and a Global Secondary Index.
Create a table called BookWords with a Hash key of bookId and a Sort key of word. If you Query this table with a bookId you will get all of the unique words in that book.
Create a Global Secondary Index with a Hash key of word and a Sort key of bookId. If you Query this index with a word you will get all of the bookIds of books that contain that word.
Depending of your use case, you will probably want to normalize the words. For example, is "Word" the same as "word"?
If you want all words, not just unique words, you can use a similar approach with a few small changes. Let me know
Related
I've been reading a DynamoDB docs and was unable to understand if it does make sense to query on Global Secondary Index with a usage of 'contains' operator.
My problem is as follows: my dynamoDB document has a list of embedded objects, every object has a 'code' field which is unique:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
]
}
I want to be able to get all documents that contain entities with entity.code = X.
For this purpose I'm considering adding a Global Secondary Index that would contain all entity.codes that are present in current db document separated by a comma. So the example above would look like:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
],
"entitiesGlobalSecondaryIndex":"entityCode1,entityCode2"
}
And then I would like to apply filter expression on entitiesGlobalSecondaryIndex something like: entitiesGlobalSecondaryIndex contains entityCode1.
Would this be efficient or using global secondary index does not make sense in this way and DynamoDB will simply check the condition against every document which is similar so scan?
Any help is very appreciated,
Thanks
The contains operator of a query cannot be run on a partition Key. In order for a query to use any sort of operators (contains, begins with, > < ect...) you must have a range attributes- aka your Sort Key.
You can very well set up a GSI with some value as your PK and this code as your SK. However, GSIs are replication of the table - there is a slight potential for the data ina GSI to lag behind that of the master copy. If the query you're doing against this GSI isn't very often, then you're probably safe from that.
However. If you are trying to do this to the entire table at once then it's no better than a scan.
If what you need is a specific Code to return all its documents at once, then you could do a GSI with that as the PK. If you add a date field as the SK of this GSI it would even be time sorted. If you query against that code in that index, you'll get every single one of them.
Since you may have multiple codes, if they aren't too many per document, you maybe could use a Sparse Index - if you have an entity with code "AAAA" then you also have an attribute named AAAA (or AAAAflag or something.) It is always null/does not exist Unless the entities contains that code. If you do a GSI on this AAAflag attribute, it will only contain documents that contain that entity code, and ignore all where this attribute does not exist on a given document. This may work for you if you can also provide a good PK on this to keep the numbers well partitioned and if you don't have too many codes.
Filter expressions by the way are different than all of the above. Filter expressions are run on tbe data that would be returned, after it is already read out of the table. This is useful I'd you have a multi access pattern setup, but don't want a particular call to get all the documents associated with a particular PK - in the interests of keeping the data your code is working with concise. The query with a filter expression still retrieves everything from that query, but only presents what makes it past the filter.
If are only querying against a particular PK at any given time and you want to know if it contains any entities of x, then a Filter expressions would work perfectly. Of course, this is only per PK and not for your entire table.
If all you need is numbers, then you could do a count attribute on the document, or a meta document on that partition that contains these values and could be queried directly.
Lastly, and I have no idea if this would work or not, if your entities attribute is a map type you might very well be able to filter against entities code - and maybe even with entities.code.contains(value) if it was an SK - but I do not know if this is possible or not
I am new to idexes and DB optimization. I know there is simple index for one
CREATE index ON table(col)
possibly B-Tree will be created and search capabilities will be improved.
But what is happen for 2 columns index ? And why is the order of defnition important?
CREATE index ON table(col1, col2)
Yes, B-Tree index will be created in most of the database if you didn't specify other type of index. Composite index is useful when the combined selectivity of the composite columns happed on the queries.
The order of the columns on the composite index is important as searching by giving exact values for all the fields included in the index leads to minimal search time but search uses only the first field to retrieve all matched recaords if we provide the values partially with first field.
I found following example for your understanding:
In the phone book example with an composite index created on the columns (city, last_name, first_name), if we search by giving exact values for all the three fields, search time is minimal—but if we provide the values for city and first_name only, the search uses only the city field to retrieve all matched records. Then a sequential lookup checks the matching with first_name. So, to improve the performance, one must ensure that the index is created on the order of search columns.
I'm new to DynamoDB - I already have an application where the data gets inserted, but I'm getting stuck on extracting the data.
Requirement:
There must be a unique table per customer
Insert documents into the table (each doc has a unique ID and a timestamp)
Get X number of documents based on timestamp (ordered ascending)
Delete individual documents based on unique ID
So far I have created a table with composite key (S:id, N:timestamp). However when I come to query it, I realise that since my id is unique, because I can't do a wildcard search on ID I won't be able to extract a range of items...
So, how should I design my table to satisfy this scenario?
Edit: Here's what I'm thinking:
Primary index will be composite: (s:customer_id, n:timestamp) where customer ID will be the same within a table. This will enable me to extact data based on time range.
Secondary index will be hash (s: unique_doc_id) whereby I will be able to delete items using this index.
Does this sound like the correct solution? Thank you in advance.
You can satisfy the requirements like this:
Your primary key will be h:customer_id and r:unique_id. This makes sure all the elements in the table have different keys.
You will also have an attribute for timestamp and will have a Local Secondary Index on it.
You will use the LSI to do requirement 3 and batchWrite API call to do batch delete for requirement 4.
This solution doesn't require (1) - all the customers can stay in the same table (Heads up - There is a limit-before-contact-us of 256 tables per account)
I am faced with a database (sqlite specifically) query that I am not sure how to approach.
I'm looking for all tuples that have 1-n word matches between their 'name' attribute and a constant. Sorted in descending order.
For example it is a database containing food items. If the constant is "Maranatha Natural Almond Butter 26oz Lightly Roasted" I would like any tuple in the database that contains atleast one of the words in that constant to be returned. For example "Almond Butter Natural" would come before "Maranatha Natural" which would come before "Almond", etc.
Essentially as long as there is one intersecting word between the tuples attribute and the constant it qualifies a match.
Matching words is what SQLite's full-text search extension is designed for. Please read that page to see how SQLite must be compiled and what is possible, but I'll add some remarks:
Simple matching is done with a query like:
SELECT * FROM foods_fts_tab WHERE name MATCH 'Maranatha Natural Almond etc.'
This will just return all records where at least one word matches.
You can weight the matches with information returned by the auxiliary functions.
For example,
SELECT ... ORDER BY length(offsets(foods_fts_tab)) DESC
will sort by the number of matching words.
You are asking for the number of word matches, but real search engines also use other information to compute relevancy scores. See the matchinfo() function in section 4.3 and the example in appendix A.
What are they and how do they work?
Where are they used?
When should I (not) use them?
I've heard the word over and over again, yet I don't know its exact meaning.
What I heard is that they allow associative arrays by sending the array key through a hash function that converts it into an int and then uses a regular array. Am I right with that?
(Notice: This is not my homework; I go too school but they teach us only the BASICs in informatics)
Wikipedia seems to have a pretty nice answer to what they are.
You should use them when you want to look up values by some index.
As for when you shouldn't use them... when you don't want to look up values by some index (for example, if all you want to ever do is iterate over them.)
You've about got it. They're a very good way of mapping from arbitrary things (keys) to arbitrary things (values). The idea is that you apply a function (a hash function) that translates the key to an index into the array where you store the values; the hash function's speed is typically linear in the size of the key, which is great when key sizes are much smaller than the number of entries (i.e., the typical case).
The tricky bit is that hash functions are usually imperfect. (Perfect hash functions exist, but tend to be very specific to particular applications and particular datasets; they're hardly ever worthwhile.) There are two approaches to dealing with this, and each requires storing the key with the value: one (open addressing) is to use a pre-determined pattern to look onward from the location in the array with the hash for somewhere that is free, the other (chaining) is to store a linked list hanging off each entry in the array (so you do a linear lookup over what is hopefully a short list). The cases of production code where I've read the source code have all used chaining with dynamic rebuilding of the hash table when the load factor is excessive.
Good hash functions are one way functions that allow you to create a distributed value from any given input. Therefore, you will get somewhat unique values for each input value. They are also repeatable, such that any input will always generate the same output.
An example of a good hash function is SHA1 or SHA256.
Let's say that you have a database table of users. The columns are id, last_name, first_name, telephone_number, and address.
While any of these columns could have duplicates, let's assume that no rows are exactly the same.
In this case, id is simply a unique primary key of our making (a surrogate key). The id field doesn't actually contain any user data because we couldn't find a natural key that was unique for users, but we use the id field for building foreign key relationships with other tables.
We could look up the user record like this from our database:
SELECT * FROM users
WHERE last_name = 'Adams'
AND first_name = 'Marcus'
AND address = '1234 Main St'
AND telephone_number = '555-1212';
We have to search through 4 different columns, using 4 different indexes, to find my record.
However, you could create a new "hash" column, and store the hash value of all four columns combined.
String myHash = myHashFunction("Marcus" + "Adams" + "1234 Main St" + "555-1212");
You might get a hash value like AE32ABC31234CAD984EA8.
You store this hash value as a column in the database and index on that. You now only have to search one index.
SELECT * FROM users
WHERE hash_value = 'AE32ABC31234CAD984EA8';
Once we have the id for the requested user, we can use that value to look up related data in other tables.
The idea is that the hash function offloads work from the database server.
Collisions are not likely. If two users have the same hash, it's most likely that they have duplicate data.