Assume I had thousands of products and each product had a 1-many relationship to labels (for simplicity sake, let’s call them tags i.e. One Product has many Tags). Assume the tags are semi-volatile meaning the labels for the tags could get updated a few times a year but not often. Is the best practice to relate the tags to the product via the sort key and then if the tag label changes for whatever reason, I catch the update to the tag via DynamoDB Streams, trigger a lambda or something similar and run the thousands of DynamoDB updates to the Products that have an access pattern to the Tag or do you think it makes more sense to only store the Tag IDs in the Product (ie traditional SQL like lookup table) and use BatchGetItems to fetch the Tags when I fetch the Product which could save thousands of updates but does require multiple GETs on a given product.
Related
I would like to be able to filter a pagination result using query operation before the limit is taken into consideration.Is there any suggestion to get right pagination on filtered results?
I would like to implement a DynamoDB Scan OR Query with the following logic:
Scanning -> Filtering(boolean true or false) -> Limiting(for pagination)
However, I have only been able to implement a Scan OR Query with this logic:
Scanning -> Limiting(for pagination) -> Filtering(boolean true or false)
Note: I have already tried Global Secondary Index but it didn't work in my case Because I have 5 different attributes to filter and limit.
Unfortunatelly DynamoDB is not capable to do this, once you do Query on one of your indexes, it will read every single item that satisfies your partition and sort key.
Lets check your example - You have boolean and you have index over that field. Lets say 50% of items are false and 50% are true. Once you search by that index you will read through 50% of all items in table (so its almost like SCAN). If you set up limit, it will read only that number of items and then it stops. You cannot use the combination of limit and skip/page/offset like in other databases.
There is some level of pagination https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.Pagination.html but it does not allow you to jump to i.e. page 10, it only allows you go through all the pages one by one. Also I am not sure how it is priced, maybe internally the AWS will go through all the items before preparing the results for you, so you will pay for reading 50% of whole table even if you stop iterating before you reach the end.
There is also the limitation that index can have maximum of 2 fields (partition, sort).
EXAMPLE
You wrote that you have 5 parameters you want to query. The workaround that is used to address these limitations is to create and manage extra fields that have combination of parameters you want to query. Lets say you have table of users and you have there gender, age, name, surname and position. Lets say its huge database, so you have to think about amount of data you can load. Then if you want to use DynamoDB, you have to think about all queries you want to do.
You most likely want to search by name and surname, so you create index with surname as partition key and name as sort key (in such case you can search by surname or by both surname and name). It can work for lot of names, but you found out that some name combinations are too common and you need to filter by position as well. In such case, you create new field (column) called i.e. name-surname and whenever you create or update item, you will need to handle this field in your app to make sure it contains both of it, i.e. will-smith. Then you can make another index, that has name-surname as partition key and position as sort key. Now you can use it for such searches.
However you found out, that for some name-surname-position combination you get too many results and you dont want to handle it on application level and you want to limit results by age as well. Then you can create index with name-surname-position as partition key and age as sort key. At this moment you can also figure out that your old name-surname field and index can be removed as it server no purposes anymore (name and surname are handled by another index and for searching just name-surname-position you can use this index)
You want to query by gender as well sometimes? Its probably better to handle that in application level (or extra filter in db query) rather than creating new index that must be handled and payed for. There are only two types of gender (ok, lets say there exists more, but 99% of people will have just male or female) so its probably cheaper to just hide few fields on application level if someone wants to check only male/female/transgenders..., but load all of them. Because for extra index you would have to pay for every single insert, but this filter will be used only from time to time. Also when someone searches already by name, surname and position you dont expect that much results anyway, so if you get 20 (all genders) or just 10 (male only) results does not make much difference.
This ^^ was just example of how you can think and work with DynamoDB. How exactly you use it depends on your business logic.
Very important note: DynamoDB is very simple database that can only do very simple queries. It has little more functionality than Redis but a lot less functionality than traditional databases. The valid result of thinking about your business model/use-cases is that maybe you should NOT use the DynamoDB at all, because it can simply not satisfy your needs and queries.
Some basic thinking can look like this:
Is key-value persistant storage enough? Use DynamoDB
Is key-value persistant storage, where one item can have multiple keys and I can search and filter by maximum of 2 fields enough? Use DynamoDB
Is persistant storage, where I want to search single Table/Collection by many multiple keys with lot of options enough? Use MongoDB
Do I need to search through multiple tables or do complex joins or need transactions? Use traditional SQL database
I am new the noSQL data modelling so please excuse me if my question is trivial. One advise I found in dynamodb is always supply 'PartitionId' while querying otherwise, it will scan the whole table. But there could be cases where we need listing our items, for instance in case of ecom website, where we need to list our products on list page (with pagination).
How should we perform this listing by avoiding scan or using is efficiently?
Basically, there are three ways of reading data from DynamoDB:
GetItem – Retrieves a single item from a table. This is the most efficient way to read a single item, because it provides direct access to the physical location of the item.
Query – Retrieves all of the items that have a specific partition key. Within those items, you can apply a condition to the sort key and retrieve only a subset of the data. Query provides quick, efficient access to the partitions where the data is stored.
Scan – Retrieves all of the items in the specified table. (This operation should not be used with large tables, because it can consume large amounts of system resources.
And that's it. As you see, you should always prefer GetItem (BatchGetItem) to Query, and Query — to Scan.
You could use queries if you add a sort key to your data. I.e. you can use category as a hash key and product name as a sort key, so that the page showing items for a particular category could use querying by that category and product name. But that design is fragile, as you may need other keys for other pages, for example, you may need a vendor + price query if the user looks for a particular mobile phones. Indexes can help here, but they come with their own tradeofs and limitations.
Moreover, filtering by arbitrary expressions is applied after the query / scan operation completes but before you get the results, so you're charged for the whole query / scan. It's literally like filtering the data yourself in the application and not on the database side.
I would say that DynamoDB just is not intended for many kinds of workloads. Probably, it's not suited for your case too. Think of it as of a rich key-value (key to object) store, and not a "classic" RDBMS where indexes come at a lower cost and with less limitations and who provide developers rich querying capabilities.
There is a good article describing potential issues with DynamoDB, take a look. It contains an awesome decision tree that guides you through the DynamoDB argumentation. I'm pasting it here, but please note, that the original author is Forrest Brazeal.
Another article worth reading.
Finally, check out this short answer on SO about DynamoDB usecases and issues.
P.S. There is nothing criminal in doing scans (and I actually do them by schedule once per day in one of my projects), but that's an exceptional case and I regret about the decision to use DynamoDB in that case. It's not efficient in terms of speed, money, support and "dirtiness". I had to increase the capacity before the job and reduce it after, but that's another story…
We have a table with 100M rows in google cloud datastore. What is the most efficient way to look up the existence of a large number of keys (500K-1M)?
For context, a use case could be that we have a big content datastore (think of all webpages in a domain). This datastore contains pre-crawled content and metadata for each document. Each document, however, could be liked by many users. Now when we have a new user and he/she says he/she likes document {a1, a2, ..., an}, we want to tell if all these document ak {k in 1 to n} are already crawled. That's the reason we want to do the lookup mentioned above. If there is a subset of documents that we don't have yet, we would start to crawl them immediately. Yes, the ultimate goal is to retrieve all these document content and use them to build the user profile.
My current thought is to issue a bunch of batch lookup requests. Each lookup request can contain up to 1K of keys [1]. However to get the existence of every key in a set of 1M, I still need to issue 1000 requests.
An alternative is to use a customized middle layer to provide a quick look up (for example, can use bloom filter or something similar) to save the time between multiple requests. Assuming we never delete keys, every time we insert a key, we add it through the middle layer. The bloom-filter keeps track of what keys we have (with a tolerable false positive rate). Since this is a custom layer, we could provide a micro-service without a limit. Say we could respond to a request asking for the existence of 1M keys. However, this definitely increases our design/implementation complexity.
Is there any more efficient ways to do that? Maybe a better design? Thanks!
[1] https://cloud.google.com/datastore/docs/concepts/limits
I'd suggest breaking down the problem in a more scalable (and less costly) approach.
In the use case you mentioned you can deal with one document at a time, each document having a corresponding entity in the datastore.
The webpage URL uniquely identifies the page, so you can use it to generate a unique key/identifier for the respective entity. With a single key lookup (strongly consistent) you can then determine if the entity exists or not, i.e. if the webpage has already been considered for crawling. If it hasn't then a new entity is created and a crawling job is launched for it.
The length of the entity key can be an issue, see How long (max characters) can a datastore entity key_name be? Is it bad to haver very long key_names?. To avoid it you can have the URL stored as a property of the webpage entity. You'll then have to query for the entity by the url property to determine if the webpage has already been considered for crawling. This is just eventually consistent, meaning that it may take a while from when the document entity is created (and its crawling job launched) until it appears in the query result. Not a big deal, it can be addressed by a bit of logic in the crawling job to prevent and/or remove document duplicates.
I'd keep the "like" information as small entities mapping a document to a user, separated from the document and from the user entities, to prevent the drawbacks of maintaining possibly very long lists in a single entity, see Manage nested list of entities within entities in Google Cloud Datastore and Creating your own activity logging in GAE/P.
When a user likes a webpage with a particular URL you just have to check if the matching document entity exists:
if it does just create the like mapping entity
if it doesn't and you used the above-mentioned unique key identifiers:
create the document entity and launch its crawling job
create the like mapping entity
otherwise:
launch the crawling job which creates the document entity taking care of deduplication
launch a delayed job to create the mapping entity later, when the (unique) document entity becomes available. Possibly chained off the crawling job. Some retry logic may be needed.
Checking if a user liked a particular document becomes a simple query for one such mapping entity (with a bit of care as it's also eventually consistent).
With such scheme in place you no longer have to make those massive lookups, you only do one at a time - which is OK, a user liking documents one a time is IMHO more natural than providing a large list of liked documents.
We have an application that allows users to "follow" other users. When a user follows another, we register this data as a document within documentDB, like this:
{
"followerId": "userUUID",
"artistId": "artistUserUUID"
}
We now want to get a list of artists, ordered by the count of followers they have. So I am looking to somehow ask the DB to, based on these documents, give me back an array of artistUserUUId's, ordered by the amount of followers they have registered (as expressed in documents like the example given above).
Alternatively, we are also open to add an Array property to the document of the artistUser themselves, though even in this scenario I am still unsure how to do an ORDER BY based on the counting of a document's property (this property being an array of follower Ids).
I guess a workaround would be to add a stored procedure or trigger that will update a counter property within the artistUser document, but I'd like to validate if these is a way to implement this counting feature natively without such a trick.
Unless you denormalize the follower count into artist user documents (as you suggest), then you'll have to fetch every follower to accomplish your goal. Fetching every follower document, may or may not be prohibitive depending upon how many there are. If you fetch them only into a stored procedure rather than your actual client, it's conceptually no less efficient than an SQL GROUP_BY clause. Design your stored procedure to do the count and only returns the table of artist and counts. A robust implementation would incrementally update your output table in pages and be able to restart where it left off after a stored procedure timeout. Look at my countDocuments example stored procedure in documentdb-mock as well as my "Pattern for writing stored procedures" in the documentation for documentdb-utils for how I typically accomplish this.
Consider a set of data called Library, which contains a set of Books and each book contains a set of Pages.
Let's say you are using Riak to store this data, and you need to be access the data in two possible ways:
- Query for a particular page (with a unique id)
- Query for all pages in a particular book (with a unique name)
Additionally, you need to be able to easily update and delete pages of a particular Book.
What would be the best way to accomplish this in Riak?
Obviously Riak Search will do the trick, but maybe is inefficient for what I am trying to do. I am wondering if it makes sense to set up buckets where each bucket can be a Book (which would make for potentially millions of "Book" buckets). Maybe that is a bad idea...
Can this be accomplished with secondary indexes?
I am trying to keep this simple...
I am new to Riak and I am trying to find the best way to accomplish something that is probably relatively simple. I would appreciate any help from the Stack Overflow community. Thanks!
A common way to model master-detail relationships in Riak is to have the master record contain a list of detail record IDs, possibly together with some information about the detail record that may be useful when deciding which detail records to retrieve.
In your example, you could have two buckets called 'books' and 'pages'. The master record in the 'books' bucket will contain metadata and information about the book as a whole together with a list of pages that are included in the book. Each page would contain the ID of the 'pages' record holding the page data as well as the corresponding page number. If you e.g. wanted to be able to query by chapter, you could also add information about which chapters a certain page belongs to.
The 'pages' bucket would contain the text of the page and possibly links to images and other media data that are included on that page. This data could be stored in yet another bucket.
In order to get a specific page or a range of pages, one would first retrieve the master record from the 'books' bucket and then based on the contents of the record the appropriate pages. Even though this requires several GET operations, they are all direct lookups based on keys, which is the most efficient and scalable way to retrieve data from Riak, so it is will perform and scale well.
This approach also makes it simple to change the order of pages and/or chapters as only the master record needs to be updated. Adding, deleting or modifying pages would however require both the master record as well as one or more detail records to be updated, added or deleted.
You can most certainly also solve this problem by adding secondary indexes to the objects and query based on this. Secondary index queries in Riak does however have to include processing on a covering set (generally ring size / n_val) of partitions in order to fulfil the request, and therefore puts a bit more load on the system and generally results in higher latencies than retrieving a single object containing keys through a direct key lookup (which only needs to involve the partitions where the object is actually stored).
Although maintaining a separate object containing indexes adds a bit of extra work when inserting or deleting pages/entries, this approach will generally result in more efficient reads, as only direct key lookups are required. If your application is heavy on reads, it probably makes sense to use this approach, while secondary indexes could be more efficient for a write heavy application as inserts and modifications are made cheaper at the expense of more expensive reads. You can however always add secondary indexes just in case in order to keep your options open.
In cases like this I would usually recommend performing some benchmarks to test the solutions and chech which solution that best matches you particular performance and scaling requirements.
The most efficient way will be to store hole book as an one object, and duplicate it's pages as another separate objects.
Pros:
you will be able to select any object by its key(the most cheapest op
in riak is kv query)
any query will be predicted by latency
this is natural way of storing for riak
Cons:
If you need to update any page you must update whole book, and then page. As riak doesn't have atomic ops, you must to think how to recover any failure situation (like this: book was updated, but page was not).
Riak is about availability predictable latency, so if you will use something like 2i to collect results, it will make unpredictable time query, which will grow with page numbers