Lookup the existence of a large number of keys (up to1M) in datastore - google-cloud-datastore

We have a table with 100M rows in google cloud datastore. What is the most efficient way to look up the existence of a large number of keys (500K-1M)?
For context, a use case could be that we have a big content datastore (think of all webpages in a domain). This datastore contains pre-crawled content and metadata for each document. Each document, however, could be liked by many users. Now when we have a new user and he/she says he/she likes document {a1, a2, ..., an}, we want to tell if all these document ak {k in 1 to n} are already crawled. That's the reason we want to do the lookup mentioned above. If there is a subset of documents that we don't have yet, we would start to crawl them immediately. Yes, the ultimate goal is to retrieve all these document content and use them to build the user profile.
My current thought is to issue a bunch of batch lookup requests. Each lookup request can contain up to 1K of keys [1]. However to get the existence of every key in a set of 1M, I still need to issue 1000 requests.
An alternative is to use a customized middle layer to provide a quick look up (for example, can use bloom filter or something similar) to save the time between multiple requests. Assuming we never delete keys, every time we insert a key, we add it through the middle layer. The bloom-filter keeps track of what keys we have (with a tolerable false positive rate). Since this is a custom layer, we could provide a micro-service without a limit. Say we could respond to a request asking for the existence of 1M keys. However, this definitely increases our design/implementation complexity.
Is there any more efficient ways to do that? Maybe a better design? Thanks!
[1] https://cloud.google.com/datastore/docs/concepts/limits

I'd suggest breaking down the problem in a more scalable (and less costly) approach.
In the use case you mentioned you can deal with one document at a time, each document having a corresponding entity in the datastore.
The webpage URL uniquely identifies the page, so you can use it to generate a unique key/identifier for the respective entity. With a single key lookup (strongly consistent) you can then determine if the entity exists or not, i.e. if the webpage has already been considered for crawling. If it hasn't then a new entity is created and a crawling job is launched for it.
The length of the entity key can be an issue, see How long (max characters) can a datastore entity key_name be? Is it bad to haver very long key_names?. To avoid it you can have the URL stored as a property of the webpage entity. You'll then have to query for the entity by the url property to determine if the webpage has already been considered for crawling. This is just eventually consistent, meaning that it may take a while from when the document entity is created (and its crawling job launched) until it appears in the query result. Not a big deal, it can be addressed by a bit of logic in the crawling job to prevent and/or remove document duplicates.
I'd keep the "like" information as small entities mapping a document to a user, separated from the document and from the user entities, to prevent the drawbacks of maintaining possibly very long lists in a single entity, see Manage nested list of entities within entities in Google Cloud Datastore and Creating your own activity logging in GAE/P.
When a user likes a webpage with a particular URL you just have to check if the matching document entity exists:
if it does just create the like mapping entity
if it doesn't and you used the above-mentioned unique key identifiers:
create the document entity and launch its crawling job
create the like mapping entity
otherwise:
launch the crawling job which creates the document entity taking care of deduplication
launch a delayed job to create the mapping entity later, when the (unique) document entity becomes available. Possibly chained off the crawling job. Some retry logic may be needed.
Checking if a user liked a particular document becomes a simple query for one such mapping entity (with a bit of care as it's also eventually consistent).
With such scheme in place you no longer have to make those massive lookups, you only do one at a time - which is OK, a user liking documents one a time is IMHO more natural than providing a large list of liked documents.

Related

DynamoDB Modeling Multiple Query Elements

Background: I have a relational db background and have never built anything for DynamoDB that wasn't just used for fast writes with very few reads. I am trying to learn DynamoDB patterns by migrating one of my help desk apps from MySQL to DynamoDB.
The application is a fairly simple one from a data storage perspective. A user submits a request and that request generates 1 or more tickets.
Setup: I have screens where people see initial requests and that request's tickets and search views that allow support to query on a bunch of attributes of a ticket (last name of user, status of ticket, use case of ticket, phone number of user, dept of user). This design in a SQL db is pretty straightforward but in Dynamo, I'm really being thrown for a loop on how to structure primary/sort keys and secondary indexes (if necessary).
I created one collection for requests and one collection for tickets. The individual requests have an array of ticket ids that belong to it. The ticket item has an attribute that stores the request id so that I can search that way. But what I am hung up on, is how do I incorporate searching on a ticket/request's attributes without having to do a full scan?
I read about composite keys and perhaps creating a composite sort key similar to: ## so that I can search on each of those fields directly without having to know the primary key (ticket id).
Question: How do you design dynamo collections/tables that require querying a lot of different attribute values without relying on a primary key?
This is typically something that DynamoDB is not good at, not to say it definitely cannot be done. The strength and speed for DynamoDB comes from having well known access patterns and designing your schema for these patterns. In general if you don't know what your users will search for, or there are many different possible queries, it's better to look at something like RDS or a native SQL DB. That being said a possible direction to solve this could be to create multiple lists for each of the fields and duplicate the data. This could all be done in the same table.

Is it always safe to use eventId as the Firestore document id?

This article here recommends using the eventId as the document id to prevent multiple creations of a document due to background process retries. Is it guaranteed that there will never be a collision?
Mentioned article is showing how to avoid duplicate item created by retires of unsuccessful function. In shortcut its saying that if you use add method (reference) and function is retried (but failed after Firestore write) you may have a problem with 2 documents identical created in Firestore with different IDs created automatically.
As solution to this author is proposing to create documentID with eventID and write to it using set (refrence).
This approach gives you 100% that retries of the same function invocation will not create duplicate items.
Backing to the question... I think you are afraid that 2 different invocation will want will have the same event_id and the document can be overwritten. This I think is possible, but in my opinion it's not in scope of this article as it's answers different question and creating as simple use case as possible to help understand the approch.
Lets imagine we have to different functions invoked by the same event writing different content to the same collection. The result will be unpredictable, I think. However in such situation you can use the same mechanism, little bit upgraded ex. like this <function_name>_<event_id>. Using the example from the article it will be small change like:
...
return db.collection('contents').doc('<function_name>_'+eventId).set(content).then
...
So in my understanding if you afraid of collision you should add additional elements to created document references, like in the example above.
From my point of view, an ability to use an event_id as a firestore document id depends on a your context and requirements.
For example - from the "business" point of view - is the message/event really a unique business related thing (thus you really would like to avoid duplication of messages)? Or are there some other business entity which is to be unique, but there can be more than one messages (with different event_id) about that business entity?
On top of that, from the best of my knowledge, it may be a good practice to generate/create the firestore document ids randomly (as a hash, of a guid, etc.). In that case, the search/retrieval from the firestore should work "faster". So, I don't know if the event_id is "random" enough in your context. Maybe it is Ok, may be not...
In my personal experience I try to generate a document id as a hex digest of a hash from a string (may be composed string), which supposed to be unique in the business context. For example, the event/message - is a google.storage.object.finalize event. In that case, I would use some metadata about the underlined object/file. Depends on the business context and requirements, or can be (or not be) a bucket name, object name, size, md5 or crc32c etc. or a combination of those elements... The chosen elements are concatenated into a string, then a hash is calculated, and a hex digest of that hash becomes a document id in the firestore collection.

Fetching parent and child item in single query in DynamoDB

I have the following one-to-many relationship:
Account 1--* User
The Account contains global account-level information, which is mutable.
The User contains user-level information, which is also mutable.
When the user signs-in, they need both Account and User information. (I only know the UserId at this point).
I ideally want to design the schema such that a single query is necessary. However, I cannot determine how to do this without duplicating the Account into each User and thus requiring some background Lambda job to propagate changes to Account attributes across all User objects -- which, for the record, seems like more resource usage (and code to maintain) than simply normalizing the data and having 2 queries on each sign-in: fetch user, then fetch account (using an FK inside the user object that identifies the account).
Is it possible to design a schema that allows one query to fetch both and doesn't require a non-transactional background job to propagate updates? (Transactional batch updates are out of the question, since there's >25 users.) And if not, is the 2-query idea the best / an acceptable method?
I'll focus on one angle in your question - the 2-query idea. In many cases it is indeed an acceptable method, better than the alternatives. In fact in many NoSQL uses, every user-visible request results in significantly more than two database requests. In fact, it is often stated that this is the reason why NoSQL systems care about low tail latencies (i.e., even 99th percentile latencies should be low).
You didn't say why you wanted to avoid the 2-query solution. The 2-query implementation you presented has two downsides:
It is more costly: you need to do two queries instead of one, costing (when the reads are shorter than 4 KB) double than a single read.
Latency doubles if you need to do the first query, and only then can do the second query.
There may be tricks you can use to solve both problems, depending on more details of your use case:
For the latency: You didn't say what is a "user id" in your application. If it is some sort of unique numeric identifier, maybe it can be set up such that the account id can be determined from the user id directly, without a table lookup (e.g., the first bits of the user id are the account id). If this is the case, you can start both lookups at the same time, and not double the latency. The cost will still be double, but not the latency.
For the cost: If there is a large number of users per account (you said there are more than 25 - I don't know if it's much more or not), it may be useful to cache the Account data, so that not every user lookup will need to read the Account data again - it might often be cached. If Account information rarely changes and consistency of it is not a big deal (I don't know if it is...), you can also get by with doing an "eventual consistency" read for the Account information - which costs half of the regular "consistent" read.
I think the following scheme will be useful for.
You will store both account and user records inthe same table
You want to get both account metadata and linked users in a single query
PK: account SK: recordId
=== Account record ===
account: 123512321 recordId: METADATA attributes: name, environment, ownerId...
=== User record ===
account: 123512321 recordId: USERID#34543543 attributes: name, email, phone...
With this denormalization of the data, you can retrieve both account metadata and related users in a single query. You can also change the account metadata without a need to apply any change to related users.
BONUS: you can also link other types of assets to the account record

Riak solution for querying data by books or unique pages

Consider a set of data called Library, which contains a set of Books and each book contains a set of Pages.
Let's say you are using Riak to store this data, and you need to be access the data in two possible ways:
- Query for a particular page (with a unique id)
- Query for all pages in a particular book (with a unique name)
Additionally, you need to be able to easily update and delete pages of a particular Book.
What would be the best way to accomplish this in Riak?
Obviously Riak Search will do the trick, but maybe is inefficient for what I am trying to do. I am wondering if it makes sense to set up buckets where each bucket can be a Book (which would make for potentially millions of "Book" buckets). Maybe that is a bad idea...
Can this be accomplished with secondary indexes?
I am trying to keep this simple...
I am new to Riak and I am trying to find the best way to accomplish something that is probably relatively simple. I would appreciate any help from the Stack Overflow community. Thanks!
A common way to model master-detail relationships in Riak is to have the master record contain a list of detail record IDs, possibly together with some information about the detail record that may be useful when deciding which detail records to retrieve.
In your example, you could have two buckets called 'books' and 'pages'. The master record in the 'books' bucket will contain metadata and information about the book as a whole together with a list of pages that are included in the book. Each page would contain the ID of the 'pages' record holding the page data as well as the corresponding page number. If you e.g. wanted to be able to query by chapter, you could also add information about which chapters a certain page belongs to.
The 'pages' bucket would contain the text of the page and possibly links to images and other media data that are included on that page. This data could be stored in yet another bucket.
In order to get a specific page or a range of pages, one would first retrieve the master record from the 'books' bucket and then based on the contents of the record the appropriate pages. Even though this requires several GET operations, they are all direct lookups based on keys, which is the most efficient and scalable way to retrieve data from Riak, so it is will perform and scale well.
This approach also makes it simple to change the order of pages and/or chapters as only the master record needs to be updated. Adding, deleting or modifying pages would however require both the master record as well as one or more detail records to be updated, added or deleted.
You can most certainly also solve this problem by adding secondary indexes to the objects and query based on this. Secondary index queries in Riak does however have to include processing on a covering set (generally ring size / n_val) of partitions in order to fulfil the request, and therefore puts a bit more load on the system and generally results in higher latencies than retrieving a single object containing keys through a direct key lookup (which only needs to involve the partitions where the object is actually stored).
Although maintaining a separate object containing indexes adds a bit of extra work when inserting or deleting pages/entries, this approach will generally result in more efficient reads, as only direct key lookups are required. If your application is heavy on reads, it probably makes sense to use this approach, while secondary indexes could be more efficient for a write heavy application as inserts and modifications are made cheaper at the expense of more expensive reads. You can however always add secondary indexes just in case in order to keep your options open.
In cases like this I would usually recommend performing some benchmarks to test the solutions and chech which solution that best matches you particular performance and scaling requirements.
The most efficient way will be to store hole book as an one object, and duplicate it's pages as another separate objects.
Pros:
you will be able to select any object by its key(the most cheapest op
in riak is kv query)
any query will be predicted by latency
this is natural way of storing for riak
Cons:
If you need to update any page you must update whole book, and then page. As riak doesn't have atomic ops, you must to think how to recover any failure situation (like this: book was updated, but page was not).
Riak is about availability predictable latency, so if you will use something like 2i to collect results, it will make unpredictable time query, which will grow with page numbers

Bulk Collection Manipulation through a REST (RESTful) API

I'd like some advice on designing a REST API which will allow clients to add/remove large numbers of objects to a collection efficiently.
Via the API, clients need to be able to add items to the collection and remove items from it, as well as manipulating existing items. In many cases the client will want to make bulk updates to the collection, e.g. adding 1000 items and deleting 500 different items. It feels like the client should be able to do this in a single transaction with the server, rather than requiring 1000 separate POST requests and 500 DELETEs.
Does anyone have any info on the best practices or conventions for achieving this?
My current thinking is that one should be able to PUT an object representing the change to the collection URI, but this seems at odds with the HTTP 1.1 RFC, which seems to suggest that the data sent in a PUT request should be interpreted independently from the data already present at the URI. This implies that the client would have to send a complete description of the new state of the collection in one go, which may well be very much larger than the change, or even be more than the client would know when they make the request.
Obviously, I'd be happy to deviate from the RFC if necessary but would prefer to do this in a conventional way if such a convention exists.
You might want to think of the change task as a resource in itself. So you're really PUT-ing a single object, which is a Bulk Data Update object. Maybe it's got a name, owner, and big blob of CSV, XML, etc. that needs to be parsed and executed. In the case of CSV you might want to also identify what type of objects are represented in the CSV data.
List jobs, add a job, view the status of a job, update a job (probably in order to start/stop it), delete a job (stopping it if it's running) etc. Those operations map easily onto a REST API design.
Once you have this in place, you can easily add different data types that your bulk data updater can handle, maybe even mixed together in the same task. There's no need to have this same API duplicated all over your app for each type of thing you want to import, in other words.
This also lends itself very easily to a background-task implementation. In that case you probably want to add fields to the individual task objects that allow the API client to specify how they want to be notified (a URL they want you to GET when it's done, or send them an e-mail, etc.).
Yes, PUT creates/overwrites, but does not partially update.
If you need partial update semantics, use PATCH. See http://greenbytes.de/tech/webdav/draft-dusseault-http-patch-14.html.
You should use AtomPub. It is specifically designed for managing collections via HTTP. There might even be an implementation for your language of choice.
For the POSTs, at least, it seems like you should be able to POST to a list URL and have the body of the request contain a list of new resources instead of a single new resource.
As far as I understand it, REST means REpresentational State Transfer, so you should transfer the state from client to server.
If that means too much data going back and forth, perhaps you need to change your representation. A collectionChange structure would work, with a series of deletions (by id) and additions (with embedded full xml Representations), POSTed to a handling interface URL. The interface implementation can choose its own method for deletions and additions server-side.
The purest version would probably be to define the items by URL, and the collection contain a series of URLs. The new collection can be PUT after changes by the client, followed by a series of PUTs of the items being added, and perhaps a series of deletions if you want to actually remove the items from the server rather than just remove them from that list.
You could introduce meta-representation of existing collection elements that don't need their entire state transfered, so in some abstract code your update could look like this:
{existing elements 1-100}
{new element foo with values "bar", "baz"}
{existing element 105}
{new element foobar with values "bar", "foo"}
{existing elements 110-200}
Adding (and modifying) elements is done by defining their values, deleting elements is done by not mentioning it the new collection and reordering elements is done by specifying the new order (if order is stored at all).
This way you can easily represent the entire new collection without having to re-transmit the entire content. Using a If-Unmodified-Since header makes sure that your idea of the content indeed matches the servers idea (so that you don't accidentally remove elements that you simply didn't know about when the request was submitted).
Best way is :
Pass Only Id Array of Deletable Objects from Front End Application To Web API
2. Then You have Two Options:
2.1 Web API Way : Find All Collections/Entities using Id arrays and Delete in API , but you need to take care of Dependant entities like Foreign Key Relational Table Data too
2.2. Database Way : Pass Ids to your database side, find all records in Foreign Key Tables and Primary Key Tables and Delete in same order i.e. F-Key Table records then P-Key Table records

Resources