Should an entity be written to ensure auto allocated ids are not reused? - google-cloud-datastore

I have User entity group and Transaction entity under that. I autoallocate ids for the transactions. I want to create unique key for integration with the payment service. Since the transaction is not a root entity, the autoallocated ids are not guaranteed to be unique and hence I can't use them as the unique keys. What I am currently doing following the suggestion at
Google Cloud Datastore unique autogenerated ids
is to have a dummy root entity and allocate ids for it and store that id with the transaction entity as a separate field. However, since it is dummy, I am currently not writing the dummy entity itself to the datastore.
I have read other posts
When using allocateIds(), how long do unused IDs remain allocated?
and
Are deleted entity IDs once again available to App Engine if auto-generated for an Entity?
but am still not sure. Do I have to insert this dummy entity with just the key? If not, how are all the allocated ids for this dummy entity tracked and what happens to the corresponding storage usage?

An entity key ID, together with the kind and ancestor (and maybe namespace as well) define a unique entity key, which is meaningful even if the entity doesn't actually exist: it's possible to have child entities in an entity group ancestry attached to an ancestor which doesn't exist. From Ancestor paths (emphasis mine):
When you create an entity, you can optionally designate another entity
as its parent; the new entity is a child of the parent entity (note
that unlike in a file system, the parent entity need not actually
exist).
So the fact that your dummy entities actually exist or not should not matter: their key IDs pre-allocated using allocateIds() should never expire. From Assigning identifiers:
Datastore mode's automatic ID generator will keep track of IDs that
have been allocated with these methods and will avoid reusing them for
another entity, so you can safely use such IDs without conflict. You
can not manually choose which values are returned by the
allocateIds() method. The values returned by allocateIds() are
assigned by Datastore mode.
Personal considerations supporting this opinion:
the datastore has no limit of the number of entities of the same kind, ancestor and namespace, so it should support a virtually unlimited number of unique IDs. IMHO this means that there should be no need to even consider re-using them. Which is probably why there is no mention about any deadline or expiry time for allocated IDs.
if IDs for deleted entities would ever be reused it would raise a significant problem for restoring datastore entities from backups - potentially overwriting newer entities with re-used IDs with the entities that used the same IDs previously

Related

Bulk writes when one object requires the Key of another

Trying to figure out a good solution to this. Using Python and the NDB library.
I want to create an entity, and that entity is also tied to another entity. Both are created at the same time. Example would be creating a Message for a large number of users. We have an Inbox table/kind, and a Message table.
So once we gather the Keys of all the users we want, what I'm doing is just creating the Inbox entity, saving it, and then using the provided Key that it returns and attaching to the Message, and then saving the Message. For a large number of users, this seems pretty expensive. 2 writes per user. Normally I would just create the objects themselves and then use ndb.put_multi() to just batch the writes. Since there is no Key until it's saved, I can't do that.
Hope that made sense. Ideas?
Take a look at the allocate_ids API. You can pass a parent key and get id assigned. The allocate_ids call guarantees that the id is never reused within that parent key context. allocate_ids is a small operation and fast. Once you allocate these ids, then you can do the put_multi by referencing the allocated ids in the other entities that refer them. As I understand the message entity itself is not being referenced and if so you only need to allocate ids for Inbox (presumably if the user already doesn't have one) and do a multi_put on both inbox and message entities.

How to generate and guarantee unique values in firestore collection?

Lets say we have a order collection in firestore where each order needs to have a unique readable random order number with lets say 8 digits:
{
orderNumber: '19456734'
}
So for every incoming order we want to generate this unique number. What is the recommended approach in firestore to make sure no other document is using it?
Note: One solution would be querying existing docs before saving, but this is not working in a concurrent scenario where multiple orders arrive at the same time (?).
The easiest to guarantee that some value is unique in a collection, is to use that value as the key/ID for the documents in that collection. Since keys/IDs are by definition unique in their collection, this implicitly enforces your requirement.
The only built-in way to generate unique IDs is by calling the add() method, which generates a UUID for the new document. If you don't want to use UUIDs to identify your orders, you'll have to roll your own mechanism.
The two most common approaches:
Generate a unique number and check if it's already taken. You'd do this in a transaction of course, to ensure no two instances can claim the same ID.
Keep a global counter (typically in a document at a well-known location) of the latest ID you've handed out, and then read-increment-write that in a transaction to get the ID for any new document. This is typically what other databases do for their built-in auto-ID fields.

Ancestor index or global index?

I have an entity that represents a relationship between two entity groups but the entity belongs to one of the groups. However, my queries for this data are going to be mostly with the other entity group. To support the queries I see I have two choices a) Create a global index that has the other entity group key as prefix b) Move the entity into the other entity group and create ancestor index.
I saw a presentation which mentioned that ancestor indexes map internally to a separate table per entity group while there is a single table for the global index. That makes me feel that ancestors are better than using global indexes which includes the ancestor keys as prefix for this specific use case where I will always be querying in the context of some ancestor key.
Looking for guidance on this in terms of performance, storage characteristics, transaction latency and any other architectural considerations to make the final call.
From what I was able to found I would say it depends on the of the type of work you'll be doing. looked at this docs and it suggest you Avoid writing to an entity group more than once per second. Also indexing a property could result in increased latency. Also it states that If you need strong consistency for your queries, use an ancestor query, in that docs there are many advice's on how to avoid latency and other issues. it should help you on taking a call.
I ended up using a 3rd option which is to have another entity denormalized into the other entity group and have ancestor queries on it. This allows me to efficiently query data for either of the entity groups. Since I was using transactions already, denormalizing wouldn't cause any inconsistencies and everything seems to work well.

Lookup the existence of a large number of keys (up to1M) in datastore

We have a table with 100M rows in google cloud datastore. What is the most efficient way to look up the existence of a large number of keys (500K-1M)?
For context, a use case could be that we have a big content datastore (think of all webpages in a domain). This datastore contains pre-crawled content and metadata for each document. Each document, however, could be liked by many users. Now when we have a new user and he/she says he/she likes document {a1, a2, ..., an}, we want to tell if all these document ak {k in 1 to n} are already crawled. That's the reason we want to do the lookup mentioned above. If there is a subset of documents that we don't have yet, we would start to crawl them immediately. Yes, the ultimate goal is to retrieve all these document content and use them to build the user profile.
My current thought is to issue a bunch of batch lookup requests. Each lookup request can contain up to 1K of keys [1]. However to get the existence of every key in a set of 1M, I still need to issue 1000 requests.
An alternative is to use a customized middle layer to provide a quick look up (for example, can use bloom filter or something similar) to save the time between multiple requests. Assuming we never delete keys, every time we insert a key, we add it through the middle layer. The bloom-filter keeps track of what keys we have (with a tolerable false positive rate). Since this is a custom layer, we could provide a micro-service without a limit. Say we could respond to a request asking for the existence of 1M keys. However, this definitely increases our design/implementation complexity.
Is there any more efficient ways to do that? Maybe a better design? Thanks!
[1] https://cloud.google.com/datastore/docs/concepts/limits
I'd suggest breaking down the problem in a more scalable (and less costly) approach.
In the use case you mentioned you can deal with one document at a time, each document having a corresponding entity in the datastore.
The webpage URL uniquely identifies the page, so you can use it to generate a unique key/identifier for the respective entity. With a single key lookup (strongly consistent) you can then determine if the entity exists or not, i.e. if the webpage has already been considered for crawling. If it hasn't then a new entity is created and a crawling job is launched for it.
The length of the entity key can be an issue, see How long (max characters) can a datastore entity key_name be? Is it bad to haver very long key_names?. To avoid it you can have the URL stored as a property of the webpage entity. You'll then have to query for the entity by the url property to determine if the webpage has already been considered for crawling. This is just eventually consistent, meaning that it may take a while from when the document entity is created (and its crawling job launched) until it appears in the query result. Not a big deal, it can be addressed by a bit of logic in the crawling job to prevent and/or remove document duplicates.
I'd keep the "like" information as small entities mapping a document to a user, separated from the document and from the user entities, to prevent the drawbacks of maintaining possibly very long lists in a single entity, see Manage nested list of entities within entities in Google Cloud Datastore and Creating your own activity logging in GAE/P.
When a user likes a webpage with a particular URL you just have to check if the matching document entity exists:
if it does just create the like mapping entity
if it doesn't and you used the above-mentioned unique key identifiers:
create the document entity and launch its crawling job
create the like mapping entity
otherwise:
launch the crawling job which creates the document entity taking care of deduplication
launch a delayed job to create the mapping entity later, when the (unique) document entity becomes available. Possibly chained off the crawling job. Some retry logic may be needed.
Checking if a user liked a particular document becomes a simple query for one such mapping entity (with a bit of care as it's also eventually consistent).
With such scheme in place you no longer have to make those massive lookups, you only do one at a time - which is OK, a user liking documents one a time is IMHO more natural than providing a large list of liked documents.

Symfony2 Doctrine merge

I am studying https://www.doctrine-project.org/projects/doctrine-orm/en/2.6/reference/working-with-associations.html but I cannot figure out what cascade merge does. I have seen elsewhere that
$new_object = $em->merge($object);
basically creates a new managed object based on $object. Is that correct?
$em->merge() is used to take an Entity which has been taken out of the context of the entity manager and 'reattach it'.
If the Entity was never managed, merge is equivalent to persist.
If the Entity was detached, or serialized (put in a cache perhaps) then merge more or less looks up the id of the entity in the data store and then starts tracking any changes to the entity from that point on.
Cascading a merge extends this behavior to associated entities of the one you are merging. This means that changes are cascaded to the associations and not just the entity being merged.
I know this is an old question, but I think it is worth mentioning that $em->merge() is deprecated and will be removed soon. Check here
Merge operation is deprecated and will be removed in Persistence 2.0.
Merging should be part of the business domain of an application rather
than a generic operation of ObjectManager.
Also please read this doc v3 how they expect entities to be stored
https://www.doctrine-project.org/projects/doctrine-orm/en/latest/cookbook/entities-in-session.html#entities-in-the-session
It is a good idea to avoid storing entities in serialized formats such
as $_SESSION: instead, store the entity identifiers or raw data.

Resources