Objectify : is there a way to know whether the entity is loaded from objectify session or directly from datastore? - objectify

I have two question :
1) Suppose i have already loaded 50 entities from datastore using a set of filters and it would be present in the objectify session, after a while, if i try to load the same entities with a different set of filters would it fetch from objectify session or datastore?
2) I have 50 entities already loaded and is available in objectify session, now am trying to load some entities with a set of filters , for example this filter would fetch 55 entities, out of that 50 entities will be the same that i have already loaded, the other 5 are a new ones. Will it fetch all 55 entities from datastore or will it fetch 50 entities from session and the remaining 5 from datastore ?

Objectify always prefers to give you objects from the session. The answer to 1 is that you will get objects from the session. The answer to 2 is you will get (as many as possible) objects from the session.
Keep in mind that queries (ie not get-by-key operations) always reach to the datastore to execute. Depending on a variety of factors, Objectify might issue a keys-only query and then perform a batch get-by-key for any "missing" entities, or Objectify might issue a full query and throw out any extra data which is already present in the session.

Related

Should an entity be written to ensure auto allocated ids are not reused?

I have User entity group and Transaction entity under that. I autoallocate ids for the transactions. I want to create unique key for integration with the payment service. Since the transaction is not a root entity, the autoallocated ids are not guaranteed to be unique and hence I can't use them as the unique keys. What I am currently doing following the suggestion at
Google Cloud Datastore unique autogenerated ids
is to have a dummy root entity and allocate ids for it and store that id with the transaction entity as a separate field. However, since it is dummy, I am currently not writing the dummy entity itself to the datastore.
I have read other posts
When using allocateIds(), how long do unused IDs remain allocated?
and
Are deleted entity IDs once again available to App Engine if auto-generated for an Entity?
but am still not sure. Do I have to insert this dummy entity with just the key? If not, how are all the allocated ids for this dummy entity tracked and what happens to the corresponding storage usage?
An entity key ID, together with the kind and ancestor (and maybe namespace as well) define a unique entity key, which is meaningful even if the entity doesn't actually exist: it's possible to have child entities in an entity group ancestry attached to an ancestor which doesn't exist. From Ancestor paths (emphasis mine):
When you create an entity, you can optionally designate another entity
as its parent; the new entity is a child of the parent entity (note
that unlike in a file system, the parent entity need not actually
exist).
So the fact that your dummy entities actually exist or not should not matter: their key IDs pre-allocated using allocateIds() should never expire. From Assigning identifiers:
Datastore mode's automatic ID generator will keep track of IDs that
have been allocated with these methods and will avoid reusing them for
another entity, so you can safely use such IDs without conflict. You
can not manually choose which values are returned by the
allocateIds() method. The values returned by allocateIds() are
assigned by Datastore mode.
Personal considerations supporting this opinion:
the datastore has no limit of the number of entities of the same kind, ancestor and namespace, so it should support a virtually unlimited number of unique IDs. IMHO this means that there should be no need to even consider re-using them. Which is probably why there is no mention about any deadline or expiry time for allocated IDs.
if IDs for deleted entities would ever be reused it would raise a significant problem for restoring datastore entities from backups - potentially overwriting newer entities with re-used IDs with the entities that used the same IDs previously

Lookup the existence of a large number of keys (up to1M) in datastore

We have a table with 100M rows in google cloud datastore. What is the most efficient way to look up the existence of a large number of keys (500K-1M)?
For context, a use case could be that we have a big content datastore (think of all webpages in a domain). This datastore contains pre-crawled content and metadata for each document. Each document, however, could be liked by many users. Now when we have a new user and he/she says he/she likes document {a1, a2, ..., an}, we want to tell if all these document ak {k in 1 to n} are already crawled. That's the reason we want to do the lookup mentioned above. If there is a subset of documents that we don't have yet, we would start to crawl them immediately. Yes, the ultimate goal is to retrieve all these document content and use them to build the user profile.
My current thought is to issue a bunch of batch lookup requests. Each lookup request can contain up to 1K of keys [1]. However to get the existence of every key in a set of 1M, I still need to issue 1000 requests.
An alternative is to use a customized middle layer to provide a quick look up (for example, can use bloom filter or something similar) to save the time between multiple requests. Assuming we never delete keys, every time we insert a key, we add it through the middle layer. The bloom-filter keeps track of what keys we have (with a tolerable false positive rate). Since this is a custom layer, we could provide a micro-service without a limit. Say we could respond to a request asking for the existence of 1M keys. However, this definitely increases our design/implementation complexity.
Is there any more efficient ways to do that? Maybe a better design? Thanks!
[1] https://cloud.google.com/datastore/docs/concepts/limits
I'd suggest breaking down the problem in a more scalable (and less costly) approach.
In the use case you mentioned you can deal with one document at a time, each document having a corresponding entity in the datastore.
The webpage URL uniquely identifies the page, so you can use it to generate a unique key/identifier for the respective entity. With a single key lookup (strongly consistent) you can then determine if the entity exists or not, i.e. if the webpage has already been considered for crawling. If it hasn't then a new entity is created and a crawling job is launched for it.
The length of the entity key can be an issue, see How long (max characters) can a datastore entity key_name be? Is it bad to haver very long key_names?. To avoid it you can have the URL stored as a property of the webpage entity. You'll then have to query for the entity by the url property to determine if the webpage has already been considered for crawling. This is just eventually consistent, meaning that it may take a while from when the document entity is created (and its crawling job launched) until it appears in the query result. Not a big deal, it can be addressed by a bit of logic in the crawling job to prevent and/or remove document duplicates.
I'd keep the "like" information as small entities mapping a document to a user, separated from the document and from the user entities, to prevent the drawbacks of maintaining possibly very long lists in a single entity, see Manage nested list of entities within entities in Google Cloud Datastore and Creating your own activity logging in GAE/P.
When a user likes a webpage with a particular URL you just have to check if the matching document entity exists:
if it does just create the like mapping entity
if it doesn't and you used the above-mentioned unique key identifiers:
create the document entity and launch its crawling job
create the like mapping entity
otherwise:
launch the crawling job which creates the document entity taking care of deduplication
launch a delayed job to create the mapping entity later, when the (unique) document entity becomes available. Possibly chained off the crawling job. Some retry logic may be needed.
Checking if a user liked a particular document becomes a simple query for one such mapping entity (with a bit of care as it's also eventually consistent).
With such scheme in place you no longer have to make those massive lookups, you only do one at a time - which is OK, a user liking documents one a time is IMHO more natural than providing a large list of liked documents.

Is there a way to set a Doctrine Event Listener that limits every entity query to a specific field id?

Setting Doctrine Event Listeners for Persisting and Updating entities is an extremely useful feature. But that appears to be limited to UPDATE or INSERT statements. What about SELECT queries?
Let's say I have a hosted CMS with a shared database where every record has a siteId denoting that a record belongs to a certain site. When I search for records (entities) to display on a given site, I need to limit the results to only records for the specific siteId. I could manually limit every single query in my code, but that has the potential for forgetting (which could be a security issue). So I am considering making this automatic for every single query (I may need to override in some instances).
So can you manipulate the DQL or QueryBuilder in some sort of preSelect Event Listener so that it makes sure that we always limit by the siteId?
And if so, ideally we could overwrite this in system administrative interfaces where we want to find all records regardless of siteId.
Is this possible? Is this a bad idea? Does this smell like Rosemary?
Take a look at Doctrine Filters.
Doctrine2 - is there a pre-selection hook?
http://docs.doctrine-project.org/en/latest/reference/filters.html
Doctrine 2.2 features a filter system that allows the developer to add
SQL to the conditional clauses of queries, regardless the place where
the SQL is generated (e.g. from a DQL query, or by loading associated
entities).
The filter functionality works on SQL level. Whether a SQL query is
generated in a Persister, during lazy loading, in extra lazy
collections or from DQL. Each time the system iterates over all the
enabled filters, adding a new SQL part as a filter returns.
By adding SQL to the conditional clauses of queries, the filter system
filters out rows belonging to the entities at the level of the SQL
result set. This means that the filtered entities are never hydrated
(which can be expensive).

Evict objects from objectify cache

especially Objectify team,
I'm persisting my objects through this code pattern
Entity filled = ofy().save().toEntity(myPojo);
filled.setUnindexedProperty( "myStuff", "computedSpecialValue" );
datastore.persist(filled);
Reading back my objects, I noticed they are get from cache since Objectify was not notified that it should evict the updated entity from its cache.
I like the Objectify cache feature since it saves me the time to grab data from memcache and reconstuct the objects for each read, so I want my objects to be cached, but I want to be able to evict them.
This discussion says there was no solution in mid 2013, https://groups.google.com/forum/#!msg/objectify-appengine/n3FJjnYVVsk/6Xp99zReOKQJ
If it's still the case, I'd expect an API like
ofy().save().entity(myPojo).evict();
and by the way, I imagine the API would be more consistent if
Entity filled = ofy().save().toEntity(myPojo);
was replaced by
Entity filled = ofy().save().entity(myPojo).toEntity();
Naturally, there's a costly workaround to the issue:
save the entity twice (once manually, then through objectify)
While there is no formal API for evicting cache entries, it's not hard to do:
MemcacheServiceFactory
.getMemcacheService(ObjectifyFactory.MEMCACHE_NAMESPACE)
.delete(key.toWebSafeString());

Symfony2 Doctrine merge

I am studying https://www.doctrine-project.org/projects/doctrine-orm/en/2.6/reference/working-with-associations.html but I cannot figure out what cascade merge does. I have seen elsewhere that
$new_object = $em->merge($object);
basically creates a new managed object based on $object. Is that correct?
$em->merge() is used to take an Entity which has been taken out of the context of the entity manager and 'reattach it'.
If the Entity was never managed, merge is equivalent to persist.
If the Entity was detached, or serialized (put in a cache perhaps) then merge more or less looks up the id of the entity in the data store and then starts tracking any changes to the entity from that point on.
Cascading a merge extends this behavior to associated entities of the one you are merging. This means that changes are cascaded to the associations and not just the entity being merged.
I know this is an old question, but I think it is worth mentioning that $em->merge() is deprecated and will be removed soon. Check here
Merge operation is deprecated and will be removed in Persistence 2.0.
Merging should be part of the business domain of an application rather
than a generic operation of ObjectManager.
Also please read this doc v3 how they expect entities to be stored
https://www.doctrine-project.org/projects/doctrine-orm/en/latest/cookbook/entities-in-session.html#entities-in-the-session
It is a good idea to avoid storing entities in serialized formats such
as $_SESSION: instead, store the entity identifiers or raw data.

Resources