Ancestor index or global index? - google-cloud-datastore

I have an entity that represents a relationship between two entity groups but the entity belongs to one of the groups. However, my queries for this data are going to be mostly with the other entity group. To support the queries I see I have two choices a) Create a global index that has the other entity group key as prefix b) Move the entity into the other entity group and create ancestor index.
I saw a presentation which mentioned that ancestor indexes map internally to a separate table per entity group while there is a single table for the global index. That makes me feel that ancestors are better than using global indexes which includes the ancestor keys as prefix for this specific use case where I will always be querying in the context of some ancestor key.
Looking for guidance on this in terms of performance, storage characteristics, transaction latency and any other architectural considerations to make the final call.

From what I was able to found I would say it depends on the of the type of work you'll be doing. looked at this docs and it suggest you Avoid writing to an entity group more than once per second. Also indexing a property could result in increased latency. Also it states that If you need strong consistency for your queries, use an ancestor query, in that docs there are many advice's on how to avoid latency and other issues. it should help you on taking a call.

I ended up using a 3rd option which is to have another entity denormalized into the other entity group and have ancestor queries on it. This allows me to efficiently query data for either of the entity groups. Since I was using transactions already, denormalizing wouldn't cause any inconsistencies and everything seems to work well.

Related

How are parent-child aggregate relationships managed in Axon Framework

The Axon documentation describes how to create a child aggregate from a parent, but not how to retrieve it, or delete it (e.g. for cascading deletes)
Does the parent aggregate typically-explicitly, or automatically-internally keep a list of references to the child aggregates?
Would such references be a collection of aggregate IDs, or, to be more object oriented, a collection of actual instance references to the child aggregates?
Another way to pose this question: What is different about child aggregates vs entities in multi-entity aggregates, and what is different about child aggregates vs totally independent aggregates?
I want a cascading delete (containment) model between parent and child, but I want separate concurrent access to the child objects in a very large collection, hence aggregate member entities are not suitable.
Also note a similar question in the forum: the OP, Jakob describes a model at the end that includes his own table managing references for cascading... do I need that?
If you require the Entities to be separate Aggregates, then you will be required to maintain a reference table from parent to child.
The support Axon provides to create child Aggregates from a parent Aggregate is to ensure the framework uses a single transaction to publish multiple events. By no means does Axon Framework automatically store the relationships for you.
Instead, all of this should be known within the event stream of the Aggregates. With that in mind, combined with Event Sourcing, you can source any form of data within the Aggregates.
To circle back to your cascading delete scenario: I've actually had direct contact with Jakob about the matter. In his case (and potentially yours) we ended up with an `aggregateId-to-childAggregateIds model dedicated to keeping the references. Upon a delete from a parent Aggregate (on any level), this model is referred to, ensuring the right set of children is deleted too. Note that all this is custom code.
Furthermore, this aggregateId-to-childAggregateIds model can be regarded as part of your Command Model (granted that you're aiming to apply CQRS). As such, it's purely used to drive decision-making. Where the decision-making, in this case, is deciding on the right children to send delete commands to.
So, to summarize:
Axon does not keep parent-child relations for you, other than in the contents of the events you publish.
I'd opt the aggregateId-to-childAggregateIds model to never store the entire Aggregate instance. You simply don't need all that data for deciding who to delete. The child's Aggregate identifier should suffice.
Axon's child Aggregate creation support is purely there to use a single transaction towards the event store to publish the parent's change and the creation of a child, while still benefitting from separate instances for increased concurrency. Axon's Aggregate Member support would mark the children as entities under the parent Aggregate Root instead of their own Aggregate instances.

What are some solutions for avoiding the select(n+1) issue in Doctrine2 for aggregate count fields?

I have two entities, User and Place, and have a many-to-many relationship between them in order to facilitate the concept of a user favoriting a place.
As part of the feature, I would like for the Place entity to contain a field which gives me the total number of users who have favorited it. I do not need the user entities themselves.
Reading the documentation, I've found several solutions, but I don't particularly like any of them.
Aggregate Field
Using this solution, I would simply define an integer field on the Venue entity that is updated as favorites are added and removed. Thus, the value is not calculated on-the-fly and instead is retrieved as any other field would be.
I dislike this approach as there are issues with concurrency and it makes the concept unnecessarily complex to manage in code.
Eager Loading
Using this method, I would eagerly load the relationship as a bidirectional association so that the Place would load each User entity that has favorited it as part of the initial querying process. To get the count, I simply ask the collection for its count().
This results in fewer queries, but the amount of data retrieved is too much and does not scale well over time.
Extra Lazy Loading
This is what I am currently using. It is similar to the Eager Loading solution in that I ensure the relationship is bi-directional and simply ask the collection for its count(), but using the extra lazy fetch mode doctrine is intelligent enough to only issue a COUNT() query rather than retrieve the entire list of users associated with the Place entity.
The drawback here is that if I am loading N Place entities, I need N+1 queries as each Place will issue a separate COUNT() query.
Ideal Solution
My ideal solution would be to find a way to tell Doctrine to perform the first query to load the collection and then a second query to load all counts for the IDs within the collection and then populate the fields in their respective entities.
I have not found a way to do this easily.
Does anyone have any examples of this or are there other solutions for solving this problem that I may be overlooking?
if you know how to write your query in sql, you can do it in doctrine.
see this answer for reference:
Symfony2, Doctrine2 query - select with IF and COUNT
You can run a similar DQL:
SELECT p place, COUNT(u) cnt FROM YourBundle:Place p
LEFT JOIN p.users u
Note that results array elements are in the form (each is an array):
array(
'place' => your hydrated object (in your case the place),
'cnt' => your aggregated field (in your case the number of users),
)

Does supplying an ancestor improve filter queries performance in objectify / google cloud datastore?

I was wondering which of these two is superior in terms of performance (and perhaps cost?):
1) ofy().load().type(A.class).ancestor(parentKey).filter("foo", bar).list()
2) ofy().load().type(A.class) .filter("foo", bar).list()
Logically, I would assume (1) will perform better as it restricts the domain on which to perform the filtering (though obviously there's an index). However, I was wondering if I may be missing some consideration/factor.
Thank you.
Including an ancestor should not significantly affect query performance (and as you note, each option requires a different index).
The more important consideration when deciding to use ancestor queries is whether you require strongly consistent query results. If you do, you'll need to group your entities into entity groups and use an ancestor query (#1 in your example).
The trade-off of grouping entities into entity groups is that each entity group only supports 1 write per second.
You can find a more detailed discussion of structuring data for strong consistency here: https://developers.google.com/datastore/docs/concepts/structuring_for_strong_consistency

Neo4j Design: Property vs "Node & Relationship"

I have a node type that has a string property that will have the same value really often. Etc. Millions of nodes with only 5 options of that string value. I will be doing searches by that property.
My question would be what is better in terms of performance and memory:
a) Implement it as a node property and have lots of duplicates (and search using WHERE).
b) Implement it as 5 additional nodes, where all original nodes reference one of them (and search using additional MATCH).
Without knowing further details it's hard to give a general purpose answer.
From a performance perspective it's better to limit the search as early as possible. Even more beneficial if you do not have to look into properties for a traversal.
Given that I assume it's better to move the lookup property into a seperate node and use the value as relationship type.
Use labels; this blog post is a good intro to this new Neo4j 2.0 feature:
Labels and Schema Indexes in Neo4j
I've thought about this problem a little as well. In my case, I had to represent state:
STARTED
IN_PROGRESS
SUBMITTED
COMPLETED
Overall the Node + Relationship approach looks more appealing in that only a single relationship reference needs to be maintained each time rather than a property string and you don't need to scan an extra additional index which has to be maintained on the property (memory and performance would intuitively be in favor of this approach).
Another advantage is that it easily supports the ability of a node being linked to multiple "special nodes". If you foresee a situation where this should be possible in your model, this is better than having to use a property array (and searching using "in").
In practice I found that the problem then became, how do you access these special nodes each time. Either you maintain some sort of constants reference where you have the node ID of these special nodes where you can jump right into them in your START statement (this is what we do) or you need to do a search against property of the special node each time (name, perhaps) and then traverse down it's relationships. This doesn't make for the prettiest of cypher queries.

How is representing all information in Nodes vs Attributes affect storage, computations?

While using Graph Databases(my case Neo4j), we can represent the same information many ways. Making each entity a Node and connecting all entities through relationships or just adding the entities to attribute list of a Node.diff
Following are two different representations of the same data.
Overall, which mechanism is suitable in which conditions?
My use case involves traversing the Database from different nodes until 4 depths and examining the information through connected nodes or attributes (based on which approach it is).
One query of interest may be, "Who are the friends of John who went to Stanford?"
What is the difference in terms of Storage, computations
Normally,
properties are loaded lazily, and are more expensive to hold in cache, especially strings. Nodes and Relationships are most effective for traversal, especially since the relationships types are stored together with the relatoinship records and thus don't trigger property loads when used in traversals.
Also, a balanced graph (that is, not many dense nodes with over say 10K relationships) is most effective to traverse.
I would try to model most of the reoccurring proeprties as nodes connecting to the entities, thus using the graph itself to index on these values, instead of having to revert to filter on property values or index the property with an expensive index lookup.
The first one is much better since you're querying on entities such as Stanford- and that entity is related to many person nodes. My opinion that modeling as nodes is more intuitive and easier to query on. "Find all persons who went to Stanford" would not be very easy to do in your second model as you don't have a place to start traversing from.
I'd use attributes mainly to describe the node/entity use them to filter results from the query e.g. Who are friends of John who went to Stanford in the year 2010. In this case, the year attribute would just be used to trim the results. Depends on your use case- if year is really important and drives a lot of queries or is used to represent a timeline, you could even model the year as a node attached to Stanford.

Resources