So I have a question about designing a datastore database, I'm using objectify. I'm trying to get optimal performance.
So I need to create two entities, List and Listings, with a relationship. There will be 500,000 listings in all and 50,000 per list.
Looking at this https://code.google.com/p/objectify-appengine/wiki/IntroductionToObjectify#Multi-Value_Relationship
I see there are three methods to store relationship.
One to one, many to one and Multi-value relationship.
The Multi-Value relationship looks like it would work great but appears to have a limit of 5,000 entries per entity(List?)
So I assume I should use the many to one method but I question the performance on this as I would have to query every listing and filter.
Can I have good performance doing what I'm attempting with datastore?
Any help at all would be great!
Multi-value relationship in that case has no a good performance because each value implies a new line on its field index. It means longer write times. Also it has a limit of entries. It's useful when you have a few values to store.
There is another type of relationship: entity group.
The criteria to choose between each method also depends on the type of queries you do and the frequency of updating entities.
In base of the information you provide, I recommend many-to-one relationship.
Related
I have an entity that represents a relationship between two entity groups but the entity belongs to one of the groups. However, my queries for this data are going to be mostly with the other entity group. To support the queries I see I have two choices a) Create a global index that has the other entity group key as prefix b) Move the entity into the other entity group and create ancestor index.
I saw a presentation which mentioned that ancestor indexes map internally to a separate table per entity group while there is a single table for the global index. That makes me feel that ancestors are better than using global indexes which includes the ancestor keys as prefix for this specific use case where I will always be querying in the context of some ancestor key.
Looking for guidance on this in terms of performance, storage characteristics, transaction latency and any other architectural considerations to make the final call.
From what I was able to found I would say it depends on the of the type of work you'll be doing. looked at this docs and it suggest you Avoid writing to an entity group more than once per second. Also indexing a property could result in increased latency. Also it states that If you need strong consistency for your queries, use an ancestor query, in that docs there are many advice's on how to avoid latency and other issues. it should help you on taking a call.
I ended up using a 3rd option which is to have another entity denormalized into the other entity group and have ancestor queries on it. This allows me to efficiently query data for either of the entity groups. Since I was using transactions already, denormalizing wouldn't cause any inconsistencies and everything seems to work well.
I'm having a hard time modeling a certain scenario without having to appeal to more than one request.
Think about a People table, each Person can be related to eachother n times, and this relationship has a description.
Consider the following modelling :
As you can see, I have two People, and person_0001 is child of person_0002.
Now, in this case, if I want to get all relationships that person_0001 has, it's easy, I just query :
GET WHERE PK = "person_0001" AND SK.BEGINS_WITH("rel")
But, since it is bidirectional, how can I get the relationships person_0002 has?
I could use a GSI that inverts the keys, so with one request I can simply query both tables at once.
But real problem comes when I need to update/delete, How can I delete/update all relationships person_0002 has with only one request? I can only read from GSIs.
It's a big difficulty I have in general, what do I do when I need to do a delete/update/write on a GSI?
Official recommendation from the team is, to my knowledge, to put all datatypes into single collection that have something like type=someType field on documents to distinguish types.
Now, if we assume large databases with partitioning where different object types can be:
Completely different fields (so no common field for partitioning)
Related (through reference)
How to organize things so that things that should go together end up in same partition?
For example, lets say we have:
User
BlogPost
BlogPostComment
If we store them as separate types with type=user|blogPost|blogPostComment, in same collection, how do we ensure that user, his blogposts and all the corresponding comments end up in same partition?
Is there some best practice for this?
[UPDATE]
Can you ever avoid cross-partition queries completely? Should that be a goal? Or you just try to minimize them?
For example, you can partition your data perfectly for 99% of cases/queries but then you need some dashboard to show aggregates from all-the-data. Is that something you just accept as inevitable and try to minimize or is it possible to avoid it completely?
I've written about this somewhat extensively in other similar questions regarding Cosmos.
Basically, when dealing with many different logical entity types in a single Cosmos collection the easiest option is to put a generic (or abstract, as you refer to it) partition key on all your documents. At this point it's the concern of the application to make sure that at runtime the appropriate value is chosen. I usually name this document property either partitionKey, routingKey or something similar.
This is extremely important when designing for optimal query efficiency as your choice of partition keys can have a huge impact on query and throughput performance. A generic key like this lets you design the optimal storage of your data as it benefits whatever application you're building.
Even something like tenant does not make sense as different tenants might have wildly different data size and access patterns. Instead you could include the tenantId at runtime as part of your partition key as a kind of composite.
UPDATE:
For certain query patterns it might be possible to serve them entirely out of a single partition. It's definitely not the end of the world if things end up going cross partition though. The system is still quick. If possible, limiting the amount of partitions that need to be touched for a given query is ideal but you're never going to get away from it 100% of the time.
A partition should hold data related to a group that is expected to grow, for instance a Tenant which will group many documents (which can be of different types as you have mentioned) So the Partition Key in this instance should be the TenantId. The partitioning is more about the data relating to a group than the type of data. If the data is related to a User then you could use the UserId, however many users may comment on the same posts so it doesn't seem like a good candidate for a partition key unless there is some de-normalization of the user info so it doest have to relate back to the other users directly.. if that makes sense?
I have two entities, User and Place, and have a many-to-many relationship between them in order to facilitate the concept of a user favoriting a place.
As part of the feature, I would like for the Place entity to contain a field which gives me the total number of users who have favorited it. I do not need the user entities themselves.
Reading the documentation, I've found several solutions, but I don't particularly like any of them.
Aggregate Field
Using this solution, I would simply define an integer field on the Venue entity that is updated as favorites are added and removed. Thus, the value is not calculated on-the-fly and instead is retrieved as any other field would be.
I dislike this approach as there are issues with concurrency and it makes the concept unnecessarily complex to manage in code.
Eager Loading
Using this method, I would eagerly load the relationship as a bidirectional association so that the Place would load each User entity that has favorited it as part of the initial querying process. To get the count, I simply ask the collection for its count().
This results in fewer queries, but the amount of data retrieved is too much and does not scale well over time.
Extra Lazy Loading
This is what I am currently using. It is similar to the Eager Loading solution in that I ensure the relationship is bi-directional and simply ask the collection for its count(), but using the extra lazy fetch mode doctrine is intelligent enough to only issue a COUNT() query rather than retrieve the entire list of users associated with the Place entity.
The drawback here is that if I am loading N Place entities, I need N+1 queries as each Place will issue a separate COUNT() query.
Ideal Solution
My ideal solution would be to find a way to tell Doctrine to perform the first query to load the collection and then a second query to load all counts for the IDs within the collection and then populate the fields in their respective entities.
I have not found a way to do this easily.
Does anyone have any examples of this or are there other solutions for solving this problem that I may be overlooking?
if you know how to write your query in sql, you can do it in doctrine.
see this answer for reference:
Symfony2, Doctrine2 query - select with IF and COUNT
You can run a similar DQL:
SELECT p place, COUNT(u) cnt FROM YourBundle:Place p
LEFT JOIN p.users u
Note that results array elements are in the form (each is an array):
array(
'place' => your hydrated object (in your case the place),
'cnt' => your aggregated field (in your case the number of users),
)
While using Graph Databases(my case Neo4j), we can represent the same information many ways. Making each entity a Node and connecting all entities through relationships or just adding the entities to attribute list of a Node.diff
Following are two different representations of the same data.
Overall, which mechanism is suitable in which conditions?
My use case involves traversing the Database from different nodes until 4 depths and examining the information through connected nodes or attributes (based on which approach it is).
One query of interest may be, "Who are the friends of John who went to Stanford?"
What is the difference in terms of Storage, computations
Normally,
properties are loaded lazily, and are more expensive to hold in cache, especially strings. Nodes and Relationships are most effective for traversal, especially since the relationships types are stored together with the relatoinship records and thus don't trigger property loads when used in traversals.
Also, a balanced graph (that is, not many dense nodes with over say 10K relationships) is most effective to traverse.
I would try to model most of the reoccurring proeprties as nodes connecting to the entities, thus using the graph itself to index on these values, instead of having to revert to filter on property values or index the property with an expensive index lookup.
The first one is much better since you're querying on entities such as Stanford- and that entity is related to many person nodes. My opinion that modeling as nodes is more intuitive and easier to query on. "Find all persons who went to Stanford" would not be very easy to do in your second model as you don't have a place to start traversing from.
I'd use attributes mainly to describe the node/entity use them to filter results from the query e.g. Who are friends of John who went to Stanford in the year 2010. In this case, the year attribute would just be used to trim the results. Depends on your use case- if year is really important and drives a lot of queries or is used to represent a timeline, you could even model the year as a node attached to Stanford.