In my design I have to store a lot of properties(say a 20 properties) in a same datastore table.
But usually most of the entities will occupy a minimum of only 5 properties.
Is this design a resource consuming idea? Will the unused properties consume any memory or performance?
Thanks,
Karthick.
If I understand your question correctly, you are envisioning a system where you have: A Kind in your Datastore where the Entities for that Kind can have differing subsets of a common property-key space W. Entity 1's property set might be {W[0], W[1]}, and Entity 2's property set might be {W[1], W[2], W[5]}. You want to know whether this polymorphism (or "schemalessness") will cost you space, and whether each Entity, as in some naive MySQL implementations
The short answer is no - due to the schemaless nature of Datastore, having polymorphic entities in a kind (the entities have all different names and combinations of properties) will not consume extra space. The only way to have these "unused" properties consume extra space is if you actually did set them on the entity but set them to "null". If you are using the low-level API, you are manually adding the properties to the entity before saving it. Think of these as properties on a JSON object. If they aren't there, they aren't there.
In MySQL, having a table with many NULL-able columns can be a bad idea, depending on the engine, indexes, etc... but take a look at this talk if you want to learn more about how the Datastore actually stores it's data using BigTable. It's a different storage implementation underneath, and so there are different best practices or possibilities.
Related
Does GSI Overloading provide any performance benefits, e.g. by allowing cached partition keys to be more efficiently routed? Or is it mostly about preventing you from running out of GSIs? Or maybe opening up other query patterns that might not be so immediately obvious.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html
e.g. I you have a base table and you want to partition it so you can query a specific attribute (which becomes the PK of the GSI) over two dimensions, does it make any difference if you create 1 overloaded GSI, or 2 non-overloaded GSIs.
For an example of what I'm referring to see the attached image:
https://drive.google.com/file/d/1fsI50oUOFIx-CFp7zcYMij7KQc5hJGIa/view?usp=sharing
The base table has documents which can be in a published or draft state. Each document is owned by a single user. I want to be able to query by user to find:
Published documents by date
Draft documents by date
I'm asking in relation to the more recent DynamoDB best practice that implies that all applications only require one table. Some of the techniques being shown in this documentation show how a reasonably complex relational model can be squashed into 1 DynamoDB table and 2 GSIs and yet still support 10-15 query patterns.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-relational-modeling.html
I'm trying to understand why someone would go down this route as it seems incredibly complicated.
The idea – in a nutshell – is to not have the overhead of doing joins on the database layer or having to go back to the database to effectively try to do the join on the application layer. By having the data sliced already in the format that your application requires, all you really need to do is basically do one select * from table where x = y call which returns multiple entities in one call (in your example that could be Users and Documents). This means that it will be extremely efficient and scalable on the db level. But also means that you'll be less flexible as you need to know the access patterns in advance and model your data accordingly.
See Rick Houlihan's excellent talk on this https://www.youtube.com/watch?v=HaEPXoXVf2k for why you'd want to do this.
I don't think it has any performance benefits, at least none that's not called out – which makes sense since it's the same query and storage engine.
That being said, I think there are some practical reasons for why you'd want to go with a single table as it allows you to keep your infrastructure somewhat simple: you don't have to keep track of metrics and/or provisioning settings for separate tables.
My opinion would be cost of storage and provisioned throughput.
Apart from that not sure with new limit of 20
Official recommendation from the team is, to my knowledge, to put all datatypes into single collection that have something like type=someType field on documents to distinguish types.
Now, if we assume large databases with partitioning where different object types can be:
Completely different fields (so no common field for partitioning)
Related (through reference)
How to organize things so that things that should go together end up in same partition?
For example, lets say we have:
User
BlogPost
BlogPostComment
If we store them as separate types with type=user|blogPost|blogPostComment, in same collection, how do we ensure that user, his blogposts and all the corresponding comments end up in same partition?
Is there some best practice for this?
[UPDATE]
Can you ever avoid cross-partition queries completely? Should that be a goal? Or you just try to minimize them?
For example, you can partition your data perfectly for 99% of cases/queries but then you need some dashboard to show aggregates from all-the-data. Is that something you just accept as inevitable and try to minimize or is it possible to avoid it completely?
I've written about this somewhat extensively in other similar questions regarding Cosmos.
Basically, when dealing with many different logical entity types in a single Cosmos collection the easiest option is to put a generic (or abstract, as you refer to it) partition key on all your documents. At this point it's the concern of the application to make sure that at runtime the appropriate value is chosen. I usually name this document property either partitionKey, routingKey or something similar.
This is extremely important when designing for optimal query efficiency as your choice of partition keys can have a huge impact on query and throughput performance. A generic key like this lets you design the optimal storage of your data as it benefits whatever application you're building.
Even something like tenant does not make sense as different tenants might have wildly different data size and access patterns. Instead you could include the tenantId at runtime as part of your partition key as a kind of composite.
UPDATE:
For certain query patterns it might be possible to serve them entirely out of a single partition. It's definitely not the end of the world if things end up going cross partition though. The system is still quick. If possible, limiting the amount of partitions that need to be touched for a given query is ideal but you're never going to get away from it 100% of the time.
A partition should hold data related to a group that is expected to grow, for instance a Tenant which will group many documents (which can be of different types as you have mentioned) So the Partition Key in this instance should be the TenantId. The partitioning is more about the data relating to a group than the type of data. If the data is related to a User then you could use the UserId, however many users may comment on the same posts so it doesn't seem like a good candidate for a partition key unless there is some de-normalization of the user info so it doest have to relate back to the other users directly.. if that makes sense?
I have a Core Data model with something like 20 entities. I want all entities to have common attributes. For example, all of them have a creation date attribute.
I therefore introduced an common entity containing all the common attributes, and all the other entities inherit from this common entity.
This is fine and works well, but then, all entities end up in one single SQLite table (which is rather logical).
I was wondering if there was any clear drawback to this ?
For example, when going in real life with 1000+ objects of each entity, would the (single) table become so huge that terrible performance problems could happen ?
This question has been asked before:
Core Data entity inheritance --> limitations?
Core data performances: when all entities inherit from the same parent entity
Core Data inheritance vs no inheritance
Also keep in mind that when you want to check the SQLite file for debugging purposes, seperate tables are easier to examine.
I would use a common NSManagedObject subclass instead of a parent entity.
Don't worry about this. From Core Data documentation:
https://developer.apple.com/library/tvos/documentation/Cocoa/Conceptual/CoreData/Performance.html
... The SQLite store can scale to terabyte-sized databases with billions of rows, tables, and columns. Unless your entities themselves have very large attributes or large numbers of properties, 10,000 objects is considered a fairly small size for a data set.
What is way more important is that if you are doing any heavy operations, like fetching a lot of objects, or parsing objects based on some JSON from a webservice, you do this not on the mainthread. This is not very hard to do, look into parent/child managedobjectcontexts and how they can be used with managedcontextobjects with a private / main queue concurrencytype. Many good blog posts about this subject exist all over the interwebs.
I've been working on a project with one base entity for around 20 subentities and easily overall 50k instances for over 2 years now. We've never had performance problems with selects, inserts or updates.
The keys to using Core Data inheritance with large data sets are
optimized fetch requests (tune predicate, exclude irrelevant properties, prefetch relationships, omit subentities, set fetchLimit, use dictionary result type or count-requests if sufficient etc.)
batch saves (meaning not saving the MOC after every insert etc.)
setting up proper indices (they can speed up selects a looot)
structuring your UI appropriately so you won't have to load and display many thousand objects in one viewController
We do not even use parent/child managedObjectContexts or private queues (which introduce a lot of extra complexity on their own) when importing JSON, as our data model and mapping code is so highly optimized, that the UI doesn't even flicker or hang considerably when importing a few thousand objects.
One of my datastore entity is growing with too many fields, hence it could be a future performance bottle neck.
As of now I see the entity consists of 100 fields, if I need to fetch 100 entities each having 100 fields, would definitely be a performance hit (considering underlying data serialization and de-serialization while fetching the data from data-store).
So is it a good idea to convert the entire entity to a blob and store it with a key value and later logically parse the data back into a required object format?
Any valuable suggestions please?
Unless you have done some profiling and find that serialization is a real bottleneck, I wouldn't worry about how many fields you have. Object assembly and disassembly in Java is fast. In the unlikely event that you actually are hitting limits (say, thousands of entities with thousands of fields) you can write a custom Objectify translator that eliminates all the reflection overhead.
This sounds like premature optimization.
I'm not so sure if converting the entity to a blob will increase the performance much, since you will still need to deserialise the blob into an entities later on in your application code.
If you never need all the fields of the object, then one method of increasing performance is to use projection queries. (See https://developers.google.com/appengine/docs/java/datastore/projectionqueries)
Projection queries basically allow you to return back only the properties you require. This works because it uses the information stored in the indexes, hence it never needs to deserialise the entity. This means that you must have an index defined for any projection query you use.
I have a node type that has a string property that will have the same value really often. Etc. Millions of nodes with only 5 options of that string value. I will be doing searches by that property.
My question would be what is better in terms of performance and memory:
a) Implement it as a node property and have lots of duplicates (and search using WHERE).
b) Implement it as 5 additional nodes, where all original nodes reference one of them (and search using additional MATCH).
Without knowing further details it's hard to give a general purpose answer.
From a performance perspective it's better to limit the search as early as possible. Even more beneficial if you do not have to look into properties for a traversal.
Given that I assume it's better to move the lookup property into a seperate node and use the value as relationship type.
Use labels; this blog post is a good intro to this new Neo4j 2.0 feature:
Labels and Schema Indexes in Neo4j
I've thought about this problem a little as well. In my case, I had to represent state:
STARTED
IN_PROGRESS
SUBMITTED
COMPLETED
Overall the Node + Relationship approach looks more appealing in that only a single relationship reference needs to be maintained each time rather than a property string and you don't need to scan an extra additional index which has to be maintained on the property (memory and performance would intuitively be in favor of this approach).
Another advantage is that it easily supports the ability of a node being linked to multiple "special nodes". If you foresee a situation where this should be possible in your model, this is better than having to use a property array (and searching using "in").
In practice I found that the problem then became, how do you access these special nodes each time. Either you maintain some sort of constants reference where you have the node ID of these special nodes where you can jump right into them in your START statement (this is what we do) or you need to do a search against property of the special node each time (name, perhaps) and then traverse down it's relationships. This doesn't make for the prettiest of cypher queries.