I am planning on building my solution, using massive amounts of properties.
Is there a limit how many properties can an entity have?
Which way is better, to divide the data into more kinds or fewer kinds and many properties?
You can find Datastore limits here.
https://cloud.google.com/datastore/docs/concepts/limits
Here it is said, that single entity can have up to 20000 of indexed properties. If you would like to not index them, then probably you can get even higher number of properties in an entity, but you will be limited by 1Mb limit of overall entity size.
Splitting on several kinds or using one highly depends on you task.
Related
If each of my database's an overview has only two types (state: pending, appended), is it efficient to designate these two types as partition keys? Or is it effective to index this state value?
It would be more effective to use a sparse index. In your case, you might add an attribute called isPending. You can add this attribute to items that are pending, and remove it once they are appended. If you create a GSI with tid as the hash key and isPending as the sort key, then only items that are pending will be in the GSI.
It will depend on how would you search for these records!
For example, if you will always search by record ID, it never minds. But if you will search every time by the set of records pending, or appended, you should think in use partitions.
You also could research in this Best practice guide from AWS: https://docs.aws.amazon.com/en_us/amazondynamodb/latest/developerguide/best-practices.html
Updating:
In this section of best practice guide, it recommends the following:
Keep related data together. Research on routing-table optimization
20 years ago found that "locality of reference" was the single most
important factor in speeding up response time: keeping related data
together in one place. This is equally true in NoSQL systems today,
where keeping related data in close proximity has a major impact on
cost and performance. Instead of distributing related data items
across multiple tables, you should keep related items in your NoSQL
system as close together as possible.
As a general rule, you should maintain as few tables as possible in a
DynamoDB application. As emphasized earlier, most well designed
applications require only one table, unless there is a specific reason
for using multiple tables.
Exceptions are cases where high-volume time series data are involved,
or datasets that have very different access patterns—but these are
exceptions. A single table with inverted indexes can usually enable
simple queries to create and retrieve the complex hierarchical data
structures required by your application.
Use sort order. Related items can be grouped together and queried
efficiently if their key design causes them to sort together. This is
an important NoSQL design strategy.
Distribute queries. It is also important that a high volume of
queries not be focused on one part of the database, where they can
exceed I/O capacity. Instead, you should design data keys to
distribute traffic evenly across partitions as much as possible,
avoiding "hot spots."
Use global secondary indexes. By creating specific global secondary
indexes, you can enable different queries than your main table can
support, and that are still fast and relatively inexpensive.
I hope I could help you!
Official recommendation from the team is, to my knowledge, to put all datatypes into single collection that have something like type=someType field on documents to distinguish types.
Now, if we assume large databases with partitioning where different object types can be:
Completely different fields (so no common field for partitioning)
Related (through reference)
How to organize things so that things that should go together end up in same partition?
For example, lets say we have:
User
BlogPost
BlogPostComment
If we store them as separate types with type=user|blogPost|blogPostComment, in same collection, how do we ensure that user, his blogposts and all the corresponding comments end up in same partition?
Is there some best practice for this?
[UPDATE]
Can you ever avoid cross-partition queries completely? Should that be a goal? Or you just try to minimize them?
For example, you can partition your data perfectly for 99% of cases/queries but then you need some dashboard to show aggregates from all-the-data. Is that something you just accept as inevitable and try to minimize or is it possible to avoid it completely?
I've written about this somewhat extensively in other similar questions regarding Cosmos.
Basically, when dealing with many different logical entity types in a single Cosmos collection the easiest option is to put a generic (or abstract, as you refer to it) partition key on all your documents. At this point it's the concern of the application to make sure that at runtime the appropriate value is chosen. I usually name this document property either partitionKey, routingKey or something similar.
This is extremely important when designing for optimal query efficiency as your choice of partition keys can have a huge impact on query and throughput performance. A generic key like this lets you design the optimal storage of your data as it benefits whatever application you're building.
Even something like tenant does not make sense as different tenants might have wildly different data size and access patterns. Instead you could include the tenantId at runtime as part of your partition key as a kind of composite.
UPDATE:
For certain query patterns it might be possible to serve them entirely out of a single partition. It's definitely not the end of the world if things end up going cross partition though. The system is still quick. If possible, limiting the amount of partitions that need to be touched for a given query is ideal but you're never going to get away from it 100% of the time.
A partition should hold data related to a group that is expected to grow, for instance a Tenant which will group many documents (which can be of different types as you have mentioned) So the Partition Key in this instance should be the TenantId. The partitioning is more about the data relating to a group than the type of data. If the data is related to a User then you could use the UserId, however many users may comment on the same posts so it doesn't seem like a good candidate for a partition key unless there is some de-normalization of the user info so it doest have to relate back to the other users directly.. if that makes sense?
Say I have:
My data stored in documetDB's collection for all of my tenants. (i.e. multiple tenants).
I configured the collection in such a way that all of my data is distributed uniformly across all partitions.
But partitions are NOT by each tenant. I use some other scheme.
Because of this data for a particular tenant is distributed across multiple partitions.
Here are my questions:
Is this the right thing to do to maximum performance for both reading and writing data?
What if I want to query for a particular tenant? What are the caveats in writing this query?
Any other things that I need to consider?
I would avoid queries across partitions, they come with quite a cost (basically multiply index and parsing costs with number of partitions - defaults to 25). It's fairly easy to try out.
I would prefer a solution where one can query on a specific partition, typically partitioning by tenant ID.
Remember that with partitioned collections, there's stil limits on each partition (10K RU and 10GB) - I have written about it here http://blog.ulriksen.net/notes-on-documentdb-partitioning/
It depends upon your usage patterns as well as the variation in tenant size.
In general for multi-tenant systems, 99% of all operations are within a single tenant. If you make the tenantID your partition key, then those operations will only touch a single partition. This won't make a single operation any faster (latency) but could provide huge throughput gains when under load by multiple tenants. However, if you only have 5 tenants and 1 of them is 10x bigger than all the others, then using the tenantID as your key will lead to a very unbalanced system.
We use the tenantID as the partition key for our system and it seems to work well. We've talked about what we would do if it became very unbalanced and one idea is to make the partition key be the tenantID + to split the large tenants up. We haven't had to do that yet though so we haven't worked out all of those details to know if that would actually be possible and performant, but we think it would work.
What you have described is a sensible solution, where you avoid data skews and load-balance across partitions well. Since the query for a particular tenant needs to touch all partitions, please remember to set FeedOptions.EnableCrossPartitionQuery to true (x-ms-documentdb-query-enablecrosspartition in the REST API).
DocumentDB site also has an excellent article on partitioned collections and tips for choosing a partition key in general. https://azure.microsoft.com/en-us/documentation/articles/documentdb-partition-data/
In my design I have to store a lot of properties(say a 20 properties) in a same datastore table.
But usually most of the entities will occupy a minimum of only 5 properties.
Is this design a resource consuming idea? Will the unused properties consume any memory or performance?
Thanks,
Karthick.
If I understand your question correctly, you are envisioning a system where you have: A Kind in your Datastore where the Entities for that Kind can have differing subsets of a common property-key space W. Entity 1's property set might be {W[0], W[1]}, and Entity 2's property set might be {W[1], W[2], W[5]}. You want to know whether this polymorphism (or "schemalessness") will cost you space, and whether each Entity, as in some naive MySQL implementations
The short answer is no - due to the schemaless nature of Datastore, having polymorphic entities in a kind (the entities have all different names and combinations of properties) will not consume extra space. The only way to have these "unused" properties consume extra space is if you actually did set them on the entity but set them to "null". If you are using the low-level API, you are manually adding the properties to the entity before saving it. Think of these as properties on a JSON object. If they aren't there, they aren't there.
In MySQL, having a table with many NULL-able columns can be a bad idea, depending on the engine, indexes, etc... but take a look at this talk if you want to learn more about how the Datastore actually stores it's data using BigTable. It's a different storage implementation underneath, and so there are different best practices or possibilities.
What is the best way to add huge amount of documents into riak? Let's say there are millions of product records, which change very often (prices, ...) and we want to update all of them very frequently. Is there a better way than replace keys one by one in Riak? Something as bulk set of 1000 documents at once...
There are unfortunately not any bulk operations available in Riak, so this has to be done by updating each object individually. If your updates however arrive in bulks, it may be worthwhile revisiting your data model. If you can de-normalise your products, perhaps by storing a range of products in a single object, it might be possible to reduce the number of updates that need to be performed by grouping them, thereby reducing the load on the cluster.
When modelling data in Riak you usually need to look at access and query patterns in addition to the structure of the data, and make sure that the model supports all types of queries and latency requirements. This quite often means de-normalising your model by either grouping or duplicating data in order to ensure that updates and queries can be performed as efficiently as possible, ideally through direct K/V access.