Tenant gets too big, move it into different container? - azure-cosmosdb

Let's assume I've started a SaaS platform with Azure Cosmos DB as my backend. I set up a container and used the tenant ID (a GUID) as the partition key. Things work well until we get two large clients. The overall system and the two large clients in particular would benefit a lot if we could move the two large clients into a different container and use more fine-grained partition keys for them. This is a bit a burden of us at the application-level, but it's doable.
How to move a whole "partition key" into a different container with a more fine-grained partition key? Is this something we can do "on-the-fly"? Do we need to take that tenant offline and use some sort of tool to migrate all the data? Is there a best practice?

So there's no built-in way of doing that, but there is as path forward, its called change feed. Basically you can use the change feed to migrate all the data from the beginning of the database to the last change. You would need to implement a filter to only filter on changes on that tenant partition key and you would also need to implement a way to distribute that tenants data across several partitions. There are some limitations to change feed, though.
Some other ways.

Related

Does DynamoDB GSI overloading give performance benefits or just flexibility

Does GSI Overloading provide any performance benefits, e.g. by allowing cached partition keys to be more efficiently routed? Or is it mostly about preventing you from running out of GSIs? Or maybe opening up other query patterns that might not be so immediately obvious.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html
e.g. I you have a base table and you want to partition it so you can query a specific attribute (which becomes the PK of the GSI) over two dimensions, does it make any difference if you create 1 overloaded GSI, or 2 non-overloaded GSIs.
For an example of what I'm referring to see the attached image:
https://drive.google.com/file/d/1fsI50oUOFIx-CFp7zcYMij7KQc5hJGIa/view?usp=sharing
The base table has documents which can be in a published or draft state. Each document is owned by a single user. I want to be able to query by user to find:
Published documents by date
Draft documents by date
I'm asking in relation to the more recent DynamoDB best practice that implies that all applications only require one table. Some of the techniques being shown in this documentation show how a reasonably complex relational model can be squashed into 1 DynamoDB table and 2 GSIs and yet still support 10-15 query patterns.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-relational-modeling.html
I'm trying to understand why someone would go down this route as it seems incredibly complicated.
The idea – in a nutshell – is to not have the overhead of doing joins on the database layer or having to go back to the database to effectively try to do the join on the application layer. By having the data sliced already in the format that your application requires, all you really need to do is basically do one select * from table where x = y call which returns multiple entities in one call (in your example that could be Users and Documents). This means that it will be extremely efficient and scalable on the db level. But also means that you'll be less flexible as you need to know the access patterns in advance and model your data accordingly.
See Rick Houlihan's excellent talk on this https://www.youtube.com/watch?v=HaEPXoXVf2k for why you'd want to do this.
I don't think it has any performance benefits, at least none that's not called out – which makes sense since it's the same query and storage engine.
That being said, I think there are some practical reasons for why you'd want to go with a single table as it allows you to keep your infrastructure somewhat simple: you don't have to keep track of metrics and/or provisioning settings for separate tables.
My opinion would be cost of storage and provisioned throughput.
Apart from that not sure with new limit of 20

CosmosDB/DocumentDB partitioning with multiple types in same collection

Official recommendation from the team is, to my knowledge, to put all datatypes into single collection that have something like type=someType field on documents to distinguish types.
Now, if we assume large databases with partitioning where different object types can be:
Completely different fields (so no common field for partitioning)
Related (through reference)
How to organize things so that things that should go together end up in same partition?
For example, lets say we have:
User
BlogPost
BlogPostComment
If we store them as separate types with type=user|blogPost|blogPostComment, in same collection, how do we ensure that user, his blogposts and all the corresponding comments end up in same partition?
Is there some best practice for this?
[UPDATE]
Can you ever avoid cross-partition queries completely? Should that be a goal? Or you just try to minimize them?
For example, you can partition your data perfectly for 99% of cases/queries but then you need some dashboard to show aggregates from all-the-data. Is that something you just accept as inevitable and try to minimize or is it possible to avoid it completely?
I've written about this somewhat extensively in other similar questions regarding Cosmos.
Basically, when dealing with many different logical entity types in a single Cosmos collection the easiest option is to put a generic (or abstract, as you refer to it) partition key on all your documents. At this point it's the concern of the application to make sure that at runtime the appropriate value is chosen. I usually name this document property either partitionKey, routingKey or something similar.
This is extremely important when designing for optimal query efficiency as your choice of partition keys can have a huge impact on query and throughput performance. A generic key like this lets you design the optimal storage of your data as it benefits whatever application you're building.
Even something like tenant does not make sense as different tenants might have wildly different data size and access patterns. Instead you could include the tenantId at runtime as part of your partition key as a kind of composite.
UPDATE:
For certain query patterns it might be possible to serve them entirely out of a single partition. It's definitely not the end of the world if things end up going cross partition though. The system is still quick. If possible, limiting the amount of partitions that need to be touched for a given query is ideal but you're never going to get away from it 100% of the time.
A partition should hold data related to a group that is expected to grow, for instance a Tenant which will group many documents (which can be of different types as you have mentioned) So the Partition Key in this instance should be the TenantId. The partitioning is more about the data relating to a group than the type of data. If the data is related to a User then you could use the UserId, however many users may comment on the same posts so it doesn't seem like a good candidate for a partition key unless there is some de-normalization of the user info so it doest have to relate back to the other users directly.. if that makes sense?

Best approach to having multiple users in one app

This is mobile app which can have different kind of users. I'm using realm only for the offline storage. Say I have two users A and B and a have a List Class. This class wont ever be shared, so different data for each user. How would i go in designing the schema? Considering versioning and migration.
A. Add a primary key for the List and assign it differently to user A and B.
B. Use two different realms
There is no one good way of defining your Realm schema and the solution to choose completely depends on the exact scenario.
If you want your users data to be completely independent of each other and you will never need to use a single query to retrieve both users data or to access some common data, then using separate Realm instances for each use seems like a good approach. It provides complete separation between your users data.
However, if your users might have some shared data or if you might end up making some statistics about all of your users even though their data is independent, using a single Realm instance is the way to go. In this case you should just create a one-to-many relationship between each of your users and whatever objects you want to store in your lists like this:
class User:Object {
let stuff = List<Stuff>()
}

How to strike a performance balance with documentDB collection for multiple tenants?

Say I have:
My data stored in documetDB's collection for all of my tenants. (i.e. multiple tenants).
I configured the collection in such a way that all of my data is distributed uniformly across all partitions.
But partitions are NOT by each tenant. I use some other scheme.
Because of this data for a particular tenant is distributed across multiple partitions.
Here are my questions:
Is this the right thing to do to maximum performance for both reading and writing data?
What if I want to query for a particular tenant? What are the caveats in writing this query?
Any other things that I need to consider?
I would avoid queries across partitions, they come with quite a cost (basically multiply index and parsing costs with number of partitions - defaults to 25). It's fairly easy to try out.
I would prefer a solution where one can query on a specific partition, typically partitioning by tenant ID.
Remember that with partitioned collections, there's stil limits on each partition (10K RU and 10GB) - I have written about it here http://blog.ulriksen.net/notes-on-documentdb-partitioning/
It depends upon your usage patterns as well as the variation in tenant size.
In general for multi-tenant systems, 99% of all operations are within a single tenant. If you make the tenantID your partition key, then those operations will only touch a single partition. This won't make a single operation any faster (latency) but could provide huge throughput gains when under load by multiple tenants. However, if you only have 5 tenants and 1 of them is 10x bigger than all the others, then using the tenantID as your key will lead to a very unbalanced system.
We use the tenantID as the partition key for our system and it seems to work well. We've talked about what we would do if it became very unbalanced and one idea is to make the partition key be the tenantID + to split the large tenants up. We haven't had to do that yet though so we haven't worked out all of those details to know if that would actually be possible and performant, but we think it would work.
What you have described is a sensible solution, where you avoid data skews and load-balance across partitions well. Since the query for a particular tenant needs to touch all partitions, please remember to set FeedOptions.EnableCrossPartitionQuery to true (x-ms-documentdb-query-enablecrosspartition in the REST API).
DocumentDB site also has an excellent article on partitioned collections and tips for choosing a partition key in general. https://azure.microsoft.com/en-us/documentation/articles/documentdb-partition-data/

Rackspace CDN container organization

I'm developing a web platform that may reach some million of users where I need to store users' images and docs.
I'm using Rackspace and now I need to define the files logic into cloud files service.
Rackspace allows to create up to 500,000 containers with an account (reference page 17, paragraph 4.2.2) and in addition they suggest to limit each container size up to 500,000 objects (reference Best practice - Limit the Number of Objects in Your Container), which is the best practice for users files management?
One container for user don't seems to be a good solution because there is the 500,000 containers limit.
Rackspace suggests to use virtual container. I'm a bit undecided how to use them.
Thanks in advance.
If you will only be interactive with the files via API calls having 200,000 objects is fine (from my experience, haven't had the need for anything larger).
if you want to try to use the web interface for ANY TASKS AT ALL you need to have far, far less than that. The web interface does not break contents up by folder, so if you have 30,000 objects, the web interface will just paginate them and show them to you in alphabetical order. This is ok for containers with up to a few hundred objects, but beyond that the web interface is unusable.
If you have some number of millions of users, you can use some part of the user ID as a shard key to decide what bucket to use. See http://docs.mongodb.org/manual/core/sharding-internals/#sharding-internals-shard-keys for information about choosing a shard key. It's written for Mongo users, but is applicable here. The takeaway is pick some attribute that will distribute your users somewhat evenly so you don't have one bucket that exceeds the max number of files you want to have per bucket.
One way is to use user ID's, which we can randomly assign and shard based on the first digit. For this example, we'll use the UID's 1234, 2234, 1123, and 2134. Say you want to break files up by the first digit of UID, you'd save user the files for 1234 and 1123 in the container "files_group_1" and the files for 2234 and 2134 in the "files_group_2" container.
Before picking a shard key, make sure you think about how many files users might store. If, for example, a user may store hundreds (or thousands) of files, then you will want to shard by a more unique key than the first digit of a UID.
Hope that helped.

Resources