How to replicate a Cosmos DB container - azure-cosmosdb

I have a very large Cosmos DB container with data as old as several years.
I want to replicate the container. As I make changes to my container, I would like the replica to be kept up to date.
How can I achieve this?

Using Change Feed is the best way to maintain a duplicate of your container.
Note that you need to use soft deletes (tombstone flag) with change feed as a hard delete is not picked up.

Related

Is it sensible to create unused generic local secondary indexes on my DynamoDB tables in case I need them later?

I currently have a need to add a local secondary index to a DynamoDB table but I see that they can't be added after the table is created. It's fine for me to re-create the table now while my project is in development, but it would be painful to do that later if I need another index when the project is publicly deployed.
That's made me wonder whether it would be sensible to re-create the table with the maximum number of secondary indexes allowed even though I don't need them now. The indexes would have generically-named attributes that I am not currently using as their sort keys. That way, if I ever need another local secondary index on this table in the future, I could just bring one of the unused ones into service.
I don't think it would be waste of storage or a performance problem because I understand that the indexes will only be written to when an item is written that includes the attribute that they index on.
I Googled to see if this idea was a common practice, but haven't found anyone talking about it. Is there some reason why this wouldn't be a good idea?
Don’t do that. If a table has any LSIs it follows different rules and cannot grow an item collection beyond 10 GB or isolate hot items within an item collection. Why incur these downsides if you don’t need to? Plus later you can always create a GSI instead of an LSI.

Is there a way to clear contents of a CosmosDb container via DTUI command line

I use DTUI from the commandline to load documents into CosmosDB from various sources. It would be handy if I could clear the contents of the collection prior to loading. Is there a straightforward way of doing this?
There is no way to to clear the contents of a container using the data migration tool. In general what you could simply do
(i) set a TTL of 0 on that particular Container, it might take some time depending on the size of the documents in that container
(ii) Another step is to to simply just to delete & recreate the container in the portal.

Most efficient way to change synthetic partition key values

I have a collection with thousands of documents all of which have a synthetic partition key property like:
partitionKey: ‘some-document-related-value’
now i need to change values for partitionKey. of course, it takes recreation of documents in order to do so but i am wondering what is the most efficient/straightforward way to do it?
should i use azure function with cosmosdbtrigger? (set to start feed from begining)
change feed processor?
some other way?
i’m looking for quickest solution thats still reliable.
Yes, change feed is a common way to migrate data from one container to another. Another simple option may be to use Data Migration Tool where you build your new partition key in the select statement.
Hopefully this is helpful.

DynamoDB: Is it worth indexing a table for a one-time migration effort?

We are migrating a ton of different tables with different attributes to another table using a script to do conversions into the new DynamoDB table formats.
Details aside, we need to add the "migrated" attribute to every item in the old tables. In order to do this, we are aware that we need to do a scan & update every item in the table with the new attribute. However, if the script we're running that adds this attribute dies midway, we will need to restart the script and filter out anything that doesn't have this new attribute (and only add the new attribute to the items missing it).
One thought that came up was that we could add a global secondary index onto the table with the primaryKey + the migrated flag so that we could just use that to identify what needs to get migrated faster.
However, for a one-time migration effort (that might be run a few times in the case of failures), I'm not sure if its worth the cost of creating the index? The table has hundreds of millions of items in it, and it's hard for me to justify creating a huge index just to speed up the scan. Thoughts?
To use a GSI effectively you would ideally make it a sparse index. It would only contain unmigrated items. You would control this by setting an attribute "unmigrated" on every item, then remove that from the item after migrating it, but this will 4x your writes (because you write to the table and index, once when you add the unmigrated flag, once when you remove it).
I recommend that in your script that scans the table, periodically save the LastEvaluatedKey so you can resume where it left off if the script fails. To speed up the scan you can perform a segmented scan in parallel.

Retrieving Random Single Items in Dynamo

We are trying to get our heads wrapped around a design question, which is not really easy in any DB. We have 100,000 random items, (could be a lot more), (we are talking a truly random key, we'll use UUIDs,) and we want to hand them out one at a time. Order is not important. We are thinking that we'll create a dynamo table of the items, then delete them out of that table as they are assigned. We can do a conditional delete to make sure that we have not already given the item away. But, when trying to find an item in the first place, if we do a scan or a query with a limit of 1, will it always hit the same first available record? I'm wondering what the ramifications are. Dynamo will shard on the UUID. We are worried about everyhone trying to hit on the same record all the time. First one would of course get delete, then they could all hit on the second one, etc.
We could set up a memcache/redis instance in elastic cache, and keep a list of the available UUDS in there. We can do a random select of items from this using redis SPOP, which gets a random item and deletes it. We might have a problem where we could get out of sync between the two, but for the most part this would work.
Any thoughts on how to do this without the cache would be great. If dynamo does scans starting at different points, that would be dandy.
I have the same situation with you that have a set of million of UUID as key in DynamoDB and I need to random select some of them in a API call. For the performance issue and easy implementation. I did use Redis as you said.
add the UUID to a Set in Redis
when the call comes, SPOP a UUID from the set
with that UUID, del in DynamoDB
The performance of Scan operation is bad, should try to avoid it as best as you can.

Resources