Cosmos Db: Container.DeleteItemAsync<T>(id, new PartitionKey(partitionKey) (UpsertItemAsync) has a large RequestCharge - azure-cosmosdb

For adding documents to the Cosmos Db I normally use the
Container.UpsertItemAsync(doc, new PartitionKey(partitionKey)) method.
Replacing the document takes twice the request charge as inserting the document.
I have re-written the method to use:
Container.DeleteItemAsync<T>(doc.Id, partitionKey);
Container.CreateItemAsync<T>(doc, partitionKey);
The insert of my huge document costs: 626 RU
Updating the same document costs 626 RU for deleting and 626 RU for creating.
Switching it to
Container.ReplaceItemAsync(doc, doc.Id, partitionKey);
Costs 1250 RU's , which is in line with the Delete/Create action.
But why does a Delete (or Update) costs that much RU's and how can I reduce this.

You can reduce it by limiting the amount of properties that are indexed. The work to keep your indexing policy up to date is reflected in the RU's that are used, which is why a delete seems to expensive. So if you have a lot of properties that don't need to be indexed because they are not used to filter or order results of queries can significantly reduce the RU's used.
You can also check if it's feasible to replace your ReplaceItemAsync with the PatchItemAsync method. Especially if you only need to update a few properties it can significantly reduce the RU's required to make a small update.
var patch = new List<PatchOperation>()
{
PatchOperation.Add("/example", "It works :)!"),
};
var response = await container.PatchItemAsync<MyItem>(myId, myPk, patch);

Related

DynamoDBMapper how to get all items without pagination

I have about 780K(count) items stored in DDB.
I'm calling DynamoDBMapper.query(...) method to get all of them.
The result is good, bcs I can get all of the items. But it cost me 3min to get them.
From the log, I see the DynamoDBMapper.query(...) method is trying to get items page by page, each page will request an individual query call to DDB which will cost about 0.7s for each page.
I counted that all items returned with 292 pages, so the total duration is about 0.7*292=200s which is unacceptable.
My code is basically like below:
// setup query condition, after filter the items count would be about 780K
DynamoDBQueryExpression<VendorAsinItem> expression = buildFilterExpression(filters, expression);
List<VendorAsinItem> results = new ArrayList<>();
try {
log.info("yrena:Start query");
DynamoDBMapperConfig config = getTableNameConfig();
results = getDynamoDBMapper().query( // get DynamoDBMapper instance and call query method
VendorAsinItem.class,
expression,
config);
} catch (Exception e) {
log.error("yrena:Error ", e);
}
log.info("yrena:End query. Size:" + results.size());
So how can I get all items at once without pagination.
My final goal is to reduce the query duration.
EDIT Just re-read the title of the question and realized that perhaps I didn't address the question head on: there is no way to retrieve 780,000 items without some pagination because of a hard limit of 1MB per page
Long form answer
780,000 items retrieved, in 3 minutes, using 292 pages: that's about 1.62 pages per second.
Take a moment and let that sync in..
Dynamo can return 1MB of data per page, so you're presumably transferring 1.5MB of data per second (that will saturate a 10 Mbit pipe).
Without further details about (a) the actual size of the items retrieved; (b) the bandwidth of your internet connection; (c) the number of items that might get filtered out of query results and (d) the provisioned read capacity on the table I would start looking at:
what is the network bandwidth between your client and Dynamo/AWS -- if you are not maxing that out, then move on to next;
how much read capacity is provisioned on the table (if you see any throttling on the requests, you may be able to increase RCU on the table to get a speed improvement at a monetary expense)
the efficiency of your query:
if you are applying filters, know that those are applied after query results are generated and so the query is consuming RCU for stuff that gets filtered out and that also means the query is inefficient
think about whether there are ways you can optimize your queries to access less data
Finally 780,000 items is A LOT for a query -- what percentage of items in the database is that?
Could you create a secondary index that would essentially contain most, or all of that data that you could then simply scan instead of querying?
Unlike a query, a scan can be parallelized so if your network bandwidth, memory and local compute are large enough, and you're willing to provision enough capacity on the database you could read 780,000 items significantly faster than a query.

How does Cosmos DB Continuation Token work?

At first sight, it's clear what the continuation token does in Cosmos DB: attaching it to the next query gives you the next set of results. But what does "next set of results" mean exactly?
Does it mean:
the next set of results as if the original query had been executed completely without paging at the time of the very first query (skipping the appropriate number of documents)?
the next set of results as if the original query had been executed now (skipping the appropriate number of documents)?
Something completely different?
Answer 1. would seem preferable but unlikely given that the server would need to store unlimited amounts of state. But Answer 2. is also problematic as it may result in inconsistencies, e.g. the same document may be served multiple times across pages, if the underlying data has changed between the page queries.
Cosmos DB query executions are stateless at the server side. The continuation token is used to recreate the state of the index and track progress of the execution.
"Next set of results" means, the query is executed again on from a "bookmark" from the previous execution. This bookmark is provided by the continuation token.
Documents created during continuations
They may or may not be returned depending on the position of insert and query being executed.
Example:
SELECT * FROM c ORDER BY c.someValue ASC
Let us assume the bookmark had someValue = 10, the query engine resumes processing using a continuation token where someValue = 10.
If you were to insert a new document with someValue = 5 in between query executions, it will not show up in the next set of results.
If the new document is inserted in a "page" that is > the bookmark, it will show up in next set of results
Documents updated during continuations
Same logic as above applies to updates as well
(See #4)
Documents deleted during continuations
They will not show up in the next set of results.
Chances of duplicates
In case of the below query,
SELECT * FROM c ORDER BY c.remainingInventory ASC
If the remainingInventory was updated after the first set of results and it now satisfies the ORDER BY criteria for the second page, the document will show up again.
Cosmos DB doesn’t provide snapshot isolation across query pages.
However, as per the product team this is an incredibly uncommon scenario because queries over continuations are very quick and in most cases all query results are returned on the first page.
Based on preliminary experiments, the answer seems to be option #2, or more precisely:
Documents created after serving the first page are observable on subsequent pages
Documents updated after serving the first page are observable on subsequent pages
Documents deleted after serving the first page are omitted on subsequent pages
Documents are never served twice
The first statement above contradicts information from MSFT (cf. Kalyan's answer). It would be great to get a more qualified answer from the Cosmos DB Team specifying precisely the semantics of retrieving pages. This may not be very important for displaying data in the UI, but may be essential for data processing in the backend, given that there doesn't seem to be any way of disabling paging when performing a query (cf. Are transactional queries possible in Cosmos DB?).
Experimental method
I used Sacha Bruttin's Cosmos DB Explorer to query a collection with 5 documents, because this tool allows playing around with the page size and other request options.
The page size was set to 1, and Cross Partition Queries were enabled. Different queries were tried, e.g. SELECT * FROM c or SELECT * FROM c ORDER BY c.name.
After retrieving page 1, new documents were inserted, and some existing documents (including documents that should appear on subsequent pages) were updated and deleted. Then all subsequent pages were retrieved in sequence.
(A quick look at the source code of the tool confirmed that ResponseContinuationTokenLimitInKb is not set.)

Maximum number of fields for a Firestore document?

Right now I have a products collection where I store my products as documents like the following:
documentID:
title: STRING,
price: NUMBER,
images: ARRAY OF OBJECTS,
userImages: ARRAY OF OBJECTS,
thumbnail: STRING,
category: STRING
NOTE: My web app has approximately 1000 products.
I'm thinking about doing full text search on client side, while also saving on database reads, so I'm thinking about duplicating my data on Firestore and save a partial copy of all of my products into a single document to send that to the client so I can implement client full text search with that.
I would create the allProducts collection, with a single document with 1000 fields. Is this possible?
allProducts: collection
Contains a single document with the following fields:
Every field would contain a MAP (object) with product details.
document_1_ID: { // Same ID as the 'products' collection
title: STRING,
price: NUMBER,
category: STRING,
thumbnail
},
document_2_ID: {
title: STRING,
price: NUMBER,
category: STRING,
thumbnail
},
// AND SO ON...
NOTE: I would still keep the products collection intact.
QUESTION
Is it possible to have a single document with 1000 fields? What is the limit?
I'm looking into this, because since I'm performing client full text search, every user will need to have access to my whole database of products. And I don't want every user to read every single document that I have, because I imagine that the costs of that would not scale very well.
NOTE2: I know that the maximum size for a document is 1mb.
According to this document, in addition to the 1MB limit per document, there is a limit of index entries per document, which is 40,000. Because each field appears in 2 indexes (ascending and descending), the maximum number of fields is 20,000.
I made a Node.js program to test it and I can confirm that I can create 20,000 fields but I cannot create 20,001.
If you try to set more than 20,000 fields, you will get the exception:
INVALID_ARGUMENT: too many index entries for entity
// Setting 20001 here throws "INVALID_ARGUMENT: too many index entries for entity"
const indexPage = Array.from(Array(20000).keys()).reduce((acc, cur) => {
acc[`index-${cur}`] = cur;
return acc;
}, {});
await db.doc(`test/doc`).set(indexPage);
I would create the allProducts collection, with a single document with 1000 fields. Is this possible?
There isn't quite a fixed limitation for that. However, the documentation recommends having fewer than 100 fields per document:
Limit the number of fields per document: 100
So the problem isn't the fact that you duplicate data, the problem is that the documents have another limitation that you should care about. So you're also limited to how much data you can put into a document. According to the official documentation regarding usage and limits:
Maximum size for a document: 1 MiB (1,048,576 bytes)
As you can see, you are limited to 1 MiB total of data in a single document. When we are talking about storing text, you can store pretty much but as your documents get bigger, be careful about this limitation.
If you are storing a large amount of data in your documents and those documents should be updated by lots of admins, there is another limitation that you need to take care of. So you are limited to 1 write per second on every document. So if you have a situation in which the admins are trying to write/update products in that same document all at once, you might start to see some of these writes fail. So, be careful about this limitation too.
And the last limitation is for index entries per document. So if you decide to get over the first limitation, please note that the maximum limit is set to 40,000. Because each field has associated two indexes (ascending and descending), the max number of fields is 20,000.
Is it possible to have a single document with 1000 fields?
It is possible up to 40,000 properties but in your case with no benefits. I say that because every time you perform a query (get the document), only a single document will be returned. So there is no way you can implement a search algorithm in a single document and expect to get Product objects in return.
And I don't want every user to read every single document that I have, because I imagine that the costs of that would not scale very well.
Downloading an entire collection to search for fields client-side isn't practical at all and is also very costly. That's the reason why the official documentation recommends a third-party search service like Algolia.
For Android, please see my answer in the following post:
Is it possible to use Algolia query in FirestoreRecyclerOptions?
Firebase has a limit of 20k fields per document.
https://www.youtube.com/watch?v=o7d5Zeic63s
According to the documentation, there is no stated limit placed on the number of fields in a document. However, a document can only have up to 40,000 index entries, which will grow as documents contain more fields that are indexed by default.

How can I query for all new and updated documents since last query?

I need to query a collection and return all documents that are new or updated since the last query. The collection is partitioned by userId. I am looking for a value that I can use (or create and use) that would help facilitate this query. I considered using _ts:
SELECT * FROM collection WHERE userId=[some-user-id] AND _ts > [some-value]
The problem with _ts is that it is not granular enough and the query could miss updates made in the same second by another client.
In SQL Server I could accomplish this using an IDENTITY column in another table. Let's call the table version. In a transaction I would create a new row in the version table, do the updates to the other table (including updating the version column with the new value. To query for new and updated rows I would use a query like this:
SELECT * FROM table WHERE userId=[some-user-id] and version > [some-value]
How could I do something like this in Cosmos DB? The Change Feed seems like the right option, but without the ability to query the Change Feed, I'm not sure how I would go about this.
In case it matters, the (web/mobile) clients connect to data in Cosmos DB via a web api. I have control of the entire stack - from client to back-end.
As the statements in this link:
Today, you see all operations in the change feed. The functionality
where you can control change feed, for specific operations such as
updates only and not inserts is not yet available. You can add a “soft
marker” on the item for updates and filter based on that when
processing items in the change feed. Currently change feed doesn’t log
deletes. Similar to the previous example, you can add a soft marker on
the items that are being deleted, for example, you can add an
attribute in the item called "deleted" and set it to "true" and set a
TTL on the item, so that it can be automatically deleted. You can read
the change feed for historic items, for example, items that were added
five years ago. If the item is not deleted you can read the change
feed as far as the origin of your container.
Change feed is not available for your requirements.
My idea:
Use Azure Function Cosmos DB Trigger to collect all the operations in your specific cosmos collection. Follow this document to configure the input of azure function as cosmos db, then follow this document to configure the output as azure queue storage.
Get the ids of changed items and send them into queue storage as messages.When you want to query the changed item,just query the messages from the queue to consume them at a specific unit time and after that just clear the entire queue. No items will be missed.
With your approach, you can get added/updated documents and save reference value (_ts and id field) somewhere (like blob)
SELECT * FROM collection WHERE userId=[some-user-id] AND _ts > [some-value] and id !='guid' order by _ts desc
This is a similar approach we use to read data from Eventhub and store checkpointing information (epoch number, sequence number and offset value) in blob. And at a time only one function can take a lease of that blob.
If you go with ChangeFeed, you can create listener (Function or Job) to listen all add/update data from collection and you can store those value in some collection, while saving data you can add Identity/version field on every document. This approach may increase your cosmos DB bill.
This is what the transaction consistency levels are for: https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels
Choose strong consistency and your queries will always return the latest write.
Strong: Strong consistency offers a linearizability guarantee. The
reads are guaranteed to return the most recent committed version of an
item. A client never sees an uncommitted or partial write. Users are
always guaranteed to read the latest committed write.

Firebase realtime db not able to filterdata with db only rules

Firebase real time database.
I am trying to limit number of items returned from a query only by changing the db rrules on firebase
Is this possible? I dont want to change the app side code
What is the rule if i have to fetch top 100 using the limittofirst.
Firebase's server-side security rules merely determine whether a certain operation is allowed. They don't filter data by themselves.
If you want to retrieve the first 100 items, put a limitToFirst(100) in your query.
If you only ever want the first 100 items to be retrieved (as in: want other read operations to be rejected), have a look at the documentation on securing queries, which contains this example:
You can also use query-based rules to limit how much data a client downloads through read operations.
For example, the following rule limits read access to only the first 1000 results of a query, as ordered by priority:
messages: {
".read": "query.orderByKey &&
query.limitToFirst <= 1000"
}
Example queries:
db.ref("messages").on("value", cb) // Would fail with PermissionDenied
db.ref("messages").limitToFirst(1000)
.on("value", cb) // Would succeed (default order by key)

Resources