Getting actual billable data size in ADX - azure-data-explorer

Looking at the the output of the following command , what column is representative of storage cost?
.show table <tablename> extents
So basically out of the following column what really is the size that's stored in the storage?
OriginalSize
ExtentSize
CompressedSize
IndexSize

Its the ExtentSize. The storage costs themselves are expected to be higher based on the recoverability setting (if its true the data will be kept for 14 additional days) and the metadata size which is stored on Azure storage as well.

Related

How to check or see size of cosmos DB document or is there any limit on loading of Azure DWH from cosmos DB

when I am trying to insert the data from cosmos to Azure DWH , it is inserting well for most of the databases but for some it is giving some strange issues.
Later we found out that it is due to the size of the Cosmos DB document.
Like we have 75GB of size of one of our cosmos DB.
Then if we are trying to insert all the data in initial load , it gives Null Pointer error. But if we try to limit the rows say , first 3000 and then increment the count of records by 3000 then it is able to insert but it takes significant amount of time.
Also, this is our ACC data , we are not sure of our PRD data. and now for some of the DBs we need to set it to 50000 rows per load and for some we have set 3000(like for above example).
So to load the data iterative way is the only solution ? or is there any other way?
Also, how can we determine the incremental value to load in each iteration for new DBs to be added?
P.S. I also tried increasing DWUs and IR cores to maximum but no luck.

what is the maximum size of a document in Realtime database in firestore

I am working on a more complicated database where I a want to store lots of data, the issue with fire store is the limit to 1MB per documents, and I am splitting my data in to different document but still according to my calculation the size will be bigger than the limit, yet I cannot find the limit for the Realtime database, and I want to be sure before switching to it, my single document in some cases could hit 6-9mb when scaling big.... at first I want to go with mongodb but I wanted to try the google cloud services.. any idea if the do size is same for both Realtime and firestore ?
Documents are part of Firestore (that have 1 MB max size limit each) while Realtime Database on the other hand is just a large JSON like thing. You can find limits of Realtime database in the documentation.
Property
Limit
Description
Maximum depth of child nodes
32
Each path in your data tree must be less than 32 levels deep.
Length of a key
768 Bytes
Keys are UTF-8 encoded and can't contain new lines or any of the following characters: . $ # [ ] / or any ASCII control characters (0x00 - 0x1F and 0x7F)
Maximum size of a string
10 MB
Data is UTF-8 encoded
There isn't a limit of number of child nodes you can have but just keep the max depth in mind. Also it might be best if you could share a sample of what currently takes over 6 MB in Firestore and maybe restructure the database.

How do you synchronize related collections in Cosmos Db?

My application need to support lookups for invoices by invoice id and by the customer. For that reason I created two collections in which I store the (exact) same invoice documents:
InvoicesById, with partition key /InvoiceId
InvoicesByCustomerId, with partition key /CustomerId
Apparently you should use partition keys when doing queries and since there are two queries I need two collections. I guess there may be more in the future.
Updates are primarily done to the InvoicesById collection, but then I need to replicate the change to InvoicesByCustomer (and others) as well.
Are there any best practice or sane approaches how to keep collections in sync?
I'm thinking change feeds and what not. I want avoid writing this sync code and risk inconsistencies due to missing transactions between collections (etc). Or maybe I'm missing something crucial here.
Change feed will do the trick though I would suggest to take a step back before brute-forcing the problem.
Please find detailed article describing split issue here: Azure Cosmos DB. Partitioning.
Based on the Microsoft recommendation for maintainable data growth you should select partition key with highest cardinality (in your case I assume it will be InvoiceId). For the main reason:
Spread request unit (RU) consumption and data storage evenly across all logical partitions. This ensures even RU consumption and storage distribution across your physical partitions.
You don't need creating separate container with CustomerId partition key as it won't give you desired, and most importantly, maintainable performance in future and might result in physical partition data skew when too many Invoices linked to the same customer.
To get optimal and scalable query performance you most probably need InvoiceId as partition key and indexing policy by CustomerId (and others in future).
There will be a slight RU overhead (definitely not multiplication of RUs but rather couple additional RUs per request) in consumption when data you're querying is distributed between number of physical partitions (PPs) but it will be neglectable comparing to issues occurring when data starts growing beyond 50-, 100-, 150GB.
Why CustomerId might not be the best partition key for the data sets which are expected to grow beyond 50GB?
Main reason is that Cosmos DB is designed to scale horizontally and provisioned throughput per PP is limited to the [total provisioned per container (or DB)] / [number of PP].
Once PP split occurs due to exceeding 50GB size your max throughput for existing PPs as well as two newly created PPs will be lower then it was before split.
So imagine following scenario (consider days as a measure of time between actions):
You've created container with provisioned 10k RUs and CustomerId partition key (which will generate one underlying PP1). Maximum throughput per PP is 10k/1 = 10k RUs
Gradually adding data to container you end-up with 3 big customers with C1[10GB], C2[20GB] and C3[10GB] of invoices
When another customer was onboarded to the system with C4[15GB] of data Cosmos DB will have to split PP1 data into two newly created PP2 (30GB) and PP3 (25GB). Maximum throughput per PP is 10k/2 = 5k RUs
Two more customers C5[10GB] C6[15GB] were added to the system and both ended-up in PP2 which lead to another split -> PP4 (20GB) and PP5 (35GB). Maximum throughput per PP is now 10k/3 = 3.333k RUs
IMPORTANT: As a result on [Day 2] C1 data was queried with up to 10k RUs
but on [Day 4] with only max to 3.333k RUs which directly impacts execution time of your query
This is a main thing to remember when designing partition keys in current version of Cosmos DB (12.03.21).
What you are doing is a good solution. Different queries requires different Partition Keys on different Cosmos DB Containers with same data.
How to sync the two Containers: use Triggers from the firs Container.
https://devblogs.microsoft.com/premier-developer/synchronizing-azure-cosmos-db-collections-for-blazing-fast-queries/
Cassandra has a Feature called Materialized Views for this exact problem, abstracting the sync problem. Maybe some day same Feature will be included on Cosmos DB.

Monitor stored data size in Firestore

Firestore costs are based on document operations and on size of stored data.
In the Firebase console, we can easily track number of document operations but I don't find any place where I can track size of stored data.
I have only found in Google Cloud Console (in App Engine > Quotas) a metric corresponding to the amount of stored data in gigabyte stored the current day, but not the total amount of stored data.
Is there a means of monitoring total size of stored data (ideally with indexes included) ?
It seems that the only available option at this moment is to calculate the storage size for Cloud Firestore in Native mode manually.
I have submitted a feature request asking to implement a solution that would display the size. I'd recommend you to star that request to be notified once there is an update in the thread.

How to find the size of logical partition in a Cosmos container is reaching the 10GB limit and set alert

I would like to to know if there is a way to find out the number of partitions in a cosmos container. Also If there is a way find the size per partition. Based on this I would like to be alerted based on a threshold that it is reaching the 10GB limit.
You can use Azure Monitor with Log Analytics and write a query to output partition key metrics such as size. Example query below.
Doc on Azure Monitor for Cosmos DB:
https://learn.microsoft.com/en-us/azure/cosmos-db/monitor-cosmos-db
Example Log Analytics query for this type of data
AzureDiagnostics
| where ResourceProvider=="MICROSOFT.DOCUMENTDB" and Category=="PartitionKeyStatistics"
| project SubscriptionId, regionName_s, databaseName_s, collectionname_s, partitionkey_s, sizeKb_s, ResourceId
Hope this helps.

Resources