I'm running a simple adf pipeline for storing data from data lake to cosmos db (sql api).
After setting database throughput to Autopilot 4000 RU/s, the run took ~11 min and I see 207 throttling requests. On setting database throughput to Autopilot 20,000 RU/s, the run took ~7 min and I see 744 throttling requests. Why is that? Thank you!
Change the Indexing Policy to None from Consistent for the ADF copy activity and then change back to Consistent when done.
Azure Cosmos DB supports two indexing modes:
Consistent: The index is updated synchronously as you create, update or delete items. This means that the consistency of your read queries will be the consistency configured for the account.
None: Indexing is disabled on the container. This is commonly used when a container is used as a pure key-value store without the need for secondary indexes. It can also be used to improve the performance of bulk operations. After the bulk operations are complete, the index mode can be set to Consistent and then monitored using the IndexTransformationProgress until complete.
How to modify the indexing policy:
Modifying the indexing policy
Related
I can't find the exact understanding between cloudflare worker kv and dynamoDB in aws, Can anyone make it clear in simpler ?
Although there are some similarities (ie. both DynamoDB and Worker KV are offered as managed services) I would say they are more different than they are alike.
Worker KV are always eventually consistent whereas DynamoDB can be strongly consistent for read after write operations.
DynamoDB has additional capabilities such as local and global secondary indexes allowing you to have different access patterns for the same underlying data.
Workers KV is heavily optimized for reads with infrequent writes, whereas DynamoDB doesn't have the same limitation (though DynamoDB also does better at reading data than writing in terms of throughput).
DynamoDB also has other features such as stream processing which allows you to do out of band processing in reaction to changes to the data stored in the database.
I'm not sure about the security model for Workers KV but DynamoDB allows you to configure strict access policies to the tables.
Our DynamoDB is configured with On-Demand capacity, but still seeing read/write throttling requests during high traffic hours.
Any clues?
On-Demand does not mean "unlimited throughtput". There is a limit, according to the docs:
If you recently switched an existing table to on-demand capacity mode for the first time, or if you created a new table with on-demand capacity mode enabled, the table has the following previous peak settings, even though the table has not served traffic previously using on-demand capacity mode:
Newly created table with on-demand capacity mode: The previous peak is 2,000 write request units or 6,000 read request units. You can drive up to double the previous peak immediately, which enables newly created on-demand tables to serve up to 4,000 write request units or 12,000 read request units, or any linear combination of the two.
Existing table switched to on-demand capacity mode: The previous peak is half the previous write capacity units and read capacity units provisioned for the table or the settings for a newly created table with on-demand capacity mode, whichever is higher.
I've also found an interesting article with some experiments and numbers: Understanding the scaling behaviour of DynamoDB OnDemand tables.
You could try to analyze how the partition keys for your write requests vary over the individual writes.. If too many write-requests for the same PK happen, it can overwhelm the partitions, and cause throttling.
Consider mixing up the PKs in BatchWriteItem calls, and also over time, so you don't hit the same partitions too frequently
Is there a way to automatically move expired documents to blob storage via change feed?
I Google but found no solution to automatically move expired documents to blob storage via the change feed option. Is it possible?
There is not built in functionality for something like that and the change feed would be of no use in this case.
The change feed processor (which is what the Azure Function trigger is using too) won't notify you for deleted documents so you can't listen for them.
Your best bet is to write some custom application that does scheduling archiving and deleted the archived document.
As statements in the Cosmos db TTL document: When you configure TTL, the system will automatically delete the expired items based on the TTL value, unlike a delete operation that is explicitly issued by the client application.
So,it is controlled by the cosmos db system,not client side.You could follow and vote up this feedback to push the progress of cosmos db.
To come back to this question, one way I've found that works is to make use of the in-built TTL (Let CosmosDB expire documents) and to have a backup script that queries for documents that are near the TTL, but with a safe window in case of latency - e.g. I have the window up to 24 hours.
The main reasoning for this is that issuing deletes as a query not only uses RUs, but quite a lot of them. Even when you slim your indexes down you can still have massive RU usage, whereas letting Cosmos TTL the documents itself induces no extra RU use.
A pattern that I came up with to help is to have my backup script enable the container-level TTL when it starts (doing an initial backup first to ensure no data loss occurs instantly) and to have a try-except-finally, with the finally removing/turning off the TTL to stop it potentially removing data in case my script is down for longer than the window. I'm not yet sure of the performance hit that might occur on large containers when it has to update indexes, but in a logical sense this approach seems to be effective.
Hi I am writing a web application and it connects to 700 Databases and executes a basic SELECT query.
For example:
There is a button to retrieve Managers of each branch.
There are 700 branches of a company and each of the branch details are stored in separate databases.
Select query retrieves 1 record from each of the database and returns the Manager of that branch.
So executing this code takes a long time.
I cannot make the user wait till such time (30 minutes)
Due to memory constraints I cannot use multi threading.
Note: This web application uses Spring MVC. Server Tomcat7.
Any workaround possible?
With that many databases to query, the only possible solutions I can see is caching. If real time is not a concern (note that 30 minutes of execution will push you out of real time anyway), then you might explore the following possibilities, all of which require centralizing data into a single, logical or physical database:
Clustering: put the database servers in a huge cluster, which is configured for performance hence uses caching internally. Depending upon licence costs, this solution might be too impractical or even too expensive.
Push data to a central database: all of the 700 database servers would push the data you need to a central database that your application will use. You can use database servers' replication features (such as in MSSQL or PostgreSQL) or scheduled data transfers. This method requires administrative access to the database servers to either configure replication or drop scripts to run on a scheduled basis.
Pull data from a central database host: have a centralized host fetch the required data into a local database, the tables of which are updated through scheduled data transfers. This is the simplest method. Its drawback is that real time querying is impossible.
It is key to transfer only the data you need. Make your select statements as narrow as possible to limit execution time.
The central database could be your web application server or a distinct machine if your resource constraints are tight. I've found PostgreSQL, with little effort, has an excellent compatibility with MSSQL. Without further information it's difficult to be more accurate.
Background - will be using .NET 4.0, Azure SDK 1.7, Azure Table Storage
Problem
How to most efficiently (= fastest processing time ) to read N entries, where N is a large # (1000's to millions) of entities, and each entity is very small (<200 bytes) from a set of Azure tables, where upfront I know the PartitionID and RowID for each of the entities ie [(P1,R1),(P2,R2),...,(PN,RN)].
What is the most efficient way to 'batch' process such a request. Naturally, underneath there will be the need to async / parallelise the fetches, without causing threadlocks either through IO locks or Synchonisation locks, ideally I should see the CPU reach >80% throughput for the server making the calls to Azure Table storage, as this processing should be CPU bound vs IO or Memory bound.
Since you are asking for "fastest" processing time to read from Azure Storage, here are some general tips that made my performance improve (top ones are the most important):
Ensure the Azure Storage has been created since July 2012. This is the Gen2 of Azure Storage and it includes storage on SSD drives.
In your case, table storage has increased scalability targets for partitions for Gen2 of Azure Storage: http://blogs.msdn.com/b/windowsazure/archive/2012/11/02/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx
10 Gbps network vs 1 Gpbs networks
Single partition can process 20,000 entities/second
.NET default connections change this number (I think this might be addressed in the new SDK, but not sure): http://social.msdn.microsoft.com/Forums/en-US/windowsazuredata/thread/d84ba34b-b0e0-4961-a167-bbe7618beb83
You can "warm" Azure Storage, the more transactions it sees the more of the controller/drive cache it will use. This might be expensive to constantly hit your storage in this way
You can use MULTIPLE Azure Storage accounts. This can distribute your load very efficiently (sharding): http://tk.azurewebsites.net/2012/08/26/hacking-azure-for-more-disk-performance/
You have several ways to architect/design in Table Storage. You have the partition key and the row key. However, you also have the table itself..remember this is NoSQL, so you can have 100 tables with the same structure serving different data. That can be a performance boost in itself and also you can store these tables in different Azure Storage accounts. RowKey-> PartitionKey -> Table -> Multiple Storage Accounts can all be thought of as "indexes" for faster access
I dunno your data, but since you will be searching on PartitionKey (I assume), maybe instead of storing 1,000,0000 really small records for each PartitionKey have that in zip file and fetch it real quick/unzip and then parallel-query it with linq when it is in the local server. Playing with caching always will help since you do have a lot of small objects. You could probably put entire partitions in memory. Another option might be to store a partition key with column data that is binary/comma seperated etc.
You say you are on the Azure 1.7 SDK...I had problem with it and using the StorageClient 2.0 library. I used the 1.8 SDK with the StorageClient 2.0 library. Something of note (not necessarily performance), since they may have improved efficiency of the libraries over the last 2+ years