How to check or see size of cosmos DB document or is there any limit on loading of Azure DWH from cosmos DB - azure-cosmosdb

when I am trying to insert the data from cosmos to Azure DWH , it is inserting well for most of the databases but for some it is giving some strange issues.
Later we found out that it is due to the size of the Cosmos DB document.
Like we have 75GB of size of one of our cosmos DB.
Then if we are trying to insert all the data in initial load , it gives Null Pointer error. But if we try to limit the rows say , first 3000 and then increment the count of records by 3000 then it is able to insert but it takes significant amount of time.
Also, this is our ACC data , we are not sure of our PRD data. and now for some of the DBs we need to set it to 50000 rows per load and for some we have set 3000(like for above example).
So to load the data iterative way is the only solution ? or is there any other way?
Also, how can we determine the incremental value to load in each iteration for new DBs to be added?
P.S. I also tried increasing DWUs and IR cores to maximum but no luck.

Related

Increase azure data explorer query limit inside stored procedure

In azure data explorer DB I have defined procedure which should move aggregated data from one table into other DB table.
While running procedure following limit is being hit, E_RUNAWAY_QUERY _ message:
Aggregation over string column exceeded the memory budget of 8GB
during evaluation
currently I can not decrease size of data transferred and wanted to use workaround option by increasing limits with set truncationmaxsize=... but when I try to include this order in stored procedure, sp fails during altering.
Is it possible to use limit increasing commands (like set truncationmaxsize=1048576) inside stored procedures or not, if this is possible how?
You can override those default memory limit consumption as per Limit on memory per iterator and Limit on memory per node.
The best way of approach would be splitting the result into multiple ones using (.set, .append, .set-or-append) you can refer query best practices and Ingest from query.

Cosmos DB - Slow COUNT

I am working on an existing Cosmos DB where the number of physical partitions is less than 100. Each contains around 30,000,000 documents. There is an indexing policy in place on "/*".
I'm just trying to get a total count from SQL API like so:
SELECT VALUE COUNT(1) FROM mycollection c
I have set EnableCrossPartitionQuery to true, and MaxDegreeOfParallelism to 100 (so as to at least cover the number of physical partitions AKA key ranges). The database is scaled to 50,000 RU. The query is running for HOURS. This does not make sense to me. An equivalent relational database would answer this question almost immediately. This is ridiculous.
What, if anything, can I change here? Am I doing something wrong?
Microsoft support ended up applying an update to the underlying instance. In this case, the update was in the development pipeline to be rolled out gradually. This instance got it earlier as a result of the support case. The update related to using indexes to service this type of query.

Delete large amount of Vertexs from the Cosmos Db using Gremlin queries

I have around 40000 vertexes with label Test
I am trying to delete all the vertex's but I always get query too large exception.
I tried deleting it through Azure Cosmos DB Data Explorer using the following query
g.V().hasLabel('Test').drop()
This deletes around 200 vertex's but thats not enough for me
I also tried deleting it through the code
await gremlinClient.SubmitAsync<dynamic>("g.V().hasLabel('Test').drop()");
The code simply does not work and I get the same exception without deleting any vertex's
How can I delete large amount of vertexes efficiently?
The error indicates that you have low Throughput (RU/s) and when the query exceeds the limit its gives too large exception.
One way you can remove the data is by applying the limit on drop
g.V().hasLabel('Test').limit(2000).drop()
Adjust the limit according to your throughput so that it can execute the query with out throwing the exception.

Mongolite - distinct too big, 16mb cap

I'm trying to query my database ratingsChoices= m$distinct({'answers'}) but I get a Warning: Error in : distinct too big, 16mb cap
Is there a way around this error in mongolite? I've seen some similar problems with PyMongo, etc
Is there a way around this error in mongolite?
The problem here is because distinct command is called when you're calling m$distinct. See more MongoDB Database commands for more information.
The distinct command returns a single document. The maximum BSON document size limit is 16 MegaBytes. So, if you have either lots of distinct values and/or large fields that would exceed the maximum size of 16MB returned by the server, you'll get the above error message.
An alternative, is to utilise MongoDB Aggregation Pipeline instead of the distinct command. Which fortunately mongolite has support for: mongolite aggregate.
Aggregation pipeline results are returned via a cursor, which can be iterated upon. This means you can fetch results more than the 16MB maximum limit.
For example (using MongoDB v3.6 and mongolite v2017-12-21):
uniqueName <- m$aggregate('[{"$group":{"_id":"$answers"}}]')
print(uniqueName)

How can I improve performance while altering a large mysql table?

I have 600 Millions records in a table and I am not able to add a column in this table as every time I try to do it, it times out.
Suppose in your MYSQL database you have a giant table having 600 Millions of rows, having some schema operation such as adding a unique key, altering a column, even adding one more column to it is a very cumbersome process which will takes hours to process and sometimes there is a server time out. In order to overcome that, one to have to come up with very good migration plan, one of which I jotting below.
1) Suppose there is table Orig_X in which I have to add a new column colNew with default value as 0.
2) A Dummy table Dummy_X is created which is replica of Orig_X except with a new column colNew.
3) Data is inserted from the Orig_X to Dummy_X with the following settings.
4) Auto commit is set to zero, so that data is not committed after each insert statement hindering the performance.
5) Binary logs are set to zero, so that no data will be written in these logs.
6) After insertion of data bot the feature are set to one.
SET AUTOCOMMIT = 0;
SET sql_log_bin = 0;
Insert into Dummy_X(col1, col2, col3, colNew)
Select col1, col2, col3, from Orig_X;
SET sql_log_bin = 1;
SET AUTOCOMMIT = 1;
7) Now primary key can be created with the newly inserted column, which is now the part of primary key.
8) All the unique keys can now be created.
9) We can check the status of the server by issuing the following command
SHOW MASTER STATUS
10) It’s also helpful to issue FLUSH LOGS so MySQL will clear the old logs.
11) In order to boost performance to run the similar type of queries such as above insert statement, one should have query cache variable on.
SHOW VARIABLES LIKE 'have_query_cache';
query_cache_type = 1
Above were the steps for the migration strategy for the large table, below I am witting so steps to improve the performance of the database/queries.
1) Remove any unnecessary indexes on the table, pay particular attention to UNIQUE indexes as these when disable change buffering. Don't use a UNIQUE index if you have no reason for that constraint, prefer a regular INDEX.
2) If bulk loading a fresh table, delay creating any indexes besides the PRIMARY KEY. If you create them once all after data is loaded, then InnoDB is able to apply a pre-sort and bulk load process which is both faster and results in typically more compact indexes.
3) More memory can actually help in performance optimization. If SHOW ENGINE INNODB STATUS shows any reads/s under BUFFER POOL AND MEMORY and the number of Free buffers (also under BUFFER POOL AND MEMORY) is zero, you could benefit from more (assuming you have sized innodb_buffer_pool_size correctly on your server.
4) Normally your database table gets re-indexed after every insert. That's some heavy lifting for you database, but when your queries are wrapped inside a Transaction, the table does not get re-indexed until after this entire bulk is processed. Saving a lot of work.
5) Most MySQL servers have query caching enabled. It's one of the most effective methods of improving performance that is quietly handled by the database engine. When the same query is executed multiple times, the result is fetched from the cache, which is quite fast.
6) Using the EXPLAIN keyword can give you insight on what MySQL is doing to execute your query. This can help you spot the bottlenecks and other problems with your query or table structures. The results of an EXPLAIN query will show you which indexes are being utilized, how the table is being scanned and sorted etc...
7) If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables. This affects how MySQL internally optimizes the join operation.
8) In every table have an id column that is the PRIMARY KEY, AUTO_INCREMENT and one of the flavors of INT. Also preferably UNSIGNED, since the value cannot be negative.
9) Even if you have a user’s table that has a unique username field, do not make that your primary key. VARCHAR fields as primary keys are slower. And you will have a better structure in your code by referring to all users with their id's internally.
10) Normally when you perform a query from a script, it will wait for the execution of that query to finish before it can continue. You can change that by using unbuffered queries. This saves a considerable amount of memory with SQL queries that produce large result sets, and you can start working on the result set immediately after the first row has been retrieved as you don't have to wait until the complete SQL query has been performed.
11) With database engines, disk is perhaps the most significant bottleneck. Keeping things smaller and more compact is usually helpful in terms of performance, to reduce the amount of disk transfer.
12) The two main storage engines in MySQL are MyISAM and InnoDB. Each have their own pros and cons.MyISAM is good for read-heavy applications, but it doesn't scale very well when there are a lot of writes. Even if you are updating one field of one row, the whole table gets locked, and no other process can even read from it until that query is finished. MyISAM is very fast at calculating
SELECT COUNT(*)
types of queries.InnoDB tends to be a more complicated storage
engine and can be slower than MyISAM for most small applications. But it supports row-based locking, which scales better. It also supports some more advanced features such as transactions.

Resources