Mongolite - distinct too big, 16mb cap - r

I'm trying to query my database ratingsChoices= m$distinct({'answers'}) but I get a Warning: Error in : distinct too big, 16mb cap
Is there a way around this error in mongolite? I've seen some similar problems with PyMongo, etc

Is there a way around this error in mongolite?
The problem here is because distinct command is called when you're calling m$distinct. See more MongoDB Database commands for more information.
The distinct command returns a single document. The maximum BSON document size limit is 16 MegaBytes. So, if you have either lots of distinct values and/or large fields that would exceed the maximum size of 16MB returned by the server, you'll get the above error message.
An alternative, is to utilise MongoDB Aggregation Pipeline instead of the distinct command. Which fortunately mongolite has support for: mongolite aggregate.
Aggregation pipeline results are returned via a cursor, which can be iterated upon. This means you can fetch results more than the 16MB maximum limit.
For example (using MongoDB v3.6 and mongolite v2017-12-21):
uniqueName <- m$aggregate('[{"$group":{"_id":"$answers"}}]')
print(uniqueName)

Related

How to check or see size of cosmos DB document or is there any limit on loading of Azure DWH from cosmos DB

when I am trying to insert the data from cosmos to Azure DWH , it is inserting well for most of the databases but for some it is giving some strange issues.
Later we found out that it is due to the size of the Cosmos DB document.
Like we have 75GB of size of one of our cosmos DB.
Then if we are trying to insert all the data in initial load , it gives Null Pointer error. But if we try to limit the rows say , first 3000 and then increment the count of records by 3000 then it is able to insert but it takes significant amount of time.
Also, this is our ACC data , we are not sure of our PRD data. and now for some of the DBs we need to set it to 50000 rows per load and for some we have set 3000(like for above example).
So to load the data iterative way is the only solution ? or is there any other way?
Also, how can we determine the incremental value to load in each iteration for new DBs to be added?
P.S. I also tried increasing DWUs and IR cores to maximum but no luck.

Increase azure data explorer query limit inside stored procedure

In azure data explorer DB I have defined procedure which should move aggregated data from one table into other DB table.
While running procedure following limit is being hit, E_RUNAWAY_QUERY _ message:
Aggregation over string column exceeded the memory budget of 8GB
during evaluation
currently I can not decrease size of data transferred and wanted to use workaround option by increasing limits with set truncationmaxsize=... but when I try to include this order in stored procedure, sp fails during altering.
Is it possible to use limit increasing commands (like set truncationmaxsize=1048576) inside stored procedures or not, if this is possible how?
You can override those default memory limit consumption as per Limit on memory per iterator and Limit on memory per node.
The best way of approach would be splitting the result into multiple ones using (.set, .append, .set-or-append) you can refer query best practices and Ingest from query.

Kusto: How to query large tables as chunks to export data?

How can I structure a Kusto query such that I can query a large table (and download it) while avoiding the memory issues like: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/concepts/querylimits#limit-on-result-set-size-result-truncation
set notruncation; only works in-so-far as the Kusto cluster does not run OOM, which in my case, it does.
I did not find the answers here: How can i query a large result set in Kusto explorer?, helpful.
What I have tried:
Using the .export command which fails for me and it is unclear why. Perhaps you need to be the cluster admin to run such a command? https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/data-export/export-data-to-storage
Cycling through row numbers, but run n times, you do not get the right answer because the results are not the same, like so:
let start = 3000000;
let end = 4000000;
table
| serialize rn = row_number()
| where rn between(start..end)
| project col_interest;
"set notruncation" is not primarily for preventing an Out-Of-Memory error, but to avoid transferring too much data over-the-wire for an un-suspected client that perhaps ran a query without a filter.
".export" into a co-located (same datacenter) storage account, using a simple format like "TSV" (without compression) has yielded the best results in my experience (billions of records/Terabytes of data in extremely fast periods of time compared to using the same client you would use for normal queries).
What was the error when using ".export"? The syntax is pretty simple, test with a few rows first:
.export to tsv (
h#"https://SANAME.blob.core.windows.net/CONTAINER/PATH;SAKEY"
) with (
includeHeaders="all"
)
<| simple QUERY | limit 5
You don't want to overload the cluster at the same time by running an inefficient query (like a serialization on a large table per your example) and trying to move the result in a single dump over the wire to your client.
Try optimizing the query first using the Kusto Explorer client's "Query analyzer" until the CPU and/or memory usage are as low as possible (ideally 100% cache hit rate; you can scale up the cluster temporarily to fit the dataset in memory as well).
You can also run the query in batches (try first to use time-filters, since this is a time-series engine) and save each batch into an "output" table (using ".set-or-append"), in this way you split the load by first using the cluster to process the dataset, and then exporting the full "output" table into an external storage.
If for some reason you absolutely most use the same client to run the query and consume the (large) result, try using database cursors instead of serializing the whole table, it's the same idea, but pre-calculated, so you can use a "limit XX" where "XX" is the largest dataset you can move over the wire to your client, so you can run the same query over and over moving the cursor, until you are finished moving the whole dataset:

Azure Cosmos DB aggregation and indexes

I'm trying to use Cosmos DB and I'm having some trouble making a simple count in a collection.
My collection schema is below and I have 80.000 documents in this collection.
{
"_id" : ObjectId("5aca8ea670ed86102488d39d"),
"UserID" : "5ac161d742092040783a4ee1",
"ReferenceID" : 87396,
"ReferenceDate" : ISODate("2018-04-08T21:50:30.167Z"),
"ElapsedTime" : 1694,
"CreatedDate" : ISODate("2018-04-08T21:50:30.168Z")
}
If I run this command below to count all documents in collection, I have the result so quickly:
db.Tests.count()
But when I run this same command but to a specific user, I've got a message "Request rate is large".
db.Tests.find({UserID:"5ac161d742092040783a4ee1"}).count()
In the Cosmos DB documentation I found this cenario and the suggestion is increase RU. Currently I have 400 RU/s, when I increase to 10.000 RU/s I'm capable to run the command with no errors but in 5 seconds.
I already tryed to create index explicity, but it seems Cosmos DB doesn't use the index to make count.
I do not think it is reasonable to have to pay 10,000 RU / s for a simple count in a collection with approximately 100,000 documents, although it takes about 5 seconds.
Count by filter queries ARE using indexes if they are available.
If you try count by filter on a not indexed column the query would not time out, but fail. Try it. You should get error along the lines of:
{"Errors":["An invalid query has been specified with filters against path(s) excluded from indexing. Consider adding allow scan header in the request."]}
So definitely add a suitable index on UserID.
If you don't have index coverage and don't get the above error then you probably have set the enableScanInQuery flag. This is almost always a bad idea, and full scan would not scale. Meaning - it would consume increasingly large amounts of RU as your dataset grows. So make sure it is off and index instead.
When you DO have index on the selected column your query should run. You can verify that index is actually being used by sending the x-ms-documentdb-populatequerymetrics header. Which should return you confirmation with indexLookupTimeInMs and indexUtilizationRatio field. Example output:
"totalExecutionTimeInMs=8.44;queryCompileTimeInMs=8.01;queryLogicalPlanBuildTimeInMs=0.04;queryPhysicalPlanBuildTimeInMs=0.06;queryOptimizationTimeInMs=0.00;VMExecutionTimeInMs=0.14;indexLookupTimeInMs=0.11;documentLoadTimeInMs=0.00;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=0;retrievedDocumentSize=0;outputDocumentCount=1;outputDocumentSize=0;writeOutputTimeInMs=0.01;indexUtilizationRatio=0.00"
It also provides you some insight where the effort has gone if you feel like RU charge is too large.
If index lookup time itself is too high, consider if you index is selective enough and if the index settings are suitable. Look at your UserId values and distribution and adjust the index accordingly.
Another wild guess to consider is to check if the API you are using would defer executing find(..) until it knows that count() is really what you are after. It is unclear which API you are using. If it turns out it is fetching all matching documents to client side before doing the counting then that would explain unexpectedly high RU cost, especially if there are large amount of matching documents or large documents involved. Check the API documentation.
I also suggest executing the same query directly in Azure Portal to compare the RU cost and verify if the issue is client-related or not.
I think it just doesn't work.
The index seems to be used when selecting the documents to be counted, but then the count is done by reading each document, so effectively consuming a lot of RU.
This query is cheap and fast:
db.Tests.count({ UserID: { '$eq': '5ac161d742092040783a4ee1' }})
but this one is slow and expensive:
db.Tests.count({ ReferenceID: { '$gt': 10 }})
even though this query is fast:
db.Tests.find({ ReferenceID: { '$gt': 10 }}).sort({ ReferenceID: 1 })
I also found this: https://feedback.azure.com/forums/263030-azure-cosmos-db/suggestions/36142468-make-count-aware-of-indexes. Note the status: "We have started work on this feature. Will update here when this becomes generally available."
Pretty disappointing to be honest, especially since this limitation hasn't been addressed for almost 2 years. Note - I am not an expert in this matter and I'd love to be proven wrong, since I also need this feature.
BTW: I noticed that simple indexes seem to be created automatically for each individual field, so no need to create them manually.

Change SQLITE_LIMIT_SQL_LENGTH for inserting large items at the same time - Sqlite3

I've had difficult to set new limit for sql length in the sqlite (version 3.10.2). I read the sql documentation and I noticed there is a function that allows to change the limit: sqlite3_limit(db,SQLITE_LIMIT_SQL_LENGTH,size)
However, this function is for C interface.
I'd like to know if there is a function can i use in sql to change the sql length because I'll work with large sql statements.
Did anyone get this problem?
Thank you very much.
Carlos
There is no SQL function for that.
And the sqlite3_limit() function allows to reduce the connection's limits, but not to increase them.
To increase the maximum length of an SQL statement, you would have to recompile the SQLite library with different compilation options.

Resources