In azure data explorer DB I have defined procedure which should move aggregated data from one table into other DB table.
While running procedure following limit is being hit, E_RUNAWAY_QUERY _ message:
Aggregation over string column exceeded the memory budget of 8GB
during evaluation
currently I can not decrease size of data transferred and wanted to use workaround option by increasing limits with set truncationmaxsize=... but when I try to include this order in stored procedure, sp fails during altering.
Is it possible to use limit increasing commands (like set truncationmaxsize=1048576) inside stored procedures or not, if this is possible how?
You can override those default memory limit consumption as per Limit on memory per iterator and Limit on memory per node.
The best way of approach would be splitting the result into multiple ones using (.set, .append, .set-or-append) you can refer query best practices and Ingest from query.
Related
when I am trying to insert the data from cosmos to Azure DWH , it is inserting well for most of the databases but for some it is giving some strange issues.
Later we found out that it is due to the size of the Cosmos DB document.
Like we have 75GB of size of one of our cosmos DB.
Then if we are trying to insert all the data in initial load , it gives Null Pointer error. But if we try to limit the rows say , first 3000 and then increment the count of records by 3000 then it is able to insert but it takes significant amount of time.
Also, this is our ACC data , we are not sure of our PRD data. and now for some of the DBs we need to set it to 50000 rows per load and for some we have set 3000(like for above example).
So to load the data iterative way is the only solution ? or is there any other way?
Also, how can we determine the incremental value to load in each iteration for new DBs to be added?
P.S. I also tried increasing DWUs and IR cores to maximum but no luck.
How can I structure a Kusto query such that I can query a large table (and download it) while avoiding the memory issues like: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/concepts/querylimits#limit-on-result-set-size-result-truncation
set notruncation; only works in-so-far as the Kusto cluster does not run OOM, which in my case, it does.
I did not find the answers here: How can i query a large result set in Kusto explorer?, helpful.
What I have tried:
Using the .export command which fails for me and it is unclear why. Perhaps you need to be the cluster admin to run such a command? https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/data-export/export-data-to-storage
Cycling through row numbers, but run n times, you do not get the right answer because the results are not the same, like so:
let start = 3000000;
let end = 4000000;
table
| serialize rn = row_number()
| where rn between(start..end)
| project col_interest;
"set notruncation" is not primarily for preventing an Out-Of-Memory error, but to avoid transferring too much data over-the-wire for an un-suspected client that perhaps ran a query without a filter.
".export" into a co-located (same datacenter) storage account, using a simple format like "TSV" (without compression) has yielded the best results in my experience (billions of records/Terabytes of data in extremely fast periods of time compared to using the same client you would use for normal queries).
What was the error when using ".export"? The syntax is pretty simple, test with a few rows first:
.export to tsv (
h#"https://SANAME.blob.core.windows.net/CONTAINER/PATH;SAKEY"
) with (
includeHeaders="all"
)
<| simple QUERY | limit 5
You don't want to overload the cluster at the same time by running an inefficient query (like a serialization on a large table per your example) and trying to move the result in a single dump over the wire to your client.
Try optimizing the query first using the Kusto Explorer client's "Query analyzer" until the CPU and/or memory usage are as low as possible (ideally 100% cache hit rate; you can scale up the cluster temporarily to fit the dataset in memory as well).
You can also run the query in batches (try first to use time-filters, since this is a time-series engine) and save each batch into an "output" table (using ".set-or-append"), in this way you split the load by first using the cluster to process the dataset, and then exporting the full "output" table into an external storage.
If for some reason you absolutely most use the same client to run the query and consume the (large) result, try using database cursors instead of serializing the whole table, it's the same idea, but pre-calculated, so you can use a "limit XX" where "XX" is the largest dataset you can move over the wire to your client, so you can run the same query over and over moving the cursor, until you are finished moving the whole dataset:
I have a requirement where I need to get only a certain attribute from the matching records on querying a DynamoDB table. I have used withSelect(Select.SPECIFIC_ATTRIBUTES).withProjectionExpression(<attribute_name>) to get that attribute. But the number of records being read by the queryPage operation is the same in both the cases (1. using withSelect and 2. without using withSelect). The only advantage is by using withSelect, these operations are being processed very quickly. But this is in turn causing a lot of DynamoDB reads. Is there any way I can read more records in a single query thereby reducing my number of DB reads?
The reason you are seeing that the number of reads is the same is due to the fact that projection expressions are applied after each item is retrieved from the storage nodes, but before it is collected into the response object. The net benefit of projection expressions is to save network bandwidth, which in turn can save latency. But it will not result in consumed capacity savings.
If you want to save consumed capacity and be able to retrieve more items per request, your only options are:
create an index and project only the attributes you need to query; this can be a local secondary index, or a global secondary index, depending whether you need to change the partition key for the index
try to optimize the schema of your data stored in the table; perhaps you can compress your items, or just generally work out encodings that result in smaller documents
Some things to keep in mind if you do decide to go with an index: a local secondary index would probably work best in your example but you would need to create a new table for that (local secondary indexes can only be created when you create the table); a global secondary index would also work but only if your application can tolerate eventually consistent reads on the index (and of course, there is a higher cost associated with these).
Read more about using indexes with DynamoDB here: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes.html
I'm trying to query my database ratingsChoices= m$distinct({'answers'}) but I get a Warning: Error in : distinct too big, 16mb cap
Is there a way around this error in mongolite? I've seen some similar problems with PyMongo, etc
Is there a way around this error in mongolite?
The problem here is because distinct command is called when you're calling m$distinct. See more MongoDB Database commands for more information.
The distinct command returns a single document. The maximum BSON document size limit is 16 MegaBytes. So, if you have either lots of distinct values and/or large fields that would exceed the maximum size of 16MB returned by the server, you'll get the above error message.
An alternative, is to utilise MongoDB Aggregation Pipeline instead of the distinct command. Which fortunately mongolite has support for: mongolite aggregate.
Aggregation pipeline results are returned via a cursor, which can be iterated upon. This means you can fetch results more than the 16MB maximum limit.
For example (using MongoDB v3.6 and mongolite v2017-12-21):
uniqueName <- m$aggregate('[{"$group":{"_id":"$answers"}}]')
print(uniqueName)
Is it possible to execute an Array DML INSERT or UPDATE statement passing a BLOB field data in the parameter array ? And the more important part of my question, if it is possible, will Array DML command containing BLOB data still be more efficient than executing commands one by one ?
I have noticed that TADParam has a AsBlobs indexed property so I assume it might be possible, but I haven't tried this yet because there's no mention of performance nor example showing this and because the indexed property is of type RawByteString which is not much suitable for my needs.
I'm using FireDAC and working with SQLite database (Params.BindMode = pbByNumber, so I'm using native SQLite INSERT with multiple VALUES). My aim is to store about 100k records containing pretty small BLOB data (about 1kB per record) as fast as possible (in cost of the FireDAC's abstraction).
The main point in your case is that you are using a SQLIte3 database.
With SQLite3, Array DML is "emulated" by FireDAC. Since it is a local instance, not a client-server instance, there is no need to prepare a bunch of rows, then send them at once to avoid network latency (as with Oracle or MS SQL).
Using Array DML may speed up your insertion process a little bit with SQLite3, but I doubt it will very high. Good plain INSERT with binding per number will work just fine.
The main tips about performance in your case will be:
Nest your process within a single transaction (or even better, use one transaction per 1000 rows of data);
Prepare an INSERT statement, then re-execute it with a bound parameter each time;
By default, FireDAC initialize SQLite3 with the fastest options (e.g. disabling LOCK), so let it be.
SQlite3 is very good about BLOB process.
From my tests, FireDAC insertion timing is pretty good, very close to direct SQlite3 access. Only reading is slower than a direct SQLite3 link, due to the overhead of the Delphi TDataSet class.