How can I structure a Kusto query such that I can query a large table (and download it) while avoiding the memory issues like: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/concepts/querylimits#limit-on-result-set-size-result-truncation
set notruncation; only works in-so-far as the Kusto cluster does not run OOM, which in my case, it does.
I did not find the answers here: How can i query a large result set in Kusto explorer?, helpful.
What I have tried:
Using the .export command which fails for me and it is unclear why. Perhaps you need to be the cluster admin to run such a command? https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/data-export/export-data-to-storage
Cycling through row numbers, but run n times, you do not get the right answer because the results are not the same, like so:
let start = 3000000;
let end = 4000000;
table
| serialize rn = row_number()
| where rn between(start..end)
| project col_interest;
"set notruncation" is not primarily for preventing an Out-Of-Memory error, but to avoid transferring too much data over-the-wire for an un-suspected client that perhaps ran a query without a filter.
".export" into a co-located (same datacenter) storage account, using a simple format like "TSV" (without compression) has yielded the best results in my experience (billions of records/Terabytes of data in extremely fast periods of time compared to using the same client you would use for normal queries).
What was the error when using ".export"? The syntax is pretty simple, test with a few rows first:
.export to tsv (
h#"https://SANAME.blob.core.windows.net/CONTAINER/PATH;SAKEY"
) with (
includeHeaders="all"
)
<| simple QUERY | limit 5
You don't want to overload the cluster at the same time by running an inefficient query (like a serialization on a large table per your example) and trying to move the result in a single dump over the wire to your client.
Try optimizing the query first using the Kusto Explorer client's "Query analyzer" until the CPU and/or memory usage are as low as possible (ideally 100% cache hit rate; you can scale up the cluster temporarily to fit the dataset in memory as well).
You can also run the query in batches (try first to use time-filters, since this is a time-series engine) and save each batch into an "output" table (using ".set-or-append"), in this way you split the load by first using the cluster to process the dataset, and then exporting the full "output" table into an external storage.
If for some reason you absolutely most use the same client to run the query and consume the (large) result, try using database cursors instead of serializing the whole table, it's the same idea, but pre-calculated, so you can use a "limit XX" where "XX" is the largest dataset you can move over the wire to your client, so you can run the same query over and over moving the cursor, until you are finished moving the whole dataset:
Related
Context: We store historical data in Azure Data Lake as versioned parquet files from our existing Databricks pipeline where we write to different Delta tables. One particular log source is about 18 GB a day in parquet. I have read through the documentation and executed some queries using Kusto.Explorer on the external table I have defined for that log source. In the query summary window of Kusto.Explorer I see that I download the entire folder when I search it, even when using the project operator. The only exception to that seems to be when I use the take operator.
Question: Is it possible to prune columns to reduce the amount of data being fetched from external storage? Whether during external table creation or using an operator at query time.
Background: The reason I ask is that in Databricks it is possible to use the SELCECT statement to only fetch the columns I'm interested in. This reduces the query time significantly.
As David wrote above, the optimization does happen on Kusto side, but there's a bug with the "Downloaded Size" metric - it presents the total data size, regardless of the selected columns. We'll fix. Thanks for reporting.
I am trying to instrument a part of kusto function to check the execution times in different scenarios. however I couldn't find a way to print the time before and after.
print now();
<query takes few seconds>;
print now();
both the print statements are returning same value in the output. I tried running print now() as part of another function or converting tostring() or adding as another column using extend. but the value remains the same.
What are the ways to instrument query performance in kusto explorer/azure data explorer? is there any way to override the default now() behavior ?
As #yifats mentioned in her answer, now() will return the same value in all occurrences within the same query, and this is by design.
But looking at the query duration isn't a good way to do a perf test anyway. Instead, you should run the two queries (that you want to compare) interchangeably several times, and instead of looking at the duration of the query, go to the Query Summary tab in Kusto Explorer, and look at the "Total CPU time" and "Data scanned (estimated)" - the lower the numbers, the better. Also, look at the "Parallelism" (higher parallelism is better, as same CPU time with higher parallelism will result in lower query duration).
Make sure to use constant time-window for all query runs, e.g. | where Timestamp between (datetime(2021-09-12 14:00:00) .. 1d) instead of filtering data using | where Timestamp > ago(1d), this is in order to run the query on exactly the same data - just make sure that your time-window doesn't include the last few minutes, as ingestion within this timeframe might still be in progress, so two runs might need to scan different amounts of data.
Lastly, make sure you only query data in the hot-cache (otherwise you won't get consistent results, performance-wise).
This behavior is by design, as mentioned in the documentation, and cannot be overridden. You should use the .show queries command to analyze queries performance.
Is it possible to execute an Array DML INSERT or UPDATE statement passing a BLOB field data in the parameter array ? And the more important part of my question, if it is possible, will Array DML command containing BLOB data still be more efficient than executing commands one by one ?
I have noticed that TADParam has a AsBlobs indexed property so I assume it might be possible, but I haven't tried this yet because there's no mention of performance nor example showing this and because the indexed property is of type RawByteString which is not much suitable for my needs.
I'm using FireDAC and working with SQLite database (Params.BindMode = pbByNumber, so I'm using native SQLite INSERT with multiple VALUES). My aim is to store about 100k records containing pretty small BLOB data (about 1kB per record) as fast as possible (in cost of the FireDAC's abstraction).
The main point in your case is that you are using a SQLIte3 database.
With SQLite3, Array DML is "emulated" by FireDAC. Since it is a local instance, not a client-server instance, there is no need to prepare a bunch of rows, then send them at once to avoid network latency (as with Oracle or MS SQL).
Using Array DML may speed up your insertion process a little bit with SQLite3, but I doubt it will very high. Good plain INSERT with binding per number will work just fine.
The main tips about performance in your case will be:
Nest your process within a single transaction (or even better, use one transaction per 1000 rows of data);
Prepare an INSERT statement, then re-execute it with a bound parameter each time;
By default, FireDAC initialize SQLite3 with the fastest options (e.g. disabling LOCK), so let it be.
SQlite3 is very good about BLOB process.
From my tests, FireDAC insertion timing is pretty good, very close to direct SQlite3 access. Only reading is slower than a direct SQLite3 link, due to the overhead of the Delphi TDataSet class.
When I open up TOAD and do a select * from table,the results (first 500 rows) come back almost instantly. But the explain plan shows full table scan and the table is very huge.
How come the results are so quick?
In general, Oracle does not need to materialize the entire result set before it starts returning the data (there are, of course, cases where Oracle has to materialize the result set in order to sort it before it can start returning data). Assuming that your query doesn't require the entire result set to be materialized, Oracle will start returning the data to the client process whether that client process is TOAD or SQL*Plus or a JDBC application you wrote. When the client requests more data, Oracle will continue executing the query and return the next page of results. This allows TOAD to return the first 500 rows relatively quickly even if it would ultimately take many hours for Oracle to execute the entire query and to return the last row to the client.
Toad only returns the first 500 rows for performance, but if you were to run that query through an Oracle interface, JDBC for example, it would return the entire result. My best guess is that the explain plan shows you the results in the case it doesn't get a subset of the records; that's how i use it. I don't have a source for this other than my own experience with it.
Background: I am using SQLite database in my flex application. Size of the database is 4 MB and have 5 tables which are
table 1 have 2500 records
table 2 have 8700 records
table 3 have 3000 records
table 4 have 5000 records
table 5 have 2000 records.
Problem: Whenever I run a select query on any table, it takes around (approx 50 seconds) to fetch data from database tables. This has made the application quite slow and unresponsive while it fetches the data from the table.
How can i improve the performance of the SQLite database so that the time taken to fetch the data from the tables is reduced?
Thanks
As I tell you in a comment, without knowing what structures your database consists of, and what queries you run against the data, there is nothing we can infer suggesting why your queries take much time.
However here is an interesting reading about indexes : Use the index, Luke!. It tells you what an index is, how you should design your indexes and what benefits you can harvest.
Also, if you can post the queries and the table schemas and cardinalities (not the contents) maybe it could help.
Are you using asynchronous or synchronous execution modes? The difference between them is that asynchronous execution runs in the background while your application continues to run. Your application will then have to listen for a dispatched event and then carry out any subsequent operations. In synchronous mode, however, the user will not be able to interact with the application until the database operation is complete since those operations run in the same execution sequence as the application. Synchronous mode is conceptually simpler to implement, but asynchronous mode will yield better usability.
The first time SQLStatement.execute() on a SQLStatement instance, the statement is prepared automatically before executing. Subsequent calls will execute faster as long as the SQLStatement.text property has not changed. Using the same SQLStatement instances is better than creating new instances again and again. If you need to change your queries, then consider using parameterized statements.
You can also use techniques such as deferring what data you need at runtime. If you only need a subset of data, pull that back first and then retrieve other data as necessary. This may depend on your application scope and what needs you have to fulfill though.
Specifying the database with the table names will prevent the runtime from checking each database to find a matching table if you have multiple databases. It also helps prevent the runtime will choose the wrong database if this isn't specified. Do SELECT email FROM main.users; instead of SELECT email FROM users; even if you only have one single database. (main is automatically assigned as the database name when you call SQLConnection.open.)
If you happen to be writing lots of changes to the database (multiple INSERT or UPDATE statements), then consider wrapping it in a transaction. Changes will made in memory by the runtime and then written to disk. If you don't use a transaction, each statement will result in multiple disk writes to the database file which can be slow and consume lots of time.
Try to avoid any schema changes. The table definition data is kept at the start of the database file. The runtime loads these definitions when the database connection is opened. Data added to tables is kept after the table definition data in the database file. If changes such as adding columns or tables, the new table definitions will be mixed in with table data in the database file. The effect of this is that the runtime will have to read the table definition data from different parts of the file rather than at the beginning. The SQLConnection.compact() method restructures the table definition data so it is at the the beginning of the file, but its downside is that this method can also consume much time and more so if the database file is large.
Lastly, as Benoit pointed out in his comment, consider improving your own SQL queries and table structure that you're using. It would be helpful to know your database structure and queries are the actual cause of the slow performance or not. My guess is that you're using synchronous execution. If you switch to asynchronous mode, you'll see better performance but that doesn't mean it has to stop there.
The Adobe Flex documentation online has more information on improving database performance and best practices working with local SQL databases.
You could try indexing some of the columns used in the WHERE clause of your SELECT statements. You might also try minimizing usage of the LIKE keyword.
If you are joining your tables together, you might try simplifying the table relationships.
Like others have said, it's hard to get specific without knowing more about your schema and the SQL you are using.