Kusto shuffle strategy behavior with nested summarize/join - azure-data-explorer

While improving the performance of a Kusto query, I came across the shuffle strategy for join/summarize. I can clearly see the performance benefits of this strategy for my query, which has high cardinality for the join/summarize key.
While reading the shuffle query Kusto documentation, it seemed that the strategy will be ignored when there are nested shuffle operators.
When the shuffled operator has other shuffle-able operators, like summarize or join, the query becomes more complex and then hint.strategy=shuffle won't be applied.
My query uses nested summarize and join (with shuffle) but I also clearly see performance gains. My query pattern:
Table1
| summarize hint.strategy=shuffle arg_max(Timestamp) by Device, Interface
| join hint.strategy=shuffle (Table2) on Device, Interface
Does a query like this benefit from shuffling?
Also, does the Kusto query planner avoid any problematic shuffle if present always? Basically I wanted to rest assured that there might only be perf issues with a wrongly used/authored shuffle and not data issues.

Please note that the article of shuffle query suggests to use hint.shufflekey in case you have nested summarize/join operators but it requires that the nested summarize/join operators have the same group-by/join key.
so in your example above, apply the following (I'm assumging that Device has a high cardinality (and you can remove/keep the shuffle strategy from the summarize, keeping/removing it will be the same behavior as long as you specify the shuffle key on the join which wraps this summarize):
Table1
| summarize arg_max(Timestamp) by Device, Interface
| join hint.shufflekey=Device (Table2) on Device, Interface

Related

Azure data factory dataflow SELECT DISTINCT

I have a dataflow with a few joins and when making the join #5, the number of row goes from 10,000 to 320,000 (to make an example of how the quantity is increased), but after that i have more joins to make so the dataflow is taking longer to complete.
What I do is to add an Aggregate transformation after the joins, to groupby the field that I will use later, using that in a way that I use a SELECT DISTINCT in a query on the database, but still taking soooo long to finish.
How can make this dataflow run faster?
Should I use an Aggregate (and groupby the fields) between every join, to avoid the duplicates or just add the Aggregate (and groupby the fields...) after the join where the rows starts to increase?
Thanks.
Can you switch to Lookups instead of Join and then choose "run single row". That provides the SELECT DISTINCT capability in a single step.
Also, to speed up the processing end-to-end, try bumping up to memory optimized and raise the core count.

Should I use WITH instead of a JOIN on a table with a lot of data?

I have a MariaDB table which contains a lot of metadata and is very big in terms of bytes.
I have columns A, B in that table a long with other columns.
I would like to join that table with another table (stuff) in order to get column C from it.
So I have something like:
SELECT metadata.A, metadata.B, stuff.C FROM metadata JOIN
stuff on metadata.D = stuff.D
This query takes a very long time sometimes, I suspect its because (AFAIK, please correct me if Im wrong) that JOIN stores the result of the join in some side table and because metadata table is very big it has to copy a lot of data even though I dont use it, so I thought about optimizing it with WITH as follows:
WITH m as (SELECT A,B,D FROM metadata),
s as (SELECT C,D FROM stuff)
SELECT * FROM m JOIN s ON m.D = s.D;
The execution plan is the same (using EXPLAIN) but I think it will be faster since the side tables that will be created by WITH (again AFAIK WITH also creates side tables, please correct me if Im wrong) will be smaller and only contain the needed data.
Is my logic correct? Is there some way I can test that in MariaDB?
More likely, there is some form of cache speeding up one query or the other.
The Query cache is usually recognizable by a query time that is only about 1ms. It can be turned off via SELECT SQL_NO_CACHE ... to get a timing to compare against.
The other likely cache is the buffer_pool. Data is read from disk into the buffer_pool unless it is already there. The simple workaround for strange timings is to run the query twice and take the second 'time'.
Your hypothesis that WITH creates 'small' temp tables falls apart because of the work that is needed to read the original tables is the same with or without WITH.
Please provide SHOW CREATE TABLE for the two tables. There are a couple of datatype issues that may be involved -- big TEXTs or BLOBs.
The newly-added WITH opens up the possibility of recursive CTEs (and other things). And it provides a way to materialize a temp table that is used more than once. Neither of those applies in your query, so I would not expect any performance improvement.

Filtering results from ClickHouse using values from dictionaries

I'm a little unfamiliar with ClickHouse and still study it by trial and error. Got a question about it.
Talking about the star scheme of data representations, with dimensions and facts. Currently, I keep everything in PostgreSQL, but OLAP queries with aggregations start to show bad timing, so I'm going to move some fact tables to ClickHouse. Initial tests of CH show incredible performance, however, in real life the queries should include joins to dimension tables from PostgreSQL. I know I can connect them as dictionaries.
Question: I found that using dictionaries I can make requests similar to LEFT JOINs in good old RDBMS, ie values from resultset could be joined with corresponding values from the dictionary. But can they be filtered by some restrictions on dictionary keys (as in INNER JOIN)? For example, in PostgreSQL I have a table users (id, name, ...) and in ClickHouse I have table visits (user_id, source, medium, session_time, timestamp, ...) with metrics about their visits to the site. Can I make a query to CH to fetch aggregated metrics (number of daily visits for given date range) of users which name matches some condition (LIKE "EVE%" for example)?
It sounds like ODBC table function is what you're looking for. ClickHouse have a bunch of table functions which work like Postgres foreign tables. The setup is similar to Dictionaries but you gain the traditional JOIN behavior. It currently doesn't show up in the official document. You can refer to this https://github.com/yandex/ClickHouse/blob/master/dbms/tests/integration/test_odbc_interaction/test.py#L84 . And in near future (this year), ClickHouse will have standard JOIN statement supported.
The dictionary will basically replace the value first. As I understand it your dictionary would be based off your users table.
Here is an example. Hopefully I am understanding your question.
select dictGetString('accountidmap', 'domain', tuple(toString(account_id))) AS domain, sum(session) as sessions from session_distributed where date = '2018-10-15' and like(domain, '%cats%') group by domain
This is a real query on our database so If there is something you want to try/confirm let me know

Is there join number limitation in SQLite?

I'm curious about what's the performance change if adding more joins? is there join number limitation? e.g. if greater than a value, the performance will be degraded. thanks.
Maximum Number Of Tables In A Join
SQLite does not support joins containing more than 64 tables. This limit arises from the fact that the SQLite code generator uses bitmaps with one bit per join-table in the query optimizer.
SQLite uses a very efficient O(N²) greedy algorithm for determining the order of tables in a join and so a large join can be prepared quickly. Hence, there is no mechanism to raise or lower the limit on the number of tables in a join.
see :http://www.sqlite.org/limits.html

Oracle explain plan:Cardinality returns a huge number but the query returns no records

I have written a complex oracle sql query and the explain plan stats look like this:
Cost: 209,201 Bytes:187,944,150 Cardinality: 409,675
Now the DBA tuned the query and the stats look like this:
Cost: 42,996 Bytes: 89,874,138 Cardinality: 209,226
My first question is, if the numbers are lower, does it automatically mean better performance?
Which number is the most pertient?Cost/Cardinality/Bytes?
My second question is: I understand cardinality is the number of rows read. But when i run the query, it returns '0' rows !
My impression was that Cardinality has to be same for two queries that are supposed to return same result sets. This I guess is wrong?
Cost, bytes, cardinality... all are estimations according to inputs like statistics given to the optimizer. So they automatically mean nothing but can give an idea. In Oracle Performance Tuning Guide's words "It is best to use EXPLAIN PLAN to determine an access plan, and then later prove that it is the optimal plan through testing. When evaluating a plan, examine the statement's actual resource consumption."
For 2nd question: Theoretically equivalent queries should return same cardinality. Your tables' statictics may be old.

Resources