Azure data factory dataflow SELECT DISTINCT - aggregate-functions

I have a dataflow with a few joins and when making the join #5, the number of row goes from 10,000 to 320,000 (to make an example of how the quantity is increased), but after that i have more joins to make so the dataflow is taking longer to complete.
What I do is to add an Aggregate transformation after the joins, to groupby the field that I will use later, using that in a way that I use a SELECT DISTINCT in a query on the database, but still taking soooo long to finish.
How can make this dataflow run faster?
Should I use an Aggregate (and groupby the fields) between every join, to avoid the duplicates or just add the Aggregate (and groupby the fields...) after the join where the rows starts to increase?
Thanks.

Can you switch to Lookups instead of Join and then choose "run single row". That provides the SELECT DISTINCT capability in a single step.
Also, to speed up the processing end-to-end, try bumping up to memory optimized and raise the core count.

Related

Does clickhouse support quick retrieval of any column?

I tried to use clickhouse to store 4 billion data, deployed on a single machine, 48-core cpu and 256g memory, mechanical hard disk.
My data has ten columns, and I want to quickly search any column through SQL statements, such as:
select * from table where key='mykeyword'; or select * from table where school='Yale';
I use order by to establish a sort key, order by (key, school, ...)
But when I search, only the first field ordered by key has very high performance. When searching for other fields, the query speed is very slow or even memory overflow (the memory allocation is already large enough)
So ask every expert, does clickhouse support such high-performance search for each column index similar to mysql? I also tried to create a secondary index for each column through index, but the performance did not improve.
You should try to understand how works sparse primary indexes
and how exactly right ORDER BY clause in CREATE TABLE help your query performance.
Clickhouse never will works the same way as mysql
Try to use PRIMARY KEY and ORDER BY in CREATE TABLE statement
and use fields with low value cardinality on first order in PRIMARY KEY
don't try to use ALL
SELECT * ...
it's really antipattern
moreover, maybe secondary data skip index may help you (but i'm not sure)
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-data_skipping-indexes

Should I use WITH instead of a JOIN on a table with a lot of data?

I have a MariaDB table which contains a lot of metadata and is very big in terms of bytes.
I have columns A, B in that table a long with other columns.
I would like to join that table with another table (stuff) in order to get column C from it.
So I have something like:
SELECT metadata.A, metadata.B, stuff.C FROM metadata JOIN
stuff on metadata.D = stuff.D
This query takes a very long time sometimes, I suspect its because (AFAIK, please correct me if Im wrong) that JOIN stores the result of the join in some side table and because metadata table is very big it has to copy a lot of data even though I dont use it, so I thought about optimizing it with WITH as follows:
WITH m as (SELECT A,B,D FROM metadata),
s as (SELECT C,D FROM stuff)
SELECT * FROM m JOIN s ON m.D = s.D;
The execution plan is the same (using EXPLAIN) but I think it will be faster since the side tables that will be created by WITH (again AFAIK WITH also creates side tables, please correct me if Im wrong) will be smaller and only contain the needed data.
Is my logic correct? Is there some way I can test that in MariaDB?
More likely, there is some form of cache speeding up one query or the other.
The Query cache is usually recognizable by a query time that is only about 1ms. It can be turned off via SELECT SQL_NO_CACHE ... to get a timing to compare against.
The other likely cache is the buffer_pool. Data is read from disk into the buffer_pool unless it is already there. The simple workaround for strange timings is to run the query twice and take the second 'time'.
Your hypothesis that WITH creates 'small' temp tables falls apart because of the work that is needed to read the original tables is the same with or without WITH.
Please provide SHOW CREATE TABLE for the two tables. There are a couple of datatype issues that may be involved -- big TEXTs or BLOBs.
The newly-added WITH opens up the possibility of recursive CTEs (and other things). And it provides a way to materialize a temp table that is used more than once. Neither of those applies in your query, so I would not expect any performance improvement.

performing table merges outside of oracle in R

If I have a set of tables that I need to extract from an Oracle server, is it always more efficient to join the tables within Oracle and have the system return the joined table, or are there cases where it would be more efficient to return two tables into R (or python) and merge them within R/Python locally?
For this discussion, let's presume that the two servers are equivalent and both have similar access to the storage systems.
I will not go into the efficiencies of joining itself but anytime you are moving data from a database into R kep the size into account. If the dataset after joining will be much smaller (maybe after an inner join) it might be best to join in db. If the data is going to expand significantly after join (say cross join) then joining it after extraction might be better. If there is not much difference then my preference would be to join in db as it can be better optimized. In fact if the data is already in db try to do as much of data preprocessing before extracting it out.

join index or collect stats which is better in Teradata

Am facing an issue with one of my FACT tables.
Through same job,
I call a procedure to load this FACT table and then second procedure to collect stats on this fact table.
As part of a new requirement I need to create a join index which will also include the above mentioned fact tables.
I believe that join index will be executed whenever there is a change in any of involved tables.So what will happen in above scenario?.will my collect stats procedure wait for join index execution to complete.or Will there be any contention because of the simulataneous occurance of collect stats and joinindex
Regards,
Anoop
The Join Index will automatically be maintained by Teradata when ETL processes add, change, or delete data in the table(s) referenced by the Join Index. The Join Index will have to be removed if you apply DDL changes to table(s) referenced in the Join Index that affect the columns participating in the Join Index or before you can DROP the table(s) referenced in the Join Index.
Statistics collection on either the Join Index or Fact table should be reserved until after the ETL for the Fact table has been completed or during a regular stats maintenance period. Whether you collect stats after each ETL process or only during a regular stats maintenance period is dependent on how much of the data in your Fact table is changing during each ETL cycle. I would hazard a guess that if you are creating a join index to improve performance of querying the fact table you likely do not need to collect stats on the same fact table after each ETL cycle unless this ETL cycle is a monthly or quarterly ETL process. Stats collection on the JI and fact table can be run in parallel. The lock required for COLLECT STATS is no higher than a READ. (It may in fact be an ACCESS lock.)
Depending on you release of Teradata you may be able to take advantage of using the THRESHOLD options to allow the optimizer to determine whether or not statistics do in fact need to be collected. I believe this was included in Teradata 14 as a stepping stone toward the automated statistics maintenance that has been introduced in Teradata 14.10.

Is count(*) really expensive?

I have a page where I have 4 tabs displaying 4 different reports based off different tables.
I obtain the row count of each table using a select count(*) from <table> query and display number of rows available in each table on the tabs. As a result, each page postback causes 5 count(*) queries to be executed (4 to get counts and 1 for pagination) and 1 query for getting the report content.
Now my question is: are count(*) queries really expensive -- should I keep the row counts (at least those that are displayed on the tab) in the view state of page instead of querying multiple times?
How expensive are COUNT(*) queries ?
In general, the cost of COUNT(*) cost is proportional to the number of records satisfying the query conditions plus the time required to prepare these records (which depends on the underlying query complexity).
In simple cases where you're dealing with a single table, there are often specific optimisations in place to make such an operation cheap. For example, doing COUNT(*) without WHERE conditions from a single MyISAM table in MySQL - this is instantaneous as it is stored in metadata.
For example, Let's consider two queries:
SELECT COUNT(*)
FROM largeTableA a
Since every record satisfies the query, the COUNT(*) cost is proportional to the number of records in the table (i.e., proportional to what it returns) (Assuming it needs to visit the rows and there isnt a specific optimisation in place to handle it)
SELECT COUNT(*)
FROM largeTableA a
JOIN largeTableB b
ON a.id = b.id
In this case, the engine will most probably use HASH JOIN and the execution plan will be something like this:
Build a hash table on the smaller of the tables
Scan the larger table, looking up each records in a hash table
Count the matches as they go.
In this case, the COUNT(*) overhead (step 3) will be negligible and the query time will be completely defined by steps 1 and 2, that is building the hash table and looking it up. For such a query, the time will be O(a + b): it does not really depend on the number of matches.
However, if there are indexes on both a.id and b.id, the MERGE JOIN may be chosen and the COUNT(*) time will be proportional to the number of matches again, since an index seek will be performed after each match.
You need to attach SQL Profiler or an app level profiler like L2SProf and look at the real query costs in your context before:
guessing what the problem is and trying to determine the likely benefits of a potential solution
allowing others to guess for you on da interwebs - there's lots of misinformation without citations about, including in this thread (but not in this post :P)
When you've done that, it'll be clear what the best approach is - i.e., whether the SELECT COUNT is dominating things or not, etc.
And having done that, you'll also know whether any changes you choose to do have had a positive or a negative impact.
As others have said COUNT(*) always physically counts rows, so if you can do that once and cache the results, thats certainly preferable.
If you benchmark and determine that the cost is negligible, you don't (currently) have a problem.
If it turns out to be too expensive for your scenario you could make your pagination 'fuzzy' as in "Showing 1 to 500 of approx 30,000" by using
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('sometable') AND indid < 2
which will return an approximation of the number of rows (its approximate because its not updated until a CHECKPOINT).
If the page gets slow, one thing you can look at is minimizing the number of database roundtrips, if at all possible. Even if your COUNT(*) queries are O(1), if you're doing enough of them, that could certainly slow things down.
Instead of setting up and executing 5 separate queries one at a time, run the SELECT statements in a single batch and process the 5 results at once.
I.e., if you're using ADO.NET, do something like this (error checking omitted for brevity; non-looped/non-dynamic for clarity):
string sql = "SELECT COUNT(*) FROM Table1; SELECT COUNT(*) FROM Table2;"
SqlCommand cmd = new SqlCommand(sql, connection);
SqlDataReader dr = cmd.ExecuteReader();
// Defaults to first result set
dr.Read();
int table1Count = (int)dr[0];
// Move to second result set
dr.NextResult();
dr.Read();
int table2Count = (int)dr[0];
If you're using an ORM of some sort, such as NHibernate, there should be a way to enable automatic query batching.
COUNT(*) can be particularly expensive as it may result in loading (and paging) an entire table, where you may only need a count on a primary key (In some implementations it is optimised).
From the sound of it, you are causing a table load operation each time, which is slow, but unless it is running noticeably slowly, or causing some sort of problem, don't optimise: premature and unnecessary optimisation can cause a great deal of trouble!
A count on an indexed primary key will be much faster, but with the costs of having an index this may provide no benefit.
All I/O is expensive and if you can accomplish the task without it, you should. But if it's needed, I wouldn't worry about it.
You mention storing the counts in view state, certainly an option, as long as the behavior of the code is acceptable when that count is wrong because the underlying records are gone or have been added to.
This depends on what are you doing with data in this table. If they are changing very often and you need them all every time, maybe you could make trigger which will fill another table that consists only on counts from this table. If you need to show this data separately, maybe you could just execute "select count(*)..." for only one particular table. This just came to my mind instantly, but there are other ways to speed this up, I'm sure. Cache data, maybe? :)

Resources