Teradata Aggregation computation locally vs globally - teradata

I have two tables Table1 and Table2 with both tables having primary index as col1,col2,col3 and col4.
I join these two tables and do a group by on a set of columns which includes the PI of the tables.
Can someone tell me why in the explain plan I get "Aggregate Intermediate Results are computed globally"
rather than locally. My understanding is that the when the group by column contain all the PI column
aggregate results are computed locally rather than globally.
select
A.col1
,A.col2
,A.col3
,A.col4
,col5
,col6
,col7
,col8
,col9
,SUM(col10)
,COUNT(col11)
table1 A
left outer join
table2 B
on A.col1 = B.col1
A.col2 = B.col2
A.col3 = B.col3
A.col4 = B.col4
group by A.col1,A.col2,A.col3,A.col4,col5,col6,col7,col8,col9
Below is the Explain plan for the Query
1) First, we lock a distinct DATEBASE_NAME."pseudo table" for read on a
RowHash to prevent global deadlock for DATEBASE_NAME.S.
2) Next, we lock a distinct DATEBASE_NAME."pseudo table" for write on a
RowHash to prevent global deadlock for
DATEBASE_NAME.TARGET_TABLE.
3) We lock a distinct DATEBASE_NAME."pseudo table" for read on a RowHash
to prevent global deadlock for DATEBASE_NAME.E.
4) We lock DATEBASE_NAME.S for read, we lock
DATEBASE_NAME.TARGET_TABLE for write, and we lock
DATEBASE_NAME.E for read.
5) We do an all-AMPs JOIN step from DATEBASE_NAME.S by way of a RowHash
match scan with no residual conditions, which is joined to
DATEBASE_NAME.E by way of a RowHash match scan. DATEBASE_NAME.S and
DATEBASE_NAME.E are left outer joined using a merge join, with
condition(s) used for non-matching on left table ("(NOT
(DATEBASE_NAME.S.col1 IS NULL )) AND ((NOT
(DATEBASE_NAME.S.col2 IS NULL )) AND ((NOT
(DATEBASE_NAME.S.col3 IS NULL )) AND (NOT
(DATEBASE_NAME.S.col4 IS NULL ))))"), with a join condition of (
"(DATEBASE_NAME.S.col1 = DATEBASE_NAME.E.col1) AND
((DATEBASE_NAME.S.col2 = DATEBASE_NAME.E.col2) AND
((DATEBASE_NAME.S.col3 = DATEBASE_NAME.E.col3) AND
(DATEBASE_NAME.S.col4 = DATEBASE_NAME.E.col4 )))"). The input
table DATEBASE_NAME.S will not be cached in memory. The result goes
into Spool 3 (all_amps), which is built locally on the AMPs. The
result spool file will not be cached in memory. The size of Spool
3 is estimated with low confidence to be 675,301,664 rows (
812,387,901,792 bytes). The estimated time for this step is 3
minutes and 37 seconds.
6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
way of an all-rows scan , grouping by field1 (
DATEBASE_NAME.S.col1 ,DATEBASE_NAME.S.col2
,DATEBASE_NAME.S.col3 ,DATEBASE_NAME.S.col4
,DATEBASE_NAME.E.col5
,DATEBASE_NAME.S.col6 ,DATEBASE_NAME.S.col7
,DATEBASE_NAME.S.col8 ,DATEBASE_NAME.S.col9). Aggregate
Intermediate Results are computed globally, then placed in Spool 4.
The aggregate spool file will not be cached in memory. The size
of Spool 4 is estimated with low confidence to be 506,476,248 rows
(1,787,354,679,192 bytes). The estimated time for this step is 1
hour and 1 minute.
7) We do an all-AMPs MERGE into DATEBASE_NAME.TARGET_TABLE
from Spool 4 (Last Use). The size is estimated with low
confidence to be 506,476,248 rows. The estimated time for this
step is 33 hours and 12 minutes.
8) We spoil the parser's dictionary cache for the table.
9) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> No rows are returned to the user as the result of statement 1.

you just use col1,col2,col3,col4 to aggregate
then it would be aggregate locall?
more details from this url:
http://www.teradataforum.com/teradata/20040526_133730.htm

I believe that is because of intermediate spool. You are using columns from that spool, not from original table for group by. I was able to compute the aggregate intermediate results locally using a volatile table.
Essentially what happened in this case is that I took the spool from step 5, gave it a name and enforced a PI on it. Since PI of volatile table is same as initial tables, the volatile table generation is also a local amp operation.
CREATE VOLATILE TABLE x AS
(
SELECT
A.col1
,A.col2
,A.col3
,A.col4
,col5
,col6
,col7
,col8
,col9
--,SUM(col10)
--,COUNT(col11)
from
table1 A
left outer join
table2 B
on A.col1 = B.col1
A.col2 = B.col2
A.col3 = B.col3
A.col4 = B.col4
--group by A.col1,A.col2,A.col3,A.col4,col5,col6,col7,col8,col9
)
WITH DATA PRIMARY INDEX (col1, col2, col3, col4)
;
SELECT
col1
,col2
,col3
,col4
,col5
,col6
,col7
,col8
,col9
SUM(col10)
COUNT(col11)
from
x
GROUP BY
col1,col2,col3,col4,col5,col6,col7,col8,col9

Related

Can I speed up calculations between R and Sqlite by using data.tables?

I have a sqlite database of about 1.4 million rows and 16 columns.
I have to run an operation on 80,000 id's :
Get all rows associated with that id
convert to R date object and sort by date
calculate difference between 2 most recent dates
For each id I have been querying sqlite from R using dbSendQuery and dbFetch for step 1, while steps 2 and 3 are done in R. Is there a faster way? Would it be faster or slower to load the entire sqlite table into a data.table ?
I heavily depends on how you are working on that problem.
Normally loading the whole query inside the memory and then do the operation will be faster from what I have experienced and have seen on grahics, I can not show you a benchmark right now. If logically it makes hopefully sense, because you have to repeat several operations multiple times on multiple data.frames. As you can see here, 80k rows are pretty fast, faster than 3x 26xxx rows.
However you could have a look at the parallel package and use multiple cores on your machine to load subsets of your data and process them parallel, each on a multiple core.
Here you can find information how to do this:
http://jaehyeon-kim.github.io/2015/03/Parallel-Processing-on-Single-Machine-Part-I
If you're doing all that in R and fetching rows from the database 80,0000 times in a loop... you'll probably have better results doing it all in one go in sqlite instead.
Given a skeleton table like:
CREATE TABLE data(id INTEGER, timestamp TEXT);
INSERT INTO data VALUES (1, '2019-07-01'), (1, '2019-06-25'), (1, '2019-06-24'),
(2, '2019-04-15'), (2, '2019-04-14');
CREATE INDEX data_idx_id_time ON data(id, timestamp DESC);
a query like:
SELECT id
, julianday(first_ts)
- julianday((SELECT max(d2.timestamp)
FROM data AS d2
WHERE d.id = d2.id AND d2.timestamp < d.first_ts)) AS days_difference
FROM (SELECT id, max(timestamp) as first_ts FROM data GROUP BY id) AS d
ORDER BY id;
will give you
id days_difference
---------- ---------------
1 6.0
2 1.0
An alternative for modern versions of sqlite (3.25 or newer) (EDIT: On a test database with 16 million rows and 80000 distinct ids, it runs considerably slower than the above one, so you don't want to actually use it):
WITH cte AS
(SELECT id, timestamp
, lead(timestamp, 1) OVER id_by_ts AS next_ts
, row_number() OVER id_by_ts AS rn
FROM data
WINDOW id_by_ts AS (PARTITION BY id ORDER BY timestamp DESC))
SELECT id, julianday(timestamp) - julianday(next_ts) AS days_difference
FROM cte
WHERE rn = 1
ORDER BY id;
(The index is essential for performance for both versions. Probably want to run ANALYZE on the table at some point after it's populated and your index(es) are created, too.)

how to query by row number in oracle

Is there any way to query directly by row number in a table in Oracle? In other words, to achieve the same effect of ordinary lookup in an array in some basic language like C or Java. I've not yet tried virtual columns.
For instance, the following is an example of an efficient query, but it wastes disk space:
create table ary (row_position_id number(10) NOT NULL,
datum binary_float NOT NULL);
declare i pls_integer;
begin
for i in 0..10000000
loop
insert into ary values (i, dbms_random.normal());
end loop;
commit;
end;
create unique index ary_rp on ary(row_position_id);
now, i'm going to create a set of query values to store in another "parameter" table:
create table query_values (qval number(10) NOT NULL);
declare i pls_integer;
begin
for i in 0..10000
loop
insert into query_values (abs(dbms_random.random() % 10000000));
end loop;
commit;
end;
now, having these query values, i'm going to query the original table
select d.* from ary d where exists (select 0 from query_values v
where d.row_position_id = v.qval);
Now, this query would be fine -- it would use INDEX UNIQUE SCAN and TABLE access by ROWID. The problem I have is that the row_position_id takes up as much space in the table blocks as the actual data (the DATUM column).
I am aware of Index-organized tables and also Virtual Columns (which cannot be used with IOTs). And, of course, things like ROWNUM and ROW_NUMBER are irrelevant here (unless I'm misunderstanding something).
Also worth pointing out, this table is static data -- once loaded, it will never change. I would likely do an ALTER TABLE ARY READ ONLY;
What I would really like is:
create table ary (datum binary_float not null);
-- load rows in a specific order
-- efficiently query this table by implicit row position
Thanks very much!
Henry
I think you're going to want to keep the extra column. Here's why:
As you said, ROWNUM and ROW_NUMBER are not applicable here because they are generated as rows are returned in the query; they will not tell you anything about insert order.
What about ROWID? ROWID is just where a row is stored - again, from the docs:
The data object number of the object
The data block in the data file in which the row resides
The position of the row in the data block (first row is 0)
The data file in which the row resides (first file is 1). The file number is relative to the tablespace.
The "position in the data block" sounds interesting, but you would have no idea what order of the data blocks that were inserted (Oracle could use whatever datablocks it can quickly make use of) so this would not be a reliable option, and even so, you'd be having to parse ROWIDs which are not human readable (e.g. in 12g they look like this: *BAGAASMCwQL+ )
Another option is ORA_ROWSCN which is interesting in that it does give you some idea of order, in terms of the system change number. However, it doesn't come for free. Just to start, you have to create your table with the ROWDEPENDENCIES option and as per docs:
ROWDEPENDENCIES Specify ROWDEPENDENCIES if you want to enable
row-level dependency tracking. This setting is useful primarily to
allow for parallel propagation in replication environments. It
increases the size of each row by 6 bytes.
The other catch with this is that you would have to have to follow each row inserted with a commit so each row would get a different SCN.
If you're willing to go this far, you'll still have to convert the rows to have indexes (starting with, say, 0 or 1) that you can use to join to other tables.
Here's a quick sample of what it would involve:
DROP TABLE temp;
CREATE TABLE temp
( a number(10)
, b varchar2(10)
)
ROWDEPENDENCIES
;
-- one commit after all rows
INSERT INTO temp VALUES (1, 'A');
INSERT INTO temp VALUES (2, 'B');
INSERT INTO temp VALUES (3, 'C');
INSERT INTO temp VALUES (4, 'D');
INSERT INTO temp VALUES (5, 'E');
INSERT INTO temp VALUES (6, 'F');
COMMIT;
SELECT X.*, ROWNUM
FROM (SELECT T.*
, ORA_ROWSCN
FROM TEMP T
ORDER BY ORA_ROWSCN
) x
;
A B ORA_ROWSCN ROWNUM
1 A 2272340 1
2 B 2272340 2
6 F 2272340 3
4 D 2272340 4
5 E 2272340 5
3 C 2272340 6
Whoops. Those rows are definitely not in the order they came in.
Now using one commit per row:
TRUNCATE TABLE temp;
INSERT INTO temp VALUES (1, 'A');
COMMIT;
INSERT INTO TEMP VALUES (2, 'B');
COMMIT;
INSERT INTO temp VALUES (3, 'C');
COMMIT;
INSERT INTO temp VALUES (4, 'D');
COMMIT;
INSERT INTO temp VALUES (5, 'E');
COMMIT;
INSERT INTO temp VALUES (6, 'F');
COMMIT;
SELECT X.*, ROWNUM
FROM (SELECT T.*
, ORA_ROWSCN
FROM TEMP T
ORDER BY ORA_ROWSCN
) x
;
A B ORA_ROWSCN ROWNUM
1 A 2272697 1
2 B 2272699 2
3 C 2272701 3
4 D 2272703 4
5 E 2272705 5
6 F 2272707 6
Better. But if you've got a significant number of rows it's not going to go in fast. (I think this is what you would do if you intentionally wanted to slow down your inserts. ;) )
I think that's about as good as you'll get trying to get around using your own column, BUT there is still hope to economize storage: you can do away with the table + index and just go with an index-organized table. It's basically an index that you query directly.
It's just this easy:
CREATE TABLE TEMP2
( A NUMBER(10)
, B VARCHAR2(10)
, CONSTRAINT PK_CONSTRAINT PRIMARY KEY (A)
)
ORGANIZATION INDEX
;
There are other parameters you'll want to consider for this as well, but for more info check out... the docs.

improve complex query in Teradata

I would like to know how to get all the rows from table1 that have a matching row in table3.
Teh structure of the tables is:
table1:
k1 k2
table2:
k1 k2 t1 t2 date type
table3:
t1 t2 date status
The conditions are:
k1 and k2 have to match with the corresponding columns in table2.
In table2 I will only chek those rows where date='today' and type='a'.
That can return 0, 1 or many rows in table2.
Looking at t1 and t2 from table 2, I get the rows that match in table3.
If in table3 date='today' and status='ok', I will return the original row from table1, this is, k1 and k2.
How can I do this query (inner joins, exists, whatever) having into account that the three tables have millions of rows, so it must be as optimal as possible?
I have the query, which is right for sure, but they are too many conditions for Teradata to come with the answer. Too many joins, I think.
I would not consider three tables and a few millions of rows a complex query.
In Teradata you usually don't have to think that much about join/in/exists, all will be rewritten to joins internally. But there's is a one-to-many-to-one relation, so you should avoid a join as this will need a final DISTINCT.
Better use IN or EXISTS instead:
SELECT
K1,K2
FROM Table1
WHERE (K1,K2) IN
(
SELECT K1,K2
FROM Table2
WHERE datecol = CURRENT_DATE
AND typecol = 'a'
AND (T1,T2) IN
(
SELECT T1,T2
FROM Table3
WHERE datecol = CURRENT_DATE
AND status = 'ok'
)
)
Regarding the actual plan: if there are the necessary statistics the optimizer should choose a good plan, check the confidences levels in Explain. You can also run a diagnostic helpstats on for session; before running Explain to see if there are missing stats.
Something like the following should work.
SELECT
Table1.*
FROM
Table1
INNER JOIN Table2 ON
Table1.K1 = Table2.K1 AND
Table1.K2 = Table2.K2 AND
Table2.date = CURRENT_DATE and
Table2.type = 'a'
INNER JOIN Table3 ON
Table2.T1 = Table3.T1 AND
Table2.T2 = Table3.T2 AND
Table3.date = CURRENT_DATE and
Table3.status = "OK"
Update:
Speaking more to the optimization part of the question. The execution steps that Teradata will most likely take here are:
In parallel it will select all records from Table1, Records from Table2 where the date is CURRENT_DATE and the type is a, and Records from Table3 where the date is CURRENT_DATE and the status is OK.
It will then join the results from the SELECT of Table2 to the results of the SELECT from table1.
It will then join the results from that to the results from the SELECT of table3.
You can get more information by putting EXPLAIN before your SELECT query. The results returned from the database will be the explanation of your Teradata server will execute the query, which can be very enlightening when trying to optimize a big slow query.
Unfortunately the steps above are the best you can hope for. Parallel execution of all three tables with the filters applied, and then a join of the results. With big data, the slowest part of a query is often the join, so filtering before you get to that step is a big plus.
There's more that can be done to optimize like making sure your Indexes are in order and Collecting statistics, especially on fields where you will be filtering. But without the admin access to do that, your hands are tied.

Optimization of Oracle Query

I have following query :
SELECT distinct A1 ,sum(total) as sum_total FROM
(
SELECT A1, A2,A3,A4,A5,A6,COUNT(A7) AS total,A8
FROM (
select a.* from table1 a
left join (select * from table_reject where name = 'smith') b on A.A3 = B.B3 and A.A9 =B.B2
where B.ID is null
) t1
WHERE A8 >= NEXT_DAY ( trunc(to_date('17/09/2013 12:00:00','dd/mm/yyyy hh24:mi:ss')) ,'SUN' )
GROUP BY
CUBE(A1, A2,A3,A4,A5,A6,A8)
)INN
WHERE
INN.A1 IS NOT NULL AND
INN.A2 IS NULL AND
INN.A3 IS NULL AND
INN.A4 IS NULL AND
INN.A5 IS NULL AND
INN.A6 is NULL AND
INN.A8 IS NOT NULL
GROUP BY A1
ORDER BY sum_total DESC ;
Total number records in table1 is around 8 million.
My problem is i need to optimize the above query in best possible way.I did tried to make index on column A8 of table1 and creating the index helped me to decrease the cost of query but execution time of query is more or less same when there was no index on the table1.
Any help would be appreciated.
Thanks
CUBE operation on large data set is really expensive, so you need to check do you really need all that data in inner query. because i see you are doing COUNT in inner and then on the outer query you have SUM of counts. so in other words, give me the row count of A7 at for all combination A1-A8 (-A7). then get only SUM for selected combinations filtered by WHERE clause. we can sure optimize this by limiting CUBE on certain column itself but very obvious things so far i have notice are as follows.
if you use below query and have right index o Table1 and Table_reject then both query can utilize the Index and reduce the data set needs to be join and further processed.
I am not 100% sure but yes Partial CUBE processing is possible and need to check that.
clustered index --> Table1 need on A8 And Table_Reject need clustered index on NAME.
non-clustered index--> Table1 need on A3,A9and Table_reject need on B3,B2
SELECT qry1.
(
SELECT A1, A2,A3,A4,A5,A6,A7,A8
FROM table1
WHERE A8 >= NEXT_DAY ( trunc(to_date('17/09/2013 12:00:00','dd/mm/yyyy hh24:mi:ss')) ,'SUN' )
)qry1
LEFT JOIN
(
select B3,B2,ID
from table_reject
where name = 'smith'
)qry2
ON qry1.A3 = qry2.B3 and qry1.A9=qry2.B2
WHERE qry2.ID IS NULL
EDIT1:
I tried to find out what will be the difference in CUBE operator result if you do it on all Columns or you do it on only columns that you need it in result set. what I found is the way CUBE function works you do not need to perform CUBE on all columns. because at the end you just care about combinations generated by CUBE where A1 and A8 is NOT NULL.
Try this link and see the output.
enter link description here
Query 1 and Query2 is just inner most queries to compare the CUBE result set.
Query3 and Query4 is the same query that you are trying and you see the results are same in both case.
DECLARE #NEXT_DAY DATE = NEXT_DAY ( trunc(to_date('17/09/2013 12:00:00','dd/mm/yyyy hh24:mi:ss')) ,'SUN' )
SELECT distinct A1 ,sum(total) as sum_total FROM
(
SELECT A1,COUNT(A7) AS total,A8
FROM (
select a.a1,a.a7,a.a8
from table1 a
left join (select * from table_reject where name = 'smith') b
on A.A3 = B.B3 and A.A9 =B.B2
where B.ID is null
) t1
WHERE A8 >= #NEXT_DAY
GROUP BY
CUBE(A1,A8)
)INN
WHERE INN.A1 IS NOT NULL AND
INN.A8 IS NOT NULL
GROUP BY A1
ORDER BY sum_total DESC ;
EDIT3
As I mentioned in the Comment this is a Round3 update. i can not change comment but i meant Edit3 instead Round3.
well the new change in your query is adding the WHERE A8 >= #NEXT_DAY condition in the inner most left join select where A8 >= #NEXT_DAY AND B.ID is null as well. that has improved the selection very much.
in your last comment you mentioned that query is taking 30-35 second and as you change the value of A8 it keep increasing. now with the execution time you didn't mentioned how much data is in the result set. why that is important? because if my query is returning 5M rows as a final result set that will going to spend 90% time in just droping that data on to UI, or output file what ever output method you are using. but actual performance should be measured how soon the query has started giving first couple of rows. because by that time Optimizer has already decided the execution paln and DB is executing that plan. yet i agree that if query is returning 100 rows and taking 10 seconds then something can be wrong with execution plan.
to demo that what I did is I created the dummy data. and perfomred your query against it.
i have table Test_CubeData with 9M rows in it with the same column numbers and data type you explained for your Table1. I have second table Table_Reject with 80K rows with number of columns and its datatype I figured out from query. To test the extreme side of this table; name column has only one value "smith" and ID is null for all 80K rows. so column values that can affect inner left join result will be B2 and B3.
in these tests i do not have any index on both tables. both are heap. and you see the results are in still few seconds with acceptable range of data in result set. as my result data set increases the completion time increases. if i create explained indexes then it will give me Index Seek operation
for all these tested cases. but at certain point that index will also exhaust and become Index Scan.
one sure example would be if my filter value for A8 column is smallest date value exist in that column. in that case Optimizer will see that all 9M rows need to be participate in inner select and CUBE and lot of data will be get processed in memory. which is expected. on the other hand lets see the another example of queries. i have unique 32873 values in A8 column and those values are almost equally distributed among 9M rows. so per single A8 values there are 260 to 300 rows. now if I execute the query for any single value smallest, largest, or any thing in between the query execution time should not change.
notice the highlighted text in each image below that indicated what the value of A8 filter is chosen,
important columns only in the select list instead using *, added A8 filter in the inner left join query, execution plan showing the TableScan operation on both table, query execution time in second,and total number of rows return by the query.
I hope that this will clear some doubts on performance of your query and will help you to set right expectation.
**Table Row Counts**
**TableScan_InnerLeftJoin**
**TableScan_FullQuery_248Rows**
**TableScan_FullQuery_5K**
**TableScan_FullQuery_56K**
**TableScan_FullQuery_480k**
You're calculating a potentially very large cube result on seven columns, and then discarding all the results except those that are logically just a group_by on column A1.
I suggest that you rewrite the query to just group by A1.

Fastest Way to Count Distinct Values in a Column, Including NULL Values

The Transact-Sql Count Distinct operation counts all non-null values in a column. I need to count the number of distinct values per column in a set of tables, including null values (so if there is a null in the column, the result should be (Select Count(Distinct COLNAME) From TABLE) + 1.
This is going to be repeated over every column in every table in the DB. Includes hundreds of tables, some of which have over 1M rows. Because this needs to be done over every single column, adding Indexes for every column is not a good option.
This will be done as part of an ASP.net site, so integration with code logic is also ok (i.e.: this doesn't have to be completed as part of one query, though if that can be done with good performance, then even better).
What is the most efficient way to do this?
Update After Testing
I tested the different methods from the answers given on a good representative table. The table has 3.2 million records, dozens of columns (a few with indexes, most without). One column has 3.2 million unique values. Other columns range from all Null (one value) to a max of 40K unique values. For each method I performed four tests (with multiple attempts at each, averaging the results): 20 columns at one time, 5 columns at one time, 1 column with many values (3.2M) and 1 column with a small number of values (167). Here are the results, in order of fastest to slowest
Count/GroupBy (Cheran)
CountDistinct+SubQuery (Ellis)
dense_rank (Eriksson)
Count+Max (Andriy)
Testing Results (in seconds):
Method 20_Columns 5_Columns 1_Column (Large) 1_Column (Small)
1) Count/GroupBy 10.8 4.8 2.8 0.14
2) CountDistinct 12.4 4.8 3 0.7
3) dense_rank 226 30 6 4.33
4) Count+Max 98.5 44 16 12.5
Notes:
Interestingly enough, the two methods that were fastest (by far, with only a small difference in between then) were both methods that submitted separate queries for each column (and in the case of result #2, the query included a subquery, so there were really two queries submitted per column). Perhaps because the gains that would be achieved by limiting the number of table scans is small in comparison to the performance hit taken in terms of memory requirements (just a guess).
Though the dense_rank method is definitely the most elegant, it seems that it doesn't scale well (see the result for 20 columns, which is by far the worst of the four methods), and even on a small scale just cannot compete with the performance of Count.
Thanks for the help and suggestions!
SELECT COUNT(*)
FROM (SELECT ColumnName
FROM TableName
GROUP BY ColumnName) AS s;
GROUP BY selects distinct values including NULL. COUNT(*) will include NULLs, as opposed to COUNT(ColumnName), which ignores NULLs.
I think you should try to keep the number of table scans down and count all columns in one table in one go. Something like this could be worth trying.
;with C as
(
select dense_rank() over(order by Col1) as dnCol1,
dense_rank() over(order by Col2) as dnCol2
from YourTable
)
select max(dnCol1) as CountCol1,
max(dnCol2) as CountCol2
from C
Test the query at SE-Data
A development on OP's own solution:
SELECT
COUNT(DISTINCT acolumn) + MAX(CASE WHEN acolumn IS NULL THEN 1 ELSE 0 END)
FROM atable
Run one query that Counts the number of Distinct values and adds 1 if there are any NULLs in the column (using a subquery)
Select Count(Distinct COLUMNNAME) +
Case When Exists
(Select * from TABLENAME Where COLUMNNAME is Null)
Then 1 Else 0 End
From TABLENAME
You can try:
count(
distinct coalesce(
your_table.column_1, your_table.column_2
-- cast them if you want replace value from column are not same type
)
) as COUNT_TEST
Function coalesce help you combine two columns with replace not null values.
I used this in mine case and success with correctly result.
Not sure this would be the fastest but might be worth testing. Use case to give null a value. Clearly you would need to select a value for null that would not occur in the real data. According to the query plan this would be a dead heat with the count(*) (group by) solution proposed by Cheran S.
SELECT
COUNT( distinct
(case when [testNull] is null then 'dbNullValue' else [testNull] end)
)
FROM [test].[dbo].[testNullVal]
With this approach can also count more than one column
SELECT
COUNT( distinct
(case when [testNull1] is null then 'dbNullValue' else [testNull1] end)
),
COUNT( distinct
(case when [testNull2] is null then 'dbNullValue' else [testNull2] end)
)
FROM [test].[dbo].[testNullVal]

Resources