Calculating median with a GROUP BY in MariaDB - mariadb

I'm running MariaDB 10.3 which supports the MEDIAN function, and in simple cases this function seems to work well. However I'm having trouble making it work in complex situations. Currently I'm trying to calculate the number, mean and median value for each of several types of item. Here's a MWE:
CREATE TABLE items (type INTEGER, value INTEGER);
INSERT INTO items (type, value) VALUES (1,1), (1,3), (1,4), (2,10), (2,32), (2,42);
SELECT type, COUNT(*) AS count, AVG(value) AS mean,
MEDIAN(value) OVER(PARTITION BY type) AS median FROM items GROUP BY type;
DROP TABLE items;
This calculates the count and mean correctly, but not the median:
type
count
mean
median
1
3
2.6667
1.0000000000
2
3
28.0000
10.0000000000
I expect it's some silly error in my SQL, but I can't for the life of me see it, and no errors or warnings are reported. I've tried dropping the PARTITION BY type which changes the medians returned but they're still not correct. Removing the OVER clause results in a syntax error.
I don't understand why this isn't working. If I just do the following, the median is calculated correctly, though with lots of duplicate lines, but adding GROUP BY type gives the erroneous values above.
SELECT type, MEDIAN(value) OVER(PARTITION BY type) AS median FROM items;

This query:
SELECT type,
COUNT(*) AS count,
AVG(value) AS mean
FROM items
GROUP BY type;
is an aggregation query and if you add to it the window function MEDIAN() it will be calculated on the results that it returns which are:
type
count
mean
1
3
2.6667
2
3
28.0000
with an arbitrary value for value (which seems to be the first of each partition).
What you should do is get the median in a subquery and then aggregate:
SELECT type,
COUNT(*) AS count,
AVG(value) AS mean,
median
FROM (
SELECT *, MEDIAN(value) OVER (PARTITION BY type) AS median
FROM items
) t
GROUP BY type, median;
Or, use only window functions:
SELECT DISTINCT type,
COUNT(*) OVER (PARTITION BY type) AS count,
AVG(value) OVER (PARTITION BY type) AS mean,
MEDIAN(value) OVER (PARTITION BY type) AS median
FROM items;
See the demo.

Related

SQLite: Using COALESCE inside a CASE statement

I have two tables: one with record of a person with initial number, and a second one with records of changes to this number.
During a join, I do coalesce(latest_of_series, initial) to get a singular number per person. So good so far.
I also group this numbers into a groups, on order these groups separately. I know I can do:
select
coalesce(latest, initial) as final,
case
when coalesce(latest, inital) > 1 and coalesce(latest, inital) < 100 then 'group 1'
-- other cases
end as group
-- rest of the query
but that's of course horribly unreadable.
I tried:
select
coalesce(latest_of_series, initial_if_no_series) as value,
case
when value > 1 and value < 100 then 'group 1'
-- rest of the cases
end as group
-- rest of the query
but then the sqlite complains that there's no column "value"
Is there really no way of using previous result of coalesce as a "variable"?
That's not an SQLite limitation. That's an SQL limitation.
All the column names are decided as one. You can't define a column in line 2 of your query and then refer to it in line 3 of your query. All columns derive from the tables you select, each on their own, they can't "see" each other.
But you can use nested queries.
select
value,
case
when value >= 1 and value < 100 then 'group 1'
when value >= 100 and value < 200 then 'group 2'
else 'group 3'
end value_group
from
(
select
coalesce(latest_of_series, initial_if_no_series) as value
from
my_table
group by
user_id
) v
This way, the columns of the inner query can be decided as one, and the columns of the outer query can be decided as one. It might ever be faster, depending on the circumnstances.

Are there in guarantees about the non-aggregated columns in a GROUP BY query?

Let's assume we have the following table and query in SQLite:
id
val
parent
letter
1
3
10
a
2
3
10
b
3
0
10
c
4
5
20
d
SELECT id, MAX(val), parent, letter FROM table GROUP BY parent
Are there any guarantees about the value of id? In MySQL there is even a mode which forbids selecting non-aggregated values. If there is no such guarantee, is it possible to somehow get a single row per parent?
id
MAX(val)
parent
letter
1†
3
10
a
4
5
20
d
† or 2 (does not matter as long as letter is from the same row)
This behavior is covered in SELECT/Simple Select Processing/Side note: Bare columns in an aggregate queries.
In your query the columns id and letter, which are not aggregated and are not included in the GROUP BY clause, are called bare columns.
Because you use the MAX() aggregate function, the values of these 2 columns:
... take values from the input row which also contains the minimum or
maximum
But, since there may exist more than 1 rows with the maximum val for the same parent:
There is still an ambiguity if two or more of the input rows have the
same minimum or maximum value
This means that for your sample data there is no guarantee that for parent = 10 you will get the row with id = 1 in the results.
You may get the row with id = 2 which also contains the maximum val.
Assuming that in such a case, where for the same parent there may exist more than 1 rows with the maximum val, you want the row with the minimum id, you can do it with window functions:
SELECT id, val, parent, letter
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY parent ORDER BY val DESC, id) rn
FROM tablename
)
WHERE rn = 1
or:
SELECT DISTINCT
FIRST_VALUE(id) OVER (PARTITION BY parent ORDER BY val DESC, id) id,
MAX(val) OVER (PARTITION BY parent) val,
parent,
FIRST_VALUE(letter) OVER (PARTITION BY parent ORDER BY val DESC, id) letter
FROM tablename
See the demo.
From the Official doc
Worth a careful read. The bold is what you are looking for.
Aggregate Queries Can Contain Non-Aggregate Result Columns That Are
Not In The GROUP BY Clause
In most SQL implementations, output columns of an aggregate query may
only reference aggregate functions or columns named in the GROUP BY
clause. It does not make good sense to reference an ordinary column in
an aggregate query because each output row might be composed from two
or more rows in the input table(s).
SQLite does not enforce this restriction. The output columns from an
aggregate query can be arbitrary expressions that include columns not
found in GROUP BY clause. This feature has two uses:
With SQLite (but not any other SQL implementation that we know of) if an aggregate query contains a single min() or max() function,
then the values of columns used in the output are taken from the row where the min() or max() value was achieved. If two or more rows have
the same min() or max() value, then the columns values will be chosen
arbitrarily from one of those rows.
...
Now, with your query
SELECT id, MAX(val), parent, letter FROM table GROUP BY parent
You basically have id and letter that are not aggregated and not in the GROUP BY.
Which means that the values returned for them are those from the row(s) matching the MAX(val).
And in your case, there are 2 of them
id val parent letter
1 3 10 a
2 3 10 b
Which means that sqlite will arbitrarily (randomly) return the values of either row 1 or 2. So no, you have no guarantees.

Count(case when) redshift sql - receiving groupby error

I'm trying to do a count(case when) in Amazon Redshift.
Using this reference, I wrote:
select
sfdc_account_key,
record_type_name,
vplus_stage,
vplus_stage_entered_date,
site_delivered_date,
case when vplus_stage = 'Lost' then -1 else 0 end as stage_lost_yn,
case when vplus_stage = 'Lost' then 2000 else 0 end as stage_lost_revenue,
case when vplus_stage = 'Lost' then datediff(month,vplus_stage_entered_date,CURRENT_DATE) else 0 end as stage_lost_months_since,
count(case when vplus_stage = 'Lost' then 1 else 0 end) as stage_lost_count
from shared.vplus_enrollment_dim
where record_type_name = 'APM Website';
But I'm getting this error:
[42803][500310] [Amazon](500310) Invalid operation: column "vplus_enrollment_dim.sfdc_account_key" must appear in the GROUP BY clause or be used in an aggregate function; java.lang.RuntimeException: com.amazon.support.exceptions.ErrorException: [Amazon](500310) Invalid operation: column "vplus_enrollment_dim.sfdc_account_key" must appear in the GROUP BY clause or be used in an aggregate function;
Query was running fine before I added the count. I'm not sure what I'm doing wrong here -- thanks!
You can not have an aggregate function (sum, count etc) without group by
The syntax is like this
select a, count(*)
from table
group by a (or group by 1 in Redshift)
In your query you need to add
group by 1,2,3,4,5,6,7,8
because you have 8 columns other than count
Since I don't know your data and use case I can not tell you it will give you the right result, but SQL will be syntactically correct.
The basic rule is:
If you are using an aggregate function (eg COUNT(...)), then you must supply a GROUP BY clause to define the grouping
Exception: If all columns are aggregates (eg SELECT COUNT(*), AVG(sales) FROM table)
Any columns that are not aggregate functions must appear in the GROUP BY (eg SELECT year, month, AVG(sales) FROM table GROUP BY year, month)
Your query has a COUNT() aggregate function mixed-in with non-aggregate values, which is giving rise to the error.
In looking at your query, you probably don't want to group on all of the columns (eg stage_lost_revenue and stage_lost_months_since don't look like likely grouping columns). You might want to mock-up a query result to figure out what you actually want from such a query.

How to get the total quantity of results using count(*)?

i need to get the total quantity of results for each person but i get ...
resultado
MY QUERY..
select t.fecha_hora_timbre,e.nombre,e.apellido,d.descripcion as departamento_trabaja, t.fecha,count(*)
from fulltime.timbre t, fulltime.empleado e, fulltime.departamento d
where d.depa_id=e.depa_id and t.codigo_empleado=e.codigo_empleado and
trunc(t.fecha) between trunc(to_date('15/02/2017','dd/mm/yyyy')) and trunc(to_date('14/03/2017','dd/mm/yyyy'))
group by t.fecha_hora_timbre,e.nombre,e.apellido,d.descripcion, t.fecha
Expected data...
NOMBRE | APELLIDO | DEPARTAMENTO_TRABAJA | VECES_MARCADAS(count)
MARIA TARCILA IGLESIAS BECERRA ALCALDIA 4
KATHERINE TATIANA SEGOVIA FERNANDEZ ALCALDIA 10
FREDDY AGUSTIN VALDIVIESO VALLEJO ALCALDIA 3
UPDATE..
select e.nombre,e.apellido,d.descripcion as departamento_trabaja,COUNT(*)
from fulltime.timbre t, fulltime.empleado e, fulltime.departamento d
where d.depa_id=e.depa_id and t.codigo_empleado=e.codigo_empleado and
trunc(t.fecha) between trunc(to_date('15/02/2017','dd/mm/yyyy')) and trunc(to_date('14/03/2017','dd/mm/yyyy'))
group by t.fecha_hora_timbre,e.nombre,e.apellido,d.descripcion, t.fecha
You should only select and group by the non-aggregate columns you actually want to count against. At the moment you're including the fecha_hora_timbre and fechacolumns in each row, so you're counting the unique combinations of those columns as well as the name/department information you actually want to count.
select e.nombre, e.apellido, d.descripcion as departamento_trabaja,
count(*) a veces_marcadas
from fulltime.timbre t
join fulltime.empleado e on t.codigo_empleado=e.codigo_empleado
join fulltime.departamento d on d.depa_id=e.depa_id
where t.fecha >= to_date('15/02/2017','dd/mm/yyyy')
and t.fecha < to_date('15/03/2017','dd/mm/yyyy')
group by e.nombre, e.apellido, d.descripcion
I've removed the extra columns. Notice that they have gone from both the select list and the group-by clause. If you have a non-aggregate column in the select list that isn't in the group-by you'll get an ORA-00937 error; but if you have a column in the group-by that isn't in the select list then it will still group by that even though you can't see it and you just won't get the results you expect.
I've also changed from old-style join syntax to modern syntax. And I've changed the date comparison; firstly because doing trunc() as part of trunc(to_date('15/02/2017','dd/mm/yyyy')) is pointless - you already know the time part is midnight, so the trunc doesn't achieve anything. But mostly so that if there is an index on fecha that index can be used. If you do trunc(f.techa) then the value of every column value has to be truncated, which stops the index being used (unless you have a function-based index). As between in inclusive, using >= and < with one day later on the higher limit should have the same effect overall.

Fastest Way to Count Distinct Values in a Column, Including NULL Values

The Transact-Sql Count Distinct operation counts all non-null values in a column. I need to count the number of distinct values per column in a set of tables, including null values (so if there is a null in the column, the result should be (Select Count(Distinct COLNAME) From TABLE) + 1.
This is going to be repeated over every column in every table in the DB. Includes hundreds of tables, some of which have over 1M rows. Because this needs to be done over every single column, adding Indexes for every column is not a good option.
This will be done as part of an ASP.net site, so integration with code logic is also ok (i.e.: this doesn't have to be completed as part of one query, though if that can be done with good performance, then even better).
What is the most efficient way to do this?
Update After Testing
I tested the different methods from the answers given on a good representative table. The table has 3.2 million records, dozens of columns (a few with indexes, most without). One column has 3.2 million unique values. Other columns range from all Null (one value) to a max of 40K unique values. For each method I performed four tests (with multiple attempts at each, averaging the results): 20 columns at one time, 5 columns at one time, 1 column with many values (3.2M) and 1 column with a small number of values (167). Here are the results, in order of fastest to slowest
Count/GroupBy (Cheran)
CountDistinct+SubQuery (Ellis)
dense_rank (Eriksson)
Count+Max (Andriy)
Testing Results (in seconds):
Method 20_Columns 5_Columns 1_Column (Large) 1_Column (Small)
1) Count/GroupBy 10.8 4.8 2.8 0.14
2) CountDistinct 12.4 4.8 3 0.7
3) dense_rank 226 30 6 4.33
4) Count+Max 98.5 44 16 12.5
Notes:
Interestingly enough, the two methods that were fastest (by far, with only a small difference in between then) were both methods that submitted separate queries for each column (and in the case of result #2, the query included a subquery, so there were really two queries submitted per column). Perhaps because the gains that would be achieved by limiting the number of table scans is small in comparison to the performance hit taken in terms of memory requirements (just a guess).
Though the dense_rank method is definitely the most elegant, it seems that it doesn't scale well (see the result for 20 columns, which is by far the worst of the four methods), and even on a small scale just cannot compete with the performance of Count.
Thanks for the help and suggestions!
SELECT COUNT(*)
FROM (SELECT ColumnName
FROM TableName
GROUP BY ColumnName) AS s;
GROUP BY selects distinct values including NULL. COUNT(*) will include NULLs, as opposed to COUNT(ColumnName), which ignores NULLs.
I think you should try to keep the number of table scans down and count all columns in one table in one go. Something like this could be worth trying.
;with C as
(
select dense_rank() over(order by Col1) as dnCol1,
dense_rank() over(order by Col2) as dnCol2
from YourTable
)
select max(dnCol1) as CountCol1,
max(dnCol2) as CountCol2
from C
Test the query at SE-Data
A development on OP's own solution:
SELECT
COUNT(DISTINCT acolumn) + MAX(CASE WHEN acolumn IS NULL THEN 1 ELSE 0 END)
FROM atable
Run one query that Counts the number of Distinct values and adds 1 if there are any NULLs in the column (using a subquery)
Select Count(Distinct COLUMNNAME) +
Case When Exists
(Select * from TABLENAME Where COLUMNNAME is Null)
Then 1 Else 0 End
From TABLENAME
You can try:
count(
distinct coalesce(
your_table.column_1, your_table.column_2
-- cast them if you want replace value from column are not same type
)
) as COUNT_TEST
Function coalesce help you combine two columns with replace not null values.
I used this in mine case and success with correctly result.
Not sure this would be the fastest but might be worth testing. Use case to give null a value. Clearly you would need to select a value for null that would not occur in the real data. According to the query plan this would be a dead heat with the count(*) (group by) solution proposed by Cheran S.
SELECT
COUNT( distinct
(case when [testNull] is null then 'dbNullValue' else [testNull] end)
)
FROM [test].[dbo].[testNullVal]
With this approach can also count more than one column
SELECT
COUNT( distinct
(case when [testNull1] is null then 'dbNullValue' else [testNull1] end)
),
COUNT( distinct
(case when [testNull2] is null then 'dbNullValue' else [testNull2] end)
)
FROM [test].[dbo].[testNullVal]

Resources