Using MEDIAN with GROUP BY - mariadb

As of MariaDB 10.3.3 there exist MEDIAN function.
Unfortunately there is a little problem when I try to use it with GROUP BY statement (currently using v10.3.9).
Given following table:
CREATE TABLE testmed
(
id INT NOT NULL auto_increment,
PRIMARY KEY(id),
group_id INT NOT NULL DEFAULT 0,
score INT NOT NULL DEFAULT 0
);
Filling it up with some data:
INSERT INTO testmed (group_id, score)
VALUES (1,1), (1,2), (1,2), (1,2), (1,3), (2,5), (2,7), (2,9), (2,11), (2,11);
Now I am getting different results with and without GROUP BY in query:
MariaDB [test]> SELECT group_id, score, MEDIAN(score) OVER (PARTITION BY group_id) FROM testmed;
+----------+-------+--------------------------------------------+
| group_id | score | MEDIAN(score) OVER (PARTITION BY group_id) |
+----------+-------+--------------------------------------------+
| 1 | 1 | 2.0000000000 |
| 1 | 2 | 2.0000000000 |
| 1 | 2 | 2.0000000000 |
| 1 | 2 | 2.0000000000 |
| 1 | 3 | 2.0000000000 |
| 2 | 5 | 9.0000000000 |
| 2 | 7 | 9.0000000000 |
| 2 | 9 | 9.0000000000 |
| 2 | 11 | 9.0000000000 |
| 2 | 11 | 9.0000000000 |
+----------+-------+--------------------------------------------+
10 rows in set (0.000 sec)
MariaDB [test]> SELECT group_id, score, MEDIAN(score) OVER (PARTITION BY group_id) FROM testmed GROUP BY group_id;
+----------+-------+--------------------------------------------+
| group_id | score | MEDIAN(score) OVER (PARTITION BY group_id) |
+----------+-------+--------------------------------------------+
| 1 | 1 | 1.0000000000 |
| 2 | 5 | 5.0000000000 |
+----------+-------+--------------------------------------------+
First one is correct, but why it is not working properly with GROUP BY.
Currently I am using query nesting like that:
MariaDB [test]> SELECT * FROM (SELECT group_id, score, MEDIAN(score) OVER (PARTITION BY group_id) FROM testmed) t GROUP BY group_id;
+----------+-------+--------------------------------------------+
| group_id | score | MEDIAN(score) OVER (PARTITION BY group_id) |
+----------+-------+--------------------------------------------+
| 1 | 1 | 2.0000000000 |
| 2 | 5 | 9.0000000000 |
+----------+-------+--------------------------------------------+
2 rows in set (0.000 sec)
but it feels so wrong doing it that way.
What is right way to do it?

Your second query is technically invalid:
SELECT
group_id,
score,
MEDIAN(score) OVER (PARTITION BY group_id)
FROM testmed
GROUP BY group_id;
The reason it is invalid is because you are selecting score which does not appear in the GROUP BY clause. The issue here is which value of score do you intend the database to use for each group_id? What appears to be happening here is that MariaDB is arbitrarily choosing the minimum value of score. But since there is only a single score value, the median just returns that single value.
Keep in mind that analytic functions are evaluated after the GROUP BY aggregation takes place. I think this is the query you were intending to run:
SELECT DISTINCT
group_id,
MEDIAN(score) OVER (PARTITION BY group_id) score_median
FROM testmed;
If this doesn't work, because MariaDB doesn't like using DISTINCT with MEDIAN, then you can try subquerying:
SELECT DISTINCT
group_id,
score_median
FROM
(
SELECT
group_id,
MEDIAN(score) OVER (PARTITION BY group_id) score_median
FROM testmed
) t;

Related

Sqlite / populate new column that ranks the existing rows

I've a SQLite database table with the following columns:
| day | place | visitors |
-------------------------------------
| 2021-05-01 | AAA | 20 |
| 2021-05-01 | BBB | 10 |
| 2021-05-01 | CCC | 3 |
| 2021-05-02 | AAA | 5 |
| 2021-05-02 | BBB | 7 |
| 2021-05-02 | CCC | 2 |
Now I would like to introduce a column 'rank' which indicates the rank according to the visitors each day. Expected table would look like:
| day | place | visitors | Rank |
------------------------------------------
| 2021-05-01 | AAA | 20 | 1 |
| 2021-05-01 | BBB | 10 | 2 |
| 2021-05-01 | CCC | 3 | 3 |
| 2021-05-02 | AAA | 5 | 2 |
| 2021-05-02 | BBB | 7 | 1 |
| 2021-05-02 | CCC | 2 | 3 |
Populating the data for the new column Rank can be done with a program like (Pseudocode).
for each i_day in all_days:
SELECT
ROW_NUMBER () OVER (ORDER BY `visitors` DESC) Day_Rank, place
FROM mytable
WHERE `day` = 'i_day'
for each i_place in all_places:
UPDATE mytable
SET rank= Day_Rank
WHERE `Day`='i_day'
AND place = 'i_place'
Since this line by line update is quite inefficient, I'm searching how to optimize this with a SQL sub query in combination with the UPDATE.
(does not work so far...)
for each i_day in all_days:
UPDATE mytable
SET rank= (
SELECT
ROW_NUMBER () OVER (ORDER BY `visitors` DESC) Day_Rank
FROM mytable
WHERE `day` = 'i_day'
)
Typically, this can be done with a subquery that counts the number of rows with visitors greater than the value of visitors of the current row:
UPDATE mytable
SET Day_Rank = (
SELECT COUNT(*) + 1
FROM mytable m
WHERE m.day = mytable.day AND m.visitors > mytable.visitors
);
Note that the result is actually what RANK() would return, if there are ties in the values of visitors.
See the demo.
Or, you could calculate the rankings with ROW_NUMBER() in a CTE and use it in a subquery:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY day ORDER BY visitors DESC) rn
FROM mytable
)
UPDATE mytable
SET Day_Rank = (SELECT rn FROM cte c WHERE (c.day, c.place) = (mytable.day, mytable.place));
See the demo.
Or, if your versipn of SQLite is 3.33.0+ you can use the join-like UPDATE...FROM... syntax:
UPDATE mytable AS m
SET Day_Rank = t.rn
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY day ORDER BY visitors DESC) rn
FROM mytable
) t
WHERE (t.day, t.place) = (m.day, m.place);

SQLite time duration calculation from rows

I want to calculate duration between rows with datetime data in SQLite.
Let's consider this for the base data (named intervals):
| id | date | state |
| 1 | 2020-07-04 10:11 | On |
| 2 | 2020-07-04 10:22 | Off |
| 3 | 2020-07-04 11:10 | On |
| 4 | 2020-07-04 11:25 | Off |
I'd like to calculate the duration for both On and Off state:
| Total On | 26mins |
| Total Off | 48mins |
Then I wrote this query:
SELECT
"Total " || interval_start.state AS state,
(SUM(strftime('%s', interval_end.date)-strftime('%s', interval_start.date)) / 60) || "mins" AS duration
FROM
intervals interval_start
INNER JOIN
intervals interval_end ON interval_end.id =
(
SELECT id FROM intervals WHERE
id > interval_start.id AND
state = CASE WHEN interval_start.state = 'On' THEN 'Off' ELSE 'On' END
ORDER BY id
LIMIT 1
)
GROUP BY
interval_start.state
However if the base data is a not in strict order:
| id | date | state |
| 1 | 2020-07-04 10:11 | On |
| 2 | 2020-07-04 10:22 | On | !!!
| 3 | 2020-07-04 11:10 | On |
| 4 | 2020-07-04 11:25 | Off |
My query will calculate wrong, as it will pair the only Off date with each On dates and sum them together.
Desired behavior should result something like this:
| Total On | 74mins |
| Total Off | 0mins | --this line can be omitted, or can be N/A
I have two questions:
How can I rewrite the query to handle these wrong data situations?
I feel my query is not the best in terms of performance, is it possible to improve it?
Use a CTE where you return only the starting rows of each state and then aggregate:
with cte as (
select *, lead(id) over (order by date) next_id
from (
select *, lag(state) over (order by date) prev_state
from intervals
)
where state <> coalesce(prev_state, '')
)
select c1.state,
sum(strftime('%s', c2.date) - strftime('%s', c1.date)) / 60 || 'mins' duration
from cte c1 inner join cte c2
on c2.id = c1.next_id
group by c1.state
See the demos: 1 and 2

MariaDB How to group by 2 or more columns combined

I want to get 1 entry per day per hour from my MariaDB database.
I have a table structured like this (with some more columns):
+------------+-----------+
| dayOfMonth | hourOfDay |
+------------+-----------+
Let's assume this table is filled like this:
+------------+-----------+
| dayOfMonth | hourOfDay |
+------------+-----------+
| 11 | 0 |
| 11 | 0 |
| 11 | 1 |
| 12 | 0 |
| 12 | 0 |
| 12 | 1 |
+------------+-----------+
What I want to get is this(in fact all columns) (Every hourOfDay for each dayOfMonth):
+------------+-----------+
| dayOfMonth | hourOfDay |
+------------+-----------+
| 11 | 0 |
| 11 | 1 |
| 12 | 0 |
| 12 | 1 |
+------------+-----------+
I was able to achieve this with this statement, but it would become way too long if I want to do this for an entire month:
(SELECT * FROM table WHERE dayOfMonth = 11 GROUP BY hourOfDay)
UNION
(SELECT * FROM table WHERE dayOfMonth = 12 GROUP BY hourOfDay)
You can group by dayOfMonth, hourOfDay:
SELECT dayOfMonth, hourOfDay
FROM table
GROUP BY dayOfMonth, hourOfDay
ORDER BY dayOfMonth, hourOfDay
This way you can't select other columns (if they exist), only aggregate on them with MIN(), MAX(), AVG() etc.
Or use DISTINCT:
SELECT DISTINCT dayOfMonth, hourOfDay
FROM table
ORDER BY dayOfMonth, hourOfDay
Your question is unclear. This will transform your initial data into your proposed data:
SELECT DISTINCT
dayOfMonth, hourOfDay
FROM tbl;
"Every hourOfDay" -- do you want all hours 24 rows per day? Of so, see the "sequence table" (eg, seq_0_to_23) feature in MariaDB.

Optimizing query that looks at a specific time window each day

This is a followup to my previous question
Optimizing query to get entire row where one field is the maximum for a group
I'll change the names from what I used there to make them a little more memorable, but these don't represent my actual use-case (so don't estimate the number of records from them).
I have a table with a schema like this:
OrderTime DATETIME(6),
Customer VARCHAR(50),
DrinkPrice DECIMAL,
Bartender VARCHAR(50),
TimeToPrepareDrink TIME(6),
...
I'd like to extract the rows from the table representing each customer's most expensive drink order during happy hour (3 PM - 6 PM) each day. So for instance I'd want results like
Date | Customer | OrderTime | MaxPrice | Bartender | ...
-------+----------+-------------+------------+-----------+-----
1/1/18 | Alice | 1/1/18 3:45 | 13.15 | Jane | ...
1/1/18 | Bob | 1/1/18 5:12 | 9.08 | Jane | ...
1/1/18 | Carol | 1/1/18 4:45 | 20.00 | Tarzan | ...
1/2/18 | Alice | 1/2/18 3:45 | 13.15 | Jane | ...
1/2/18 | Bob | 1/2/18 5:57 | 6.00 | Tarzan | ...
1/2/18 | Carol | 1/2/18 3:13 | 6.00 | Tarzan | ...
...
The table has an index on OrderTime, and contains tens of billions of records. (My customers are heavy drinkers).
Thanks to the previous question I'm able to extract this for a specific day pretty easily. I can do something like:
SELECT * FROM orders b
INNER JOIN (
SELECT Customer, MAX(DrinkPrice) as MaxPrice
FROM orders
WHERE OrderTime >= '2018-01-01 15:00'
AND OrderTime <= '2018-01-01 18:00'
GROUP BY Customer
) AS a
ON a.Customer = b.Customer
AND a.MaxPrice = b.DrinkPrice
WHERE b.OrderTime >= '2018-01-01 15:00'
AND b.OrderTime <= '2018-01-01 18:00';
This query runs in less than a second. The explain plan looks like this:
+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+
| id| select_type | table | type | possible_keys | key | ref | Extra |
+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+
| 1 | PRIMARY | b | range | OrderTime | OrderTime | NULL | Using index condition |
| 1 | PRIMARY | <derived2> | ref | key0 | key0 | b.Customer,b.Price | |
| 2 | DERIVED | orders | range | OrderTime | OrderTime | NULL | Using index condition; Using temporary; Using filesort |
+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+
I can also get the information about the relevant rows for my query:
SELECT Date, Customer, MAX(DrinkPrice) AS MaxPrice
FROM
orders
INNER JOIN
(SELECT '2018-01-01' AS Date
UNION
SELECT '2018-01-02' AS Date) dates
WHERE OrderTime >= TIMESTAMP(Date, '15:00:00')
AND OrderTime <= TIMESTAMP(Date, '18:00:00')
GROUP BY Date, Customer
HAVING MaxPrice > 0;
This query also runs in less than a second. Here's how its explain plan looks:
+------+--------------+------------+------+---------------+------+------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | ref | Extra |
+------+--------------+------------+------+---------------+------+------+------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | Using temporary; Using filesort |
| 1 | PRIMARY | orders | ALL | OrderTime | NULL | NULL | Range checked for each record (index map: 0x1) |
| 2 | DERIVED | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | UNION | NULL | NULL | NULL | NULL | NULL | No tables used |
| NULL | UNION RESULT | <union2,3> | ALL | NULL | NULL | NULL | |
+------+--------------+------------+------+---------------+------+------+------------------------------------------------+
The problem now is retrieving the remaining fields from the table. I tried adapting the trick from before, like so:
SELECT * FROM
orders a
INNER JOIN
(SELECT Date, Customer, MAX(DrinkPrice) AS MaxPrice
FROM
orders
INNER JOIN
(SELECT '2018-01-01' AS Date
UNION
SELECT '2018-01-02' AS Date) dates
WHERE OrderTime >= TIMESTAMP(Date, '15:00:00')
AND OrderTime <= TIMESTAMP(Date, '18:00:00')
GROUP BY Date, Customer
HAVING MaxPrice > 0) b
ON a.OrderTime >= TIMESTAMP(b.Date, '15:00:00')
AND a.OrderTime <= TIMESTAMP(b.Date, '18:00:00')
AND a.Customer = b.Customer;
However, for reasons I don't understand, the database chooses to execute this in a way that takes forever. Explain plan:
+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | ref | Extra |
+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+
| 1 | PRIMARY | a | ALL | OrderTime | NULL | NULL | |
| 1 | PRIMARY | <derived2> | ref | key0 | key0 | a.Customer | Using where |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | Using temporary; Using filesort |
| 2 | DERIVED | orders | ALL | OrderTime | NULL | NULL | Range checked for each record (index map: 0x1) |
| 3 | DERIVED | NULL | NULL | NULL | NULL | NULL | No tables used |
| 4 | UNION | NULL | NULL | NULL | NULL | NULL | No tables used |
| NULL | UNION RESULT | <union3,4> | ALL | NULL | NULL | NULL | |
+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+
Questions:
What is going on here?
How can I fix it?
To extract the rows from the table representing each customer's most expensive drink order during happy hour (3 PM - 6 PM) each day I would use row_number() over() within a case expression evaluating the hour of day, like this:
CREATE TABLE mytable(
Date DATE
,Customer VARCHAR(10)
,OrderTime DATETIME
,MaxPrice NUMERIC(12,2)
,Bartender VARCHAR(11)
);
note changes were made to OrderTime
INSERT INTO mytable(Date,Customer,OrderTime,MaxPrice,Bartender)
VALUES
('1/1/18','Alice','1/1/18 13:45',13.15,'Jane')
, ('1/1/18','Bob' ,'1/1/18 15:12', 9.08,'Jane')
, ('1/2/18','Alice','1/2/18 13:45',13.15,'Jane')
, ('1/2/18','Bob' ,'1/2/18 15:57', 6.00,'Tarzan')
, ('1/2/18','Carol','1/2/18 13:13', 6.00,'Tarzan')
;
The suggested query is this:
select
*
from (
select
*
, case when hour(OrderTime) between 15 and 18 then
row_number() over(partition by `Date`, customer
order by MaxPrice DESC)
else null
end rn
from mytable
) d
where rn = 1
;
and the result will give access to all columns you include in the derived table.
Date | Customer | OrderTime | MaxPrice | Bartender | rn
:--------- | :------- | :------------------ | -------: | :-------- | -:
0001-01-18 | Bob | 0001-01-18 15:12:00 | 9.08 | Jane | 1
0001-02-18 | Bob | 0001-02-18 15:57:00 | 6.00 | Tarzan | 1
To help display how this works, running the derived table subquery:
select
*
, case when hour(OrderTime) between 15 and 18 then
row_number() over(partition by `Date`, customer order by MaxPrice DESC)
else null
end rn
from mytable
;
produces this interim resultset:
Date | Customer | OrderTime | MaxPrice | Bartender | rn
:--------- | :------- | :------------------ | -------: | :-------- | ---:
0001-01-18 | Alice | 0001-01-18 13:45:00 | 13.15 | Jane | null
0001-01-18 | Bob | 0001-01-18 15:12:00 | 9.08 | Jane | 1
0001-02-18 | Alice | 0001-02-18 13:45:00 | 13.15 | Jane | null
0001-02-18 | Bob | 0001-02-18 15:57:00 | 6.00 | Tarzan | 1
0001-02-18 | Carol | 0001-02-18 13:13:00 | 6.00 | Tarzan | null
db<>fiddle here
The task seems to be a "groupwise-max" problem. Here's one approach, involving only 2 'queries' (the inner one is called a "derived table").
SELECT x.OrderDate, x.Customer, b.OrderTime,
x.MaxPrice, b.Bartender
FROM
(
SELECT DATE(OrderTime) AS OrderDate,
Customer,
Max(Price) AS MaxPrice
FROM tbl
WHERE TIME(OrderTime) BETWEEN '15:00' AND '18:00'
GROUP BY OrderDate, Customer
) AS x
JOIN tbl AS b
ON b.OrderDate = X.OrderDate
AND b.customer = x.Customer
AND b.Price = x.MaxPrice
WHERE TIME(b.OrderTime) BETWEEN '15:00' AND '18:00'
ORDER BY x.OrderDate, x.Customer
Desirable index:
INDEX(Customer, Price)
(There's no good reason to be using MyISAM.)
Billions of new rows per day
This adds new wrinkles. That's upwards of a terabyte of additional disk space needed each and every day?
Is it possible to summarize the data? The goal here is to add summary info as the new data comes in, and never have to re-scan the billions of old data. This may also let you remove all the secondary indexes on the Fact table.
Normalization will help shrink the table size, hence speeding up the queries. Bartender and Customer are prime candidates for such -- perhaps a SMALLINT UNSIGNED (2 bytes; 65K values) for the former and MEDIUMINT UNSIGNED (3 bytes, 16M) for the latter. That would probably shrink by 50% the 5 columns you currently show. You may get a 2x speedup on many operations after normalizing.
Normalization is best done by 'staging' the data -- Load the data into a temporary table, normalize within it, summarize it, then copy into the main Fact table.
See http://mysql.rjweb.org/doc.php/summarytables
and http://mysql.rjweb.org/doc.php/staging_table
Before getting back to the question of optimizing the one query, we need to see the schema, the data flow, whether things can be normalized, whether summary tables can be effective, etc. I would hope to have the 'answer' for the query to be mostly digested in a summary table. Sometimes this leads to a 10x speedup.

Get number of distinct records with same key

Let's assume I have a table A:
| ID | B_ID | C | column 1 | ... | column x|
| 1 | 24 | 44 | xxxxxxx
| 2 | 25 | 55 | xxxxxxx
| 3 | 25 | 66 | xxxxxxx (data in all other columns are the same)
| 4 | 26 | 77 | xxxxxxx
| 4 | 26 | 78 | xxxxxxx
| 4 | 26 | 79 | xxxxxxx
I want to get highest number of distinct records with same B_ID (and I also want to know B_ID where this occurs). So in this example I want to get values 3 and 26.
What would be best approach to achieve this?
The best approach is a simple aggregation + ordering + limiting the number of rows returned to 1:
select b_id, no_of_records
from (
select b_id, count(1) as no_of_records
from table_A
group by b_id
order by no_of_records desc
)
where rownum <= 1;
This, of course, returns 1 row even if you have multiple groups with the same highest number of records. If you want all groups with the same highest number of records, then the approach is slightly different ...
select b_id, no_of_records
from (
select b_id, count(1) as no_of_records,
rank() over (partition by null order by count(1) desc) as rank$
from table_A
group by b_id
)
where rank$ <= 1;

Resources