SQLite time duration calculation from rows - sqlite

I want to calculate duration between rows with datetime data in SQLite.
Let's consider this for the base data (named intervals):
| id | date | state |
| 1 | 2020-07-04 10:11 | On |
| 2 | 2020-07-04 10:22 | Off |
| 3 | 2020-07-04 11:10 | On |
| 4 | 2020-07-04 11:25 | Off |
I'd like to calculate the duration for both On and Off state:
| Total On | 26mins |
| Total Off | 48mins |
Then I wrote this query:
SELECT
"Total " || interval_start.state AS state,
(SUM(strftime('%s', interval_end.date)-strftime('%s', interval_start.date)) / 60) || "mins" AS duration
FROM
intervals interval_start
INNER JOIN
intervals interval_end ON interval_end.id =
(
SELECT id FROM intervals WHERE
id > interval_start.id AND
state = CASE WHEN interval_start.state = 'On' THEN 'Off' ELSE 'On' END
ORDER BY id
LIMIT 1
)
GROUP BY
interval_start.state
However if the base data is a not in strict order:
| id | date | state |
| 1 | 2020-07-04 10:11 | On |
| 2 | 2020-07-04 10:22 | On | !!!
| 3 | 2020-07-04 11:10 | On |
| 4 | 2020-07-04 11:25 | Off |
My query will calculate wrong, as it will pair the only Off date with each On dates and sum them together.
Desired behavior should result something like this:
| Total On | 74mins |
| Total Off | 0mins | --this line can be omitted, or can be N/A
I have two questions:
How can I rewrite the query to handle these wrong data situations?
I feel my query is not the best in terms of performance, is it possible to improve it?

Use a CTE where you return only the starting rows of each state and then aggregate:
with cte as (
select *, lead(id) over (order by date) next_id
from (
select *, lag(state) over (order by date) prev_state
from intervals
)
where state <> coalesce(prev_state, '')
)
select c1.state,
sum(strftime('%s', c2.date) - strftime('%s', c1.date)) / 60 || 'mins' duration
from cte c1 inner join cte c2
on c2.id = c1.next_id
group by c1.state
See the demos: 1 and 2

Related

Sqlite / populate new column that ranks the existing rows

I've a SQLite database table with the following columns:
| day | place | visitors |
-------------------------------------
| 2021-05-01 | AAA | 20 |
| 2021-05-01 | BBB | 10 |
| 2021-05-01 | CCC | 3 |
| 2021-05-02 | AAA | 5 |
| 2021-05-02 | BBB | 7 |
| 2021-05-02 | CCC | 2 |
Now I would like to introduce a column 'rank' which indicates the rank according to the visitors each day. Expected table would look like:
| day | place | visitors | Rank |
------------------------------------------
| 2021-05-01 | AAA | 20 | 1 |
| 2021-05-01 | BBB | 10 | 2 |
| 2021-05-01 | CCC | 3 | 3 |
| 2021-05-02 | AAA | 5 | 2 |
| 2021-05-02 | BBB | 7 | 1 |
| 2021-05-02 | CCC | 2 | 3 |
Populating the data for the new column Rank can be done with a program like (Pseudocode).
for each i_day in all_days:
SELECT
ROW_NUMBER () OVER (ORDER BY `visitors` DESC) Day_Rank, place
FROM mytable
WHERE `day` = 'i_day'
for each i_place in all_places:
UPDATE mytable
SET rank= Day_Rank
WHERE `Day`='i_day'
AND place = 'i_place'
Since this line by line update is quite inefficient, I'm searching how to optimize this with a SQL sub query in combination with the UPDATE.
(does not work so far...)
for each i_day in all_days:
UPDATE mytable
SET rank= (
SELECT
ROW_NUMBER () OVER (ORDER BY `visitors` DESC) Day_Rank
FROM mytable
WHERE `day` = 'i_day'
)
Typically, this can be done with a subquery that counts the number of rows with visitors greater than the value of visitors of the current row:
UPDATE mytable
SET Day_Rank = (
SELECT COUNT(*) + 1
FROM mytable m
WHERE m.day = mytable.day AND m.visitors > mytable.visitors
);
Note that the result is actually what RANK() would return, if there are ties in the values of visitors.
See the demo.
Or, you could calculate the rankings with ROW_NUMBER() in a CTE and use it in a subquery:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY day ORDER BY visitors DESC) rn
FROM mytable
)
UPDATE mytable
SET Day_Rank = (SELECT rn FROM cte c WHERE (c.day, c.place) = (mytable.day, mytable.place));
See the demo.
Or, if your versipn of SQLite is 3.33.0+ you can use the join-like UPDATE...FROM... syntax:
UPDATE mytable AS m
SET Day_Rank = t.rn
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY day ORDER BY visitors DESC) rn
FROM mytable
) t
WHERE (t.day, t.place) = (m.day, m.place);

MariaDB How to group by 2 or more columns combined

I want to get 1 entry per day per hour from my MariaDB database.
I have a table structured like this (with some more columns):
+------------+-----------+
| dayOfMonth | hourOfDay |
+------------+-----------+
Let's assume this table is filled like this:
+------------+-----------+
| dayOfMonth | hourOfDay |
+------------+-----------+
| 11 | 0 |
| 11 | 0 |
| 11 | 1 |
| 12 | 0 |
| 12 | 0 |
| 12 | 1 |
+------------+-----------+
What I want to get is this(in fact all columns) (Every hourOfDay for each dayOfMonth):
+------------+-----------+
| dayOfMonth | hourOfDay |
+------------+-----------+
| 11 | 0 |
| 11 | 1 |
| 12 | 0 |
| 12 | 1 |
+------------+-----------+
I was able to achieve this with this statement, but it would become way too long if I want to do this for an entire month:
(SELECT * FROM table WHERE dayOfMonth = 11 GROUP BY hourOfDay)
UNION
(SELECT * FROM table WHERE dayOfMonth = 12 GROUP BY hourOfDay)
You can group by dayOfMonth, hourOfDay:
SELECT dayOfMonth, hourOfDay
FROM table
GROUP BY dayOfMonth, hourOfDay
ORDER BY dayOfMonth, hourOfDay
This way you can't select other columns (if they exist), only aggregate on them with MIN(), MAX(), AVG() etc.
Or use DISTINCT:
SELECT DISTINCT dayOfMonth, hourOfDay
FROM table
ORDER BY dayOfMonth, hourOfDay
Your question is unclear. This will transform your initial data into your proposed data:
SELECT DISTINCT
dayOfMonth, hourOfDay
FROM tbl;
"Every hourOfDay" -- do you want all hours 24 rows per day? Of so, see the "sequence table" (eg, seq_0_to_23) feature in MariaDB.

Default value for LAG function in MariaDB

I'm trying to build a view which allows me to track the difference between paid values at two consecutive month_ids. When a figure is missing however, that would be because it's the first entry and therefore has a paid amount of 0. At present, I'm using the below to represent the previous figure since the [,default] argument has not been implemented in MariaDB.
CASE WHEN (
NOT(policy_agent_month.policy_agent_month_id IS NOT NULL
AND LAG(days_paid, 1) OVER (PARTITION BY claim_id ORDER BY month_id ) IS NULL)) THEN
LAG(days_paid, 1) OVER ( PARTITION BY claim_id ORDER BY month_id)
ELSE
0
END
The problem I have with this is that I have about 30 variables which this function needs to be applied over and it makes my code unreadable and very clunky. Is there a better solution?
Why use WITH?
SELECT province, tot_pop,
tot_pop - COALESCE(
(LAG(tot_pop) OVER (ORDER BY tot_pop ASC)),
0) AS delta
FROM provinces
ORDER BY tot_pop asc;
+---------------------------+----------+---------+
| province | tot_pop | delta |
+---------------------------+----------+---------+
| Nunavut | 14585 | 14585 |
| Yukon | 21304 | 6719 |
| Northwest Territories | 24571 | 3267 |
| Prince Edward Island | 63071 | 38500 |
| Newfoundland and Labrador | 100761 | 37690 |
| New Brunswick | 332715 | 231954 |
| Nova Scotia | 471284 | 138569 |
| Saskatchewan | 622467 | 151183 |
| Manitoba | 772672 | 150205 |
| Alberta | 2481213 | 1708541 |
| British Columbia | 3287519 | 806306 |
| Quebec | 5321098 | 2033579 |
| Ontario | 10071458 | 4750360 |
+---------------------------+----------+---------+
13 rows in set (0.00 sec)
However, it is not cheap (at least in MySQL 8.0);
the table has 13 rows, yet
FLUSH STATUS;
SELECT ...
SHOW SESSION STATUS LIKE 'Handler%';
MySQL 8.0:
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_read_rnd | 89 |
| Handler_read_rnd_next | 52 |
| Handler_write | 26 |
(and others)
MariaDB 10.3:
| Handler_read_rnd | 77 |
| Handler_read_rnd_next | 42 |
| Handler_tmp_write | 13 |
| Handler_update | 13 |
You can use a CTE (Common Table Expression) in MariaDB 10.2+ to pre-compute frequently used expressions and name them for later use:
with
x as ( -- first we compute the CTE that we name "x"
select
*,
coalesce(
LAG(days_paid, 1) OVER (PARTITION BY claim_id ORDER BY month_id),
123456
) as prev_month -- this expression gets the name "prev_month"
from my_table -- or a simple/complex join here
)
select -- now the main query
prev_month
from x
... -- rest of your query here where "prev_month" is computed.
In the main query prev_month has the lag value, or the default value 123456 when it's null.

Optimizing query that looks at a specific time window each day

This is a followup to my previous question
Optimizing query to get entire row where one field is the maximum for a group
I'll change the names from what I used there to make them a little more memorable, but these don't represent my actual use-case (so don't estimate the number of records from them).
I have a table with a schema like this:
OrderTime DATETIME(6),
Customer VARCHAR(50),
DrinkPrice DECIMAL,
Bartender VARCHAR(50),
TimeToPrepareDrink TIME(6),
...
I'd like to extract the rows from the table representing each customer's most expensive drink order during happy hour (3 PM - 6 PM) each day. So for instance I'd want results like
Date | Customer | OrderTime | MaxPrice | Bartender | ...
-------+----------+-------------+------------+-----------+-----
1/1/18 | Alice | 1/1/18 3:45 | 13.15 | Jane | ...
1/1/18 | Bob | 1/1/18 5:12 | 9.08 | Jane | ...
1/1/18 | Carol | 1/1/18 4:45 | 20.00 | Tarzan | ...
1/2/18 | Alice | 1/2/18 3:45 | 13.15 | Jane | ...
1/2/18 | Bob | 1/2/18 5:57 | 6.00 | Tarzan | ...
1/2/18 | Carol | 1/2/18 3:13 | 6.00 | Tarzan | ...
...
The table has an index on OrderTime, and contains tens of billions of records. (My customers are heavy drinkers).
Thanks to the previous question I'm able to extract this for a specific day pretty easily. I can do something like:
SELECT * FROM orders b
INNER JOIN (
SELECT Customer, MAX(DrinkPrice) as MaxPrice
FROM orders
WHERE OrderTime >= '2018-01-01 15:00'
AND OrderTime <= '2018-01-01 18:00'
GROUP BY Customer
) AS a
ON a.Customer = b.Customer
AND a.MaxPrice = b.DrinkPrice
WHERE b.OrderTime >= '2018-01-01 15:00'
AND b.OrderTime <= '2018-01-01 18:00';
This query runs in less than a second. The explain plan looks like this:
+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+
| id| select_type | table | type | possible_keys | key | ref | Extra |
+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+
| 1 | PRIMARY | b | range | OrderTime | OrderTime | NULL | Using index condition |
| 1 | PRIMARY | <derived2> | ref | key0 | key0 | b.Customer,b.Price | |
| 2 | DERIVED | orders | range | OrderTime | OrderTime | NULL | Using index condition; Using temporary; Using filesort |
+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+
I can also get the information about the relevant rows for my query:
SELECT Date, Customer, MAX(DrinkPrice) AS MaxPrice
FROM
orders
INNER JOIN
(SELECT '2018-01-01' AS Date
UNION
SELECT '2018-01-02' AS Date) dates
WHERE OrderTime >= TIMESTAMP(Date, '15:00:00')
AND OrderTime <= TIMESTAMP(Date, '18:00:00')
GROUP BY Date, Customer
HAVING MaxPrice > 0;
This query also runs in less than a second. Here's how its explain plan looks:
+------+--------------+------------+------+---------------+------+------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | ref | Extra |
+------+--------------+------------+------+---------------+------+------+------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | Using temporary; Using filesort |
| 1 | PRIMARY | orders | ALL | OrderTime | NULL | NULL | Range checked for each record (index map: 0x1) |
| 2 | DERIVED | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | UNION | NULL | NULL | NULL | NULL | NULL | No tables used |
| NULL | UNION RESULT | <union2,3> | ALL | NULL | NULL | NULL | |
+------+--------------+------------+------+---------------+------+------+------------------------------------------------+
The problem now is retrieving the remaining fields from the table. I tried adapting the trick from before, like so:
SELECT * FROM
orders a
INNER JOIN
(SELECT Date, Customer, MAX(DrinkPrice) AS MaxPrice
FROM
orders
INNER JOIN
(SELECT '2018-01-01' AS Date
UNION
SELECT '2018-01-02' AS Date) dates
WHERE OrderTime >= TIMESTAMP(Date, '15:00:00')
AND OrderTime <= TIMESTAMP(Date, '18:00:00')
GROUP BY Date, Customer
HAVING MaxPrice > 0) b
ON a.OrderTime >= TIMESTAMP(b.Date, '15:00:00')
AND a.OrderTime <= TIMESTAMP(b.Date, '18:00:00')
AND a.Customer = b.Customer;
However, for reasons I don't understand, the database chooses to execute this in a way that takes forever. Explain plan:
+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | ref | Extra |
+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+
| 1 | PRIMARY | a | ALL | OrderTime | NULL | NULL | |
| 1 | PRIMARY | <derived2> | ref | key0 | key0 | a.Customer | Using where |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | Using temporary; Using filesort |
| 2 | DERIVED | orders | ALL | OrderTime | NULL | NULL | Range checked for each record (index map: 0x1) |
| 3 | DERIVED | NULL | NULL | NULL | NULL | NULL | No tables used |
| 4 | UNION | NULL | NULL | NULL | NULL | NULL | No tables used |
| NULL | UNION RESULT | <union3,4> | ALL | NULL | NULL | NULL | |
+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+
Questions:
What is going on here?
How can I fix it?
To extract the rows from the table representing each customer's most expensive drink order during happy hour (3 PM - 6 PM) each day I would use row_number() over() within a case expression evaluating the hour of day, like this:
CREATE TABLE mytable(
Date DATE
,Customer VARCHAR(10)
,OrderTime DATETIME
,MaxPrice NUMERIC(12,2)
,Bartender VARCHAR(11)
);
note changes were made to OrderTime
INSERT INTO mytable(Date,Customer,OrderTime,MaxPrice,Bartender)
VALUES
('1/1/18','Alice','1/1/18 13:45',13.15,'Jane')
, ('1/1/18','Bob' ,'1/1/18 15:12', 9.08,'Jane')
, ('1/2/18','Alice','1/2/18 13:45',13.15,'Jane')
, ('1/2/18','Bob' ,'1/2/18 15:57', 6.00,'Tarzan')
, ('1/2/18','Carol','1/2/18 13:13', 6.00,'Tarzan')
;
The suggested query is this:
select
*
from (
select
*
, case when hour(OrderTime) between 15 and 18 then
row_number() over(partition by `Date`, customer
order by MaxPrice DESC)
else null
end rn
from mytable
) d
where rn = 1
;
and the result will give access to all columns you include in the derived table.
Date | Customer | OrderTime | MaxPrice | Bartender | rn
:--------- | :------- | :------------------ | -------: | :-------- | -:
0001-01-18 | Bob | 0001-01-18 15:12:00 | 9.08 | Jane | 1
0001-02-18 | Bob | 0001-02-18 15:57:00 | 6.00 | Tarzan | 1
To help display how this works, running the derived table subquery:
select
*
, case when hour(OrderTime) between 15 and 18 then
row_number() over(partition by `Date`, customer order by MaxPrice DESC)
else null
end rn
from mytable
;
produces this interim resultset:
Date | Customer | OrderTime | MaxPrice | Bartender | rn
:--------- | :------- | :------------------ | -------: | :-------- | ---:
0001-01-18 | Alice | 0001-01-18 13:45:00 | 13.15 | Jane | null
0001-01-18 | Bob | 0001-01-18 15:12:00 | 9.08 | Jane | 1
0001-02-18 | Alice | 0001-02-18 13:45:00 | 13.15 | Jane | null
0001-02-18 | Bob | 0001-02-18 15:57:00 | 6.00 | Tarzan | 1
0001-02-18 | Carol | 0001-02-18 13:13:00 | 6.00 | Tarzan | null
db<>fiddle here
The task seems to be a "groupwise-max" problem. Here's one approach, involving only 2 'queries' (the inner one is called a "derived table").
SELECT x.OrderDate, x.Customer, b.OrderTime,
x.MaxPrice, b.Bartender
FROM
(
SELECT DATE(OrderTime) AS OrderDate,
Customer,
Max(Price) AS MaxPrice
FROM tbl
WHERE TIME(OrderTime) BETWEEN '15:00' AND '18:00'
GROUP BY OrderDate, Customer
) AS x
JOIN tbl AS b
ON b.OrderDate = X.OrderDate
AND b.customer = x.Customer
AND b.Price = x.MaxPrice
WHERE TIME(b.OrderTime) BETWEEN '15:00' AND '18:00'
ORDER BY x.OrderDate, x.Customer
Desirable index:
INDEX(Customer, Price)
(There's no good reason to be using MyISAM.)
Billions of new rows per day
This adds new wrinkles. That's upwards of a terabyte of additional disk space needed each and every day?
Is it possible to summarize the data? The goal here is to add summary info as the new data comes in, and never have to re-scan the billions of old data. This may also let you remove all the secondary indexes on the Fact table.
Normalization will help shrink the table size, hence speeding up the queries. Bartender and Customer are prime candidates for such -- perhaps a SMALLINT UNSIGNED (2 bytes; 65K values) for the former and MEDIUMINT UNSIGNED (3 bytes, 16M) for the latter. That would probably shrink by 50% the 5 columns you currently show. You may get a 2x speedup on many operations after normalizing.
Normalization is best done by 'staging' the data -- Load the data into a temporary table, normalize within it, summarize it, then copy into the main Fact table.
See http://mysql.rjweb.org/doc.php/summarytables
and http://mysql.rjweb.org/doc.php/staging_table
Before getting back to the question of optimizing the one query, we need to see the schema, the data flow, whether things can be normalized, whether summary tables can be effective, etc. I would hope to have the 'answer' for the query to be mostly digested in a summary table. Sometimes this leads to a 10x speedup.

Using Ifnull in Subquery SQLite

I've this two tables, members and water_meter
members
id | name
=========
1 | Dani
2 | Dina
3 | Roni
water_meter
id | member_id | date | start | finish | paid | paid_at
===+============+===========+=======+===========+=======+=====================+
1 | 1 |2014-07-01 | 12.3 | 38.7 | 1 | 2014-12-29 18:28:30
2 | 2 |2014-07-01 | 57.2 | 64.3 | 0 | null
3 | 3 |2014-07-01 | 14.6 | 52.3 | 0 | null
This member need to pay their water usage every month. What I want is, the 'start' value of each month is the 'finish' value from previous months. This is my query to check water usage at August,
SELECT m.id, m.name,
ifnull(t.start, (SELECT ifnull(finish, 0) FROM members m2
LEFT JOIN water_meter t2 ON m2.id = t2.member_id AND t2.date = '2014-07-01') ) as start,
t.finish, paid
FROM members m
LEFT JOIN water_meter t ON m.id = t.member_id AND t.date = '2014-08-01'
Result :
id | name | start | finish |
===+========+========+=========+
1 | Dani | 38.7 | null |
2 | Dina | 38.7 | null |
3 | Roni | 38.7 | null |
As you can see, the "start" value is not right. What is the right query for this case?
What I want is like this
id | name | start | finish |
===+========+========+=========+
1 | Dani | 38.7 | null |
2 | Dina | 64.3 | null |
3 | Roni | 52.3 | null |
Check : http://sqlfiddle.com/#!7/29a4c/2
You haven't assigned correct where condition in inner query.
SELECT m.id, m.name,
ifnull(t.start,
(SELECT ifnull(finish, 0) FROM members m2
LEFT JOIN water_meter t2
ON m2.id = t2.member_id AND t2.date = '2014-07-01'
where m2.id = m.id)) as start,
t.finish, paid
FROM members m
LEFT JOIN water_meter t ON m.id = t.member_id AND t.date = '2014-08-01'
WHERE m.active = 1
I don't like query itself, but that produces the output you wanted.
A little better (no subqueries, which may be slow on large dataset) solution:
select
members.id,
name,
coalesce(wm_cur.start, wm_prev.finish),
wm_cur.finish
from members
left join water_meter wm_cur
on members.id = wm_cur.member_id
and wm_cur.date between '2014-08-01' and date('2014-08-01','start of month','+1 month','-1 day')
left join water_meter wm_prev
on members.id = wm_prev.member_id
and wm_prev.date between '2014-07-01' and date('2014-07-01','start of month','+1 month','-1 day')
where members.active = 1
You can replace coalesce with ifnull if you wish. It also handles entire month and not only first day, which may or may not be what you want it to be.

Resources