I would like to compute a moving average over data in a SQLite table. I found several method in MySQL, but couldn't find an efficient one in SQLite.
In SQL, I think something like this should do it (however, I was not able to try it...) :
SELECT date, value,
avg(value) OVER (ORDER BY date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) as MovingAverageWindow7
FROM t ORDER BY date;
However, I see two drawbacks :
This does not seems to work on sqlite
If data are not continuous for few dates on preceding/following rows, it computes a moving average on a window which is wider than what I actually want since it is only based on the number of surrounding rows. Thus, a date condition should be added
Indeed, I would like it to compute the average of 'value' at each date, over +/-3 days (weekly moving average) or +/-15 days (monthly moving average)
Here is an example data set :
CREATE TABLE t ( date DATE, value INTEGER );
INSERT INTO t (date, value) VALUES ('2018-02-01', 8);
INSERT INTO t (date, value) VALUES ('2018-02-02', 2);
INSERT INTO t (date, value) VALUES ('2018-02-05', 5);
INSERT INTO t (date, value) VALUES ('2018-02-06', 4);
INSERT INTO t (date, value) VALUES ('2018-02-07', 1);
INSERT INTO t (date, value) VALUES ('2018-02-10', 6);
INSERT INTO t (date, value) VALUES ('2018-02-11', 0);
INSERT INTO t (date, value) VALUES ('2018-02-12', 2);
INSERT INTO t (date, value) VALUES ('2018-02-13', 1);
INSERT INTO t (date, value) VALUES ('2018-02-14', 3);
INSERT INTO t (date, value) VALUES ('2018-02-15', 11);
INSERT INTO t (date, value) VALUES ('2018-02-18', 4);
INSERT INTO t (date, value) VALUES ('2018-02-20', 1);
INSERT INTO t (date, value) VALUES ('2018-02-21', 5);
INSERT INTO t (date, value) VALUES ('2018-02-28', 10);
INSERT INTO t (date, value) VALUES ('2018-03-02', 6);
INSERT INTO t (date, value) VALUES ('2018-03-03', 7);
INSERT INTO t (date, value) VALUES ('2018-03-04', 3);
INSERT INTO t (date, value) VALUES ('2018-03-08', 5);
INSERT INTO t (date, value) VALUES ('2018-03-09', 6);
INSERT INTO t (date, value) VALUES ('2018-03-15', 1);
INSERT INTO t (date, value) VALUES ('2018-03-16', 3);
INSERT INTO t (date, value) VALUES ('2018-03-25', 5);
INSERT INTO t (date, value) VALUES ('2018-03-31', 1);
Window functions were added in version 3.25.0 (2018-09-15). With the RANGE frame type added in version 3.28.0 (2019-04-16), you can now do:
SELECT date, value,
avg(value) OVER (
ORDER BY CAST (strftime('%s', date) AS INT)
RANGE BETWEEN 3 * 24 * 60 * 60 PRECEDING
AND 3 * 24 * 60 * 60 FOLLOWING
) AS MovingAverageWindow7
FROM t ORDER BY date;
I think I actually found a solution :
SELECT date, value,
(SELECT AVG(value) FROM t t2
WHERE datetime(t1.date, '-3 days') <= datetime(t2.date) AND datetime(t1.date, '+3 days') >= datetime(t2.date)
) AS MAVG
FROM t t1
GROUP BY strftime('%Y-%m-%d', date);
I don't know if it is the most efficient way, but it seems to work
Edit :
Applied to my real database containing 20 000 rows, a weekly moving average over two parameters takes approximately 1 minute to be calculated.
I see two options there :
There is a more efficient way to compute this with SQLite
I compute the moving average in Python after extracting data from SQLite
One approach is to create a intermediate table that maps each date to the groups it belong to.
CREATE TABLE groups (date DATE, daygroup DATE);
INSERT INTO groups
SELECT date, strftime('%Y-%m-%d', datetime(date, '-1 days')) AS daygroup
FROM t;
INSERT INTO groups
SELECT date, strftime('%Y-%m-%d', datetime(date, '-2 days')) AS daygroup
FROM t;
INSERT INTO groups
SELECT date, strftime('%Y-%m-%d', datetime(date, '-3 days')) AS daygroup
FROM t;
INSERT INTO groups
SELECT date, strftime('%Y-%m-%d', datetime(date, '+1 days')) AS daygroup
FROM t;
INSERT INTO groups
SELECT date, strftime('%Y-%m-%d', datetime(date, '+2 days')) AS daygroup
FROM t;
INSERT INTO groups
SELECT date, strftime('%Y-%m-%d', datetime(date, '+3 days')) AS daygroup
FROM t;
INSERT INTO groups
SELECT date, date AS daygroup FROM t;
You get for example,
SELECT * FROM groups WHERE date = '2018-02-05'
date daygroup
2018-02-05 2018-02-04
2018-02-05 2018-02-03
2018-02-05 2018-02-02
2018-02-05 2018-02-06
2018-02-05 2018-02-07
2018-02-05 2018-02-08
2018-02-05 2018-02-05
indicating that '2018-02-05' belongs to groups '2018-02-02' to '2018-02-08'. If a date belongs to a group, then the value of the data joins the calculation of moving average for the group.
With this, calculating the moving average becomes straightforward:
SELECT
d.date, d.value, c.ma
FROM
t AS d
INNER JOIN
(SELECT
b.daygroup,
avg(a.value) AS ma
FROM
t AS a
INNER JOIN
groups AS b
ON a.date = b.date
GROUP BY b.daygroup) AS c
ON
d.date = c.daygroup
Note that the number of rows of intermediate table is 7 times as large as that of original table, it grows proportionately as taking wider the window. This should be acceptable unless you have much larger table.
I also experimented with 20 000 rows.
The insert query took 1.5s and select query took 0.5s on my laptop.
ADDED, perhaps better.
An alternative that does not require intermediate table.
The query below merges the table with itself, in a way 3 days lag is allowed, then takes average.
SELECT
t1.date, avg(t2.value) AS MVG
FROM
t AS t1
INNER JOIN
t AS t2
ON
datetime(t1.date, '-3 days') <= datetime(t2.date)
AND
datetime(t1.date, '+3 days') >= datetime(t2.date)
GROUP BY
t1.date
;
Related
It is actually possible to use # (the at sign) with sqlite to be able to use a calculated value as a constant in an other query ?
I am using a variable(a total) that i calculated previously to get an other variable (a proportion) over two time periods.
Total amout of sale
Proportion of sale between the first semester and second semester.
I copy the first query to get the constant and i had the first query to the second.
The answer is no BUT:-
This could possibly be done in a single query.
Consider this simple demo with hopefully easy to understand all-in-one queries:-
First the sales table:-
i.e. 2 columns semester and amount
10 rows in total so 1000 is the total amount
6 rows are S1 (amount is 600) so 60%
4 rows are S2 (amount is 400) so 40%
Created and populated using:-
CREATE TABLE IF NOT EXISTS sales (semester TEXT, amount REAL);
INSERT INTO sales VALUES('S1',100),('S1',100),('S1',100),('S1',100),('S1',100),('S1',100),('S2',100),('S2',100),('S2',100),('S2',100);
So you could use an all-in-one query such as:-
SELECT
(SELECT sum(amount) FROM sales) AS total,
(SELECT sum(amount) FROM sales WHERE semester = 'S1') AS s1total,
((SELECT sum(amount) FROM sales WHERE semester = 'S1') / (SELECT sum(amount) FROM sales)) * 100 AS s1prop,
(SELECT sum(amount) FROM sales WHERE semester = 'S2') AS s2total,
((SELECT sum(amount) FROM sales WHERE semester = 'S2') / (SELECT sum(amount) FROM sales)) * 100 AS s2prop
;
This would result in
i.e. s1prop and s2prop the expected results (the other columns may be useful)
An alternative, using a CTE (Common Table Expressions) that does the same could be:-
WITH cte_total(total,s1total,s2total) AS (SELECT
(SELECT sum(amount) FROM sales),
(SELECT sum(amount) FROM sales WHERE semester = 'S1'),
(SELECT sum(amount) FROM sales WHERE semester = 'S2')
)
SELECT total, s1total, (s1total / total) * 100 AS s1prop, s2total, (s2total / total) * 100 AS s2prop FROM cte_total;
you can have multiple CTE's and gather data from other tables or even being passed as parameters. They can be extremely useful and would even allow values to be accessed throughout.
e.g.
Here's an example where a 2nd cte is added (as the first cte) that mimics passing 3 dates (instead of the hard coded values ?'s could be coded and the parameters passed via parameter binding).
As the sales table has no date for the sale a literal value has been coded, this would be normally be the column with the sale date instead of WHERE '2023-01-01' /*<<<<< would be the column that holds the date */
the hard coded date has purposefully been used so result in the BETWEEN clause resulting in true.
if the date column did exist then WHERE criteria for the semester could then be by between the respective dates for the semester.
The example:-
WITH
dates AS (SELECT
'2023-01-01' /*<<<<< ? and can then be passed as bound parameter*/ AS startdate,
'2023-03-01' /*<<<<< ? and can then be passed as bound parameter*/ AS semester2_start,
'2023-05-30' /*<<<<< ? and can then be passed as bound parameter*/as enddate
),
cte_total(total,s1total,s2total) AS (SELECT
(SELECT sum(amount) FROM sales
WHERE '2023-01-01' /*<<<<< would be the column that holds the date */
BETWEEN (SELECT startdate FROM dates)
AND (SELECT enddate FROM dates)),
(SELECT sum(amount) FROM sales WHERE semester = 'S1'),
(SELECT sum(amount) FROM sales WHERE semester = 'S2')
)
SELECT total, s1total, (s1total / total) * 100 AS s1prop, s2total, (s2total / total) * 100 AS s2prop FROM cte_total;
I have an ending balance of $5000. I need to create a running balance, but adjust the first row to show the ending balance then sum the rest, so it will look like a bank statement. Here is what I have for the running balance but how can I adjust row 1 to not show a sum of the first row, but the ending balance instead.
with BalBefore as (
select *
from transactions
where ACCT_NAME = 'Real Solutions'
ORDER BY DATE DESC
)
select
DATE,
amount,
'$' || printf("%.2f", sum(AMOUNT) over (order by ROW_ID)) as Balance
from BalBefore;
This gives me"
DATE AMOUNT BALANCE
9/6/2019 -31.00 $-31.00 <- I need this balance to be replaced with $5000 and have the rest
9/4/2019 15.00 $-16.00 sum as normal.
9/4/2019 15.00 $-1.00
9/3/2019 -16.00 $-17.00
I have read many other questions, but I couldn't find one that I could understand so I thought I would post a simpler question.
The following is not short and sweet, but using the WITH statement and CTEs, I hope that the logic is apparent. Multiple CTEs are defined which refer to each other to make the overall query more readable. Altogether the goal was just to add a beginning balance record that could be :
/*
DROP TABLE IF EXISTS data;
CREATE temp TABLE data (
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
date DATETIME NOT NULL,
amount NUMERIC NOT NULL
);
INSERT INTO data
(date, amount)
VALUES
('2019-09-03', -16.00),
('2019-09-04', 15.00),
('2019-09-04', 15.00),
('2019-09-06', -31.00)
;
*/
WITH
initial_filter AS (
SELECT id, date, amount
FROM data
--WHERE ACCT_NAME = 'Real Solutions'
),
prepared AS (
SELECT *
FROM initial_filter
UNION ALL
SELECT
9223372036854775807 as id, --largest signed integer
(SELECT MAX(date) FROM initial_filter) AS FinalDate,
-(5000.00) --ending balance (negated for summing algorithm)
),
running AS (
SELECT
id,
date,
amount,
SUM(-amount) OVER
(ORDER BY date DESC, id DESC
RANGE UNBOUNDED PRECEDING
EXCLUDE CURRENT ROW) AS balance
FROM prepared
ORDER BY date DESC, id DESC
)
SELECT *
FROM running
WHERE id != 9223372036854775807
ORDER BY date DESC, id DESC;
This produces the following
id date amount balance
4 2019-09-06 -31.00 5000
3 2019-09-04 15.00 5031
2 2019-09-04 15.00 5016
1 2019-09-03 -16.00 5001
UPDATE: The first query was not producing the correct balances. The beginning balance row and the windowing function (i.e. OVER clause) were updated to accurately sum over the correct amounts.
Note: The balance on each row is determined completely from the previous rows, not from the current row's amount, because this works backward from an ending balance, not forward from the previous row balance.
There is table that has a date and cnt column e.g.
timestamp cnt
------------------
1547015021 14
1547024080 2
This table can be created using :-
DROP TABLE IF EXISTS roundit_base;
CREATE TABLE IF NOT EXISTS roundit_base (timestamp INTEGER, cnt INTEGER);
INSERT INTO roundit_base VALUES (1547015021,14),(1547024080,2);
The result should be the sum of the cnt column of rows that are the closest timestamp to a list of supplied timestamps, e.g. the supplied data could be
1546905600 - 0
1546992000 - 0
1547078400 - 0
...
The result should be along the lines of
1546905600 - 0
1546992000 - 14
1547078400 - 2
That is two columns:-
the timestamp from the list of supplied timestamps, that the respective rows from the database are closest to and
the sum of the cnt column those rows on a per supplied timestamp
Although the results are different from the expected results in that the calculations used places both 1547015021 and 1547024080 as being closest to the suplied timestamp of 1546992000;
The following could be the basis of an SQLite based solution :-
WITH
-- The supplied list of timestamps
v (cv,dflt) AS (
VALUES (1546905600,0),(1546992000,0),(1547078400,0)
),
-- Join the two sets calculating the difference
cte1 AS (
SELECT *, abs(cv - timestamp) AS diff FROM roundit_base INNER JOIN v
),
-- Find the closest (smallest difference) for each timestamp
cte2 AS (
SELECT *, min(diff) FROM cte1 GROUP BY timestamp
)
-- For each compartive value sum the counts allocated/assigned (timestamps) to that
SELECT cv,
CASE
WHEN
(SELECT sum(cnt) FROM cte2 WHERE cv = v.cv) IS NOT NULL
THEN
(SELECT sum(cnt) FROM cte2 WHERE cv = v.cv)
ELSE 0
END AS cnt
FROM v;
;
The above results in :-
SQLITE3
Task: get a data set that contains the following data - SEE NOTES BESIDE COLUMNS
SELECT DISTINCT DateTime(Rounded, 'unixepoch') AS RoundedDate, -- Rounded DateTime to the floor hour
Count() AS Count, -- Count of items that registered within the above time
CAST (avg(Speed) AS INT) AS AverageSpeed, -- Average table.Speed column data within the defined datetime
Count() AS SpeederCount -- ?? WTF? [pseudo constraints: if Speed > Speedlimit then +1]
FROM RawSpeedLane AS sl
INNER JOIN
SpeedLaneSearchData AS slsd ON slsd.ParentId = sl.Id
INNER JOIN
Projects AS p ON p.ProjectId = sl.ProjectId
WHERE sl.ProjectId = 72
GROUP BY RoundedDate;
The SQL above is currently gives me all the data I need, EXECPT for the last column.
This last column is supposed to be the count of records where that pass specific criteria. The only way I have found to successfully do this is to build a sub query... Cool? okay, but the problem is the sub query takes 4 minutes to run because well... I suck at SQL :P No matter how many different ways I've tried to write it, it still takes forever.
Here is the long, but working version.
SELECT DISTINCT RoundedDate,
Count() AS Count,
CAST (avg(Speed) AS INT) AS AverageSpeed,
(
SELECT count()
FROM RawSpeedLane AS slr
WHERE slr.ProjectId = 72 AND
datetime( ( (strftime('%s', Start) - (strftime('%M', Start) * 60 + strftime('%S', Start) ) ) ), 'unixepoch') = sl.RoundedDate AND
Speed > p.SpeedLimit
)
AS SpeederCount
FROM SpeedLaneReportDataView AS sl
INNER JOIN
Projects AS p ON p.ProjectId = sl.ProjectId
WHERE sl.ProjectId = 72
GROUP BY RoundedDate;
I currently just tried this for the last column
(select Count() where sl.Speed > p.SpeedLimit)
but as expected, i got 1s and 0s im not really sure on what to do here. Any hints or help that lead me in the right direction is very much appreciated.
I don't think SQLite has an IIF but CASE works.
This is a response to Backs answer, but I can't comment yet.
SELECT DISTINCT DateTime(Rounded, 'unixepoch') AS RoundedDate, -- Rounded DateTime to the floor hour
Count() AS Count, -- Count of items that registered within the above time
CAST (avg(Speed) AS INT) AS AverageSpeed, -- Average table.Speed column data within the defined datetime
SUM(CASE WHEN Speed > SpeedLimit THEN 1 ELSE 0 END) AS SpeederCount
FROM RawSpeedLane AS sl
With SUM and IIF:
SELECT DISTINCT DateTime(Rounded, 'unixepoch') AS RoundedDate, -- Rounded DateTime to the floor hour
Count() AS Count, -- Count of items that registered within the above time
CAST (avg(Speed) AS INT) AS AverageSpeed, -- Average table.Speed column data within the defined datetime
SUM(IIF(Speed > SpeedLimit, 1, 0)) AS SpeederCount
FROM RawSpeedLane AS sl
We observed that Sqlite returns always only one row if we apply group on a subquery and do not use an aggregation operation such as count or sum.
Here is a toy example:
Given table
CREATE TABLE ExampleTable (
id INT PRIMARY KEY,
rank INT NOT NULL
);
with data
INSERT INTO ExampleTable(id, rank) VALUES (1, 1);
INSERT INTO ExampleTable(id, rank) VALUES (2, 2);
INSERT INTO ExampleTable(id, rank) VALUES (3, 2);
the query
SELECT rank, COUNT(*) FROM (select id, rank from ExampleTable) GROUP BY rank;
returns
rank|count
2|2
1|1
However, without the COUNT operation Sqlite returns only 1 row.
SELECT rank FROM (select id, rank from ExampleTable) GROUP BY rank;
=>
rank
1
Is this is a bug or an expected behavior?