cumulative sum of differences - sqlite

Is it possible to summarize in sqlite as it does excel?
Here's a picture of what I want to do but only in sqlite:

SQLite does not have windowing functions, so you have to construct a set-based computation.
You want to get the sum of all previous differences, i.e., the sum of the differences in all rows that have the same or a smaller ID:
SELECT id,
d,
k,
(SELECT sum(d - k)
FROM MyTable AS T2
WHERE T2.id <= MyTable.id
) AS cumulative_sum
FROM MyTable
ORDER BY id;

Related

Sqlite3 GROUPBY with HAVING and ORDER BY not working

I have a SQL query that is calculating the correct output, and is even doing the order by correctly, however it is not executing the having clause. am i doing something wrong? It is not filtering the percentages at all.
select table1.name as theme,
printf("%.2f", cast(count(table2.name)as float)/(select count(table2.name)
from table1
join table2
where table1.id = table2.theme_id)*100) as percentage from table1
join table2
where table1.id = table2.theme_id
group by table1.id
having percentage >=5.00
order by percentage desc;
The problem is that since printf() returns a string, the calculated column percentage is a string and you compare it to 5.00 which is a number.
This comparison can't give you what you expect because it is not a comparison between numbers.
One way to solve this is to drop printf() and use round() which returns a number instead:
select table1.name as theme,
round(cast(count(table2.name)as float)/(select count(table2.name)
from table1
join table2
where table1.id = table2.theme_id)*100, 2) as percentage from table1
join table2
where table1.id = table2.theme_id
group by table1.id
having percentage >=5.00
order by percentage desc;
or cast percentage to float:
having cast(percentage as float) >= 5.00

will converting from dateFrom/dateTo to period data type improve performance?

I have a really slow query and I'm trying to speed it up.
I have a target date range (dateFrom/dateTo) defined in a table with only one row I need to use as a limit against a table with millions of rows. Is there a best practice for this?
I started with one table with one row with dateFrom and dateTo fields. I can limit the rows in the large table by CROSS JOINing it with the small table and using the WHERE clause, like:
select
count(*)
from
tblOneRow o, tblBig b
where
o.dateFrom < b.dateTo and
o.dateTo >= b.dateFrom
or I can inner join the tables on the date range, like:
select
count(*)
from
tblOneRow o inner join
tblBig b on
o.dateFrom < b.dateTo and
o.dateTo >= b.dateFrom
but I thought if I changed my single-row table to use one field with a PERIOD data type instead of two fields with DATE data types, it could improve the performance. Is this a reasonable assumption? The explain isn't showing a time difference if I change it to:
select
count(*)
from
tblOneRow o inner join
tblBig b on
begin(o.date) < b.dateTo and
end(o.date) >= b.dateFrom
or if I convert the small table's date range to a PERIOD data type and join ON P_INTERSECT, like:
select
count(*)
from
tblOneRow o inner join
tblBig b on
o.date p_intersect period(b.dateFrom, b.dateTo + 1) is not null
to help the parsing engine with this join, would I need to define the fields on the large table with a period data type instead of two dates? I can't do that as I don't own that table, but if that's the case, I'll give up on improving performance with this method.
Thanks for your help.
I don't expect any difference between the first three Selects, Explain should be the same a product join (the optimizer should expect exactly one row, but as it's duplicated the estimated size should be the number of AMPs in your system). The last Select should be worse, because you apply a calculation (OVERLAPS would be more appropriate, but probably not better).
One way to improve this single row cross join would be a View (select date '...' as dateFrom, date '...' as dateTo) instead of the single row table. This should resolve the dates and result in hard-coded dateFrom/To instead of a product join.
Similar when you switch to Scalar Subqueries:
select
count(*)
from
tblBig b
where
(select min(o.dateFrom) from tblOneRow) < b.dateTo
and
(select min(o.dateTo) from tblOneRow) >= b.dateFrom

Getting median of column values in each group

I have a table containing user_id, movie_id, rating. These are all INT, and ratings range from 1-5.
I want to get the median rating and group it by user_id, but I'm having some trouble doing this.
My code at the moment is:
SELECT AVG(rating)
FROM (SELECT rating
FROM movie_data
ORDER BY rating
LIMIT 2 - (SELECT COUNT(*) FROM movie_data) % 2
OFFSET (SELECT (COUNT(*) - 1) / 2
FROM movie_data));
However, this seems to return the median value of all the ratings. How can I group this by user_id, so I can see the median rating per user?
The following gives the required median:
DROP TABLE IF EXISTS movie_data2;
CREATE TEMPORARY TABLE movie_data2 AS
SELECT user_id, rating FROM movie_data order by user_id, rating;
SELECT a.user_id, a.rating FROM (
SELECT user_id, rowid, rating
FROM movie_data2) a JOIN (
SELECT user_id, cast(((min(rowid)+max(rowid))/2) as int) as midrow FROM movie_data2 b
GROUP BY user_id
) c ON a.rowid = c.midrow
;
The logic is straightforward but the code is not beautified. Given encouragement or comments I will improve it. In a nutshell, the trick is to use rowid of SQLite.
This is not easily possible because SQLite does not allow correlated subqueries to refer to outer values in the LIMIT/OFFSET clauses.
Add WHERE clauses for the user_id to all three subqueries, and execute them for each user ID.
SELECT user_id,AVG(rating)
FROM movie_data
GROUP BY user_id
ORDER BY rating

Split-apply-combine in SQLite

Is there an SQLite equivalent of by or the split-apply-combine strategy?
Specifically, I have a table with columns firm,flag. firm is an integer that takes on a few hundred values (a firm id), flag is an integer taking on the values {0,1}. There are hundreds of entries per firm. I would like to compute the mean of flag for each firm, then store that in the same table (not efficient, I know, as each value will be repeated multiple times).
You could use a subquery:
UPDATE MyTable
SET FlagAverage = (SELECT AVG(flag)
FROM MyTable AS T2
WHERE T2.firm = MyTable.firm)

How to know if a row doesn't exist?

I have the following query:
SELECT rowid FROM table1 ORDER BY RANDOM() LIMIT 1
And as well I have another table (table3). In that table I have columns table1_id and table2_id. table1_id is a link to a row in table1 and table2_id is a link to a row in another table.
I want in my query to get only those results that are defined in table3. Only those that have table1 rowid in their table1_id column. There may not be any columns at all referring to a certain table1 rowid so in this case I don't want to receive them.
How can I achieve this goal?
Update: I tried the following query, which doesn't work:
SELECT rowid FROM table1
WHERE rowid IN (SELECT table1_id FROM table3 WHERE table1_id = table1.rowid)
ORDER BY RANDOM() LIMIT 1
SELECT rowid FROM table1
WHERE rowid IN ( SELECT DISTINCT table1_id FROM table3 )
ORDER BY RANDOM() LIMIT 1;
This query means "choose at random a row from table1 which has an entry in table3".
Every row in table1 equal likelihood of being selected (DISTINCT) as long as it is referenced at least once in table3.
If you are trying to get more than one result, then you should remove the "ORDER BY RANDOM() LIMIT 1" clause.
Assuming you want to select more than just a rowid, you need to SELECT from a JOIN between the tables you're interested in. SQLite doesn't have the full set of standard JOIN functionality, so you'll need to re-work your query so it can use a LEFT OUTER JOIN.
SELECT table1.rowid, table1.other_field
FROM table3
LEFT OUTER JOIN table1 ON table3.table1_id = table1.rowid
ORDER BY RANDOM()
LIMIT 1;

Resources