Sqlite improve case-when and group by performance - sqlite

I'm optimizing my query using SQLite3.
There are some "CASE WHEN", "GROUP BY", "COUNT" functions.
BUT the query is VERY Slow (about 14sec)
Here is my database file information.
size: about 2GB
rows : about 3 millions
columns : 55 columns
What can i do for optimizing the query's performance?
Is there any better query for the result?
Please help me TT Thanks.
select
case
when score = 100 then 'A'
when score < 100 and score >= 40 then 'B'
else 'C'
end as range,
count(*) as count
from grade_info
where type < 9 and
(date >= '2019-07-09 00:00:00' and date <= '2019-07-09 23:59:59') and
is_new = 1
group by
case
when score = 100 then 'A'
when score < 100 and score >= 40 then 'B'
else 'C'
end;
Table grade_info has multi-column index: (type, date, is_new, score)
The conditions for columns (type, date, is_new) are always used in this query. Here is the explain query plan result.
selectid | order | from | detail
--------------------------------
0 0 0 SEARCH TABLE grade_info USING INDEX idx_03 (type<?) (~2777 rows)
0 0 0 USE TEMP B-TREE FOR GROUP BY
and I want the result like this.
A | 5124
B | 124
C | 12354

As Shawn suggests, try changing the index to have the date column as the first column:
CREATE INDEX [idx_cover] ON [grade_info] ([date], [is_new], [type], [score]);
sqlite allows reference to aliased expressions in the WHERE and GROUP BY clauses, so you can simply say GROUP BY range rather than repeating the CASE statement. This probably won't change the efficiency, but makes the query shorter and more readable.
If you run ANALYZE as MikeT suggests, the execution plan should change to say "COVERING INDEX...". If I understand correctly, that indicates that the entire query can be executed by traversing the single multi-column index without going back to the table data.
Try date BETWEEN '2019-07-09 00:00:00' AND '2019-07-09 23:59:59'.
Finally, CASE... WHEN is short-circuited, so make sure to place the more likely cases first so that it avoids unnecessary calculations. Also eliminate redundant conditional checks. If you have already checked a certain range in a previous condition, no need to re-evaluate that range in the next condition. (If you have already ruled out score = 100, then it is unnecessary to check score < 100 since it will of course be less than 100... assuming that all scores are ensured to be in the range 0 to 100) For instance, if scores are uniformly distributed, then the following could be faster, possibly eliminating +17000 conditional checks.
SELECT
CASE
WHEN score < 40 then 'C'
WHEN score < 100 then 'B' -- already tested to be >= 40
ELSE 'A' -- already tested to be >= 100
END AS range,
count(*) AS count
FROM grade_info
WHERE type < 9 AND
(date BETWEEN '2019-07-09 00:00:00' AND '2019-07-09 23:59:59') AND
is_new = 1
GROUP BY
range;

Related

SQLite: Using COALESCE inside a CASE statement

I have two tables: one with record of a person with initial number, and a second one with records of changes to this number.
During a join, I do coalesce(latest_of_series, initial) to get a singular number per person. So good so far.
I also group this numbers into a groups, on order these groups separately. I know I can do:
select
coalesce(latest, initial) as final,
case
when coalesce(latest, inital) > 1 and coalesce(latest, inital) < 100 then 'group 1'
-- other cases
end as group
-- rest of the query
but that's of course horribly unreadable.
I tried:
select
coalesce(latest_of_series, initial_if_no_series) as value,
case
when value > 1 and value < 100 then 'group 1'
-- rest of the cases
end as group
-- rest of the query
but then the sqlite complains that there's no column "value"
Is there really no way of using previous result of coalesce as a "variable"?
That's not an SQLite limitation. That's an SQL limitation.
All the column names are decided as one. You can't define a column in line 2 of your query and then refer to it in line 3 of your query. All columns derive from the tables you select, each on their own, they can't "see" each other.
But you can use nested queries.
select
value,
case
when value >= 1 and value < 100 then 'group 1'
when value >= 100 and value < 200 then 'group 2'
else 'group 3'
end value_group
from
(
select
coalesce(latest_of_series, initial_if_no_series) as value
from
my_table
group by
user_id
) v
This way, the columns of the inner query can be decided as one, and the columns of the outer query can be decided as one. It might ever be faster, depending on the circumnstances.

Oracle 11g PLSQL - Splitting A Record Out Into Constituent Records Over Time - Row Generating

I have a dataset (a view) that has a numeric field "WR_EST_MHs". If that field exceeds a certain number of man hours (120 or 60, depending on 2 other fields' values), I need to split it out into constiuent records and spread those hours over future weeks.
The OH_UG_Key and 1kMCM_Flag fields determine the threshold for splitting. For example, if the OH_UG = 1 AND 1kMCM_Flag = 'N' and the WR_EST_MHs > 120, then spread the WR_EST_MHs value over as many records as is necessary, in 120 MH increments, changing only the WRSchedDate and WRSchedDate_Key fields (advancing each by one week).
Each OH_UG / 1kMCM_Flag / WR_EST_MHs scenario is as follows:
This is an example of what I need to do:
I thought that something like this might work, but I haven't worked with levels before:
with cte as
2 (Select * from "STJOF"."vfactScheduledWAWork"
5 )
6 select WR_Key, WP_Key, WRShedDate, DistSA_Key_Hash, CrewHQ_Key_Hash, Priority_Key_Hash, JobType_Key_Hash, WRStatus_Key_Hash, PerfBy_Key, OHUG_Key, 1kMCM_Flag, WR_EST_MHs
7 from cte cross join table(cast(multiset(select level from dual
8 connect by level >= WR_EST_MHs / 120
9 ) as sys.odcinumberlist))
10 order by WR_Key;
I also thought this could be done with a "tally table" which I have a little experience with. I really don't know where to begin on this one.
So I would say that a "Tally Table" will work if it is applied correctly. (Or, in this case, a tally view.)
First, break the logic for the hour breakout into a function so we don't have case when everywhere like so:
CREATE OR REPLACE FUNCTION get_hour_breakout(in_ohug_key IN NUMBER, in_1kmcm_flag in varchar2, in_tot_hours in number)
RETURN number
IS hours number;
BEGIN
hours:=
case when in_ohug_key=2 and in_1kmcm_flag='N' and in_tot_hours>60 then 60 else
case when in_ohug_key=2 and in_1kmcm_flag='Y' and in_tot_hours>60 and in_tot_hours<=120 then 60 else
case when in_ohug_key=2 and in_1kmcm_flag='Y' and in_tot_hours>120 then 120 else
120
end
end
end;
RETURN(hours);
END get_hour_breakout;
This way, if the hour breakout logic changes, it can be tweaked in one place.
Second, join to a dynamic "tally" view like so:
select wr_key,
WP_Key,
wrscheddate+idxkey.nnn*7 wrscheddate,
to_char(wrscheddate+idxkey.nnn*7,'yyyymmdd') WRSchedDate_Key,
OHUG_Key,
kMCM_Flag,
case when (wr_est_mhs-idxkey.nnn*get_hour_breakout(ohug_key, kmcm_flag, wr_est_mhs))>=get_hour_breakout(ohug_key, kmcm_flag, wr_est_mhs) then get_hour_breakout(ohug_key, kmcm_flag, wr_est_mhs) else wr_est_mhs-idxkey.nnn*get_hour_breakout(ohug_key, kmcm_flag, wr_est_mhs) end wr_est_mhs
from yourView inner join (SELECT ROWNUM-1 nnn
FROM ( SELECT 1 just_a_column
FROM dual
CONNECT BY LEVEL <= 52
)
) idxkey on vwrk.wr_est_mhs/get_hour_breakout(ohug_key, kmcm_flag, wr_est_mhs) > idxkey.nnn
By using the connect by level we, in effect, generate a bunch of zero indexed rows, then by joining to it with the hours divided by the breakout greater than the feed number we get a few rows for each group.
For example, if the function returns 120 and the hours are 100 you get a single row, so it stays 1 to 1. If the function returns 120 and the hours are 500, however, you get 5 rows because 500/120=4.1666666…, which in the join gives rows 4,3,2,1,0. Then the rest is simple math to determine the number of hours per breakout.
This could also be improved by moving the function call into the lower view so it is only used once per row. And the inline tally view could be made into it's own view, depends on the maintainability you need to build into it.

How to get the total quantity of results using count(*)?

i need to get the total quantity of results for each person but i get ...
resultado
MY QUERY..
select t.fecha_hora_timbre,e.nombre,e.apellido,d.descripcion as departamento_trabaja, t.fecha,count(*)
from fulltime.timbre t, fulltime.empleado e, fulltime.departamento d
where d.depa_id=e.depa_id and t.codigo_empleado=e.codigo_empleado and
trunc(t.fecha) between trunc(to_date('15/02/2017','dd/mm/yyyy')) and trunc(to_date('14/03/2017','dd/mm/yyyy'))
group by t.fecha_hora_timbre,e.nombre,e.apellido,d.descripcion, t.fecha
Expected data...
NOMBRE | APELLIDO | DEPARTAMENTO_TRABAJA | VECES_MARCADAS(count)
MARIA TARCILA IGLESIAS BECERRA ALCALDIA 4
KATHERINE TATIANA SEGOVIA FERNANDEZ ALCALDIA 10
FREDDY AGUSTIN VALDIVIESO VALLEJO ALCALDIA 3
UPDATE..
select e.nombre,e.apellido,d.descripcion as departamento_trabaja,COUNT(*)
from fulltime.timbre t, fulltime.empleado e, fulltime.departamento d
where d.depa_id=e.depa_id and t.codigo_empleado=e.codigo_empleado and
trunc(t.fecha) between trunc(to_date('15/02/2017','dd/mm/yyyy')) and trunc(to_date('14/03/2017','dd/mm/yyyy'))
group by t.fecha_hora_timbre,e.nombre,e.apellido,d.descripcion, t.fecha
You should only select and group by the non-aggregate columns you actually want to count against. At the moment you're including the fecha_hora_timbre and fechacolumns in each row, so you're counting the unique combinations of those columns as well as the name/department information you actually want to count.
select e.nombre, e.apellido, d.descripcion as departamento_trabaja,
count(*) a veces_marcadas
from fulltime.timbre t
join fulltime.empleado e on t.codigo_empleado=e.codigo_empleado
join fulltime.departamento d on d.depa_id=e.depa_id
where t.fecha >= to_date('15/02/2017','dd/mm/yyyy')
and t.fecha < to_date('15/03/2017','dd/mm/yyyy')
group by e.nombre, e.apellido, d.descripcion
I've removed the extra columns. Notice that they have gone from both the select list and the group-by clause. If you have a non-aggregate column in the select list that isn't in the group-by you'll get an ORA-00937 error; but if you have a column in the group-by that isn't in the select list then it will still group by that even though you can't see it and you just won't get the results you expect.
I've also changed from old-style join syntax to modern syntax. And I've changed the date comparison; firstly because doing trunc() as part of trunc(to_date('15/02/2017','dd/mm/yyyy')) is pointless - you already know the time part is midnight, so the trunc doesn't achieve anything. But mostly so that if there is an index on fecha that index can be used. If you do trunc(f.techa) then the value of every column value has to be truncated, which stops the index being used (unless you have a function-based index). As between in inclusive, using >= and < with one day later on the higher limit should have the same effect overall.

sqlite query - select all older than X days, not Y newest

my sqlite table has following format (all not null and not unique INTEGER for example):
time type data
1436268660 0 ...
1436268661 1 ...
1436268662 0 ...
1436268666 2 ...
1436268668 1 ...
Sometimes I need to delete all rows of each type which are older than some time but I need to leave 5 newest from each type even if they are older than that specific time.
In other words, to leave 5 newest rows of each type and also all newer than specified time (if there is more than 5) and to delete the rest.
So, if the specified time is X and type 0 has 20 rows newer than X, nothing is done for type 0 (all are new enough).
Also if the specified time is X and type 0 has 5 rows all older than X, nothing is done (there is not more than 5 of them).
But if there is for example 7 entries and at least 2 of them are older than X, then those 2 oldest are deleted.
What I have so far is this query. But it is not correct. It just deletes all rows older than X when there is more than 5 of that type. If they are all older than X nothing is left.
DELETE FROM table WHERE rowid IN
(SELECT table.rowid FROM table JOIN
(SELECT type FROM table GROUP BY type HAVING COUNT(*) > 5)
USING (type) WHERE TIME < 14362685399);
As you can see the situation is little bit more complicated as "type" described above by me is in reality unique combination of multiple columns (you can replace by type1,type2,type3), but I guess it is not so important for the solution.
Thank you for any help.
time type0 type1 type2 data
1436268660 0 0 0 ...
1436268661 1 1 1 ...
1436268662 0 0 0 ...
1436268666 2 2 2 ...
1436268668 1 1 1 ...
Edit: Basically I need to delete all rows NOT in: (newer than X) UNION (5 latest entries for each type). I just don't know how to create result with "5 latest entries for each type".
Try this. I've renamed your table to t.
DELETE FROM t WHERE rowid IN(
SELECT a.rowid FROM t a
WHERE time < 14362685399
AND a.rowid NOT IN (
SELECT b.rowid FROM t b
WHERE a.type = b.type
ORDER BY b.time DESC
LIMIT 5
)
);
Note that this may not be very efficient on large data, due to the correlated subquery which will be evaluated each time it is required (possibly once per distinct type in your table, or maybe even once per row in the table, depending on how the query is executed).
As an aside, in a SQL variant that supports it, this would probably be better achieved with a window function. For example, in Postgres.

Fastest Way to Count Distinct Values in a Column, Including NULL Values

The Transact-Sql Count Distinct operation counts all non-null values in a column. I need to count the number of distinct values per column in a set of tables, including null values (so if there is a null in the column, the result should be (Select Count(Distinct COLNAME) From TABLE) + 1.
This is going to be repeated over every column in every table in the DB. Includes hundreds of tables, some of which have over 1M rows. Because this needs to be done over every single column, adding Indexes for every column is not a good option.
This will be done as part of an ASP.net site, so integration with code logic is also ok (i.e.: this doesn't have to be completed as part of one query, though if that can be done with good performance, then even better).
What is the most efficient way to do this?
Update After Testing
I tested the different methods from the answers given on a good representative table. The table has 3.2 million records, dozens of columns (a few with indexes, most without). One column has 3.2 million unique values. Other columns range from all Null (one value) to a max of 40K unique values. For each method I performed four tests (with multiple attempts at each, averaging the results): 20 columns at one time, 5 columns at one time, 1 column with many values (3.2M) and 1 column with a small number of values (167). Here are the results, in order of fastest to slowest
Count/GroupBy (Cheran)
CountDistinct+SubQuery (Ellis)
dense_rank (Eriksson)
Count+Max (Andriy)
Testing Results (in seconds):
Method 20_Columns 5_Columns 1_Column (Large) 1_Column (Small)
1) Count/GroupBy 10.8 4.8 2.8 0.14
2) CountDistinct 12.4 4.8 3 0.7
3) dense_rank 226 30 6 4.33
4) Count+Max 98.5 44 16 12.5
Notes:
Interestingly enough, the two methods that were fastest (by far, with only a small difference in between then) were both methods that submitted separate queries for each column (and in the case of result #2, the query included a subquery, so there were really two queries submitted per column). Perhaps because the gains that would be achieved by limiting the number of table scans is small in comparison to the performance hit taken in terms of memory requirements (just a guess).
Though the dense_rank method is definitely the most elegant, it seems that it doesn't scale well (see the result for 20 columns, which is by far the worst of the four methods), and even on a small scale just cannot compete with the performance of Count.
Thanks for the help and suggestions!
SELECT COUNT(*)
FROM (SELECT ColumnName
FROM TableName
GROUP BY ColumnName) AS s;
GROUP BY selects distinct values including NULL. COUNT(*) will include NULLs, as opposed to COUNT(ColumnName), which ignores NULLs.
I think you should try to keep the number of table scans down and count all columns in one table in one go. Something like this could be worth trying.
;with C as
(
select dense_rank() over(order by Col1) as dnCol1,
dense_rank() over(order by Col2) as dnCol2
from YourTable
)
select max(dnCol1) as CountCol1,
max(dnCol2) as CountCol2
from C
Test the query at SE-Data
A development on OP's own solution:
SELECT
COUNT(DISTINCT acolumn) + MAX(CASE WHEN acolumn IS NULL THEN 1 ELSE 0 END)
FROM atable
Run one query that Counts the number of Distinct values and adds 1 if there are any NULLs in the column (using a subquery)
Select Count(Distinct COLUMNNAME) +
Case When Exists
(Select * from TABLENAME Where COLUMNNAME is Null)
Then 1 Else 0 End
From TABLENAME
You can try:
count(
distinct coalesce(
your_table.column_1, your_table.column_2
-- cast them if you want replace value from column are not same type
)
) as COUNT_TEST
Function coalesce help you combine two columns with replace not null values.
I used this in mine case and success with correctly result.
Not sure this would be the fastest but might be worth testing. Use case to give null a value. Clearly you would need to select a value for null that would not occur in the real data. According to the query plan this would be a dead heat with the count(*) (group by) solution proposed by Cheran S.
SELECT
COUNT( distinct
(case when [testNull] is null then 'dbNullValue' else [testNull] end)
)
FROM [test].[dbo].[testNullVal]
With this approach can also count more than one column
SELECT
COUNT( distinct
(case when [testNull1] is null then 'dbNullValue' else [testNull1] end)
),
COUNT( distinct
(case when [testNull2] is null then 'dbNullValue' else [testNull2] end)
)
FROM [test].[dbo].[testNullVal]

Resources