sqlite query - select all older than X days, not Y newest - sqlite

my sqlite table has following format (all not null and not unique INTEGER for example):
time type data
1436268660 0 ...
1436268661 1 ...
1436268662 0 ...
1436268666 2 ...
1436268668 1 ...
Sometimes I need to delete all rows of each type which are older than some time but I need to leave 5 newest from each type even if they are older than that specific time.
In other words, to leave 5 newest rows of each type and also all newer than specified time (if there is more than 5) and to delete the rest.
So, if the specified time is X and type 0 has 20 rows newer than X, nothing is done for type 0 (all are new enough).
Also if the specified time is X and type 0 has 5 rows all older than X, nothing is done (there is not more than 5 of them).
But if there is for example 7 entries and at least 2 of them are older than X, then those 2 oldest are deleted.
What I have so far is this query. But it is not correct. It just deletes all rows older than X when there is more than 5 of that type. If they are all older than X nothing is left.
DELETE FROM table WHERE rowid IN
(SELECT table.rowid FROM table JOIN
(SELECT type FROM table GROUP BY type HAVING COUNT(*) > 5)
USING (type) WHERE TIME < 14362685399);
As you can see the situation is little bit more complicated as "type" described above by me is in reality unique combination of multiple columns (you can replace by type1,type2,type3), but I guess it is not so important for the solution.
Thank you for any help.
time type0 type1 type2 data
1436268660 0 0 0 ...
1436268661 1 1 1 ...
1436268662 0 0 0 ...
1436268666 2 2 2 ...
1436268668 1 1 1 ...
Edit: Basically I need to delete all rows NOT in: (newer than X) UNION (5 latest entries for each type). I just don't know how to create result with "5 latest entries for each type".

Try this. I've renamed your table to t.
DELETE FROM t WHERE rowid IN(
SELECT a.rowid FROM t a
WHERE time < 14362685399
AND a.rowid NOT IN (
SELECT b.rowid FROM t b
WHERE a.type = b.type
ORDER BY b.time DESC
LIMIT 5
)
);
Note that this may not be very efficient on large data, due to the correlated subquery which will be evaluated each time it is required (possibly once per distinct type in your table, or maybe even once per row in the table, depending on how the query is executed).
As an aside, in a SQL variant that supports it, this would probably be better achieved with a window function. For example, in Postgres.

Related

Oracle 11g PLSQL - Splitting A Record Out Into Constituent Records Over Time - Row Generating

I have a dataset (a view) that has a numeric field "WR_EST_MHs". If that field exceeds a certain number of man hours (120 or 60, depending on 2 other fields' values), I need to split it out into constiuent records and spread those hours over future weeks.
The OH_UG_Key and 1kMCM_Flag fields determine the threshold for splitting. For example, if the OH_UG = 1 AND 1kMCM_Flag = 'N' and the WR_EST_MHs > 120, then spread the WR_EST_MHs value over as many records as is necessary, in 120 MH increments, changing only the WRSchedDate and WRSchedDate_Key fields (advancing each by one week).
Each OH_UG / 1kMCM_Flag / WR_EST_MHs scenario is as follows:
This is an example of what I need to do:
I thought that something like this might work, but I haven't worked with levels before:
with cte as
2 (Select * from "STJOF"."vfactScheduledWAWork"
5 )
6 select WR_Key, WP_Key, WRShedDate, DistSA_Key_Hash, CrewHQ_Key_Hash, Priority_Key_Hash, JobType_Key_Hash, WRStatus_Key_Hash, PerfBy_Key, OHUG_Key, 1kMCM_Flag, WR_EST_MHs
7 from cte cross join table(cast(multiset(select level from dual
8 connect by level >= WR_EST_MHs / 120
9 ) as sys.odcinumberlist))
10 order by WR_Key;
I also thought this could be done with a "tally table" which I have a little experience with. I really don't know where to begin on this one.
So I would say that a "Tally Table" will work if it is applied correctly. (Or, in this case, a tally view.)
First, break the logic for the hour breakout into a function so we don't have case when everywhere like so:
CREATE OR REPLACE FUNCTION get_hour_breakout(in_ohug_key IN NUMBER, in_1kmcm_flag in varchar2, in_tot_hours in number)
RETURN number
IS hours number;
BEGIN
hours:=
case when in_ohug_key=2 and in_1kmcm_flag='N' and in_tot_hours>60 then 60 else
case when in_ohug_key=2 and in_1kmcm_flag='Y' and in_tot_hours>60 and in_tot_hours<=120 then 60 else
case when in_ohug_key=2 and in_1kmcm_flag='Y' and in_tot_hours>120 then 120 else
120
end
end
end;
RETURN(hours);
END get_hour_breakout;
This way, if the hour breakout logic changes, it can be tweaked in one place.
Second, join to a dynamic "tally" view like so:
select wr_key,
WP_Key,
wrscheddate+idxkey.nnn*7 wrscheddate,
to_char(wrscheddate+idxkey.nnn*7,'yyyymmdd') WRSchedDate_Key,
OHUG_Key,
kMCM_Flag,
case when (wr_est_mhs-idxkey.nnn*get_hour_breakout(ohug_key, kmcm_flag, wr_est_mhs))>=get_hour_breakout(ohug_key, kmcm_flag, wr_est_mhs) then get_hour_breakout(ohug_key, kmcm_flag, wr_est_mhs) else wr_est_mhs-idxkey.nnn*get_hour_breakout(ohug_key, kmcm_flag, wr_est_mhs) end wr_est_mhs
from yourView inner join (SELECT ROWNUM-1 nnn
FROM ( SELECT 1 just_a_column
FROM dual
CONNECT BY LEVEL <= 52
)
) idxkey on vwrk.wr_est_mhs/get_hour_breakout(ohug_key, kmcm_flag, wr_est_mhs) > idxkey.nnn
By using the connect by level we, in effect, generate a bunch of zero indexed rows, then by joining to it with the hours divided by the breakout greater than the feed number we get a few rows for each group.
For example, if the function returns 120 and the hours are 100 you get a single row, so it stays 1 to 1. If the function returns 120 and the hours are 500, however, you get 5 rows because 500/120=4.1666666…, which in the join gives rows 4,3,2,1,0. Then the rest is simple math to determine the number of hours per breakout.
This could also be improved by moving the function call into the lower view so it is only used once per row. And the inline tally view could be made into it's own view, depends on the maintainability you need to build into it.

Sqlite improve case-when and group by performance

I'm optimizing my query using SQLite3.
There are some "CASE WHEN", "GROUP BY", "COUNT" functions.
BUT the query is VERY Slow (about 14sec)
Here is my database file information.
size: about 2GB
rows : about 3 millions
columns : 55 columns
What can i do for optimizing the query's performance?
Is there any better query for the result?
Please help me TT Thanks.
select
case
when score = 100 then 'A'
when score < 100 and score >= 40 then 'B'
else 'C'
end as range,
count(*) as count
from grade_info
where type < 9 and
(date >= '2019-07-09 00:00:00' and date <= '2019-07-09 23:59:59') and
is_new = 1
group by
case
when score = 100 then 'A'
when score < 100 and score >= 40 then 'B'
else 'C'
end;
Table grade_info has multi-column index: (type, date, is_new, score)
The conditions for columns (type, date, is_new) are always used in this query. Here is the explain query plan result.
selectid | order | from | detail
--------------------------------
0 0 0 SEARCH TABLE grade_info USING INDEX idx_03 (type<?) (~2777 rows)
0 0 0 USE TEMP B-TREE FOR GROUP BY
and I want the result like this.
A | 5124
B | 124
C | 12354
As Shawn suggests, try changing the index to have the date column as the first column:
CREATE INDEX [idx_cover] ON [grade_info] ([date], [is_new], [type], [score]);
sqlite allows reference to aliased expressions in the WHERE and GROUP BY clauses, so you can simply say GROUP BY range rather than repeating the CASE statement. This probably won't change the efficiency, but makes the query shorter and more readable.
If you run ANALYZE as MikeT suggests, the execution plan should change to say "COVERING INDEX...". If I understand correctly, that indicates that the entire query can be executed by traversing the single multi-column index without going back to the table data.
Try date BETWEEN '2019-07-09 00:00:00' AND '2019-07-09 23:59:59'.
Finally, CASE... WHEN is short-circuited, so make sure to place the more likely cases first so that it avoids unnecessary calculations. Also eliminate redundant conditional checks. If you have already checked a certain range in a previous condition, no need to re-evaluate that range in the next condition. (If you have already ruled out score = 100, then it is unnecessary to check score < 100 since it will of course be less than 100... assuming that all scores are ensured to be in the range 0 to 100) For instance, if scores are uniformly distributed, then the following could be faster, possibly eliminating +17000 conditional checks.
SELECT
CASE
WHEN score < 40 then 'C'
WHEN score < 100 then 'B' -- already tested to be >= 40
ELSE 'A' -- already tested to be >= 100
END AS range,
count(*) AS count
FROM grade_info
WHERE type < 9 AND
(date BETWEEN '2019-07-09 00:00:00' AND '2019-07-09 23:59:59') AND
is_new = 1
GROUP BY
range;

Will SQLite hoist invariant constraints?

I have been looking for solutions of a similar question like
https://forums.asp.net/t/1887910.aspx?Inner+join+2+tables+but+return+all+if+1+table+empty
Someone proposed a solution like the following
SELECT
*
FROM
TableA AS a
LEFT JOIN TableB AS b ON b.A_Id = a.A_Id
WHERE
b.A_Id IS NOT null
OR
NOT EXISTS (SELECT TOP 1 A_Id FROM TableB)
Does SQLite run not exists (select top 1 A_Id from TableB) for each row? This condition should be the same for all rows. I was wondering if the query engine could hoist it to the very top. Otherwise, it would take a lot of cost.
I tried the following
EXPLAIN QUERY PLAN
select test.testID, test.name, parameter.name
from test
left join parameter
on test.testID = parameter.testID
where
parameter.testID is not null
or not exists (select testId from parameter limit 1)
It reports
selectid order form detail
0 0 0 SCAN TABLE test (~1M rows)
0 1 1 search table parameter using automatic covering index (testID=?) (~ 7 rows)
0 0 0 EXECUTE SCALAR SUBQUERY 1
1 0 0 SCAN TABLE parameter (~1M rows)
Does it mean SQlite does not hoist (select testId from parameter limit 1)?
The documentation says:
A correlated subquery is reevaluated each time its result is required. An uncorrelated subquery is evaluated only once and the result reused as necessary.
Your subquery is not correlated, because it does not contain references to columns in the outer query. (And the EXPLAIN QUERY PLAN output would show EXECUTE CORRELATED SCALAR SUBQUERY ....)

Oracle query to count rows based on value from next record

Input values to the query : 1-20
Values in the database : 4,5, 15,16
I would like a query that gives me results as following
Value - Count
===== - =====
1 - 3
6 - 9
17 - 3
So basically, first generate continuous numbers from 1 to 20, count available numbers.
I wrote a query but I can not get it to fully work:
with avail_ip as (
SELECT (0) + LEVEL AS val
FROM DUAL
CONNECT BY LEVEL < 20),
grouped_tab as (
select val,lead(val,1,0) over (order by val) next_val
from avail_ip u
where not exists (
select 'x' from (select 4 val from dual) b
where b.val=u.val) )
select
val,next_val-val difference,
count(*) over (partition by next_val-val) avail_count
from grouped_tab
order by 1
It gives me count but i am not sure how to compress the rows to just three rows.
I was not able to add complete query, I kept getting 'error occurred while submission'. For some reason it does not like union clause. So I am attaching query as a image :(
More details of exact requirement:
I am writing a ip management module and i need to find out available (free) ip addresses within a ip block. Block could be /16 or /24 or even /12. To make it even challenging, i also support IPv6 so will have lot more numbers to manage. All issued ip addresses are stored in decimal format. So my thought is to first generate all ip decimals within the block range from network address to broadcast address. For eg. in a /24, there would 255 addresses and in case of /16 would 64K.
Now, secondly find all used addresses within a block, and find out available number of address with a starting ip. So in the above example, starting 1 ip- 3 addresses are available, starting with 6, 9 are available .
My last concern would be the query should be able to run fast enough to run through millions of numbers.
And sorry again, if my original question was not clear enough.
Similar sort of idea to what you tried:
with all_values as (
select :start_val + level - 1 as val
from dual
connect by level <= (:end_val - :start_val) + 1
),
missing_values as (
select val
from all_values
where not exists (select null from t42 where id = val)
),
chains as (
select val,
val - (row_number() over (order by val) + :start_val - 1) as chain
from missing_values
)
select min(val), count(*) - 1 as gap_count
from chains
group by chain
order by min(val);
With start_val as 1 and end_val as 20, and your data in table t42, that gets:
MIN(VAL) GAP_COUNT
---------- ----------
1 3
6 9
17 4
I've made end_val inclusive though; not sure if if you want it to be inclusive or exclusive. And I've perhaps made it more flexible that you need - your version also assumes you're always starting from 1.
The all_values CTE is basically the same as your, generating all the numbers between the start and end values - 1 to 20 (inclusive!) in this case.
The missing_values CTE removes the values that are in the table, so you're left with 1,2,3,6,7,8,9,10,11,12,13,14,17,18,19,20.
The chains CTE does the magic part. This gets the difference between each value and where you would expect it to be in a contiguous list. The difference - what I've called 'chain' - is the same for all contiguous missing values; 1,2,3 all get 0, 6 to 14 all get 2, and 17 to 20 all get 4. That chain value can then be used to group by, and you can use the aggregate count and min to get the answer you need.
SQL Fiddle of a simplified version that is specifically for 1-20, showing the data from each intermediate step. This would work for any upper limit, just by changing the 20, but assumes you'll always start from 1.

Fastest Way to Count Distinct Values in a Column, Including NULL Values

The Transact-Sql Count Distinct operation counts all non-null values in a column. I need to count the number of distinct values per column in a set of tables, including null values (so if there is a null in the column, the result should be (Select Count(Distinct COLNAME) From TABLE) + 1.
This is going to be repeated over every column in every table in the DB. Includes hundreds of tables, some of which have over 1M rows. Because this needs to be done over every single column, adding Indexes for every column is not a good option.
This will be done as part of an ASP.net site, so integration with code logic is also ok (i.e.: this doesn't have to be completed as part of one query, though if that can be done with good performance, then even better).
What is the most efficient way to do this?
Update After Testing
I tested the different methods from the answers given on a good representative table. The table has 3.2 million records, dozens of columns (a few with indexes, most without). One column has 3.2 million unique values. Other columns range from all Null (one value) to a max of 40K unique values. For each method I performed four tests (with multiple attempts at each, averaging the results): 20 columns at one time, 5 columns at one time, 1 column with many values (3.2M) and 1 column with a small number of values (167). Here are the results, in order of fastest to slowest
Count/GroupBy (Cheran)
CountDistinct+SubQuery (Ellis)
dense_rank (Eriksson)
Count+Max (Andriy)
Testing Results (in seconds):
Method 20_Columns 5_Columns 1_Column (Large) 1_Column (Small)
1) Count/GroupBy 10.8 4.8 2.8 0.14
2) CountDistinct 12.4 4.8 3 0.7
3) dense_rank 226 30 6 4.33
4) Count+Max 98.5 44 16 12.5
Notes:
Interestingly enough, the two methods that were fastest (by far, with only a small difference in between then) were both methods that submitted separate queries for each column (and in the case of result #2, the query included a subquery, so there were really two queries submitted per column). Perhaps because the gains that would be achieved by limiting the number of table scans is small in comparison to the performance hit taken in terms of memory requirements (just a guess).
Though the dense_rank method is definitely the most elegant, it seems that it doesn't scale well (see the result for 20 columns, which is by far the worst of the four methods), and even on a small scale just cannot compete with the performance of Count.
Thanks for the help and suggestions!
SELECT COUNT(*)
FROM (SELECT ColumnName
FROM TableName
GROUP BY ColumnName) AS s;
GROUP BY selects distinct values including NULL. COUNT(*) will include NULLs, as opposed to COUNT(ColumnName), which ignores NULLs.
I think you should try to keep the number of table scans down and count all columns in one table in one go. Something like this could be worth trying.
;with C as
(
select dense_rank() over(order by Col1) as dnCol1,
dense_rank() over(order by Col2) as dnCol2
from YourTable
)
select max(dnCol1) as CountCol1,
max(dnCol2) as CountCol2
from C
Test the query at SE-Data
A development on OP's own solution:
SELECT
COUNT(DISTINCT acolumn) + MAX(CASE WHEN acolumn IS NULL THEN 1 ELSE 0 END)
FROM atable
Run one query that Counts the number of Distinct values and adds 1 if there are any NULLs in the column (using a subquery)
Select Count(Distinct COLUMNNAME) +
Case When Exists
(Select * from TABLENAME Where COLUMNNAME is Null)
Then 1 Else 0 End
From TABLENAME
You can try:
count(
distinct coalesce(
your_table.column_1, your_table.column_2
-- cast them if you want replace value from column are not same type
)
) as COUNT_TEST
Function coalesce help you combine two columns with replace not null values.
I used this in mine case and success with correctly result.
Not sure this would be the fastest but might be worth testing. Use case to give null a value. Clearly you would need to select a value for null that would not occur in the real data. According to the query plan this would be a dead heat with the count(*) (group by) solution proposed by Cheran S.
SELECT
COUNT( distinct
(case when [testNull] is null then 'dbNullValue' else [testNull] end)
)
FROM [test].[dbo].[testNullVal]
With this approach can also count more than one column
SELECT
COUNT( distinct
(case when [testNull1] is null then 'dbNullValue' else [testNull1] end)
),
COUNT( distinct
(case when [testNull2] is null then 'dbNullValue' else [testNull2] end)
)
FROM [test].[dbo].[testNullVal]

Resources