This has been driving me nuts because I feel like it should be possible. But I'm admittedly not a huge SQL whiz.
I have an sqlite3 table that looks essentially like this:
id date amount
-- ---- ------
51 2018.10.01 10.0
52 2018.11.15 100.0
53 2018.11.15 20.0
54 2018.09.10 -30.0
(At least, these are the pertinent fields; the others have been left out).
What I want to do is generate a running total of the amount column, but with the data sorted by date.
I'm aware of the 'join the table with itself' trick for calculating a running total. So if I wanted a new running total value for each id (which is a unique field), I can do this:
select T2.id, T2.date, T2.amount, sum(T2.amount)
from Transactions T1
inner join Transactions T2
on T1.id >= T2.id
group by T1.id
And I get this:
"51" "2018.10.01" "10.0" "10.0"
"52" "2018.11.15" "100.0" "110.0"
"53" "2018.11.15" "20.0" "130.0"
"54" "2018.09.10" "-30.0" "100.0"
Running total correct.
But if I want a running total on this data in date order, it breaks down. This is close:
select T1.id, T2.date, T2.amount, sum(T2.amount)
from Transactions T1
inner join Transactions T2
on T1.date >= T2.date
group by T1.date
Except that it over-counts (and combines) the amount values in the two rows where date is 2018.11.15. Presumably because the on T1.date >= T2.date clause applies to both rows twice each.
"54" "2018.09.10" "-30.0" "-30.0"
"51" "2018.09.10" "-30.0" "-20.0"
"53" "2018.09.10" "-30.0" "200.0"
As I see it, this technique will only work if the join is performed on a field whose values are both unique and sorted. Once I sort the table by date, the unique id values are out of order and no longer usable.
So then I thought -- maybe sort the table by date first, then add a temporary column of unique sorted numbers. Simply the row number would do.
Unfortunately, this appears to be a version of sqlite that does not support any of row_number(), rownum or the over clause.
I'm aware of this technique for generating row numbers:
select id, date,
(select count(*) from Transactions T1 where T1.id <= T2.id)
from Transactions T2
"51" "2018.10.01" "1"
"52" "2018.11.15" "2"
"53" "2018.11.15" "3"
"54" "2018.09.10" "4"
But in no amount of fiddling around have I been able to figure out a way to:
First sort the table by date
Then use the count(*) technique to generate the unique row numbers
Then join the table with itself to create the running total
in a single SQL statement.
Hope this makes sense. Thanks for any thoughts anyone might have.
If you're using Sqlite 3.25 or better, window functions make this easy. Example:
First, populate a table with your sample data:
CREATE TABLE example(id INTEGER PRIMARY KEY, date TEXT, amount REAL);
INSERT INTO example VALUES(51,'2018-10-01',10.0);
INSERT INTO example VALUES(52,'2018-11-15',100.0);
INSERT INTO example VALUES(53,'2018-11-15',20.0);
INSERT INTO example VALUES(54,'2018-09-10',-30.0);
(Note that I changed the date format to one that the sqlite date and time functions understand, as that makes life easier as soon as you want to do something more complicated than sorting them).
The query
SELECT *, sum(amount) OVER (ORDER BY date, id) AS running_total
FROM example
ORDER BY date, id;
produces:
id date amount running_total
---------- ---------- ---------- -------------
54 2018-09-10 -30.0 -30.0
51 2018-10-01 10.0 -20.0
52 2018-11-15 100.0 80.0
53 2018-11-15 20.0 100.0
If you're using an older version, you really should consider upgrading for more reasons than just having window functions.
Thank you Shawn -- you put me on the track to an answer.
It looks like the most recent beta version of DB Browser for SQLite does support window functions (I presume because the latest version of SQLite itself does).
Problem solved!
Related
I have a sqlite database of about 1.4 million rows and 16 columns.
I have to run an operation on 80,000 id's :
Get all rows associated with that id
convert to R date object and sort by date
calculate difference between 2 most recent dates
For each id I have been querying sqlite from R using dbSendQuery and dbFetch for step 1, while steps 2 and 3 are done in R. Is there a faster way? Would it be faster or slower to load the entire sqlite table into a data.table ?
I heavily depends on how you are working on that problem.
Normally loading the whole query inside the memory and then do the operation will be faster from what I have experienced and have seen on grahics, I can not show you a benchmark right now. If logically it makes hopefully sense, because you have to repeat several operations multiple times on multiple data.frames. As you can see here, 80k rows are pretty fast, faster than 3x 26xxx rows.
However you could have a look at the parallel package and use multiple cores on your machine to load subsets of your data and process them parallel, each on a multiple core.
Here you can find information how to do this:
http://jaehyeon-kim.github.io/2015/03/Parallel-Processing-on-Single-Machine-Part-I
If you're doing all that in R and fetching rows from the database 80,0000 times in a loop... you'll probably have better results doing it all in one go in sqlite instead.
Given a skeleton table like:
CREATE TABLE data(id INTEGER, timestamp TEXT);
INSERT INTO data VALUES (1, '2019-07-01'), (1, '2019-06-25'), (1, '2019-06-24'),
(2, '2019-04-15'), (2, '2019-04-14');
CREATE INDEX data_idx_id_time ON data(id, timestamp DESC);
a query like:
SELECT id
, julianday(first_ts)
- julianday((SELECT max(d2.timestamp)
FROM data AS d2
WHERE d.id = d2.id AND d2.timestamp < d.first_ts)) AS days_difference
FROM (SELECT id, max(timestamp) as first_ts FROM data GROUP BY id) AS d
ORDER BY id;
will give you
id days_difference
---------- ---------------
1 6.0
2 1.0
An alternative for modern versions of sqlite (3.25 or newer) (EDIT: On a test database with 16 million rows and 80000 distinct ids, it runs considerably slower than the above one, so you don't want to actually use it):
WITH cte AS
(SELECT id, timestamp
, lead(timestamp, 1) OVER id_by_ts AS next_ts
, row_number() OVER id_by_ts AS rn
FROM data
WINDOW id_by_ts AS (PARTITION BY id ORDER BY timestamp DESC))
SELECT id, julianday(timestamp) - julianday(next_ts) AS days_difference
FROM cte
WHERE rn = 1
ORDER BY id;
(The index is essential for performance for both versions. Probably want to run ANALYZE on the table at some point after it's populated and your index(es) are created, too.)
I know there are many topics regarding this question but none actually helped me solve my problem. I am still sort of new when it comes to databases and I came across this problem.
I have a table named tests which contains two columns: id and date.
I want to calculate the average difference of days between a couple values.
Say select date from tests where id=1 which will provide me with a list of dates. I want to calculate the avg difference between those days.
Table "tests"
1|2018-03-13
1|2018-03-01
2|2018-03-13
2|2018-03-01
3|2018-03-13
3|2018-03-01
1|2018-03-17
2|2018-03-17
3|2018-03-17
Select date from tests where id=1
2018-03-13
2018-03-01
2018-03-17
Now I am looking to calculate the average difference in days between those three dates.
Can really use some help, thank you!
Edit:
Sorry for being unclear, I'll clarify my question.
So student one had a test on the 01/03, then on the 13/03 and then on the 17/03. What I want to calculate is the avg difference in days between test to test, so:
Diff between first to second is 12 days. Diff between second to third is 4 days.
12+6 divided by two since we have two gaps is 8 eight.
I am looking to calculate the average difference in days between those three dates.*
And by average difference we mean "take the average of the absolute value of the difference between all dates". That's 12 + 16 + 4 / 3 or 10.6667.
We need all combinations of dates. For this we need a self-join with no repeats. That's accomplished by picking a field and using on with a < or >.
select t1.date, t2.date
from tests as t1
join tests as t2 on t1.id = t2.id and t1.date < t2.date
where t1.id = 1;
2018-03-01|2018-03-13
2018-03-01|2018-03-17
2018-03-13|2018-03-17
Now that we have all combinations, we can take the difference. But not by simply subtracting the dates, SQLite doesn't support that. First, convert them to Julian Days.
sqlite> select julianday(t1.date), julianday(t2.date) from tests as t1 join tests as t2 on t1.id = t2.id and t1.date < t2.date where t1.id = 1;
2458178.5|2458190.5
2458178.5|2458194.5
2458190.5|2458194.5
Now that we have numbers we can take the absolute value of the difference and do an average.
select avg(abs(julianday(t1.date) - julianday(t2.date)))
from tests as t1
join tests as t2 on t1.id = t2.id and t1.date < t2.date
where t1.id = 1;
UPDATE
What I want to calculate is the avg difference in days between test to test, so: Diff between first to second is 12 days. Diff between second to third is 4 days. Then (12+4)/2=8 which should be the result.
For this twist on the problem you want to compare each row with the next one. You want a table like this:
2018-03-01|2018-03-13
2018-03-13|2018-03-17
Other databases have features like window or lag to accomplish this. SQLite doesn't have that. Again, we'll use a self-join, but we have to do it per row. This is a correlated subquery.
select t1.date as date, (
select t2.date
from tests t2
where t1.id = t2.id and t2.date > t1.date
order by t2.date
limit 1
) as next
from tests t1
where id = 1
and next is not null
The subquery-as-column finds the next date for each row.
This is a bit unwieldy, so let's turn it into a view. Then we can use it as a table. Just take out the where id = 1 so it's generally useful.
create view test_and_next as
select t1.id, t1.date as date, (
select t2.date
from tests t2
where t1.id = t2.id and t2.date > t1.date
order by t2.date
limit 1
) as next
from tests t1
where next is not null
Now we can treat test_and_next as a table with the columns id, date, and next. Then it's the same as before: turn them into Julian Days, subtract, and take the average.
select avg(julianday(next) - julianday(date))
from test_and_next
where id = 1;
Note that this will go sideways when you have two rows with the same date: there's no way for SQL to know which is the "next" one. For example, if there were two tests for ID 1 on "2018-03-13" they'll both choose "2018-03-17" as the "next" one.
2018-03-01|2018-03-13
2018-03-13|2018-03-17
2018-03-13|2018-03-17
I'm not sure how to fix this.
I would like to know how to get all the rows from table1 that have a matching row in table3.
Teh structure of the tables is:
table1:
k1 k2
table2:
k1 k2 t1 t2 date type
table3:
t1 t2 date status
The conditions are:
k1 and k2 have to match with the corresponding columns in table2.
In table2 I will only chek those rows where date='today' and type='a'.
That can return 0, 1 or many rows in table2.
Looking at t1 and t2 from table 2, I get the rows that match in table3.
If in table3 date='today' and status='ok', I will return the original row from table1, this is, k1 and k2.
How can I do this query (inner joins, exists, whatever) having into account that the three tables have millions of rows, so it must be as optimal as possible?
I have the query, which is right for sure, but they are too many conditions for Teradata to come with the answer. Too many joins, I think.
I would not consider three tables and a few millions of rows a complex query.
In Teradata you usually don't have to think that much about join/in/exists, all will be rewritten to joins internally. But there's is a one-to-many-to-one relation, so you should avoid a join as this will need a final DISTINCT.
Better use IN or EXISTS instead:
SELECT
K1,K2
FROM Table1
WHERE (K1,K2) IN
(
SELECT K1,K2
FROM Table2
WHERE datecol = CURRENT_DATE
AND typecol = 'a'
AND (T1,T2) IN
(
SELECT T1,T2
FROM Table3
WHERE datecol = CURRENT_DATE
AND status = 'ok'
)
)
Regarding the actual plan: if there are the necessary statistics the optimizer should choose a good plan, check the confidences levels in Explain. You can also run a diagnostic helpstats on for session; before running Explain to see if there are missing stats.
Something like the following should work.
SELECT
Table1.*
FROM
Table1
INNER JOIN Table2 ON
Table1.K1 = Table2.K1 AND
Table1.K2 = Table2.K2 AND
Table2.date = CURRENT_DATE and
Table2.type = 'a'
INNER JOIN Table3 ON
Table2.T1 = Table3.T1 AND
Table2.T2 = Table3.T2 AND
Table3.date = CURRENT_DATE and
Table3.status = "OK"
Update:
Speaking more to the optimization part of the question. The execution steps that Teradata will most likely take here are:
In parallel it will select all records from Table1, Records from Table2 where the date is CURRENT_DATE and the type is a, and Records from Table3 where the date is CURRENT_DATE and the status is OK.
It will then join the results from the SELECT of Table2 to the results of the SELECT from table1.
It will then join the results from that to the results from the SELECT of table3.
You can get more information by putting EXPLAIN before your SELECT query. The results returned from the database will be the explanation of your Teradata server will execute the query, which can be very enlightening when trying to optimize a big slow query.
Unfortunately the steps above are the best you can hope for. Parallel execution of all three tables with the filters applied, and then a join of the results. With big data, the slowest part of a query is often the join, so filtering before you get to that step is a big plus.
There's more that can be done to optimize like making sure your Indexes are in order and Collecting statistics, especially on fields where you will be filtering. But without the admin access to do that, your hands are tied.
I have following query :
SELECT distinct A1 ,sum(total) as sum_total FROM
(
SELECT A1, A2,A3,A4,A5,A6,COUNT(A7) AS total,A8
FROM (
select a.* from table1 a
left join (select * from table_reject where name = 'smith') b on A.A3 = B.B3 and A.A9 =B.B2
where B.ID is null
) t1
WHERE A8 >= NEXT_DAY ( trunc(to_date('17/09/2013 12:00:00','dd/mm/yyyy hh24:mi:ss')) ,'SUN' )
GROUP BY
CUBE(A1, A2,A3,A4,A5,A6,A8)
)INN
WHERE
INN.A1 IS NOT NULL AND
INN.A2 IS NULL AND
INN.A3 IS NULL AND
INN.A4 IS NULL AND
INN.A5 IS NULL AND
INN.A6 is NULL AND
INN.A8 IS NOT NULL
GROUP BY A1
ORDER BY sum_total DESC ;
Total number records in table1 is around 8 million.
My problem is i need to optimize the above query in best possible way.I did tried to make index on column A8 of table1 and creating the index helped me to decrease the cost of query but execution time of query is more or less same when there was no index on the table1.
Any help would be appreciated.
Thanks
CUBE operation on large data set is really expensive, so you need to check do you really need all that data in inner query. because i see you are doing COUNT in inner and then on the outer query you have SUM of counts. so in other words, give me the row count of A7 at for all combination A1-A8 (-A7). then get only SUM for selected combinations filtered by WHERE clause. we can sure optimize this by limiting CUBE on certain column itself but very obvious things so far i have notice are as follows.
if you use below query and have right index o Table1 and Table_reject then both query can utilize the Index and reduce the data set needs to be join and further processed.
I am not 100% sure but yes Partial CUBE processing is possible and need to check that.
clustered index --> Table1 need on A8 And Table_Reject need clustered index on NAME.
non-clustered index--> Table1 need on A3,A9and Table_reject need on B3,B2
SELECT qry1.
(
SELECT A1, A2,A3,A4,A5,A6,A7,A8
FROM table1
WHERE A8 >= NEXT_DAY ( trunc(to_date('17/09/2013 12:00:00','dd/mm/yyyy hh24:mi:ss')) ,'SUN' )
)qry1
LEFT JOIN
(
select B3,B2,ID
from table_reject
where name = 'smith'
)qry2
ON qry1.A3 = qry2.B3 and qry1.A9=qry2.B2
WHERE qry2.ID IS NULL
EDIT1:
I tried to find out what will be the difference in CUBE operator result if you do it on all Columns or you do it on only columns that you need it in result set. what I found is the way CUBE function works you do not need to perform CUBE on all columns. because at the end you just care about combinations generated by CUBE where A1 and A8 is NOT NULL.
Try this link and see the output.
enter link description here
Query 1 and Query2 is just inner most queries to compare the CUBE result set.
Query3 and Query4 is the same query that you are trying and you see the results are same in both case.
DECLARE #NEXT_DAY DATE = NEXT_DAY ( trunc(to_date('17/09/2013 12:00:00','dd/mm/yyyy hh24:mi:ss')) ,'SUN' )
SELECT distinct A1 ,sum(total) as sum_total FROM
(
SELECT A1,COUNT(A7) AS total,A8
FROM (
select a.a1,a.a7,a.a8
from table1 a
left join (select * from table_reject where name = 'smith') b
on A.A3 = B.B3 and A.A9 =B.B2
where B.ID is null
) t1
WHERE A8 >= #NEXT_DAY
GROUP BY
CUBE(A1,A8)
)INN
WHERE INN.A1 IS NOT NULL AND
INN.A8 IS NOT NULL
GROUP BY A1
ORDER BY sum_total DESC ;
EDIT3
As I mentioned in the Comment this is a Round3 update. i can not change comment but i meant Edit3 instead Round3.
well the new change in your query is adding the WHERE A8 >= #NEXT_DAY condition in the inner most left join select where A8 >= #NEXT_DAY AND B.ID is null as well. that has improved the selection very much.
in your last comment you mentioned that query is taking 30-35 second and as you change the value of A8 it keep increasing. now with the execution time you didn't mentioned how much data is in the result set. why that is important? because if my query is returning 5M rows as a final result set that will going to spend 90% time in just droping that data on to UI, or output file what ever output method you are using. but actual performance should be measured how soon the query has started giving first couple of rows. because by that time Optimizer has already decided the execution paln and DB is executing that plan. yet i agree that if query is returning 100 rows and taking 10 seconds then something can be wrong with execution plan.
to demo that what I did is I created the dummy data. and perfomred your query against it.
i have table Test_CubeData with 9M rows in it with the same column numbers and data type you explained for your Table1. I have second table Table_Reject with 80K rows with number of columns and its datatype I figured out from query. To test the extreme side of this table; name column has only one value "smith" and ID is null for all 80K rows. so column values that can affect inner left join result will be B2 and B3.
in these tests i do not have any index on both tables. both are heap. and you see the results are in still few seconds with acceptable range of data in result set. as my result data set increases the completion time increases. if i create explained indexes then it will give me Index Seek operation
for all these tested cases. but at certain point that index will also exhaust and become Index Scan.
one sure example would be if my filter value for A8 column is smallest date value exist in that column. in that case Optimizer will see that all 9M rows need to be participate in inner select and CUBE and lot of data will be get processed in memory. which is expected. on the other hand lets see the another example of queries. i have unique 32873 values in A8 column and those values are almost equally distributed among 9M rows. so per single A8 values there are 260 to 300 rows. now if I execute the query for any single value smallest, largest, or any thing in between the query execution time should not change.
notice the highlighted text in each image below that indicated what the value of A8 filter is chosen,
important columns only in the select list instead using *, added A8 filter in the inner left join query, execution plan showing the TableScan operation on both table, query execution time in second,and total number of rows return by the query.
I hope that this will clear some doubts on performance of your query and will help you to set right expectation.
**Table Row Counts**
**TableScan_InnerLeftJoin**
**TableScan_FullQuery_248Rows**
**TableScan_FullQuery_5K**
**TableScan_FullQuery_56K**
**TableScan_FullQuery_480k**
You're calculating a potentially very large cube result on seven columns, and then discarding all the results except those that are logically just a group_by on column A1.
I suggest that you rewrite the query to just group by A1.
The Transact-Sql Count Distinct operation counts all non-null values in a column. I need to count the number of distinct values per column in a set of tables, including null values (so if there is a null in the column, the result should be (Select Count(Distinct COLNAME) From TABLE) + 1.
This is going to be repeated over every column in every table in the DB. Includes hundreds of tables, some of which have over 1M rows. Because this needs to be done over every single column, adding Indexes for every column is not a good option.
This will be done as part of an ASP.net site, so integration with code logic is also ok (i.e.: this doesn't have to be completed as part of one query, though if that can be done with good performance, then even better).
What is the most efficient way to do this?
Update After Testing
I tested the different methods from the answers given on a good representative table. The table has 3.2 million records, dozens of columns (a few with indexes, most without). One column has 3.2 million unique values. Other columns range from all Null (one value) to a max of 40K unique values. For each method I performed four tests (with multiple attempts at each, averaging the results): 20 columns at one time, 5 columns at one time, 1 column with many values (3.2M) and 1 column with a small number of values (167). Here are the results, in order of fastest to slowest
Count/GroupBy (Cheran)
CountDistinct+SubQuery (Ellis)
dense_rank (Eriksson)
Count+Max (Andriy)
Testing Results (in seconds):
Method 20_Columns 5_Columns 1_Column (Large) 1_Column (Small)
1) Count/GroupBy 10.8 4.8 2.8 0.14
2) CountDistinct 12.4 4.8 3 0.7
3) dense_rank 226 30 6 4.33
4) Count+Max 98.5 44 16 12.5
Notes:
Interestingly enough, the two methods that were fastest (by far, with only a small difference in between then) were both methods that submitted separate queries for each column (and in the case of result #2, the query included a subquery, so there were really two queries submitted per column). Perhaps because the gains that would be achieved by limiting the number of table scans is small in comparison to the performance hit taken in terms of memory requirements (just a guess).
Though the dense_rank method is definitely the most elegant, it seems that it doesn't scale well (see the result for 20 columns, which is by far the worst of the four methods), and even on a small scale just cannot compete with the performance of Count.
Thanks for the help and suggestions!
SELECT COUNT(*)
FROM (SELECT ColumnName
FROM TableName
GROUP BY ColumnName) AS s;
GROUP BY selects distinct values including NULL. COUNT(*) will include NULLs, as opposed to COUNT(ColumnName), which ignores NULLs.
I think you should try to keep the number of table scans down and count all columns in one table in one go. Something like this could be worth trying.
;with C as
(
select dense_rank() over(order by Col1) as dnCol1,
dense_rank() over(order by Col2) as dnCol2
from YourTable
)
select max(dnCol1) as CountCol1,
max(dnCol2) as CountCol2
from C
Test the query at SE-Data
A development on OP's own solution:
SELECT
COUNT(DISTINCT acolumn) + MAX(CASE WHEN acolumn IS NULL THEN 1 ELSE 0 END)
FROM atable
Run one query that Counts the number of Distinct values and adds 1 if there are any NULLs in the column (using a subquery)
Select Count(Distinct COLUMNNAME) +
Case When Exists
(Select * from TABLENAME Where COLUMNNAME is Null)
Then 1 Else 0 End
From TABLENAME
You can try:
count(
distinct coalesce(
your_table.column_1, your_table.column_2
-- cast them if you want replace value from column are not same type
)
) as COUNT_TEST
Function coalesce help you combine two columns with replace not null values.
I used this in mine case and success with correctly result.
Not sure this would be the fastest but might be worth testing. Use case to give null a value. Clearly you would need to select a value for null that would not occur in the real data. According to the query plan this would be a dead heat with the count(*) (group by) solution proposed by Cheran S.
SELECT
COUNT( distinct
(case when [testNull] is null then 'dbNullValue' else [testNull] end)
)
FROM [test].[dbo].[testNullVal]
With this approach can also count more than one column
SELECT
COUNT( distinct
(case when [testNull1] is null then 'dbNullValue' else [testNull1] end)
),
COUNT( distinct
(case when [testNull2] is null then 'dbNullValue' else [testNull2] end)
)
FROM [test].[dbo].[testNullVal]