So I've been looking at this for the past week and learning. I'm used to SQL Server not SQLite. I understand RowId now, and that if I have an "id" column of my own (for convenience) it will actually use RowId. I've done running totals in SQL Server using ROW_NUMBER, but that doesn't seem to be an option with SQLite. The most useful post was...
How do I calculate a running SUM on a SQLite query?
My issue is that it works as long as I have data that I will keep adding to at the "bottom" of the table. I say "bottom" and not bottom because my display of the data is always sorted based on some other column such as a month. So in other words if I insert a new record for a missing month it will get inserted with a higher "id" (aka _RowId"). My running total below that month now needs to reflect this new data for all subsequent months. This means I cannot order by "id".
With SQL Server, ROW_NUMBER took care of my sequencing because in the select where I use a.id > running.id, I would have used a.rownum > running.rownum
Here's my table
CREATE TABLE `Test` (
`id` INTEGER,
`month` INTEGER,
`year` INTEGER,
`value` INTEGER,
PRIMARY KEY(`id`)
);
Here's my query
WITH RECURSIVE running (id, month, year, value, rt) AS
(
SELECT id, month, year, value, value
FROM Test AS row1
WHERE row1.id = (SELECT a.id FROM Test AS a ORDER BY a.id LIMIT 1)
UNION ALL
SELECT rowN.id, rowN.month, rowN.year, rowN.value, (rowN.value + running.rt)
FROM Test AS rowN
INNER JOIN running ON rowN.id = (
SELECT a.id FROM Test AS a WHERE a.id > running.id ORDER BY a.id LIMIT 1
)
)
SELECT * FROM running
I can order my CTE with year,month,id similar to how it is suggested in original example I linked above. However unless I'm mistaken that example solution relies on records in the table already ordered by year, month, id. If I'm right if I insert an earlier "month", then it will break because the "id" will have the largest value of all the _RowId_s.
Appreciate if someone can set me straight.
Related
I've looked at many answers on SO concerning situations related to this but I must not be understanding them too well as I didn't manage to get anything to work.
I have a table with the following columns:
timestamp (PK), type (STRING), val (INT)
I need to get the most recent 20 entries from each type and average the val column. I also need the COUNT() as there may be fewer than 20 rows for some of the types.
I can do the following if I want to get the average of ALL rows for each type:
SELECT type, COUNT(success), AVG(success)
FROM user_data
GROUP BY type
But I want to limit each group COUNT() to 20.
From here I tried the following:
SELECT type, (
SELECT AVG(success) AS ave
FROM (
SELECT success
FROM user_data AS ud2
WHERE umd2.timestamp = umd.timestamp
ORDER BY umd2.timestamp DESC
LIMIT 20
)
) AS ave
FROM user_data AS ud
GROUP BY type
But the returned average is not correct. The values it returns are as if the statement is only returning the average of a single row for each group (it doesn't change regardless of the LIMIT).
Using sqlite, you may consider the row_number function in a subquery to acquire/filter the most recent entries before determining the average and count.
SELECT
type,
AVG(val),
COUNT(1)
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY type
ORDER BY timestamp DESC
) rn
FROM
user_data
) t
WHERE rn <=20
GROUP BY type
Consider the following schema and table:
CREATE TABLE IF NOT EXISTS `names` (
`id` INTEGER,
`name` TEXT,
PRIMARY KEY(`id`)
);
INSERT INTO `names` VALUES (1,'zulu');
INSERT INTO `names` VALUES (2,'bene');
INSERT INTO `names` VALUES (3,'flip');
INSERT INTO `names` VALUES (4,'rossB');
INSERT INTO `names` VALUES (5,'albert');
INSERT INTO `names` VALUES (6,'zuse');
INSERT INTO `names` VALUES (7,'rossA');
INSERT INTO `names` VALUES (8,'juss');
I access this table with the following query:
SELECT *
FROM names
ORDER BY name
LIMIT 10
OFFSET 4;
Where offset 4 is used because it's the rowid (in the ordered list) to the first occurance of 'R%' names. This returns:
1="7" "rossA"
2="4" "rossB"
3="1" "zulu"
4="6" "zuse"
My question is, is there an SQL statement which can return the OFFSET value (in the R case above its 4) given a starting first letter please? (I don't really want to resort to stepping() through results, counting rows, until first 'R%' is reached!)
I've tried the following without success:
SELECT MIN(ROWID)
FROM
(
SELECT *
FROM names
ORDER BY name
)
WHERE name LIKE 'R%'
It always returns single row of NULL data.
As background, this table is a phone book list and I want to provide subset of results (from main table) back to caller, starting at a initial letter offset.
Just count the rows before the string of interest:
select count(*) from names where name < 'r';
The following has a number of options. Basically your issues is that the sub-query doesn't return the roiwd hencne NULL as the minimum. However, there is no need to use the rowid directly as the id column is an alias of the rowid, so that could be used:-
SELECT name, id, MIN(rowid), min(id) -- shows how rowid and id are the same
FROM
(
SELECT rowid, * -- returns rowid from the subquery so min(rowid) now works
FROM names
ORDER BY name
)
WHERE name LIKE 'R%' ORDER BY id ASC LIMIT 1 -- Will effectivley do the same (no need for the sub-query)
Extra columns added for demonstration.
As such your query could be :-
SELECT min(rowid) FROM names where name LIKE 'R%';
Or :-
SELECT min(id) FROM names where name LIKE 'R%';
You could also use :-
SELECT id FROM names WHERE name LIKE 'R%' ORDER BY id ASC LIMIT 1;
Or :-
SELECT rowid FROM names WHERE name LIKE 'R%' ORDER BY id ASC LIMIT 1;
I have created a subset of the pg_table_def table with table_name,col_name and data_type. I have also added a column active with 'Y' as value for some of the rows. Let us call this table as config.Table config looks like below:
table_name column_name
interaction_summary name_id
tag_transaction name_id
interaction_summary direct_preference
bulk_sent email_image_click
crm_dm web_le_click
Now I want to be able to map the table names from this table to the actual table and fetch values for the corresponding column. name_id will be the key here which will be available in all tables. My output should look like below:
name_id direct_preference email_image_click web_le_click
1 Y 1 2
2 N 1 2
The solution needs to be dynamic so that even if the table list extends tomorrow, the new table should be able to accommodate. Since I am new to Redshift, any help is appreciated. I am also considering to do the same via R using the dplyr package.
I understood that dynamic queries don't work with Redshift.
My objective was to pull any new table that comes in and use their columns for regression analysis in R.
I made this working by using listagg feature and concat operation. And then wrote the output to a dataframe in R. This dataframe would have 'n' number of select queries as different rows.
Below is the format:
df <- as.data.frame(tbl(conn,sql("select 'select ' || col_names|| ' from ' || table_name as q1 from ( select distinct table_name, listagg(col_name,',') within group (order by col_name)
over (partition by table_name) as col_names
from attribute_config
where active = 'Y'
order by table_name )
group by 1")))
Once done, I assigned every row of this dataframe to a new dataframe and fetched the output using below:
df1 <- tbl(conn,sql(df[1,]))
I know this is a round about solution. But it works !! Fetches about 17M records under 1 second.
I have a really slow query and I'm trying to speed it up.
I have a target date range (dateFrom/dateTo) defined in a table with only one row I need to use as a limit against a table with millions of rows. Is there a best practice for this?
I started with one table with one row with dateFrom and dateTo fields. I can limit the rows in the large table by CROSS JOINing it with the small table and using the WHERE clause, like:
select
count(*)
from
tblOneRow o, tblBig b
where
o.dateFrom < b.dateTo and
o.dateTo >= b.dateFrom
or I can inner join the tables on the date range, like:
select
count(*)
from
tblOneRow o inner join
tblBig b on
o.dateFrom < b.dateTo and
o.dateTo >= b.dateFrom
but I thought if I changed my single-row table to use one field with a PERIOD data type instead of two fields with DATE data types, it could improve the performance. Is this a reasonable assumption? The explain isn't showing a time difference if I change it to:
select
count(*)
from
tblOneRow o inner join
tblBig b on
begin(o.date) < b.dateTo and
end(o.date) >= b.dateFrom
or if I convert the small table's date range to a PERIOD data type and join ON P_INTERSECT, like:
select
count(*)
from
tblOneRow o inner join
tblBig b on
o.date p_intersect period(b.dateFrom, b.dateTo + 1) is not null
to help the parsing engine with this join, would I need to define the fields on the large table with a period data type instead of two dates? I can't do that as I don't own that table, but if that's the case, I'll give up on improving performance with this method.
Thanks for your help.
I don't expect any difference between the first three Selects, Explain should be the same a product join (the optimizer should expect exactly one row, but as it's duplicated the estimated size should be the number of AMPs in your system). The last Select should be worse, because you apply a calculation (OVERLAPS would be more appropriate, but probably not better).
One way to improve this single row cross join would be a View (select date '...' as dateFrom, date '...' as dateTo) instead of the single row table. This should resolve the dates and result in hard-coded dateFrom/To instead of a product join.
Similar when you switch to Scalar Subqueries:
select
count(*)
from
tblBig b
where
(select min(o.dateFrom) from tblOneRow) < b.dateTo
and
(select min(o.dateTo) from tblOneRow) >= b.dateFrom
Sqlite doesn't have a row number function. My database however could have several thousands of records. I need to sort a table based upon a date (the date field is actually an INTEGER) and then return a specific range of rows. So if I wanted all the rows from 600 to 800, I need to somehow create a row number and limit the results to fall within my desired range. I cannot use RowID or any auto-incremented ID field because all the data is inserted with random dates. The closest I can get is this:
CREATE TABLE Test (ID INTEGER, Name TEXT, DateRecorded INTEGER);
Insert Into Test (ID, Name, DateRecorded) Values (5,'fox', 400);
Insert Into Test (ID, Name, DateRecorded) Values (1,'rabbit', 100);
Insert Into Test (ID, Name, DateRecorded) Values (10,'ant', 800);
Insert Into Test (ID, Name, DateRecorded) Values (8,'deer', 300);
Insert Into Test (ID, Name, DateRecorded) Values (6,'bear', 200);
SELECT ID,
Name,
DateRecorded,
(SELECT COUNT(*)
FROM Test AS t2
WHERE t2.DateRecorded > t1.DateRecorded) AS RowNum
FROM Test AS t1
where RowNum > 2
ORDER BY DateRecorded Desc;
This will work except it's really ugly. The Select Count(*) will result in carrying out that Select statement for every row encountered. So if I have several thousands of rows, that will be a very poor performance.
This is what the LIMIT/OFFSET clauses are for:
SELECT *
FROM Test
ORDER BY DateRecorded DESC
LIMIT 200 OFFSET 600