Getting median of column values in each group - sqlite

I have a table containing user_id, movie_id, rating. These are all INT, and ratings range from 1-5.
I want to get the median rating and group it by user_id, but I'm having some trouble doing this.
My code at the moment is:
SELECT AVG(rating)
FROM (SELECT rating
FROM movie_data
ORDER BY rating
LIMIT 2 - (SELECT COUNT(*) FROM movie_data) % 2
OFFSET (SELECT (COUNT(*) - 1) / 2
FROM movie_data));
However, this seems to return the median value of all the ratings. How can I group this by user_id, so I can see the median rating per user?

The following gives the required median:
DROP TABLE IF EXISTS movie_data2;
CREATE TEMPORARY TABLE movie_data2 AS
SELECT user_id, rating FROM movie_data order by user_id, rating;
SELECT a.user_id, a.rating FROM (
SELECT user_id, rowid, rating
FROM movie_data2) a JOIN (
SELECT user_id, cast(((min(rowid)+max(rowid))/2) as int) as midrow FROM movie_data2 b
GROUP BY user_id
) c ON a.rowid = c.midrow
;
The logic is straightforward but the code is not beautified. Given encouragement or comments I will improve it. In a nutshell, the trick is to use rowid of SQLite.

This is not easily possible because SQLite does not allow correlated subqueries to refer to outer values in the LIMIT/OFFSET clauses.
Add WHERE clauses for the user_id to all three subqueries, and execute them for each user ID.

SELECT user_id,AVG(rating)
FROM movie_data
GROUP BY user_id
ORDER BY rating

Related

How to Average the most recent X entries with GROUP BY

I've looked at many answers on SO concerning situations related to this but I must not be understanding them too well as I didn't manage to get anything to work.
I have a table with the following columns:
timestamp (PK), type (STRING), val (INT)
I need to get the most recent 20 entries from each type and average the val column. I also need the COUNT() as there may be fewer than 20 rows for some of the types.
I can do the following if I want to get the average of ALL rows for each type:
SELECT type, COUNT(success), AVG(success)
FROM user_data
GROUP BY type
But I want to limit each group COUNT() to 20.
From here I tried the following:
SELECT type, (
SELECT AVG(success) AS ave
FROM (
SELECT success
FROM user_data AS ud2
WHERE umd2.timestamp = umd.timestamp
ORDER BY umd2.timestamp DESC
LIMIT 20
)
) AS ave
FROM user_data AS ud
GROUP BY type
But the returned average is not correct. The values it returns are as if the statement is only returning the average of a single row for each group (it doesn't change regardless of the LIMIT).
Using sqlite, you may consider the row_number function in a subquery to acquire/filter the most recent entries before determining the average and count.
SELECT
type,
AVG(val),
COUNT(1)
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY type
ORDER BY timestamp DESC
) rn
FROM
user_data
) t
WHERE rn <=20
GROUP BY type

SQLite Running Total Without Relying on RowId sequence

So I've been looking at this for the past week and learning. I'm used to SQL Server not SQLite. I understand RowId now, and that if I have an "id" column of my own (for convenience) it will actually use RowId. I've done running totals in SQL Server using ROW_NUMBER, but that doesn't seem to be an option with SQLite. The most useful post was...
How do I calculate a running SUM on a SQLite query?
My issue is that it works as long as I have data that I will keep adding to at the "bottom" of the table. I say "bottom" and not bottom because my display of the data is always sorted based on some other column such as a month. So in other words if I insert a new record for a missing month it will get inserted with a higher "id" (aka _RowId"). My running total below that month now needs to reflect this new data for all subsequent months. This means I cannot order by "id".
With SQL Server, ROW_NUMBER took care of my sequencing because in the select where I use a.id > running.id, I would have used a.rownum > running.rownum
Here's my table
CREATE TABLE `Test` (
`id` INTEGER,
`month` INTEGER,
`year` INTEGER,
`value` INTEGER,
PRIMARY KEY(`id`)
);
Here's my query
WITH RECURSIVE running (id, month, year, value, rt) AS
(
SELECT id, month, year, value, value
FROM Test AS row1
WHERE row1.id = (SELECT a.id FROM Test AS a ORDER BY a.id LIMIT 1)
UNION ALL
SELECT rowN.id, rowN.month, rowN.year, rowN.value, (rowN.value + running.rt)
FROM Test AS rowN
INNER JOIN running ON rowN.id = (
SELECT a.id FROM Test AS a WHERE a.id > running.id ORDER BY a.id LIMIT 1
)
)
SELECT * FROM running
I can order my CTE with year,month,id similar to how it is suggested in original example I linked above. However unless I'm mistaken that example solution relies on records in the table already ordered by year, month, id. If I'm right if I insert an earlier "month", then it will break because the "id" will have the largest value of all the _RowId_s.
Appreciate if someone can set me straight.

Count columns in a table where columns are same and then subtract from a column in another table

Lets pretend that we have a supermarket.
We got a table called Sales where every record is one article), so if we scan 3 articles we will have 3 rows with following columns: ArticleId and Amount where amount Always is 1.
And then we have a table called Articles which have columns: ArticleId and AvailableAmount.
When the sale is done we need to Count records that are the same in Sales table and then update AvailableAmount with AvailableAmount subtracted with the sum of each article.
I'm thinking something like this but i dont know if im thinking right:
UPDATE Articles
SET
AvailableAmount = AvailableAmount - (
Select ArticleId,Count(*) From Sales Group by ArticleId HAVING Count(*) > 1
)
WHERE
ArticleId in(Select distinct ArticleId FROM Sales)
This query is almost correct, but
the subquery must return only one column,
HAVING Count(*) > 1 does not make sense, and
the subquery must return only one value, so you need a correlated subquery:
UPDATE Articles
SET AvailableAmount = AvailableAmount -
(SELECT COUNT(*)
FROM Sales
WHERE ArticleId = Articles.ArticleId)
WHERE ArticleId IN (SELECT ArticleId
FROM Sales)

How to select data where sum is greater than x

I am very new to SQL and am using SQLite 3 to run basket analysis on sales data.
The relevant columns are the product ID, a unique transaction ID (which identifies the basket) and the product quantity. Where a customer has bought more than one product type, the unqiue transaction ID is repeated.
I am wanting to select only baskets where the customer has bought more than 1 item.
Is there any way on SQLite to select the unique transaction ID and the sum of the quantity, but only for unique transaction IDs where the quantity is more than one?
So far I have tried:
select uniqID, sum(qty) from salesdata where sum(qty) > 1 group by uniqID;
But SQLite gives me the error 'misuse of aggregate: sum()'
Sorry if this is a simple question but I am struggling to find any relevant information by googling!
Try
select uniqID, sum(qty) from salesdata group by uniqID having sum(qty) > 1
"where" cannot be used on aggregate functions - you can only use where on uniqId, in this case.
if you want to put any condition on the result you get with group by you must use having.
select uniqID, sum(qty) as sumqty from salesdata group by uniqID having sumqty > 1
you can put any of the condition with having normaly as in where.
having sumqty = 1 ,having sumqty < 1 ,having sumqty IN (1,2,3) etc..

SQL query for finding the first, second and third highest numbers

What is an example query to retrieve the first, second and third largest number from a database table using SQL Server?
You can sort by your value descendingly and take the top 3.
SELECT TOP 3 YourVal FROM YourTable ORDER BY YourVal DESC
Or if you wanted each result separate,
first number :
SELECT TOP 1 YourVal FROM YourTable ORDER BY YourVal DESC
second number:
SELECT TOP 1 YourVal FROM YourTable
WHERE YourVal not in (SELECT TOP 1 YourVal FROM YourTable ORDER BY YourVal DESC)
ORDER BY YourVal DESC
third number:
SELECT TOP 1 YourVal FROM YourTable
WHERE YourVal not in (SELECT TOP 2 YourVal FROM YourTable ORDER BY YourVal DESC)
ORDER BY YourVal DESC
assuming YourVal is unique
EDIT : following on from OPs comment
to get the nth value, select the TOP 1 that isn't in the TOP (n-1), so fifth can be chosen by:
SELECT TOP 1 YourVal FROM YourTable
WHERE YourVal not in (SELECT TOP 4 YourVal FROM YourTable ORDER BY YourVal DESC)
ORDER BY YourVal DESC
The proposed SELECT TOP n ... ORDER BY key will work but you need to be aware of the fact that you might get unexpected results if the column you're sorting on is not unique. Find more information on the topic here.
Sudhakar,
It may be easier to use ROW_NUMBER() or DENSE_RANK() for some of these questions. For example, to find YourVal and other columns from the fifth row in order of YourVal DESC:
WITH TRanked AS (
SELECT *,
ROW_NUMBER() OVER (
ORDER BY YourVal DESC, yourPrimaryKey
) AS rk
)
SELECT YourVal, otherColumns
FROM TRanked
WHERE rk = 5;
If you want all rows with the fifth largest distinct YourVal value, just change ROW_NUMBER() to DENSE_RANK().
One really big advantage to these functions is the fact that you can immediately change a "the nth highest YourVal" query to a "the nth highest YourVal for each otherColumn" query just by adding PARTITION BY otherColumn to the OVER clause.
In certain DBMS packages the top command may not work. Then how to do this? Suppose we need to find the 3rd largest salary in employee table. So we select the distinct salary from the table in descending order:
select distinct salary from employee order by salary desc
Now among the salaries selected we need top 3 salaries, for that we write:
select salary from (select distinct salary from employee order by salary desc) where rownum<=3 order by salary
This gives top 3 salaries in ascending order. This makes the third largest salary to go to first position. Now we have the final task of printing the 3rd largest number.
select salary from (select salary from (select distinct salary from employee order by salary desc) where rownum<=3 order by salary) where rownum=1
This gives the third largest number. For any mistake in the query please let me know. Basically to get the nth largest number we can rewrite the above query as
select salary from (select salary from (select distinct salary from employee order by salary desc) where rownum<=**n** order by salary) where rownum=1
If you have a table called Orders and 3 columns Id, ProductId and Quantity then to retrieve the top 3 highest quantities your query would look like:
SELECT TOP 3 [Id], [ProductId], [Quantity] FROM [Orders] ORDER BY [Quantity] DESC
or if you just want the quantity column:
SELECT TOP 3 [Quantity] FROM [Orders] ORDER BY [Quantity] DESC
This works prefect!
select top 1 * from Employees where EmpId in
(
select top 3 EmpId from Employees order by EmpId
) order by EmpId desc;
If you would like to get 2nd, 3rd or 4th highest just change top3 to appropriate number.

Resources