Related
I have a following requirement: I have a table in following format.
and this is what I want it to be transformed into:
Basically I want number of users with various combination of activities
I want to have this format as I want to create a TreeMap visualization out of it.
This is what I have done till now.
First find out number of users with activity groupings
WITH lookup AS
(
SELECT listagg(name,',') AS groupings,
processed_date,
guid
FROM warehouse.test
GROUP BY processed_date,
guid
)
SELECT groupings AS activity_groupings,
LENGTH(groupings) -LENGTH(REPLACE(groupings,',','')) + 1 AS count,
processed_date,
COUNT( guid) AS users
FROM lookup
GROUP BY processed_date,
groupings
I put the results in a separate table
Then, I do a Split and coalesce like this:
SELECT NULLIF(SPLIT_PART(groupings,',', 1),'') AS grouping_1,
COALESCE(NULLIF(SPLIT_PART(groupings,',', 2),''), grouping_1) AS grouping_2,
COALESCE(NULLIF(SPLIT_PART(groupings,',', 3),''), grouping_2, grouping_1) AS grouping_3,
num_users
FROM warehouse.groupings) AS expr_qry
GROUP BY grouping_1,
grouping_2,
grouping_3
The problem is the first query takes more than 90 minutes to execute as I have more than 250M rows.
There must be a better and efficient way to di this.
Any heads up would be greatly appreciated.
Thanks
You do not need to use complex string manipulation functions (LISTAGG(), SPLIT_PART()). You can achieve what you're after with the ROW_NUMBER() function and simple aggregates.
-- Create sample data
CREATE TEMP TABLE test_data (id, guid, name)
AS SELECT 1::INT, 1::INT, 'cooking'
UNION ALL SELECT 2::INT, 1::INT, 'cleaning'
UNION ALL SELECT 3::INT, 2::INT, 'washing'
UNION ALL SELECT 4::INT, 4::INT, 'cooking'
UNION ALL SELECT 6::INT, 5::INT, 'cooking'
UNION ALL SELECT 7::INT, 3::INT, 'cooking'
UNION ALL SELECT 8::INT, 3::INT, 'cleaning'
;
-- Assign a row number to each name per guid
WITH name_order AS (
SELECT guid
, name
, ROW_NUMBER() OVER(PARTITION BY guid ORDER BY id) row_n
FROM test_data
) -- Use MAX() to collapse each guid's data to 1 row
, groupings AS (
SELECT guid
, MAX(CASE WHEN row_n = 1 THEN name END) grouping_1
, MAX(CASE WHEN row_n = 2 THEN name END) grouping_2
FROM name_order
GROUP BY guid
) -- Count the guids per each grouping
SELECT grouping_1
, COALESCE(grouping_2, grouping_1) AS grouping_2
, COUNT(guid) num_users
FROM groupings
GROUP BY 1,2
;
-- Output
grouping_1 | grouping_2 | num_users
------------+------------+-----------
washing | washing | 1
cooking | cleaning | 2
cooking | cooking | 2
This is based on a Khan Academy course. I have 2 SQLite tables:
CREATE TABLE table1 (id STRING PRIMARY KEY, charge_id TEXT, amount INTEGER, currency INTEGER, country STRING);
INSERT INTO table1
( id, charge_id, amount, currency, country) VALUES
('0xb01', '0x1', 2000, 'USD', 'USA'),
('0x0a1', '0x1', 500, 'USD', 'USA'),
('0x0c1', '0x1', 1000, 'CAD', 'USA'),
('0xs31', '0x4', 1000, 'YEN', 'CA');
CREATE TABLE table2 (id STRING PRIMARY KEY, charge_id TEXT, value VARIABLE);
INSERT INTO table2
( id, charge_id, value ) VALUES
('0x34s', '0x1', '123 main street'),
('0x3ze', '0x1', 'merchant-id-001'),
('0x3w2', '0x2', 'zip-code-90210' ),
('0x35k', '0x2', 'merchant-id-002');
I would SELECT the amount, currency and country from table 1 (Charges) and join with table 2 (Metadata) based on the id. Charges uses ID, while Metadata stores meta tags, with a unique identifier [id] equal to the charge [id] from Charges. I want to group the total amount, total currency for each merchant_id and only those charges that were made in the USA.
Step-by-step pseudo code:
(1) find all charges in the USA (Charges country)
(2) match all charge_ids from Charges (id) to charges in Metadata (id)
(3) separate each charge by the merchant_id (Metadata value)
(4) display the total amount, currency by merchant_id (amount, Charges currency, value)
This is a difficult because :
(1) I want to select from Charges and
(2) join to Metadata by the [id]
(3) but each Metadata record only has the charge_id and a metadata tag, which would match the merchant_id with the charge
The query result I would like is:
value (merchant id) currency total amount
merchant-id-001 usd 2500
merchant-id-001 cad 1000
merchant-id-002 yen 200
merchant-id-002 cad 50
Currently I have this query but it does not seem to be working:
select table1.amount, table1.currency, table1.country, count(*)
from table1
LEFT JOIN table1
UNION ALL
SELECT table2.value
FROM CHARGES_table2
LEFT JOIN table2
ON table1.id = table2.id
WHERE table1.country = 'USA'
GROUP BY table2.value
I am getting errors on union parameters: 2,1
Read the grammar & other documentation for the expressions you are using. The arguments to UNION are two SELECTs & it can have a final ORDER BY. Here's the parse:
select table1.amount, table1.currency, table1.country, count(*)
from table1
LEFT JOIN table1
UNION ALL
SELECT table2.value
FROM CHARGES_table2
LEFT JOIN table2
ON table1.id = table2.id
WHERE table1.country = 'USA'
GROUP BY table2.value
UNION is putting its arguments' rows into one table so it also requires that their columns agree in number & have compatible types. Here the numbers disagree.
There is no table1 in scope in the second SELECT so that is an error in isolation that is moot given the UNION.
I have a setup like so:
Movies (
movieId INTEGER PRIMARY KEY,
title TEXT,
year INTEGER
)
Rentals (
cardNo INTEGER,
movieId INTEGER,
date DATE,
rating INTEGER,
PRIMARY KEY(cardNo, movieID, date),
FOREIGN KEY (cardNo) REFERENCES Customers,
FOREIGN KEY (movieId) REFERENCES Movies
)
and I want to figure out which movie(s) were rented the most amount of times in a given year if and only if the movie was released that year.
For example: If movie_x was rented the most in 2003 but was not also released in 2003, then it cannot count. If movie_y was both released in 2003 and rented the most (of the movies released that year) in 2003 then it does count.
I am thinking I need to setup a temporary table that stores the movieId and the count(movieId) so that I can then perform a select max() on the count, but I am unsure how to go about it.
I am using python, so I can store the the movieId of the max() in a variable and then check the original movies column to match it to the title of the movie, if that helps.
The strategy used in this answer is to join the Rental and Movies tables together on matching movieID and year. This serves to discard any records from the Rentals table which did not occur in the same year a movie was released.
We can aggregate such a join, which would then generate year/movie rentals counts for the entire database. But, since you only want movies having the highest rental count for a given year, we need to do more work. In this case, we can find the highest rental count for each year (see subquery t2 below), and join to the subquery described earlier.
SELECT
t1.movieId,
t1.title,
t1.year,
t1.num_rentals
FROM
(
SELECT
m.movieId,
m.title,
m.year,
COUNT(*) AS num_rentals
FROM Rentals r
INNER JOIN Movies m
ON r.movieId = m.movieId AND CAST(SUBSTR(r.date, 1, 4) AS INTEGER) = m.year
GROUP BY
m.movieId,
m.title,
m.year
) t1
INNER JOIN
(
SELECT year, MAX(num_rentals) AS max_num_rentals
FROM
(
SELECT
m.year,
COUNT(*) AS num_rentals
FROM Rentals r
INNER JOIN Movies m
ON r.movieId = m.movieId AND CAST(SUBSTR(r.date, 1, 4) AS INTEGER) = m.year
GROUP BY
m.movieId,
m.year
) t
GROUP BY year
) t2
ON t1.year = t2.year AND t1.num_rentals = t2.max_num_rentals
-- WHERE t1.year = 2003
ORDER BY
t1.year;
Demo
This answer will report all years, along with all movies released in that year having the highest rental counts. In the case of ties for two or more movies in a given year, all tied movies would be reported.
Note that if SQLite supported analytic functions, the query could be greatly simplified.
Here's a slightly different approach, using CTEs instead of nested subqueries.
WITH first_year_rentals(movieid, title, rentals, year) AS
(SELECT m.movieid, m.title, count(*), m.year
FROM movies AS m
JOIN rentals AS r ON m.movieid = r.movieid AND m.year = strftime('%Y', r.date)
GROUP BY m.movieid)
, maximums(year, maxrent) AS
(SELECT year, max(rentals)
FROM first_year_rentals
GROUP BY year)
SELECT movieid, title, rentals, f.year AS year
FROM first_year_rentals AS f
JOIN maximums AS m ON f.year = m.year AND m.maxrent = f.rentals
ORDER BY f.year, title;
A CTE (Common Table Expression) is like a view that only exists for the one statement. Very handy for organizing a statement with multiple queries. The first one generates results that count the number of times each movie was rented in the year it came out. The second one is the highest rental count for each year's new releases. Then it's just a matter of joining the two CTEs and limiting the results to just rows where the rental count equals the highest for that movie's release year.
Edit:
Tested using these tables and data:
CREATE TABLE Movies (
movieId INTEGER PRIMARY KEY,
title TEXT,
year INTEGER
);
INSERT INTO Movies VALUES(1,'a good movie',2003);
INSERT INTO Movies VALUES(2,'a better movie',2003);
INSERT INTO Movies VALUES(3,'the best movie',2004);
INSERT INTO Movies VALUES(4,'the worst movie',2004);
CREATE TABLE Rentals (
cardNo INTEGER,
movieId INTEGER,
date DATE,
rating INTEGER,
PRIMARY KEY(cardNo, movieID, date),
-- FOREIGN KEY (cardNo) REFERENCES Customers,
FOREIGN KEY (movieId) REFERENCES Movies
);
INSERT INTO Rentals VALUES(1,1,'2003-01-01',NULL);
INSERT INTO Rentals VALUES(1,2,'2003-01-01',NULL);
INSERT INTO Rentals VALUES(1,3,'2006-01-01',NULL);
INSERT INTO Rentals VALUES(2,1,'2003-01-01',NULL);
INSERT INTO Rentals VALUES(2,3,'2004-01-01',NULL);
INSERT INTO Rentals VALUES(2,2,'2004-01-01',NULL);
INSERT INTO Rentals VALUES(3,2,'2003-01-01',NULL);
INSERT INTO Rentals VALUES(3,1,'2005-01-01',NULL);
INSERT INTO Rentals VALUES(3,4,'2004-01-01',NULL);
INSERT INTO Rentals VALUES(4,2,'2003-01-01',NULL);
INSERT INTO Rentals VALUES(4,4,'2004-01-01',NULL);
INSERT INTO Rentals VALUES(5,1,'2003-01-01',NULL);
Giving:
movieid title rentals year
---------- -------------- ---------- ----------
2 a better movie 3 2003
1 a good movie 3 2003
4 the worst movi 2 2004
Demo
Further edits:
The mention of analytic functions in the other answer reminded me; sqlite does have them these days (Added in 3.25)! So...
WITH first_year_rentals(movieid, title, rentals, maxrentals, year) AS
(SELECT m.movieid
, m.title
, count(*)
, max(count(*)) OVER (PARTITION BY m.year)
, m.year
FROM movies AS m
JOIN rentals AS r ON m.movieid = r.movieid AND m.year = strftime('%Y', r.date)
GROUP BY m.movieid)
SELECT movieid, title, rentals, year
FROM first_year_rentals
WHERE rentals = maxrentals
ORDER BY year, title;
It uses a window function to combine the two CTEs from the first query into a single one. (There might be an even better way; I'm not super fluent with them yet).
And a different version using the rank suggestion:
WITH first_year_rentals(movieid, title, rentals, ranking, year) AS
(SELECT m.movieid
, m.title
, count(*)
, rank() OVER (PARTITION BY m.year ORDER BY count(*) DESC)
, m.year
FROM movies AS m
JOIN rentals AS r ON m.movieid = r.movieid AND m.year = strftime('%Y', r.date)
GROUP BY m.movieid)
SELECT movieid, title, rentals, year
FROM first_year_rentals
WHERE ranking = 1
ORDER BY year, title;
I'm finding it hard to get my head around this problem, and I couldn't find any answers to this specific problem anywhere:
Say I have a table like this, I'm just using fruit as an example:
Fruit | Date | Value
=================================
Apple | 1 | other_random_value
Apple | 2 | some_value_1
Apple | 3 | some_value_2
Pear | 1 | other_random_value
Pear | 2 | unexpected_value_1
Pear | 3 | some_value_2
Everything will be ordered by Fruit, then Date.
Basically, if the last row (for each fruit) is some_value_2, but the one preceding it is not some_value_1, I want to match just those fruits (i.e. in this case, Pear).
So, some_value_2 I always expect to come after a row with a certain value for that particular fruit, and if it doesn't I want to flag errors against those particular fruits. It would also be nice to match cases where nothing precedes some_value_2 as well, though if this is too complicated I could match it seperately and just check that some_value_2 is not the first row, which I don't imagine would be a difficult query.
EDIT: Also, being able to match any consecutive rows where the preceding value is unexpected would be nice, though I mainly care about the last 2 rows. So if being able to match all consecutive rows results in a simpler and better performing query, then I might go with that. I'm going to be doing an INSERT at the same time (into an alert table), so if I could flag it as an ERROR if it's the last two rows and a WARNING if it's not, that would be really nifty. Though I wouldn't know where to start with writing a query that does that. Also having a query that performs well is a must, as I will be using this across a large dataset.
EDIT:
This is what I used in the end, it's quite slow, but if I index Date, it's not so bad:
SELECT c.Id AS CId, c.Fruit AS CFruit,
c.Date AS CDate, c.Value AS CValue,
(SELECT Id
FROM fruits
WHERE Fruit = c.Fruit
AND Date >= c.Date
AND Id > c.Id
ORDER BY Date, Id) AS NId, n.Fruit AS NFruit,
n.Date AS NDate, n.Value AS NValue
FROM fruits AS c
JOIN fruits AS n ON n.Id = NId
ORDER BY c.Date, c.Id
I might try Joachim's method again at some point, as I realised I'm getting a lot of results I don't really care much about. Or I might even try incorporating the two somehow and delegate to INFO/ERROR as appropriate...
Solved: I used the same SELECT statement that I used to get NId, and used SELECT COUNT(*) instead of SELECT Id. This told me the number of results after the current one. Then I just used a CASE operator to turn it into a boolean field called Latest :). So I effectively combined Nicolas' and Joachim's methods. Performance still seems OK, probably because SQLite caches the results.
SQLite is (as far as I know) a bit low on efficient operators for this, so this is the best I can come up with for now :)
SELECT Fruit FROM fruits
WHERE ( SELECT COUNT(*) FROM fruits f
WHERE f.fruit=fruits.fruit
AND f.date > fruits.date ) = 1
AND fruits.value <> 'some_value_1'
INTERSECT
SELECT Fruit FROM fruits
WHERE ( SELECT COUNT(*) FROM fruits f
WHERE f.fruit=fruits.fruit
AND f.date > fruits.date ) = 0
AND fruits.value = 'some_value_2'
An SQLfiddle to test with.
I named the table fruits. This query gets you the preceding date for a ‘key‘ (fruit + date)
select fruit, date, value currvalue,
(select max(date) precedingDate
from fruits p
where p.fruit = c.fruit
and p.date < c.date) precedingdate
from fruits c ;
From there we can get the precedent value for each key
select f1.*, precedingdate, f2.value precedingvalue
from
fruits f1 join
(select fruit, date, value,
(select max(date) precedingDate
from fruits p
where p.fruit = c.fruit
and p.date < c.date) precedingdate
from fruits c) f2
on f1.fruit = f2.fruit and f1.date = precedingdate ;
For all the rows that have a previous row, you get both the current and preceding date and the current and preceding value.
Edit : we add an id used to choose when there are several identical previous date (see comment below)
I will be using intermediate views for the sake of clarity but you could write one big query.
As before, what's the previous date :
create view VFruitsWithPreviousDate
as select fruit, date, value, id,
(select max(date)
from fruits p
where p.fruit = c.fruit
and p.date < c.date) previousdate
from fruits c ;
What's the previous id :
create view VFruitsWithPreviousId
as select fruit, date, value,
(select max(id)
from fruits f
where v.fruit = f.fruit AND
v.previousdate = f.date) previousID
from VFruitsWithPreviousDate v ;
A query for all consecutive rows :
select f.*, v.value
from fruits f
join VFruitsWithPreviousId v on f.id = v.previousid ;
You can then add the condition WHERE f.Value = 'some_value_2' AND v.value != 'some_value_1'
SELECT gameratingstblx245v.gameid,avg( gameratingstblx245v.rating ) as avgrating, count(gameratingstblx245v.rating) as count,gamedata.name ,gamedata.gameinfo
FROM gameratingstblx245v
LEFT JOIN gamedata ON gamedata.id = gameratingstblx245v.game_id
WHERE gameratingstblx245v.game_id=gameratingstblx245v.game_id
GROUP BY gameid
ORDER BY avg( gameratingstblx245v.rating ) DESC LIMIT 0,8
Table gameratingstblx245v - gameid, rating
Rable gamedata - id, gameinfo, name, releasedate
This is the query I am currently using to extract data from two tables gamedata and gameratingstblx245v.... What I am doing here is taking the avg. of all the ratings from table gameratingstblx245v in descending order of their avg. rating and I am also extracting the related info corresponding to the selected gameid's from table gamedata...
Now what I want to extract is the top avg. ratings from game_ratingstblx245v but for the games whose field releasedate from table gamedata is in the last 90 days...
Help would be appreciated..Thanks
Here's how I'd design that query:
SELECT d.id, d.name, d.gameinfo,
AVG(r.rating) AS avgrating, COUNT(r.rating) AS count
FROM gamedata d
LEFT JOIN gameratingstblx245v r ON (d.id = r.game_id)
WHERE d.releasedate BETWEEN NOW() - INTERVAL 90 DAY AND NOW()
GROUP BY d.id
ORDER BY avgrating DESC LIMIT 0,8;