calculate percentages with postgresql join queries - percentage

I am trying to calculate percentages by joining 3 tables data to get the percentages of positive_count, negative_count, neutral_count of each user's tweets. I have succeeded in getting positive, negative and neutral counts, but failing to get the same as percentages instead of counts. Here is the query to get counts:
SELECT
t1.u_id,count() as total_tweets_count ,
(
SELECT count() from t1,t2,t3 c
WHERE
t1.u_id='18839785' AND
t1.u_id=t2.u_id AND
t2.ts_id=t3.ts_id AND
t3.sentiment='Positive'
) as pos_count ,
(
SELECT count() from t1,t2,t3
WHERE
t1.u_id='18839785' AND
t1.u_id=t2.u_id AND
t2.ts_id=t3.ts_id AND
t3.sentiment='Negative'
) as neg_count ,
(
SELECT count() from t1,t2,t3
WHERE
t1.u_id='18839785' AND
t1.u_id=t2.u_id AND
t2.ts_id=t3.ts_id AND
t3.sentiment='Neutral'
) as neu_count
FROM t1,t2,t3
WHERE
t1.u_id='18839785' AND
t1.u_id=t2.u_id AND
t2.ts_id=t3.ts_id
GROUP BY t1.u_id;
**OUTPUT:**
u_id | total_tweets_count | pos_count | neg_count | neu_count
-----------------+--------------------+-----------+-----------+-------
18839785| 88 | 38 | 25 | 25
(1 row)
Now I want the same in percentages instead of counts. I have written the query in the following way but failed.
SELECT
total_tweets_count,pos_count,
round((pos_count * 100.0) / total_tweets_count, 2) AS pos_per,neg_count,
round((neg_count * 100.0) / total_tweets_count, 2) AS neg_per,
neu_count, round((neu_count * 100.0) / total_tweets_count, 2) AS neu_per
FROM (
SELECT
count(*) as total_tweets_count,
count(
a.u_id='18839785' AND
a.u_id=b.u_id AND
b.ts_id=c.ts_id AND
c.sentiment='Positive'
) AS pos_count,
count(
a.u_id='18839785' AND
a.u_id=b.u_id AND
b.ts_id=c.ts_id AND
c.sentiment='Negative'
) AS neg_count,
count(
a.u_id='18839785' AND
a.u_id=b.u_id AND
b.ts_id=c.ts_id AND
c.sentiment='Neutral') AS neu_count
FROM t1,t2, t3
WHERE
a.u_id='18839785' AND
a.u_id=b.u_id AND
b.ts_id=c.ts_id
GROUP BY a.u_id
) sub;
Can anyone help me out in achieving as percentages for each user data as below?
u_id | total_tweets_count | pos_count | neg_count | neu_count
------------------+--------------------+-----------+-----------+-----
18839785| 88 | 43.18 | 28.4 | 28.4
(1 row)

I am not entirely sure what you are looking for.
For starters, you can simplify your query by using conditional aggregation instead of three scalar subqueries (which btw. do not need to repeat the where condition on a.u_id)
You state you want to "count for all users", so you need to remove the WHERE clause in the main query. The simplification also gets rid of the repeated WHERE condition.
select u_id,
total_tweets_count,
pos_count,
round((pos_count * 100.0) / total_tweets_count, 2) AS pos_per,
neg_count,
round((neg_count * 100.0) / total_tweets_count, 2) AS neg_per,
neu_cont,
round((neu_count * 100.0) / total_tweets_count, 2) AS neu_per
from (
SELECT
t1.u_id,
count(*) as total_tweets_count,
count(case when t3.sentiment='Positive' then 1 end) as pos_count,
count(case when t3.sentiment='Negative' then 1 end) as neg_count,
count(case when t3.sentiment='Neutral' then 1 end) as neu_count
FROM t1
JOIN t2 ON t1.u_id=t2.u_id
JOIN t3 t2.ts_id=t3.ts_id
-- no WHERE condition on the u_id here
GROUP BY t1.u_id
) t
Note that I replaced the outdated, ancient and fragile implicit joins in the WHERE clause with "modern" explicit JOIN operators
With a more up-do-date Postgres version, the expression count(case when t3.sentiment='Positive' then 1 end) as pos_count can also be re-written to:
count(*) filter (where t3.sentiment='Positive') as pos_count
which is a bit more readable (and understandable I think).
In your query you can achieve the repetition of the global WHERE condition on the u_id by using a co-related subquery, e.g.:
(
SELECT count(*)
FROM t1 inner_t1 --<< use different aliases than in the outer query
JOIN t2 inner_t2 ON inner_t2.u_id = inner_t1.u_id
JOIN t3 inner_t3 ON inner_t3.ts_id = inner_t2.ts_id
-- referencing the outer t1 removes the need to repeat the hardcoded ID
WHERE innter_t1.u_id = t1.u_id
) as pos_count
The repetition of the table t1 isn't necessary either, so the above could be re-written to:
(
SELECT count(*)
FROM t2 inner_t2
JOIN t3 inner_t3 ON inner_t3.ts_id = inner_t2.ts_id
WHERE inner_t2.u_id = t1.u_id --<< this references the outer t1 table
) as pos_count
But the version with conditional aggregation will still be a lot faster than using three scalar sub-queries (even if you remove the unnecessary repetition of the t1 table).

Related

SQLite query JOIN , CONCAT, subquery

I have been trying to use LEFT OUTER JOIN, GROUP BY and (failing) to use the CONCAT || function to get the maximum score for the best book of the decade but have had no luck finding out when two books get the same max score in the same decade.
I need the output below:
There are 2 tables:
Table 1: bookName
Schema: uniqueBookNameId, BookName, yearPublished (from 1901 to 2022)
Table 2: bookRating
Schema: uniqueBookNameId, bookRating
Use a CTE where you join the tables and rank the books with RANK() window function.
Then filter the results to get the top books of each decade:
WITH cte AS (
SELECT r.bookRating,
n.BookName,
n.yearPublished / 10 * 10 || 's' AS Decade,
RANK() OVER (PARTITION BY n.yearPublished / 10 * 10 ORDER BY r.bookRating DESC) AS rnk
FROM bookName n INNER JOIN bookRating r
ON r.uniqueBookNameId = n.uniqueBookNameId
)
SELECT DISTINCT bookRating, BookName, Decade
FROM cte
WHERE rnk = 1
ORDER BY Decade;
or:
WITH cte AS (
SELECT r.bookRating,
n.BookName,
n.yearPublished / 10 * 10 || 's' AS Decade,
RANK() OVER (PARTITION BY n.yearPublished / 10 * 10 ORDER BY r.bookRating DESC) AS rnk
FROM bookName n INNER JOIN bookRating r
ON r.uniqueBookNameId = n.uniqueBookNameId
)
SELECT bookRating, GROUP_CONCAT(DISTINCT BookName) AS BookName, Decade
FROM cte
WHERE rnk = 1
GROUP BY Decade
ORDER BY Decade;

SQLite: Calculate how a counter has increased in current day and week

I have a SQLite database with a counter and timestamp in unixtime as showed below:
+---------+------------+
| counter | timestamp |
+---------+------------+
| | 1582933500 |
| 1 | |
+---------+------------+
| 2 | 1582933800 |
+---------+------------+
| ... | ... |
+---------+------------+
I would like to calculate how 'counter' has increased in current day and current week.
It is possible in a SQLite query?
Thanks!
Provided you have SQLite version >= 3.25.0 the SQLite window functions will help you achieve this.
Using the LAG function to retrieve the value from the previous record - if there is none (which will be the case for the first row) a default value is provided, that is same as current row.
For the purpose of demonstration this code:
SELECT counter, timestamp,
LAG (timestamp, 1, timestamp) OVER (ORDER BY counter) AS previous_timestamp,
(timestamp - LAG (timestamp, 1, timestamp) OVER (ORDER BY counter)) AS diff
FROM your_table
ORDER BY counter ASC
will give this result:
1 1582933500 1582933500 0
2 1582933800 1582933500 300
In a CTE get the min and max timestamp for each day and join it twice to the table:
with cte as (
select date(timestamp, 'unixepoch', 'localtime') day,
min(timestamp) mindate, max(timestamp) maxdate
from tablename
group by day
)
select c.day, t2.counter - t1.counter difference
from cte c
inner join tablename t1 on t1.timestamp = c.mindate
inner join tablename t2 on t2.timestamp = c.maxdate;
With similar code get the results for each week:
with cte as (
select strftime('%W', date(timestamp, 'unixepoch', 'localtime')) week,
min(timestamp) mindate, max(timestamp) maxdate
from tablename
group by week
)
select c.week, t2.counter - t1.counter difference
from cte c
inner join tablename t1 on t1.timestamp = c.mindate
inner join tablename t2 on t2.timestamp = c.maxdate;

SQL - Performing additional math calculations on results obtained from aggregate functions within a query

I am fairly new to writing SQL queries. I have used a subquery so that I could obtain the results of aggregate functions applied to 2 different tables. Furthermore, I would like to obtain the ratio between the results from these 2 aggregate functions. In other words, Result of Aggregate function 1 / Result of Aggregate function 2.
INPUT:
TABLE USERS
Id
1
2
3
4
5
6
7
TABLE EVENTS
User_Id Event_Name
1 View_User_Profile
1 View_User_Activity
1 View_User_Profile
2 View_User_Activity
3 View_User_Activity
4 View_User_Profile
5 View_User_Activity
7 View_User_Activity
This is my code so far:
SELECT COUNT(*) AS Number_of_Users,
(SELECT COUNT(DISTINCT U.Id) AS Number_of_Users_Viewed_Profile
FROM dsv1069.Users U Left Join dsv1069.Events E
ON U.Id = E.user_id
WHERE E.Event_Name = 'view_user_profile') AS Number_of_Users_Viewed_Profile
FROM dsv1069.Users
RESULTS:
Number_of_Users: 7
Number_of_Users_Viewed_Profile: 2
EXPECTED OUTPUT:
Number_of_Users: 7
Number_of_Users_Viewed_Profile: 2
PERCENT OF USERS VIEWED PROFILE: 28.6%
ISSUE: What my code doesn't do so far is calculate the ratio 2/7 = 28.6%
I have done lots of searches on aggregate functions but can't find any information on how to use the results from those functions as part of the query. Thank you for any assistance!
I believe that you can obtain the column Number_of_Users_Viewed_Profile without the join:
SELECT COUNT(DISTINCT user_id)
FROM dsv1069.Events
WHERE Event_Name = 'view_user_profile'
So this is equivalent to your query:
SELECT
COUNT(*) AS Number_of_Users,
(
SELECT COUNT(DISTINCT user_id)
FROM dsv1069.Events
WHERE Event_Name = 'View_User_Profile'
) AS Number_of_Users_Viewed_Profile
FROM dsv1069.Users
You can get the ratio column with a CTE:
WITH cte AS (
SELECT
COUNT(*) AS Number_of_Users,
(
SELECT COUNT(DISTINCT user_id)
FROM Events
WHERE Event_Name = 'View_User_Profile'
) AS Number_of_Users_Viewed_Profile
FROM Users
)
SELECT *, ROUND(100.0 * Number_of_Users_Viewed_Profile / Number_of_Users, 1) AS ratio
FROM cte
See the demo.
Or without the CTE:
SELECT
t1.Number_of_Users,
t2.Number_of_Users_Viewed_Profile,
ROUND(100.0 * t2.Number_of_Users_Viewed_Profile / t1.Number_of_Users, 1) AS ratio
FROM (SELECT COUNT(*) AS Number_of_Users FROM Users) AS t1
CROSS JOIN (
SELECT COUNT(DISTINCT user_id) AS Number_of_Users_Viewed_Profile
FROM Events
WHERE Event_Name = 'View_User_Profile'
) AS t2
See the demo.
Results:
| Number_of_Users | Number_of_Users_Viewed_Profile | ratio |
| --------------- | ------------------------------ | ----- |
| 7 | 2 | 28.6 |

Selecting the n'th range/island of rows where columns have a common value?

I need to select all rows (for a range) which have a common value within a column.
For example (starting from the last row)
I try to select all of the rows where _user_id == 1 until _user_id != 1 ?
In this case resulting in selecting rows [4, 5, 6]
+------------------------+
| _id _user_id amount |
+------------------------+
| 1 1 777 |
| 2 2 1 |
| 3 2 11 |
| 4 1 10 |
| 5 1 100 |
| 6 1 101 |
+------------------------+
/*Create the table*/
CREATE TABLE IF NOT EXISTS t1 (
_id INTEGER PRIMARY KEY AUTOINCREMENT,
_user_id INTEGER,
amount INTEGER);
/*Add the datas*/
INSERT INTO t1 VALUES(1, 1, 777);
INSERT INTO t1 VALUES(2, 2, 1);
INSERT INTO t1 VALUES(3, 2, 11);
INSERT INTO t1 VALUES(4, 1, 10);
INSERT INTO t1 VALUES(5, 1, 100);
INSERT INTO t1 VALUES(6, 1, 101);
/*Check the datas*/
SELECT * FROM t1;
1|1|777
2|2|1
3|2|11
4|1|10
5|1|100
6|1|101
In my attempt I use Common Table Expressions to group the results of _user_id. This gives the index of the last row containing a unique value (eg. SELECT _id FROM t1 GROUP BY _user_id LIMIT 2; will produce: [6, 3])
I then use those two values to select a range where LIMIT 1 OFFSET 1 is the lower end (3) and LIMIT 1 is the upper end (6)
WITH test AS (
SELECT _id FROM t1 GROUP BY _user_id LIMIT 2
) SELECT * FROM t1 WHERE _id BETWEEN 1+ (
SELECT * FROM test LIMIT 1 OFFSET 1
) and (
SELECT * FROM test LIMIT 1
);
Output:
4|1|10
5|1|100
6|1|101
This appears to work ok at selecting the last "island" but what I really need is a way to select the n'th island.
Is there a way to generate a query capable of producing outputs like these when provided a parameter n?:
island (n=1):
4|1|10
5|1|100
6|1|101
island (n=2):
2|2|1
3|2|11
island (n=3):
1|1|777
Thanks!
SQL tables are unordered, so the only way to search for islands is to search for consecutive _id values:
WITH RECURSIVE t1_with_islands(_id, _user_id, amount, island_number) AS (
SELECT _id,
_user_id,
amount,
1
FROM t1
WHERE _id = (SELECT max(_id)
FROM t1)
UNION ALL
SELECT t1._id,
t1._user_id,
t1.amount,
CASE WHEN t1._user_id = t1_with_islands._user_id
THEN island_number
ELSE island_number + 1
END
FROM t1
JOIN t1_with_islands ON t1._id = (SELECT max(_id)
FROM t1
WHERE _id < t1_with_islands._id)
)
SELECT *
FROM t1_with_islands
ORDER BY _id;

Retrieve a table to tallied numbers, best way

I have query that runs as part of a function which produces a one row table full of counts, and averages, and comma separated lists like this:
select
(select
count(*)
from vw_disp_details
where round = 2013
and rating = 1) applicants,
(select
count(*)
from vw_disp_details
where round = 2013
and rating = 1
and applied != 'yes') s_applicants,
(select
LISTAGG(discipline, ',')
WITHIN GROUP (ORDER BY discipline)
from (select discipline,
count(*) discipline_number
from vw_disp_details
where round = 2013
and rating = 1
group by discipline)) disciplines,
(select
LISTAGG(discipline_count, ',')
WITHIN GROUP (ORDER BY discipline)
from (select discipline,
count(*) discipline_count
from vw_disp_details
where round = 2013
and rating = 1
group by discipline)) disciplines_count,
(select
round(avg(util.getawardstocols(application_id,'1','AWARD_NAME')), 2)
from vw_disp_details
where round = 2013
and rating = 1) average_award_score,
(select
round(avg(age))
from vw_disp_details
where round = 2013
and rating = 1) average_age
from dual;
Except that instead of 6 main sub-queries there are 23.
This returns something like this (if it were a CSV):
applicants | s_applicants | disciplines | disciplines_count | average_award_score | average_age
107 | 67 | "speed,accuracy,strength" | 3 | 97 | 23
Now I am programmatically swapping out the "rating = 1" part of the where clauses for other expressions. They all work rather quickly except for the "rating = 1" one which takes about 90 seconds to run and that is because the rating column in the vw_disp_details view is itself compiled by a sub-query:
(SELECT score
FROM read r,
eval_criteria_lookup ecl
WHERE r.criteria_id = ecl.criteria_id
AND r.application_id = a.lgo_application_id
AND criteria_description = 'Overall Score'
AND type = 'ABC'
) reader_rank
So when the function runs this extra query seems to slow everything down dramatically.
My question is, is there a better (more efficient) way to run a query like this that is basically just a series of counts and averages, and how can I refactor to optimize the speed so that the rating = 1 query doesn't take 90 seconds to run.
You could choose to MATERIALIZE the vw_disp_details VIEW. That would pre-calculate the value of the rating column. There are various options for how up-to-date a materialized view is kept, you would probably want to use the ON COMMIT clause so that vw_disp_details is always correct.
Have a look at the official documentation and see if that would work for you.
http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_6002.htm
Do all most of your queries in only one. Instead of doing:
select
(select (count(*) from my_tab) as count_all,
(select avg(age) from my_tab) as avg_age,
(select avg(mypkg.get_award(application_id) from my_tab) as_avg-app_id
from dual;
Just do:
select count(*), avg(age),avg(mypkg.get_award(application_id)) from my_tab;
And then, maybe you can do some union all for the other results. But this step all by itself should help.
I was able to solve this issue by doing two things: creating a new view that displayed only the results I needed, which gave me marginal gains in speed, and in that view moving the where clause of the sub-query that caused the lag into the where clause of the view and tacking on the result of the sub-query as column in the view. This still returns the same results thanks to the fact that there are always going to be records in the table the sub-query accessed for each row of the view query.
SELECT
a.application_id,
util.getstatus (a.application_id) status,
(SELECT score
FROM applicant_read ar,
eval_criteria_lookup ecl
WHERE ar.criteria_id = ecl.criteria_id
AND ar.application_id = a.application_id
AND criteria_description = 'Overall Score' //THESE TWO FIELDS
AND type = 'ABC' //ARE CRITERIA_ID = 15
) score
as.test_total test_total
FROM application a,
applicant_scores as
WHERE a.application_id = as.application_id(+);
Became
SELECT
a.application_id,
util.getstatus (a.application_id) status,
ar.score,
as.test_total test_total
FROM application a,
applicant_scores as,
applicant_read ar
WHERE a.application_id = as.application_id(+)
AND ar.application_id = a.application_id(+)
AND ar.criteria_id = 15;

Resources