SQL select column and count from the next column - count

My table structure for referrals is , the field ref is unique:
ID pid ref ref_by
1 1 k NAN
2 2 l k
3 3 m k
4 4 n l
And the user table is:
id name
1 john
2 Bob
3 Tim
4 Rob
I need to get the id,pid, ref and count of ref in next column .Based on the number of referrals they each will be assigned some points that is a constant 100, the result should look like this .
pid name ref number_of_referals points_earned
1 john k 2 200
2 Bob l 1 100
3 Tim m 0 0
4 Rob n 0 0

You need 2 joins:
the 1st from users table to referrals and the 2nd to a query that groups and counts the referrals:
select
r.id, u.name, r.ref,
case when c.counter is null then 0 else c.counter end number_of_referals,
case when c.counter is null then 0 else c.counter end * 100 points_earned
from users u inner join referrals r
on r.pid = u.id
left join (select ref_by, count(*) counter from referrals group by ref_by) c
on c.ref_by = r.ref
order by r.id
See the demo

Related

Group by having clause in teradata

A teradata table
as
Group_categ id
A 1
A 2
A 3
A 5
A 8
A 9
B 11
C 1
C 2
C 3
C 4
need to filter it like
Group_categ min_id max _id
A 1 2
A 3 5
A 8 9
B 11 11
C 1 2
C 3 4
Seems you want to combine two consecutive rows into a single row:
SELECT Group_categ, Min(id), Max(id)
FROM
(
SELECT
Group_categ, id,
-- assign the same value to two consecutive rows: 0,0,1,1,2,2,..
-- -> used in the outer GROUP BY
(Row_Number() Over (PARTITION BY Group_categ ORDER BY id)-1) / 2 AS grp
FROM mytab
) AS dt
GROUP BY Group_categ, grp

SQLite - Count Distinct with three column groupings ignores last grouping

I'm working on a database about directors from US companies. One specific table I've created from a query contains all directors who were in a company 5 years before it declared bankruptcy (query below):
CREATE TABLE IF NOT EXISTS y_query AS
SELECT
A.dir_id, A.linked_dir_ir, A.bankrupt_co_id, A.co_name, A.event_date,
B.conn_co_id, B.conn_co_name, B.conn_co_type, B.conn_co_start_date, B.conn_co_end_date,
(CASE WHEN conn_co_start_date >= event_date THEN 1 ELSE 0 END) AS dir_hired
FROM (
SELECT
C.dir_id, C.linked_dir_ir, C.conn_co_id as bankrupt_co_id, C.overlap_start_date, C.overlap_end_date, C.conn_co_type,
D.co_name, D.filing_date, D.event_date
FROM director_network C
INNER JOIN company_bankruptcy D
ON C.conn_co_id = D.co_id
WHERE (
(C.overlap_end_date >= DATE(D.event_date, '-5 years')) AND
(C.overlap_end_date <= D.event_date))
) A
LEFT OUTER JOIN company_network B
ON A.dir_id = B.dir_id;
(linked_dir_ir should read linked_dir_id but I'm on a slow computer and it would take ~1hr to change the column name)
So, that table is fine, the query takes a while to run but it works as intended. But now I need to count the number of directors (linked_dir_ir) a given director (dir_id) was associated (ie, each row is a connection) for each bankrupt company (bankrupt_co_id) the director was in, and each connected company (conn_co_id). There can be many lines connecting a pair of directors because a new entry is made if any of them receives a promotion and so on.
(A few rows of y_query table: y_query)
So, I thought this query would work, but I'm running into problems:
SELECT dir_id, bankrupt_co_id, conn_co_id, COUNT(DISTINCT linked_dir_ir) as conn_dirs
FROM y_query
WHERE bankrupt_co_id != conn_co_id
GROUP BY dir_id, bankrupt_co_id, conn_co_id;
I am not sure why but this query disregards the last group (conn_co_id) and the result is the same for any dir_id and bankrupt_co_id, where it should vary also based on what conn_co_id is. A sample of the result (it only changes when dir_id or bankrupt_co_id changes, as depicted):
resulting query
The result is the same as if I'd only grouped with dir_id and bankrupt_co_id, when it should be different for each conn_co_id. I've done a lot of research into GROUP BY statements and how it can be tricky, but I haven't been able to crack this one. I'd greatly appreciate any help on this!
Its hard to reproduce your result. But your query with multiple groupings seems to be fine. See example below:
CREATE TABLE test (dir_id INTEGER, bankrupt_co_id INTEGER, conn_co_id INTEGER, linked_dir_id INTEGER);
with some dummy data:
select * from test;
dir_id bankrupt_co_id conn_co_id linked_dir_id
---------- -------------- ---------- -------------
1 1 1 1
1 1 1 2
1 1 1 4
1 1 1 5
1 1 1 6
1 1 1 7
1 1 2 1
1 1 2 2
1 1 2 3
1 2 2 1
3 3 2 1
3 1 2 1
3 2 2 1
3 2 2 4
1 1 1 3
1 1 4 4
The query including your conn_co_id results in:
SELECT dir_id, bankrupt_co_id, conn_co_id, COUNT(DISTINCT linked_dir_id) as conn_dirs FROM test WHERE bankrupt_co_id!=conn_co_id GROUP BY dir_id, bankrupt_co_id, conn_co_id;
dir_id bankrupt_co_id conn_co_id conn_dirs
---------- -------------- ---------- ----------
1 1 2 3
1 1 4 1
3 1 2 1
3 3 2 1
whereas without the top two results are combined:
SELECT dir_id, bankrupt_co_id, conn_co_id, COUNT(DISTINCT linked_dir_id) as conn_dirs FROM test WHERE bankrupt_co_id!=conn_co_id GROUP BY dir_id, bankrupt_co_id;
dir_id bankrupt_co_id conn_co_id conn_dirs
---------- -------------- ---------- ----------
1 1 4 4
3 1 2 1
3 3 2 1

Generate matrix of unique user-item cross-product combinations

I am trying to create a cross-product matrix of unique users in R. I searched for it on SO but could not find what I was looking for. Any help is appreciated.
I have a large dataframe (over a million) and a sample is shown:
df <- data.frame(Products=c('Product a', 'Product b', 'Product a',
'Product c', 'Product b', 'Product c'),
Users=c('user1', 'user1', 'user2', 'user1',
'user2','user3'))
Output of df is:
Products Users
1 Product a user1
2 Product b user1
3 Product a user2
4 Product c user1
5 Product b user2
6 Product c user3
I would like to see two matrices:
The first one will show the number of unique users that had either products(OR) - so the output will be something like:
Product a Product b Product c
Product a 2 3
Product b 2 3
Product c 3 3
The second matrix will be the number of unique users that had both products(AND):
Product a Product b Product c
Product a 2 1
Product b 2 1
Product c 1 1
Any help is appreciated.
Thanks
UPDATE:
Here is more clarity: Product a is used by User1 and User2. Product b is used by User1 and User2 and Product c is used by User1 and User3. So in the first matrix, Product a and Product b will be 2 since there are 2 unique users. Similarly, Product a and Product c will be 3. Where as in the second matrix, they would be 2 and 1 since I want the intersection.
Thanks
Try
lst <- split(df$Users, df$Products)
ln <- length(lst)
m1 <- matrix(0, ln,ln, dimnames=list(names(lst), names(lst)))
m1[lower.tri(m1, diag=FALSE)] <- combn(seq_along(lst), 2,
FUN= function(x) length(unique(unlist(lst[x]))))
m1[upper.tri(m1)] <- m1[lower.tri(m1)]
m1
# Product a Product b Product c
#Product a 0 2 3
#Product b 2 0 3
#Product c 3 3 0
Or using outer
f1 <- function(u, v) length(unique(unlist(c(lst[[u]], lst[[v]]))))
res <- outer(seq_along(lst), seq_along(lst), FUN= Vectorize(f1)) *!diag(3)
dimnames(res) <- rep(list(names(lst)),2)
res
# Product a Product b Product c
#Product a 0 2 3
#Product b 2 0 3
#Product c 3 3 0
For the second case
tcrossprod(table(df))*!diag(3)
# Products
#Products Product a Product b Product c
# Product a 0 2 1
# Product b 2 0 1
# Product c 1 1 0

How do I find the highest common number for each group in SQLite?

Here is my table example:
LETTER NUMBER
a 1
a 2
a 4
b 1
b 2
b 3
c 1
c 2
c 3
d 1
d 2
d 3
e 1
e 2
e 3
The result I want:
LETTER NUMBER
a 2
b 2
c 2
d 2
e 2
The highest number that matches an 'a' is 4, while it's 3 for the other letters. However, the highest letter they all have in common is 2. That is why the result table has 2 for the NUMBER.
Does anyone know how I can accomplish this?
Let's call your table l. Here's a horribly inefficient solution:
select l.LETTER, max(l.NUMBER)
from l
where
(select count(distinct LETTER) from l)
= (select count(distinct l2.LETTER) from l as l2 where l2.NUMBER = l.NUMBER)
group by l.LETTER;
Kind of a mess, huh?

SELECTing "first" (as determined by ORDER BY) row FROM near-duplicate rows (as determined by GROUP BY, HAVING, COUNT) within SQLite

I have a problem which is a bit beyond me (I'm really awfully glad I'm a Beta) involving duplicates (so GROUP BY, HAVING, COUNT), compounded by keeping the solution within the standard functions that came with SQLite. I am using the sqlite3 module from Python.
Example table workers, Columns:
* ID: integer, auto-incrementing
* ColA: integer
* ColB: varchar(20)
* UserType: varchar(20)
* LoadMe: Boolean
(Yes, SQLite's datatypes are nominal)
My data table, Workers, at start looks like:
ID ColA ColB UserType LoadMe
1 1 a Alpha 0
2 1 b Beta 0
3 2 a Alpha 0
4 2 a Beta 0
5 2 b Delta 0
6 2 b Alpha 0
7 1 a Delta 0
8 1 b Epsilon 0
9 1 c Gamma 0
10 4 b Delta 0
11 5 a Alpha 0
12 5 a Beta 0
13 5 b Gamma 0
14 5 a Alpha 0
I would like to enable, for Loading onto trucks at a new factory, all workers who have unique combinations between ColA and ColB. For those duplicates (twins, triplets, etc., perhaps via Bokanovsky's Process) where unique combinations of ColA and ColB have more than one worker, I would like to select only one from each set of duplicates. To make the problem harder, I would like to additionally be able to make the selection one from each set of duplicates on the basis of UserType in some form of ORDER BY. I may wish to select the first "duplicate" with a UserType of "Alpha," to work on a frightfully clever problem, or ORDER BY UserType DESC, that I may issue an order for black tunics for the lowest of the workers.
You can see that IDs 9, 10, and 13 have unique combinations of ColA and ColB and are most easily identified. The 1-a, 1-b, 2-a, 2-b, and 5-a combinations, however, have duplicates within them.
My current process, as it stands so far:
0) Everyone comes with a unique ID number. This is done at birth.
1) SET all Workers to LoadMe = 1.
UPDATE Workers
SET LoadMe = 1
2) Find my duplicates based on their similarity in two columns (GROUP BY ColA, ColB):
SELECT Wk1.*
FROM Workers AS Wk1
INNER JOIN (
SELECT ColA, ColB
FROM Workers
GROUP BY ColA, ColB
HAVING COUNT(*) > 1
) AS Wk2
ON Wk1.ColA = Wk2.ColA
AND Wk1.ColB = Wk2.ColB
ORDER BY ColA, ColB
3) SET all of my duplicates to LoadMe = 0.
UPDATE Workers
SET LoadMe = 0
WHERE ID IN (
SELECT Wk1.ID
FROM Workers AS Wk1
INNER JOIN (
SELECT ColA, ColB
FROM Workers
GROUP BY ColA, ColB
HAVING COUNT(*) > 1
) AS Wk2
ON Wk1.ColA = Wk2.ColA
AND Wk1.ColB = Wk2.ColB
)
4) For each set of duplicates in my GROUP BY, ORDERed BY UserType, SELECT only one, the first in the list, to have LoadMe SET to 1.
This table would look like:
ID ColA ColB UserType LoadMe
1 1 a Alpha 1
2 1 b Beta 1
3 2 a Alpha 1
4 2 a Beta 0
5 2 b Delta 0
6 2 b Alpha 1
7 1 a Delta 0
8 1 b Epsilon 0
9 1 c Gamma 1
10 4 b Delta 1
11 5 a Alpha 1
12 5 a Beta 0
13 5 b Gamma 1
14 5 a Alpha 0
ORDERed BY ColA, ColB, UserType, then ID, and broken out by the GROUP BY columns, (and finally spaced for clarity) that same data might look like:
ID ColA ColB UserType LoadMe
1 1 a Alpha 1
7 1 a Delta 0
2 1 b Beta 1
8 1 b Epsilon 0
9 1 c Gamma 1
3 2 a Alpha 1
4 2 a Beta 0
6 2 b Alpha 1
5 2 b Delta 0
10 4 b Delta 1
11 5 a Alpha 1
14 5 a Alpha 0
12 5 a Beta 0
13 5 b Gamma 1
I am confounded on the last step and feel like an Epsilon-minus semi-moron. I had previously been pulling the duplicates out of the database into program space and working within Python, but this situation arises not infrequently and I would like to more permanently solve this.
I like to break a problem like this up a bit. The first step is to identify the unique ColA,ColB pairs:
SELECT ColA,ColB FROM Workers GROUP BY ColA,ColB
Now for each of these pairs you want to find the highest priority record. A join won't work because you'll end up with multiple records for each unique pair but a subquery will work:
SELECT ColA,ColB,
(SELECT id FROM Workers w1
WHERE w1.ColA=w2.ColA AND w1.ColB=w2.ColB
ORDER BY UserType LIMIT 1) AS id
FROM Workers w2 GROUP BY ColA,ColB;
You can change the ORDER BY clause in the subquery to control the priority. LIMIT 1 ensures that there is only one record for each subquery (otherwise sqlite will return the last record that matches the WHERE clause, although I'm not sure that that's guaranteed).
The result of this query is a list of records to be loaded with ColA, ColB, id. I would probably work directly from that and get rid of LoadMe but if you want to keep it you could do this:
BEGIN TRANSACTION;
UPDATE Workers SET LoadMe=0;
UPDATE Workers SET LoadMe=1
WHERE id IN (SELECT
(SELECT id FROM Workers w1
WHERE w1.ColA=w2.ColA AND w1.ColB=w2.ColB
ORDER BY UserType LIMIT 1) AS id
FROM Workers w2 GROUP BY ColA,ColB);
COMMIT;
That clears the LoadMe flag and then sets it to 1 for each of the records returned by our last query. The transaction guarantees that this all takes place or fails as one step and never leaves your LoadMe fields in an inconsistent state.

Resources