I have a table like this:
user_id | subscription_id
-------------------------
1 | 1
1 | 2
2 | 3
2 | 4
3 | 1
3 | 2
4 | 3
5 | 3
What I want to do is count how many users have similar subscriptions:
user_id | same_subscriptions
----------------------------
1 | 1
2 | 0
3 | 1
4 | 1
5 | 1
Is this even possible? How can I achieve this...
Best I managed to do is get a table like this with group_concat:
user_id | subscriptions
-----------------------
1 | 1,2
2 | 3,4
3 | 1,2
4 | 3
5 | 3
This is how I achieved it:
SELECT A.user_id, group_concat(B.subscription_id)
FROM Subscriptions A LEFT JOIN Subscriptions B ON
A.user_id=B.user_id GROUP BY A.user_id;
The aggregate function GROUP_CONCAT() does not help in this case because in SQLite it does not support an ORDER BY clause, so that a safe comparison can be done.
But you can use GROUP_CONCAT() window function instead:
SELECT user_id,
COUNT(*) OVER (PARTITION BY subs) - 1 same_subscriptions
FROM (
SELECT user_id,
GROUP_CONCAT(subscription_id) OVER (PARTITION BY user_id ORDER BY subscription_id) subs,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY subscription_id DESC) rn
FROM Subscriptions
)
WHERE rn = 1
ORDER BY user_id
See the demo.
Results:
> user_id | same_subscriptions
> ------: | -----------------:
> 1 | 1
> 2 | 0
> 3 | 1
> 4 | 1
> 5 | 1
Related
Suppose I have 2 tables in an sqlite3 database:
table1
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
+----+
table2
+----+-----------+------+
| id | table1_id | col1 |
+----+-----------+------+
| 1 | 1 | A |
| 2 | 1 | B |
| 3 | 1 | C |
| 4 | 2 | A |
| 5 | 2 | C |
| 6 | 2 | D |
| 7 | 2 | E |
| 8 | 3 | A |
| 8 | 3 | D |
| 8 | 3 | E |
+----+-----------+------+
Expected result
I would like to return all the items from table1 which have associated col1 values of both D and E, namely:
+----+
| id |
+----+
| 2 |
| 3 |
+----+
How can I achieve this using sqlite3?
If table1_id is a foreign key to id of table1 then table1 is not needed at all.
You should filter table2 with the rows of that contain 'D' or 'E' in col1 and group by table1_id.
Then set the condition in the HAVING clause:
SELECT table1_id AS id
FROM table2
WHERE col1 IN ('D', 'E')
GROUP BY table1_id
HAVING COUNT(*) = 2 -- the number of values in the IN list
If there are duplicates in col1 for each table1_id change to:
HAVING COUNT(DISTINCT col1) = 2
Or with a CTE:
WITH cte(col1) AS (VALUES ('D'), ('E'))
SELECT table1_id AS id
FROM table2
WHERE col1 IN cte
GROUP BY table1_id
HAVING COUNT(*) = (SELECT COUNT(*) FROM cte)
See the demo.
Try this:
select table_id from table2 t where t.col1 in ('D','E') group by table_id;
You could also inner join with table 1 to verifiy that the record exists on both tables
Suppose a respondent (id) is asked to make a binary (discrete) choice, either select 1 or 2 in five tasks (t=1,2,3,4,5) (a panel dataset with five observations per respondent).
If a respondent selects choice 1, then the outcome is a fixed value (let say 30 always) but if a respondent selects choice 2, then the outcome is different and depends on which treatment the respondent is in (there is only one treatment per respondent since the respondent is randomly assigned to one treatment only). Let say there are four treatments (a vector) and in each treatment, there are five outcomes if choice 2 is selected.
That is,
treat1= 1,2,3,4,5
treat2= 6,7,8,9,10
treat3= 11,12,13,14,15
treat4= 16,17,18,19,20
For example, in the case of treat1, if a respondent in the first task selects choice 2, then the outcome is equal to 1. In the second task, the respondent selects choice 1, the outcome is 30 (as always). In the third task, if a respondent selects choice 2, the outcome is 2 (and not 3). That is if choice 2 is selected for the first time in treat1, then pick the first value from the treat1 sequence; if choice 2 is selected for the second time in treat1, then pick the second value from the treat 2 sequence and so on.
The outcome looks like the below.
+----+---+-----------+--------+---------+
| id | t | treatment | choice | outcome |
+----+---+-----------+--------+---------+
| 1 | 1 | 1 | 2 | 1 |
| 1 | 2 | 1 | 1 | 30 |
| 1 | 3 | 1 | 2 | 2 |
| 1 | 4 | 1 | 1 | 30 |
| 1 | 5 | 1 | 2 | 3 |
| 2 | 1 | 3 | 1 | 30 |
| 2 | 2 | 3 | 2 | 11 |
| 2 | 3 | 3 | 2 | 12 |
| 2 | 4 | 3 | 1 | 30 |
| 2 | 5 | 3 | 2 | 13 |
| 3 | 1 | 2 | 2 | 6 |
| 3 | 2 | 2 | 1 | 30 |
| 3 | 3 | 2 | 1 | 30 |
| 3 | 4 | 2 | 1 | 30 |
| 3 | 5 | 2 | 2 | 7 |
| 4 | 1 | 4 | 1 | 30 |
| 4 | 2 | 4 | 1 | 30 |
| 4 | 3 | 4 | 1 | 30 |
| 4 | 4 | 4 | 2 | 16 |
| 4 | 5 | 4 | 1 | 30 |
| 5 | 1 | 2 | 1 | 30 |
| 5 | 2 | 2 | 1 | 30 |
| 5 | 3 | 2 | 1 | 30 |
| 5 | 4 | 2 | 1 | 30 |
| 5 | 5 | 2 | 2 | 6 |
| . | . | . | . | . |
| . | . | . | . | . |
| . | . | . | . | . |
| . | . | . | . | . |
| . | . | . | . | . |
+----+---+-----------+--------+---------+
Since my data has thousands of observations, I was wondering what would be an efficient way to generate the variable outcome.
The id, t, treatment, and choice variables are available in my dataset.
Any thoughts would be appreciated. Thanks.
Another possible approach is to organize the treatment into a data.table, then do a join and update by reference when choice=2
#the sequence of treatment when choice==2
DT[choice==2, ri := rowid(id)]
#look up treatment for the sequence
DT[choice==2, outcome := treat[.SD, on=.(treatment, ri), val]]
#set outcome to 30 for choice=1
DT[choice==1, outcome := 30]
#delete column
DT[, ri := NULL]
data:
library(data.table)
treat <- data.table(treatment=rep(1:4, each=5),
ri=rep(1:5, times=4),
val=1:20)
DT <- fread("id,t,treatment,choice,outcome
1,1,1,2,1
1,2,1,1,30
1,3,1,2,2
1,4,1,1,30
1,5,1,2,3")
DT[, outcome := NULL]
You did not provide any sample data, so I create some fake data first
Data
set.seed(1)
treat_lkp <- list(trt1 = 1:5, trt2 = 6:10, trt3 = 11:15, trt4 = 16:20)
d_in <- expand.grid(task = 1:5, id = 1:5)
d_in$treatment <- paste0("trt", d_in$id %% 4 + 1)
d_in$choice <- sample(2, NROW(d_in), TRUE)
tidyverse solution
I use a simple tidyverse solution.
library(purrr)
library(dplyr)
d_out <- d_in %>%
group_by(id) %>%
mutate(task_new = cumsum(choice == 2)) %>%
ungroup() %>%
mutate(outcome = {
l <- treat_lkp[as.character(d_in$treatment)]
pmap_dbl(list(task = task_new, choice = choice, set = l),
function(task, choice, set)
ifelse(choice == 1, 30, set[task])
)}
)
head(d_out)
# # A tibble: 6 x 6
# task id treatment choice task_new outcome
# <int> <int> <chr> <int> <int> <dbl>
# 1 1 1 trt2 1 0 30
# 2 2 1 trt2 1 0 30
# 3 3 1 trt2 2 1 6
# 4 4 1 trt2 2 2 7
# 5 5 1 trt2 1 2 30
# 6 1 2 trt3 2 1 11
Explanation
You create first a list l with the relevant lookup values for your outcome (depends on treatment). Then you loop over task, treatment and choice to select either 30 (for choice == 1) or you use the right lookup value from l
Update
Taking the comment into account, we need now first to create a task_new variable which holds the correct position. That is the first choice == 2 should result in 1 the second in 2 and so on. So we group_by id and add the counter via cumsum. We use task_new in the mutate call after we ungrouped the data.
I have a table like this:
COL1 | COL2
------------
1 | NULL
2 | NULL
3 | NULL
4 | NULL
How can I use SQL to update the COL2 which has the accumulated total of all previous row? Like this:
COL1 | COL2
------------
1 | 1
2 | 3
3 | 6
4 | 10
Thanks.
Got the answer from my colleague: (Assume the table name is abc)
UPDATE abc set col2 = (
SELECT temp.t from (SELECT abc.id, SUM(def.col1) as t FROM abc join abc as def on def.id<=abc.id group by abc.id)as temp WHERE abc.id=temp.id
)
Or we can use this:
REPLACE INTO abc SELECT abc.id,r2.col1, SUM(r2.col1) as col2 FROM abc join abc as r2 on r2.id<=abc.id group by abc.id
I have multiple dataframes like mentioned below with unique id for each row. I am trying to find common rows and make a new dataframe which is appearing at least in two dataframes.
example- row with Id=2 is appearing in all three dataframes. similarly row with Id= 3 is there in df1 and df3.
I want to make a loop which can find common rows and create a new dataframe with common rows.
df1 <- data.frame(Id=c(1,2,3,4),a=c(0,1,0,2),b=c(1,0,1,0),c=c(0,0,4,0))
df2 <- data.frame(Id=c(7,2,5,9),a=c(4,1,9,2),b=c(1,0,1,5),c=c(3,0,7,0))
df3 <- data.frame(Id=c(5,3,2,6),a=c(9,0,1,5),b=c(1,1,0,0),c=c(7,4,0,0))
> df1 > df2
Id | a | b | c | Id | a | b | c |
---|---|---|---| ---|---|---|---|
1 | 0 | 1 | 0 | 7 | 4 | 1 | 3 |
---|---|---|---| ---|---|---|---|
2 | 1 | 0 | 0 | 2 | 1 | 0 | 0 |
---|---|---|---| ---|---|---|---|
3 | 0 | 1 | 4 | 5 | 9 | 1 | 7 |
---|---|---|---| ---|---|---|---|
4 | 2 | 0 | 0 | 9 | 2 | 5 | 0 |
> df3
Id | a | b | c |
---|---|---|---|
5 | 9 | 1 | 7 |
---|---|---|---|
3 | 0 | 1 | 4 |
---|---|---|---|
2 | 1 | 0 | 0 |
---|---|---|---|
6 | 5 | 0 | 0 |
> expected_output
Id | a | b | c |
---|---|---|---|
5 | 9 | 1 | 7 |
---|---|---|---|
3 | 0 | 1 | 4 |
---|---|---|---|
2 | 1 | 0 | 0 |
---|---|---|---|
Note:- ID is unique.
Also, i want to remove rows from original dataframes which are duplicated and I am using it to create new dataframe.
I have multiple dataframes like mentioned below with unique id for each row. I am trying to find common rows and make a new dataframe which is appearing at least in two dataframes.
Since no ID appears twice in the same table, we can tabulate the IDs and keep any found twice:
library(data.table)
DTs = lapply(list(df1,df2,df3), data.table)
Id_keep = rbindlist(lapply(DTs, `[`, j = "Id"))[, .N, by=Id][N >= 2L, Id]
DT_keep = Reduce(funion, DTs)[Id %in% Id_keep]
# Id a b c
# 1: 2 1 0 0
# 2: 3 0 1 4
# 3: 5 9 1 7
Your data should be in an object like DTs to begin with, not a bunch of separate named objects.
How it works
To get a sense of how it works, examine intermediate objects like
list(df1,df2,df3)
lapply(DTs, `[`, j = "Id")
Reduce(funion, DTs)
Also, read the help files, like ?lapply, ?rbindlist, ?funion.
Combine all of the data frames:
combined <- rbind(df1, df2, df3)
Extract the duplicates:
duplicate_rows <- unique(combined[duplicated(combined), ])
(duplicated(combined) gives you the row indices of duplicate rows)
SELECT * FROM MyTable
a | b
---+----
1 | 2
2 | 10
2 | 5
3 | 10
I want each number to only appear once in each column, i.e. the result should be:
a | b
---+----
1 | 2
2 | 10
Is this possible?
I suggest to add a column with ID. Then do this:
select a, b from mytable m1 where
a not in (select m2.a from mytable m2 where m2.id < m1.id)
and
b not in (select m3.b from mytable m3 where m3.id < m1.id)
If you don't want to decide by the order of IDs, you can add another column (like a sequence).