group by where items in associated table exist in sqlite3 - sqlite

Suppose I have 2 tables in an sqlite3 database:
table1
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
+----+
table2
+----+-----------+------+
| id | table1_id | col1 |
+----+-----------+------+
| 1 | 1 | A |
| 2 | 1 | B |
| 3 | 1 | C |
| 4 | 2 | A |
| 5 | 2 | C |
| 6 | 2 | D |
| 7 | 2 | E |
| 8 | 3 | A |
| 8 | 3 | D |
| 8 | 3 | E |
+----+-----------+------+
Expected result
I would like to return all the items from table1 which have associated col1 values of both D and E, namely:
+----+
| id |
+----+
| 2 |
| 3 |
+----+
How can I achieve this using sqlite3?

If table1_id is a foreign key to id of table1 then table1 is not needed at all.
You should filter table2 with the rows of that contain 'D' or 'E' in col1 and group by table1_id.
Then set the condition in the HAVING clause:
SELECT table1_id AS id
FROM table2
WHERE col1 IN ('D', 'E')
GROUP BY table1_id
HAVING COUNT(*) = 2 -- the number of values in the IN list
If there are duplicates in col1 for each table1_id change to:
HAVING COUNT(DISTINCT col1) = 2
Or with a CTE:
WITH cte(col1) AS (VALUES ('D'), ('E'))
SELECT table1_id AS id
FROM table2
WHERE col1 IN cte
GROUP BY table1_id
HAVING COUNT(*) = (SELECT COUNT(*) FROM cte)
See the demo.

Try this:
select table_id from table2 t where t.col1 in ('D','E') group by table_id;
You could also inner join with table 1 to verifiy that the record exists on both tables

Related

Create a new column by other column's value in data table in r

Lets say I have a data table like this
| x | y |
| - | - |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
And I need to create another column z based on the value of x : if its x>=1 and x<=3 then the value should be 1 else be 0
| x | y | z |
| - | - | - |
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 3 | 1 |
| 4 | 4 | 0 |
| 5 | 5 | 0 |
I was trying to use dt[, z:= ,]
But I'm not sure how to add the conditions to the function
Use the ifelse function
dt[, z := ifelse(x >= 1 & x <= 3, 1, 0)]
Or, more directly, you can coerce the logical condition to integer--TRUE will be 1, FALSE will be 0:
dt[, z := as.integer(x >= 1 & x <= 3)]
Or using data.table's helper %between%:
dt[, z := as.integer(x %between% c(1, 3))]

Trying to count similar users from sqlite database

I have a table like this:
user_id | subscription_id
-------------------------
1 | 1
1 | 2
2 | 3
2 | 4
3 | 1
3 | 2
4 | 3
5 | 3
What I want to do is count how many users have similar subscriptions:
user_id | same_subscriptions
----------------------------
1 | 1
2 | 0
3 | 1
4 | 1
5 | 1
Is this even possible? How can I achieve this...
Best I managed to do is get a table like this with group_concat:
user_id | subscriptions
-----------------------
1 | 1,2
2 | 3,4
3 | 1,2
4 | 3
5 | 3
This is how I achieved it:
SELECT A.user_id, group_concat(B.subscription_id)
FROM Subscriptions A LEFT JOIN Subscriptions B ON
A.user_id=B.user_id GROUP BY A.user_id;
The aggregate function GROUP_CONCAT() does not help in this case because in SQLite it does not support an ORDER BY clause, so that a safe comparison can be done.
But you can use GROUP_CONCAT() window function instead:
SELECT user_id,
COUNT(*) OVER (PARTITION BY subs) - 1 same_subscriptions
FROM (
SELECT user_id,
GROUP_CONCAT(subscription_id) OVER (PARTITION BY user_id ORDER BY subscription_id) subs,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY subscription_id DESC) rn
FROM Subscriptions
)
WHERE rn = 1
ORDER BY user_id
See the demo.
Results:
> user_id | same_subscriptions
> ------: | -----------------:
> 1 | 1
> 2 | 0
> 3 | 1
> 4 | 1
> 5 | 1

Creating a variable based on data in predefined vectors and discrete outcomes

Suppose a respondent (id) is asked to make a binary (discrete) choice, either select 1 or 2 in five tasks (t=1,2,3,4,5) (a panel dataset with five observations per respondent).
If a respondent selects choice 1, then the outcome is a fixed value (let say 30 always) but if a respondent selects choice 2, then the outcome is different and depends on which treatment the respondent is in (there is only one treatment per respondent since the respondent is randomly assigned to one treatment only). Let say there are four treatments (a vector) and in each treatment, there are five outcomes if choice 2 is selected.
That is,
treat1= 1,2,3,4,5
treat2= 6,7,8,9,10
treat3= 11,12,13,14,15
treat4= 16,17,18,19,20
For example, in the case of treat1, if a respondent in the first task selects choice 2, then the outcome is equal to 1. In the second task, the respondent selects choice 1, the outcome is 30 (as always). In the third task, if a respondent selects choice 2, the outcome is 2 (and not 3). That is if choice 2 is selected for the first time in treat1, then pick the first value from the treat1 sequence; if choice 2 is selected for the second time in treat1, then pick the second value from the treat 2 sequence and so on.
The outcome looks like the below.
+----+---+-----------+--------+---------+
| id | t | treatment | choice | outcome |
+----+---+-----------+--------+---------+
| 1 | 1 | 1 | 2 | 1 |
| 1 | 2 | 1 | 1 | 30 |
| 1 | 3 | 1 | 2 | 2 |
| 1 | 4 | 1 | 1 | 30 |
| 1 | 5 | 1 | 2 | 3 |
| 2 | 1 | 3 | 1 | 30 |
| 2 | 2 | 3 | 2 | 11 |
| 2 | 3 | 3 | 2 | 12 |
| 2 | 4 | 3 | 1 | 30 |
| 2 | 5 | 3 | 2 | 13 |
| 3 | 1 | 2 | 2 | 6 |
| 3 | 2 | 2 | 1 | 30 |
| 3 | 3 | 2 | 1 | 30 |
| 3 | 4 | 2 | 1 | 30 |
| 3 | 5 | 2 | 2 | 7 |
| 4 | 1 | 4 | 1 | 30 |
| 4 | 2 | 4 | 1 | 30 |
| 4 | 3 | 4 | 1 | 30 |
| 4 | 4 | 4 | 2 | 16 |
| 4 | 5 | 4 | 1 | 30 |
| 5 | 1 | 2 | 1 | 30 |
| 5 | 2 | 2 | 1 | 30 |
| 5 | 3 | 2 | 1 | 30 |
| 5 | 4 | 2 | 1 | 30 |
| 5 | 5 | 2 | 2 | 6 |
| . | . | . | . | . |
| . | . | . | . | . |
| . | . | . | . | . |
| . | . | . | . | . |
| . | . | . | . | . |
+----+---+-----------+--------+---------+
Since my data has thousands of observations, I was wondering what would be an efficient way to generate the variable outcome.
The id, t, treatment, and choice variables are available in my dataset.
Any thoughts would be appreciated. Thanks.
Another possible approach is to organize the treatment into a data.table, then do a join and update by reference when choice=2
#the sequence of treatment when choice==2
DT[choice==2, ri := rowid(id)]
#look up treatment for the sequence
DT[choice==2, outcome := treat[.SD, on=.(treatment, ri), val]]
#set outcome to 30 for choice=1
DT[choice==1, outcome := 30]
#delete column
DT[, ri := NULL]
data:
library(data.table)
treat <- data.table(treatment=rep(1:4, each=5),
ri=rep(1:5, times=4),
val=1:20)
DT <- fread("id,t,treatment,choice,outcome
1,1,1,2,1
1,2,1,1,30
1,3,1,2,2
1,4,1,1,30
1,5,1,2,3")
DT[, outcome := NULL]
You did not provide any sample data, so I create some fake data first
Data
set.seed(1)
treat_lkp <- list(trt1 = 1:5, trt2 = 6:10, trt3 = 11:15, trt4 = 16:20)
d_in <- expand.grid(task = 1:5, id = 1:5)
d_in$treatment <- paste0("trt", d_in$id %% 4 + 1)
d_in$choice <- sample(2, NROW(d_in), TRUE)
tidyverse solution
I use a simple tidyverse solution.
library(purrr)
library(dplyr)
d_out <- d_in %>%
group_by(id) %>%
mutate(task_new = cumsum(choice == 2)) %>%
ungroup() %>%
mutate(outcome = {
l <- treat_lkp[as.character(d_in$treatment)]
pmap_dbl(list(task = task_new, choice = choice, set = l),
function(task, choice, set)
ifelse(choice == 1, 30, set[task])
)}
)
head(d_out)
# # A tibble: 6 x 6
# task id treatment choice task_new outcome
# <int> <int> <chr> <int> <int> <dbl>
# 1 1 1 trt2 1 0 30
# 2 2 1 trt2 1 0 30
# 3 3 1 trt2 2 1 6
# 4 4 1 trt2 2 2 7
# 5 5 1 trt2 1 2 30
# 6 1 2 trt3 2 1 11
Explanation
You create first a list l with the relevant lookup values for your outcome (depends on treatment). Then you loop over task, treatment and choice to select either 30 (for choice == 1) or you use the right lookup value from l
Update
Taking the comment into account, we need now first to create a task_new variable which holds the correct position. That is the first choice == 2 should result in 1 the second in 2 and so on. So we group_by id and add the counter via cumsum. We use task_new in the mutate call after we ungrouped the data.

How to update a column with accumulated total of another column in SQLite?

I have a table like this:
COL1 | COL2
------------
1 | NULL
2 | NULL
3 | NULL
4 | NULL
How can I use SQL to update the COL2 which has the accumulated total of all previous row? Like this:
COL1 | COL2
------------
1 | 1
2 | 3
3 | 6
4 | 10
Thanks.
Got the answer from my colleague: (Assume the table name is abc)
UPDATE abc set col2 = (
SELECT temp.t from (SELECT abc.id, SUM(def.col1) as t FROM abc join abc as def on def.id<=abc.id group by abc.id)as temp WHERE abc.id=temp.id
)
Or we can use this:
REPLACE INTO abc SELECT abc.id,r2.col1, SUM(r2.col1) as col2 FROM abc join abc as r2 on r2.id<=abc.id group by abc.id

Impute character values in column [duplicate]

This question already has an answer here:
fill in NA based on the last non-NA value for each group in R [duplicate]
(1 answer)
Closed 5 years ago.
My code looks like this:
Item | Category
A | 1
A |
A |
A | 1
A |
A |
A | 1
B | 2
B |
B |
B | 2
B |
B |
B | 2
B |
B |
I want to impute values and fill the "Category" column with the values corresponding to each "Item", wherever it isn't blank. The end result should be like this:
Item | Category
A | 1
A | 1
A | 1
A | 1
A | 1
A | 1
A | 1
B | 2
B | 2
B | 2
B | 2
B | 2
B | 2
B | 2
B | 2
B | 2
How can I do this in R?
We can use fill from tidyverse
library(tidyverse)
df1 %>%
fill(Category)

Resources