Lets say I have a data table like this
| x | y |
| - | - |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
And I need to create another column z based on the value of x : if its x>=1 and x<=3 then the value should be 1 else be 0
| x | y | z |
| - | - | - |
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 3 | 1 |
| 4 | 4 | 0 |
| 5 | 5 | 0 |
I was trying to use dt[, z:= ,]
But I'm not sure how to add the conditions to the function
Use the ifelse function
dt[, z := ifelse(x >= 1 & x <= 3, 1, 0)]
Or, more directly, you can coerce the logical condition to integer--TRUE will be 1, FALSE will be 0:
dt[, z := as.integer(x >= 1 & x <= 3)]
Or using data.table's helper %between%:
dt[, z := as.integer(x %between% c(1, 3))]
I have a table like this:
user_id | subscription_id
-------------------------
1 | 1
1 | 2
2 | 3
2 | 4
3 | 1
3 | 2
4 | 3
5 | 3
What I want to do is count how many users have similar subscriptions:
user_id | same_subscriptions
----------------------------
1 | 1
2 | 0
3 | 1
4 | 1
5 | 1
Is this even possible? How can I achieve this...
Best I managed to do is get a table like this with group_concat:
user_id | subscriptions
-----------------------
1 | 1,2
2 | 3,4
3 | 1,2
4 | 3
5 | 3
This is how I achieved it:
SELECT A.user_id, group_concat(B.subscription_id)
FROM Subscriptions A LEFT JOIN Subscriptions B ON
A.user_id=B.user_id GROUP BY A.user_id;
The aggregate function GROUP_CONCAT() does not help in this case because in SQLite it does not support an ORDER BY clause, so that a safe comparison can be done.
But you can use GROUP_CONCAT() window function instead:
SELECT user_id,
COUNT(*) OVER (PARTITION BY subs) - 1 same_subscriptions
FROM (
SELECT user_id,
GROUP_CONCAT(subscription_id) OVER (PARTITION BY user_id ORDER BY subscription_id) subs,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY subscription_id DESC) rn
FROM Subscriptions
)
WHERE rn = 1
ORDER BY user_id
See the demo.
Results:
> user_id | same_subscriptions
> ------: | -----------------:
> 1 | 1
> 2 | 0
> 3 | 1
> 4 | 1
> 5 | 1
Suppose a respondent (id) is asked to make a binary (discrete) choice, either select 1 or 2 in five tasks (t=1,2,3,4,5) (a panel dataset with five observations per respondent).
If a respondent selects choice 1, then the outcome is a fixed value (let say 30 always) but if a respondent selects choice 2, then the outcome is different and depends on which treatment the respondent is in (there is only one treatment per respondent since the respondent is randomly assigned to one treatment only). Let say there are four treatments (a vector) and in each treatment, there are five outcomes if choice 2 is selected.
That is,
treat1= 1,2,3,4,5
treat2= 6,7,8,9,10
treat3= 11,12,13,14,15
treat4= 16,17,18,19,20
For example, in the case of treat1, if a respondent in the first task selects choice 2, then the outcome is equal to 1. In the second task, the respondent selects choice 1, the outcome is 30 (as always). In the third task, if a respondent selects choice 2, the outcome is 2 (and not 3). That is if choice 2 is selected for the first time in treat1, then pick the first value from the treat1 sequence; if choice 2 is selected for the second time in treat1, then pick the second value from the treat 2 sequence and so on.
The outcome looks like the below.
+----+---+-----------+--------+---------+
| id | t | treatment | choice | outcome |
+----+---+-----------+--------+---------+
| 1 | 1 | 1 | 2 | 1 |
| 1 | 2 | 1 | 1 | 30 |
| 1 | 3 | 1 | 2 | 2 |
| 1 | 4 | 1 | 1 | 30 |
| 1 | 5 | 1 | 2 | 3 |
| 2 | 1 | 3 | 1 | 30 |
| 2 | 2 | 3 | 2 | 11 |
| 2 | 3 | 3 | 2 | 12 |
| 2 | 4 | 3 | 1 | 30 |
| 2 | 5 | 3 | 2 | 13 |
| 3 | 1 | 2 | 2 | 6 |
| 3 | 2 | 2 | 1 | 30 |
| 3 | 3 | 2 | 1 | 30 |
| 3 | 4 | 2 | 1 | 30 |
| 3 | 5 | 2 | 2 | 7 |
| 4 | 1 | 4 | 1 | 30 |
| 4 | 2 | 4 | 1 | 30 |
| 4 | 3 | 4 | 1 | 30 |
| 4 | 4 | 4 | 2 | 16 |
| 4 | 5 | 4 | 1 | 30 |
| 5 | 1 | 2 | 1 | 30 |
| 5 | 2 | 2 | 1 | 30 |
| 5 | 3 | 2 | 1 | 30 |
| 5 | 4 | 2 | 1 | 30 |
| 5 | 5 | 2 | 2 | 6 |
| . | . | . | . | . |
| . | . | . | . | . |
| . | . | . | . | . |
| . | . | . | . | . |
| . | . | . | . | . |
+----+---+-----------+--------+---------+
Since my data has thousands of observations, I was wondering what would be an efficient way to generate the variable outcome.
The id, t, treatment, and choice variables are available in my dataset.
Any thoughts would be appreciated. Thanks.
Another possible approach is to organize the treatment into a data.table, then do a join and update by reference when choice=2
#the sequence of treatment when choice==2
DT[choice==2, ri := rowid(id)]
#look up treatment for the sequence
DT[choice==2, outcome := treat[.SD, on=.(treatment, ri), val]]
#set outcome to 30 for choice=1
DT[choice==1, outcome := 30]
#delete column
DT[, ri := NULL]
data:
library(data.table)
treat <- data.table(treatment=rep(1:4, each=5),
ri=rep(1:5, times=4),
val=1:20)
DT <- fread("id,t,treatment,choice,outcome
1,1,1,2,1
1,2,1,1,30
1,3,1,2,2
1,4,1,1,30
1,5,1,2,3")
DT[, outcome := NULL]
You did not provide any sample data, so I create some fake data first
Data
set.seed(1)
treat_lkp <- list(trt1 = 1:5, trt2 = 6:10, trt3 = 11:15, trt4 = 16:20)
d_in <- expand.grid(task = 1:5, id = 1:5)
d_in$treatment <- paste0("trt", d_in$id %% 4 + 1)
d_in$choice <- sample(2, NROW(d_in), TRUE)
tidyverse solution
I use a simple tidyverse solution.
library(purrr)
library(dplyr)
d_out <- d_in %>%
group_by(id) %>%
mutate(task_new = cumsum(choice == 2)) %>%
ungroup() %>%
mutate(outcome = {
l <- treat_lkp[as.character(d_in$treatment)]
pmap_dbl(list(task = task_new, choice = choice, set = l),
function(task, choice, set)
ifelse(choice == 1, 30, set[task])
)}
)
head(d_out)
# # A tibble: 6 x 6
# task id treatment choice task_new outcome
# <int> <int> <chr> <int> <int> <dbl>
# 1 1 1 trt2 1 0 30
# 2 2 1 trt2 1 0 30
# 3 3 1 trt2 2 1 6
# 4 4 1 trt2 2 2 7
# 5 5 1 trt2 1 2 30
# 6 1 2 trt3 2 1 11
Explanation
You create first a list l with the relevant lookup values for your outcome (depends on treatment). Then you loop over task, treatment and choice to select either 30 (for choice == 1) or you use the right lookup value from l
Update
Taking the comment into account, we need now first to create a task_new variable which holds the correct position. That is the first choice == 2 should result in 1 the second in 2 and so on. So we group_by id and add the counter via cumsum. We use task_new in the mutate call after we ungrouped the data.
I have a table like this:
COL1 | COL2
------------
1 | NULL
2 | NULL
3 | NULL
4 | NULL
How can I use SQL to update the COL2 which has the accumulated total of all previous row? Like this:
COL1 | COL2
------------
1 | 1
2 | 3
3 | 6
4 | 10
Thanks.
Got the answer from my colleague: (Assume the table name is abc)
UPDATE abc set col2 = (
SELECT temp.t from (SELECT abc.id, SUM(def.col1) as t FROM abc join abc as def on def.id<=abc.id group by abc.id)as temp WHERE abc.id=temp.id
)
Or we can use this:
REPLACE INTO abc SELECT abc.id,r2.col1, SUM(r2.col1) as col2 FROM abc join abc as r2 on r2.id<=abc.id group by abc.id
This question already has an answer here:
fill in NA based on the last non-NA value for each group in R [duplicate]
(1 answer)
Closed 5 years ago.
My code looks like this:
Item | Category
A | 1
A |
A |
A | 1
A |
A |
A | 1
B | 2
B |
B |
B | 2
B |
B |
B | 2
B |
B |
I want to impute values and fill the "Category" column with the values corresponding to each "Item", wherever it isn't blank. The end result should be like this:
Item | Category
A | 1
A | 1
A | 1
A | 1
A | 1
A | 1
A | 1
B | 2
B | 2
B | 2
B | 2
B | 2
B | 2
B | 2
B | 2
B | 2
How can I do this in R?
We can use fill from tidyverse
library(tidyverse)
df1 %>%
fill(Category)