Assigning test / control group vector using split-apply-combine strategy [duplicate] - r

This question already has answers here:
Stratified random sampling from data frame
(6 answers)
Closed 6 years ago.
this should be simple but it's got me pulling my hair out!
Here is some data:
Clicks <- c(1,2,3,4,5,6,5,4,3,2)
Cost <- c(10,11,12,13,14,15,14,13,12,11)
Cluster <- c(1,1,1,2,2,1,1,1,1,1)
df <- data.frame(Clicks,Cost,Cluster)
I want to filter my df by cluster, assign a new vector that assigns "test" and "control" group at random, then recombine to the original data frame
Step 1: Filter (by cluster 1)
Clicks Cost Cluster
1 1 10 1
2 2 11 1
3 3 12 1
4 6 15 1
5 5 14 1
6 4 13 1
7 3 12 1
8 2 11 1
Step 2: Assign test and control group at random
Clicks Cost Cluster group
1 1 10 1 Test
2 2 11 1 Control
3 3 12 1 Control
4 6 15 1 Test
5 5 14 1 Control
6 4 13 1 Control
7 3 12 1 Test
8 2 11 1 Control
Step 3: Get back to the original data frame
Clicks Cost Cluster group
1 1 10 1 Test
2 2 11 1 Control
3 3 12 1 Control
4 4 13 2 NULL
5 5 14 2 NULL
6 6 15 1 Test
7 5 14 1 Control
8 4 13 1 Control
9 3 12 1 Test
10 2 11 1 Control
Step 4: do the same for cluster 2
Thanks :)

How about
df$Group <- 'NULL'
df1 <- df
df1[df1$Cluster==1, ]$Group <- ifelse(runif(sum(df1$Cluster==1)) > 0.5, 'Control', 'Test')
df1
Clicks Cost Cluster Group
1 1 10 1 Test
2 2 11 1 Test
3 3 12 1 Test
4 4 13 2 NULL
5 5 14 2 NULL
6 6 15 1 Control
7 5 14 1 Test
8 4 13 1 Test
9 3 12 1 Control
10 2 11 1 Control
df2 <- df
df2[df2$Cluster==2, ]$Group <- ifelse(runif(sum(df2$Cluster==2)) > 0.5, 'Control', 'Test')
df2
Clicks Cost Cluster Group
1 1 10 1 NULL
2 2 11 1 NULL
3 3 12 1 NULL
4 4 13 2 Test
5 5 14 2 Control
6 6 15 1 NULL
7 5 14 1 NULL
8 4 13 1 NULL
9 3 12 1 NULL
10 2 11 1 NULL

Related

anti-join not working - giving 0 rows, why?

I am trying to use anti-join exactly as I have done many times to establish which rows across two datasets do not have matches for two specific columns. For some reason I keep getting 0 rows in the result and I can't understand why.
Below are two dummy df's containing the two columns I am trying to compare - you will see one is missing an entry (df1, SITE no2, PLOT no 8) - so when I use anti-join to compare the two dfs, this entry should be returned, but I am just getting a result of 0.
a<- seq(1:3)
SITE <- rep(a, times = c(16,15,1))
PLOT <- c(1:16,1:7,9:16,1)
df1 <- data.frame(SITE,PLOT)
SITE <- rep(a, times = c(16,16,1))
PLOT <- c(rep(1:16,2),1)
df2 <- data.frame(SITE,PLOT)
df1 df2
SITE PLOT SITE PLOT
1 1 1 1
1 2 1 2
1 3 1 3
1 4 1 4
1 5 1 5
1 6 1 6
1 7 1 7
1 9 1 8
1 10 1 9
1 11 1 10
1 12 1 11
1 13 1 12
1 14 1 13
1 15 1 14
1 16 1 15
1 1 1 16
2 2 2 1
2 3 2 2
2 4 2 3
2 5 2 4
2 6 2 5
2 7 2 6
2 8 2 7
2 9 2 8
2 10 2 9
2 11 2 10
2 12 2 11
2 13 2 12
2 14 2 13
2 15 2 14
2 16 2 15
3 1 2 16
3 1
a <- anti_join(df1, df2, by=c('SITE', 'PLOT'))
a
<0 rows> (or 0-length row.names)
I'm sure the answer is obvious but I can't see it.
The answer can be found in the help file.
anti_join() return all rows from x without a match in y.
So reversing the input for df1 and df2 will give you what you expect.
anti_join(df2, df1, by=c('SITE', 'PLOT'))
# SITE PLOT
# 1 2 8

r recode by a splitting rule

I have a student dataset including student information, question id (5 questions), the sequence of each trial to answer the questions. I would like to create a variable to distinguish where exactly student starts reviewing questions after finishing all questions.
Here is a sample dataset:
data <- data.frame(
person = c(1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
question = c(1,2,2,3,3,3,4,3,5,1,2, 1,1,1,2,3,4,4,4,5,5,4,3,4,4,5,4,5),
sequence = c(1,1,2,1,2,3,1,4,1,2,3, 1,2,3,1,1,1,2,3,1,2,4,2,5,6,3,7,4))
data
person question sequence
1 1 1 1
2 1 2 1
3 1 2 2
4 1 3 1
5 1 3 2
6 1 3 3
7 1 4 1
8 1 3 4
9 1 5 1
10 1 1 2
11 1 2 3
12 2 1 1
13 2 1 2
14 2 1 3
15 2 2 1
16 2 3 1
17 2 4 1
18 2 4 2
19 2 4 3
20 2 5 1
21 2 5 2
22 2 4 4
23 2 3 2
24 2 4 5
25 2 4 6
26 2 5 3
27 2 4 7
28 2 5 4
sequence variables record each visit by giving a sequence number. Generally revisits could be before seeing all questions. However, the attempt variable should only record after the student sees all 5 questions. With the new variable, I target this dataset.
> data
person question sequence attempt
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 initial
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
Any ideas?
Thanks!
What a challenging question. Took almost 2 hours to find the solution.
Try this
library(dplyr)
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
data %>%
mutate(var0 = n_distinct(question)) %>%
group_by(person) %>%
mutate(var1 = dist_cum(question),
var2 = cumsum(c(1, diff(question) != 0))) %>%
ungroup() %>%
mutate(var3 = if_else(sequence == 1 | var1 < var0, 0, 1)) %>%
group_by(person, var2) %>%
mutate(var4 = min(var3)) %>%
ungroup() %>%
mutate(attemp = if_else(var4 == 0, "initial", "review")) %>%
select(-starts_with("var")) %>%
as.data.frame
Result
person question sequence attemp
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 initial
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
dist_cum is a function to calculate rolling distinct (Source). var0...var4 are helpers
One way to do it is by finding where the reviewing starts (i.e. the next entry after the fifth question has been seen) and where the sequence is 2. See v1 and v2. Then by means of subsetting for every individual person and looping by each subset, you can update the missing entries for the attempt variable since it is now known where the reviewing starts.
v1 <- c(FALSE, (data$question == 5)[-(nrow(data))])
v2 <- data$sequence == 2
data$attempt <- ifelse(v1 * v2 == 1, "review", NA)
persons <- unique(data$person)
persons.list <- vector(mode = "list", length = length(persons))
for(i in 1:length(persons)){
person.i <- subset(data, person == persons[i])
n <- which(person.i$attempt == "review")
m <- nrow(person.i)
person.i$attempt[(n+1):m] <- "review"
person.i$attempt[which(is.na(person.i$attempt))] <- "initial"
persons.list[[i]] <- person.i
}
do.call(rbind, persons.list)
person question sequence attempt
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 review
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
Alternatively, you can also use lapply:
do.call(rbind,
lapply(persons, function(x){
person.x <- subset(data, person == x)
n <- which(person.x$attempt == "review")
m <- nrow(person.x)
person.x$attempt[(n+1):m] <- "review"
person.x$attempt[which(is.na(person.x$attempt))] <- "initial"
person.x
}))

Conditional difference between two data.frame columns

I have a tidy data.frame of experimental data with subjects ID who were measured three times (Trial) at a varying(!) number of time points (Session) in two different conditions (Direction) on a dependent continuous variable, say LC:
set.seed(5)
nSubjects <- 4
nDirections <- 2
nTrials <- 3
# Between 1 and 3 sessions per subject:
nSessions <- round(runif(nSubjects,
min = 1, max = 3))
mydat <- data.frame(ID = do.call(rep, args = list(1:nSubjects,
times = nSessions * nDirections * nTrials)),
Session = rep(sequence(nSessions),
each = nDirections * nTrials),
Trial = rep(rep(1:nTrials,
each = nDirections),
times = sum(nSessions)),
Direction = rep(c("up", "down"),
times = nTrials * sum(nSessions)),
LC = 1:(nDirections * nTrials * sum(nSessions)))
What I would like to calculate is a vector of length nrow(mydat) that contains the difference in LC between a given subject's and trial's and direction's first and current session. In other words, from each (absolute) LC score of any ID, session, trial and direction, the (absolute) LC from session == 1 of the same ID, trial and direction gets subtracted, like this (for the sake of simplicity I chose LC to be monotonically increasing):
# ID Session Trial Direction LC LC_diff
# 7 2 1 1 up 7 0
# 8 2 1 2 down 8 0
# 9 2 1 3 up 9 0
# 10 2 1 1 down 10 0
# 11 2 1 2 up 11 0
# 12 2 1 3 down 12 0
# 13 2 2 1 up 13 6
# 14 2 2 2 down 14 6
# 15 2 2 3 up 15 6
# 16 2 2 1 down 16 6
# 17 2 2 2 up 17 6
# 18 2 2 3 down 18 6
I thought the following code would yield the desired result:
library(dplyr)
ordered <- group_by(mydat, ID, Session, Trial, Direction)
mydat$LC_diff <- summarise(ordered,
Diff = sum(abs(LC[Trial != 1]),
- abs(LC[Trial == 1])))$Diff
But, alas:
mydat[7:18, ]
# ID Session Trial Direction LC LC_diff
# 7 2 1 1 up 7 -8
# 8 2 1 2 down 8 -7
# 9 2 1 3 up 9 10
# 10 2 1 1 down 10 9
# 11 2 1 2 up 11 12
# 12 2 1 3 down 12 11
# 13 2 2 1 up 13 -14
# 14 2 2 2 down 14 -13
# 15 2 2 3 up 15 16
# 16 2 2 1 down 16 15
# 17 2 2 2 up 17 18
# 18 2 2 3 down 18 17
I am at a complete loss here and would appreciate any pointers to where my code is wrong.
I'm not sure this is what you meant, but with data.table would be like this:
library(data.table)
setDT(mydat)[,new:= abs(LC)-abs(LC[1]),by=.(ID, Trial, Direction)]
mydat[ID==2,]
ID Session Trial Direction LC new
1: 2 1 1 up 7 0
2: 2 1 1 down 8 0
3: 2 1 2 up 9 0
4: 2 1 2 down 10 0
5: 2 1 3 up 11 0
6: 2 1 3 down 12 0
7: 2 2 1 up 13 6
8: 2 2 1 down 14 6
9: 2 2 2 up 15 6
10: 2 2 2 down 16 6
11: 2 2 3 up 17 6
12: 2 2 3 down 18 6

dplyr solution to split dataset, but keep IDs in same splits

I'm looking for a dplyr or tidyr solution to split a dataset into n chunks. However, I do not want to have any single ID go into multiple chunks. That is, each ID should appear in only one chunk.
For example, imagine "test" below is an ID variable, and the dataset has many other columns.
test<-data.frame(id= c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
val = 1:16)
out <- test %>% select(id) %>% ntile(n = 3)
out
[1] 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
The ID=4 would end up in chunks 1 and 2. I am wondering how to code this so that all ID=4 end up in the same chunk (doesn't matter which one). I looked at the split function but could not find a way to do this.
The desired output would be something like
test[which(out==1),]
returning
id val
1 1 1
2 2 2
3 3 3
4 4 4
5 4 5
6 4 6
7 4 7
8 4 8
Then if I wanted to look at the second chunk, I would call something like test[which(out==2),], and so on up to out==n. I only want to deal with one chunk at a time. I don't need to create all n chunks simultaneously.
You need to create a data frame, then use group_by and mutate to add columns:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
mutate(new_column = ntile(id,3))
out
# A tibble: 16 x 3
id value new_column
<dbl> <int> <int>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 4 1
5 4 5 1
6 4 6 1
7 4 7 2
8 4 8 2
9 6 9 2
10 7 10 2
11 8 11 2
12 9 12 3
13 9 13 3
14 9 14 3
15 9 15 3
16 10 16 3
Or given Frank's comment you could run the ntile function on distinct/unique values of the id - then join the original table back on id:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
distinct(id) %>%
mutate(new_column = ntile(id,3)) %>%
right_join(test, by = "id")
out
# A tibble: 16 x 3
id new_column value
<dbl> <int> <int>
1 1 1 1
2 2 1 2
3 3 1 3
4 4 2 4
5 4 2 5
6 4 2 6
7 4 2 7
8 4 2 8
9 6 2 9
10 7 2 10
11 8 3 11
12 9 3 12
13 9 3 13
14 9 3 14
15 9 3 15
16 10 3 16

Vectorized Conditional Random Matching

I want to create conditional random pairs without using for-loops so I can use the code with large datasets. At first, I create rows with unique IDs and randomly assign two different "types" to my rows:
df<-data.frame(id=1:10,type=NA,partner=NA)
df[sample(df$id,nrow(df)/2),"type"]<-1 ##random 50% type 1
df[which(is.na(df$type)==TRUE),"type"]<-2 ##other 50% type 2
df
id type partner
1 1 2 NA
2 2 1 NA
3 3 1 NA
4 4 1 NA
5 5 2 NA
6 6 1 NA
7 7 1 NA
8 8 2 NA
9 9 2 NA
10 10 2 NA
Now I want them to receive a random partner of the opposite type. So I randomize my type 1 IDs and match them to some type 2 IDs like so:
df$partner[which(df$type==2)]<-sample(df$id[which(df$type==1)],
nrow(df)/2)
df
id type partner
1 1 2 4
2 2 1 NA
3 3 1 NA
4 4 1 NA
5 5 2 2
6 6 1 NA
7 7 1 NA
8 8 2 6
9 9 2 3
10 10 2 7
And that's where I'm stuck. For some reason I can't think of a vectorized way to tell R "take the IDs of type 1, look where these IDs are in df$partner and return the corresponding row ID as df$partner instead of NA".
One example for a for-loop for conditional random pairing can be found here: click
I'm pretty sure that that's very basic and doable, however, any help appreciated!
Presumably, you want the type 1 and type 2 matched together to have each other's id in their respective partner entries. Fully vectorized solution.
# Define number of ids
n = 100
# Generate startingn data frame
df = data.frame(id = 1:n, type = NA, partner = NA)
# Generate the type column
df$type[(a<-sample(df$id, n/2))] = 1
df$type[(b<-setdiff(1:100, a))] = 2
# Select a random partner id from the other type
df$partner[a] = sample(df$id[b])
# Fill in partner values based on previous line
df$partner[b] = df$id[match(df$id[b], df$partner)]
Output:
id type partner
1 2 11
2 1 13
3 2 19
4 2 10
5 1 17
6 2 28
7 2 27
8 2 21
9 1 22
10 1 4
11 1 1
12 2 20
13 2 2
14 2 25
15 2 24
16 2 30
17 2 5
18 2 29
19 1 3
20 1 12
21 1 8
22 2 9
23 2 26
24 1 15
25 1 14
26 1 23
27 1 7
28 1 6
29 1 18
30 1 16

Resources