I asked something very similar [enter link description here][1] but I have a better understanding of my problem now. I will try my best to ask it as clear as I can.
I have a sample dataset looks like this below:
id <- c(1,1,1, 2,2,2, 3,3, 4,4, 5,5,5,5, 6,6,6, 7, 8,8, 9,9, 10,10)
item.id <- c(1,1,2, 1,1,1 ,1,1, 1,2, 1,2,2,2, 1,1,1, 1, 1,2, 1,1, 1,1)
sequence <- c(1,2,1, 1,2,3, 1,2, 1,1, 1,1,2,3, 1,2,3, 1, 1,1, 1,2, 1,2)
score <- c(0,0,0, 0,0,1, 2,0, 1,1, 1,0,1,1, 0,0,0, 1, 0,2, 1,2, 2,1)
data <- data.frame("id"=id, "item.id"=item.id, "sequence"=sequence, "score"=score)
> data
id item.id sequence score
1 1 1 1 0
2 1 1 2 0
3 1 2 1 0
4 2 1 1 0
5 2 1 2 0
6 2 1 3 1
7 3 1 1 2
8 3 1 2 0
9 4 1 1 1
10 4 2 1 1
11 5 1 1 1
12 5 2 1 0
13 5 2 2 1
14 5 2 3 1
15 6 1 1 0
16 6 1 2 0
17 6 1 3 0
18 7 1 1 1
19 8 1 1 0
20 8 2 1 2
21 9 1 1 1
22 9 1 2 2
23 10 1 1 2
24 10 1 2 1
id represents for each student, item.id represents the questions students take, sequence is the attempt number for each item.id, and score is the score for each attempt, taking 0,1, or 2. Students can change their answers.
For item.id within each id, I create a variable (status) by looking at the last two sequences (changes): Here the recoding rules are for status:
1-If there is only one attempt for each question:
a) assign "BTW" (Blank to Wrong) if the item score is 0.
b) assign "BTW" (Blank to Right) if the item score is 1.
2-If there are multiple attempts for each question:
a) assign "BTW" (Blank to Wrong) if the first item attempt score is 0.
b) assign "BTW" (Blank to Right) if the first item attempt score is 1.
c) assign "WW" for those who changed from wrong to wrong (0 to 0),
d) assign "WR" for those who changed to increasing score (0 to 1, or 1 to 2),
e) assign "RW" for those who changed to decreasing score (2 to 1, 2 to 0, or 1 to 0 ), and
f) assign "RR" for those who changed from right to right (1 to 1, 2 to 2).
score change from 0 to 1 or 0 to 2 or 1 to 2 considered correct (right) change while,
score change from 1 to 0 or 2 to 0 or 2 to 1 considered incorrect (wrong) change.
If there is only one attempt for item.id as in id=7, then the status should be "BTR". If the score was 0, then it should be "BTW". the logic is supposed to be if the score increases, it should be WR, if it decreases, it should be RW.
a) from 1 to 2 as WR, instead, they were coded as RR,
b) from 2 to 1 as RW, instead, they were coded as WW.
I used this code. Things did not work out for some, for example for id=1. The status should be {BTW, WW}.
library(dplyr)
data %>% group_by(id,item.id) %>%
mutate(diff = c(0, diff(score)),
status = case_when(
n() == 1 & score == 0 ~ "BTW",
n() == 1 & score == 1 ~ "BTR",
diff == 0 & score == 0 ~ "WW",
diff == 0 & score > 0 ~ "RR",
diff > 0 ~ "WR",
diff < 0 ~ "RW",
TRUE ~ "oops"))
> data
id item.id sequence score diff status
1 1 1 1 0 0 WW
2 1 1 2 0 0 WW
3 1 2 1 0 0 BTW
4 2 1 1 0 0 WW
5 2 1 2 0 0 WW
6 2 1 3 1 1 WR
7 3 1 1 2 0 RR
8 3 1 2 0 -2 RW
9 4 1 1 1 0 BTR
10 4 2 1 1 0 BTR
11 5 1 1 1 0 BTR
12 5 2 1 0 0 WW
13 5 2 2 1 1 WR
14 5 2 3 1 0 RR
15 6 1 1 0 0 WW
16 6 1 2 0 0 WW
17 6 1 3 0 0 WW
18 7 1 1 1 0 BTR
19 8 1 1 0 0 BTW
20 8 2 1 2 0 RR
21 9 1 1 1 0 RR
22 9 1 2 2 1 WR
23 10 1 1 2 0 RR
24 10 1 2 1 -1 RW
the desired output would be with cases:
> desired
id item.id sequence score status
1 1 1 1 0 BTW
2 1 1 2 0 WW
3 1 2 1 0 BTW
4 2 1 1 0 BTW
5 2 1 2 0 WW
6 2 1 3 1 WR
7 3 1 1 2 BTR
8 3 1 2 0 RW
9 4 1 1 1 BTR
10 4 2 1 1 BTR
11 5 1 1 1 BTR
12 5 2 1 0 BTW
13 5 2 2 1 WR
14 5 2 3 1 RR
15 6 1 1 0 BTW
16 6 1 2 0 WW
17 6 1 3 0 WW
18 7 1 1 1 BTR
19 8 1 1 0 BTW
20 8 2 1 2 BTR
21 9 1 1 1 BTR
22 9 1 2 2 RR
23 10 1 1 2 BTR
24 10 1 2 1 RW
Any opinions?
Thanks!
In order to solve this, I broke the problem down into two steps. First identify the Blank to answer lines. Then once the first tries are identified then assign the change of answers to the remaining lines.
#rows that are not the first answer are assigned a "NA"
test<-data %>% group_by(id,item.id) %>%
mutate(status = case_when(
sequence == 1 & score == 0 ~ "BTW",
sequence == 1 & score >0 ~ "BTR",
TRUE ~ "NA"))
answer<- test %>% ungroup() %>% group_by(id, item.id) %>%
transmute(sequence, score,
status = case_when(score == 0 & score==lag(score) & status=="NA" ~ "WW",
score >= 1 & score == lag(score) & status=="NA"~ "RR",
score > 0 & score > lag(score) & status=="NA"~ "WR",
score < lag(score) & status=="NA"~ "RW",
TRUE ~ status))
head(answer, 20)
tail(answer, 4)
The status column matches your sample data for all rows except row 20, please double check the calculation.
Related
I have a data recoding puzzle. Here is how my sample data looks like:
df <- data.frame(
id = c(1,1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3,3),
scores = c(0,1,1,0,0,-1,-1, 0,0,1,-1,-1,-1, 0,1,0,1,1,0,1),
position = c(1,2,3,4,5,6,7, 1,2,3,4,5,6, 1,2,3,4,5,6,7),
cat = c(1,1,1,1,1,0,0, 1,1,1,0,0,0, 1,1,1,1,1,1,1))
id scores position cat
1 1 0 1 1
2 1 1 2 1
3 1 1 3 1
4 1 0 4 1
5 1 0 5 1
6 1 -1 6 0
7 1 -1 7 0
8 2 0 1 1
9 2 0 2 1
10 2 1 3 1
11 2 -1 4 0
12 2 -1 5 0
13 2 -1 6 0
14 3 0 1 1
15 3 1 2 1
16 3 0 3 1
17 3 1 4 1
18 3 1 5 1
19 3 0 6 1
20 3 1 7 1
There are three ids in the dataset and rows were ordered by a positon variable. For each id, the first row after the scores start by -1 needs to be 0, and the cat variable needs to be 1. For example, for id=1, the first row would be 6th position and in that row, score should be 0 and the cat variable needs to 1. For those ids do not have scores=-1, I keep them as they are.
The desired output should look like below:
id scores position cat
1 1 0 1 1
2 1 1 2 1
3 1 1 3 1
4 1 0 4 1
5 1 0 5 1
6 1 0 6 1
7 1 -1 7 0
8 2 0 1 1
9 2 0 2 1
10 2 1 3 1
11 2 0 4 1
12 2 -1 5 0
13 2 -1 6 0
14 3 0 1 1
15 3 1 2 1
16 3 0 3 1
17 3 1 4 1
18 3 1 5 1
19 3 0 6 1
20 3 1 7 1
Any recommendations??
Thanks
This may be what you are after
df %>%
group_by(id) %>%
mutate(i = which(scores == -1)[1]) %>% # find the first row == -1
mutate(scores = case_when(position == i & scores !=0 ~ 0, T ~ scores), # update the score using position & i
cat = ifelse(scores == -1,0,1)) %>% # then update cat
select (-i) # remove I
After trying a few things and getting ideas from #Ricky and #e.matt, I came up with a solution.
df %>%
filter(scores == -1) %>% # keep cases where var = 1
distinct(id, .keep_all = T) %>% # keep distinct cases based on group
mutate(first = 1) %>% # create first column
right_join(df, by=c("id","scores","position","cat")) %>% # join back original dataset
mutate(first = coalesce(first, 0)) %>% # replace NAs with 0
mutate(scores = case_when(
first == 1 ~ 0,
TRUE~scores)) %>%
mutate(cat = case_when(
first == 1 ~ 1,
TRUE~cat))
This provides my desired output.
id scores position cat first
1 1 0 1 1 0
2 1 1 2 1 0
3 1 1 3 1 0
4 1 0 4 1 0
5 1 0 5 1 0
6 1 0 6 1 1
7 1 -1 7 0 0
8 2 0 1 1 0
9 2 0 2 1 0
10 2 1 3 1 0
11 2 0 4 1 1
12 2 -1 5 0 0
13 2 -1 6 0 0
14 3 0 1 1 0
15 3 1 2 1 0
16 3 0 3 1 0
17 3 1 4 1 0
18 3 1 5 1 0
19 3 0 6 1 0
20 3 1 7 1 0
here is a data.table oneliner
library( data.table )
setDT(df)
df[ df[, .(cumsum( scores == -1 ) == 1), by = .(id)]$V1, `:=`( scores = 0, cat = 1) ]
# id scores position cat
# 1: 1 0 1 1
# 2: 1 1 2 1
# 3: 1 1 3 1
# 4: 1 0 4 1
# 5: 1 0 5 1
# 6: 1 0 6 1
# 7: 1 -1 7 0
# 8: 2 0 1 1
# 9: 2 0 2 1
# 10: 2 1 3 1
# 11: 2 0 4 1
# 12: 2 -1 5 0
# 13: 2 -1 6 0
# 14: 3 0 1 1
# 15: 3 1 2 1
# 16: 3 0 3 1
# 17: 3 1 4 1
# 18: 3 1 5 1
# 19: 3 0 6 1
# 20: 3 1 7 1
You could do something along these lines using the dplyr package:
library(dplyr)
df = mutate(df, cat = ifelse(scores == -1, 1, cat),
scores = ifelse(scores == -1, 0, scores))
Using the mutate() function, I am re-assigning the values for the scores and cat fields according to ifelse() conditional statements. For scores, if the score is -1, the value is replaced by 0, otherwise it keeps the score as is. For cat, it also checks if scores is equal to -1, but would assign a value of 1 when the condition is met, or the already existing value of cat when the condition is not met.
EDIT
After our discussion in the comments, I think something along these lines should be helpful (you may have to modify the logic since I don't exactly follow what the desired output is here):
for(i in 1:nrow(df)){
# Check if score is -1
if(df[i, 'scores'] == -1){
# Update values for the next row
df[i+1, 'scores'] <- 0
df[i+1, 'cat'] <- 1
}
}
Sorry that I don't really follow the desired output, hopefully this is helpful in getting you to your answer!
I have a question regarding data preparation. I have the following data set (in long format; one row per measurement point, therefore several rows per person):
dd <- read.table(text=
"ID time
1 -4
1 -3
1 -2
1 -1
1 0
1 1
2 -3
2 -1
2 2
2 3
2 4
3 -3
3 -2
3 -1
4 -1
4 1
4 2
4 3
5 0
5 1
5 2
5 3
5 4", header=TRUE)
Now I would like to create a new variable that has a 1 in the row, in which a sign change on the time variable happens for the first time for this person, and a 0 in all other rows. If a person has only negative values on time, the should not be any 1 on the new variable. For a person that has only positive values on time, the first row should have a 1 on the new variable and all other rows should be coded with 0. For my example above the new data frame should look like this:
dd <- read.table(text=
"ID time new.var
1 -4 0
1 -3 0
1 -2 0
1 -1 0
1 0 1
1 1 0
2 -3 0
2 -1 0
2 2 1
2 3 0
2 4 0
3 -3 0
3 -2 0
3 -1 0
4 -1 0
4 1 1
4 2 0
4 3 0
5 0 1
5 1 0
5 2 0
5 3 0
5 4 0", header=TRUE)
Does anyone know how to do this? I thought about using dplyr and group_by, however I am pretty new to R and did not make it. Any help is much appreciated!
There are 2 different operations you want done to create new.var, so you need to do them in 2 steps. I'll break this into 2 separate mutate calls for simplicity, but you can put both of them into the same mutate
First, we group by ID and then find the rows where the sign changes. We need to use time >= 0 instead of sign as recommended in this answer: R identifying a row prior to a change in sign because you want a sign change to be counted only when going from -1 <-> 0, not from 0 <-> 1:
library(tidyverse)
dd2 <- dd %>%
group_by(ID) %>%
mutate(new.var = as.numeric((time >= 0) != (lag(time) >= 0)))
dd2
# A tibble: 23 x 3
# Groups: ID [5]
ID time new.var
<int> <int> <dbl>
1 1 -4 NA
2 1 -3 0
3 1 -2 0
4 1 -1 0
5 1 0 1
6 1 1 0
7 2 -3 NA
8 2 -1 0
9 2 2 1
10 2 3 0
# … with 13 more rows
Then we use case_when to modify the first row based on your desired rules. Due to the way lag works, the first row will always have NA (since there is no previous row to look at) which makes it a good way to pick out that first row to change it based on the time values in that group:
dd3 <- dd2 %>%
mutate(new.var = case_when(
!is.na(new.var) ~ new.var,
all(time >= 0) ~ 1,
TRUE ~ 0)
)
print(dd3, n = 100) #n=100 because tibbles are truncated to 10 rows by print
# A tibble: 23 x 3
# Groups: ID [5]
ID time new.var
<int> <int> <dbl>
1 1 -4 0
2 1 -3 0
3 1 -2 0
4 1 -1 0
5 1 0 1
6 1 1 0
7 2 -3 0
8 2 -1 0
9 2 2 1
10 2 3 0
11 2 4 0
12 3 -3 0
13 3 -2 0
14 3 -1 0
15 4 -1 0
16 4 1 1
17 4 2 0
18 4 3 0
19 5 0 1
20 5 1 0
21 5 2 0
22 5 3 0
23 5 4 0
You can try this:
library(dplyr)
dd %>% left_join(dd %>% group_by(ID) %>% summarise(index=min(which(time>=0)))) %>%
group_by(ID) %>% mutate(new.var=ifelse(row_number(ID)==index,1,0)) %>% select(-index)-> DF
# A tibble: 23 x 3
# Groups: ID [5]
ID time new.var
<int> <int> <dbl>
1 1 -4 0
2 1 -3 0
3 1 -2 0
4 1 -1 0
5 1 0 1
6 1 1 0
7 2 -3 0
8 2 -1 0
9 2 2 1
10 2 3 0
The following ave instruction does what the question asks for.
dd$new.var <- with(dd, ave(time, ID, FUN = function(x){
y <- integer(length(x))
if(any(x >= 0)) y[which.max(x[1]*x <= 0)] <- 1L
y
}))
dd
# ID time new.var
#1 1 -4 0
#2 1 -3 0
#3 1 -2 0
#4 1 -1 0
#5 1 0 1
#6 1 1 0
#7 2 -3 0
#8 2 -1 0
#9 2 2 1
#10 2 3 0
#11 2 4 0
#12 3 -3 0
#13 3 -2 0
#14 3 -1 0
#15 4 -1 0
#16 4 1 1
#17 4 2 0
#18 4 3 0
#19 5 0 1
#20 5 1 0
#21 5 2 0
#22 5 3 0
#23 5 4 0
If the expected output is renamed dd2 then
identical(dd, dd2)
#[1] TRUE
I have some data where one of the variables is an accountant with some requirements. What I need to know now is how many times that counter reaches 1 for each ID, if there are several 1's in a row, you only have to count 1.
For example, let's say that the ID has counter: 1, 0, 0, 1, 1, 0, 0, 1,1,1,0,0. I would have to say that the id has 3 of frequency.
Frec_counter count the number of non-consecutive times that a 1. appears. If there are consecutive 1's, the last one is numbered.
My data:
id <- c(10,10,10,10,10,11,11,11,11,11,11,12,12,12,13, 13, 15, 14)
counter <- c(0,0,1,1,0,1,0,1,0,1,1,1,1,1,0,0,1,1)
DF <- data.frame(id, counter); DF
Id 10 has 0,0,1,1,0.
5 data, but only 1 non-consecutive, so it is set to frec_counter 0,0,0,1,0
My desirable output:
id <- c(10,10,10,10,10,11,11,11,11,11,11,12,12,12,13, 13, 15, 14)
counter <- c(0,0,1,1,0,1,0,1,0,1,1,1,1,1,0,0,1,1)
frec_counter <- c(0,0,0,1,0,1,0,2,0,0,3,0,0,1,0,0,1,1)
max_counter <- c(1,1,1,1,1,3,3,3,3,3,3,1,1,1,0,0,1,1)
DF <- data.frame(id, counter, frec_counter, max_counter); DF
Here is one approach using tidyverse:
library(tidyverse)
DF %>%
group_by(id) %>% #group by id
mutate(one = ifelse(counter == lead(counter), 0, counter) #if the leading value is the same replace the value with 0
one = ifelse(is.na(one), counter, one), #to handle last in group where lead results in NA
frec_counter1 = cumsum(one), #get cumulative sum of 1s
frec_counter1 = ifelse(one == 0, 0 , frec_counter1), #replace the cumsum values with 0 where approprate
max_counter1 = max(frec_counter1)) %>% #get the max frec_counter1 per group
select(-one) #remove dummy variable
#output
id counter frec_counter max_counter frec_counter1 max_counter1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10 0 0 1 0 1
2 10 0 0 1 0 1
3 10 1 0 1 0 1
4 10 1 1 1 1 1
5 10 0 0 1 0 1
6 11 1 1 3 1 3
7 11 0 0 3 0 3
8 11 1 2 3 2 3
9 11 0 0 3 0 3
10 11 1 0 3 0 3
11 11 1 3 3 3 3
12 12 1 0 1 0 1
13 12 1 0 1 0 1
14 12 1 1 1 1 1
15 13 0 0 0 0 0
16 13 0 0 0 0 0
17 15 1 1 1 1 1
18 14 1 1 1 1 1
Your data:
id <- c(10,10,10,10,10,11,11,11,11,11,11,12,12,12,13, 13, 15, 14)
counter <- c(0,0,1,1,0,1,0,1,0,1,1,1,1,1,0,0,1,1)
DF <- data.frame(id, counter)
id counter
1 10 0
2 10 0
3 10 1
4 10 1
5 10 0
6 11 1
7 11 0
8 11 1
9 11 0
10 11 1
11 11 1
12 12 1
13 12 1
14 12 1
15 13 0
16 13 0
17 15 1
18 14 1
If all you wanted was the final counts, we could do that in base R:
counts <- with(DF, split(counter, id))
lengths <- lapply(counts, rle)
final <- lapply(lengths, function(x) sum(x$values == 1))
$`10`
[1] 1
$`11`
[1] 3
$`12`
[1] 1
$`13`
[1] 0
$`14`
[1] 1
$`15`
[1] 1
But since you specifically want a data frame with the intermediary "flags", the tidyverse set of packages works better:
library(tidyverse)
df.new <- DF %>%
group_by(id) %>%
mutate(
frec_counter = counter == 1 & (is.na(lead(counter)) | lead(counter == 0)),
frec_counter = as.numeric(frec_counter),
max_counter = sum(frec_counter)
)
# A tibble: 18 x 4
# Groups: id [6]
id counter frec_counter max_counter
<dbl> <dbl> <dbl> <dbl>
1 10 0 0 1
2 10 0 0 1
3 10 1 0 1
4 10 1 1 1
5 10 0 0 1
6 11 1 1 3
7 11 0 0 3
8 11 1 1 3
9 11 0 0 3
10 11 1 0 3
11 11 1 1 3
12 12 1 0 1
13 12 1 0 1
14 12 1 1 1
15 13 0 0 0
16 13 0 0 0
17 15 1 1 1
18 14 1 1 1
I'm trying to calculate a running count (i.e., cumulative sum) that is conditional on other variables and that can reset for particular values on another variable. I'm working in R and would prefer a dplyr-based solution, if possible.
I'd like to create a variable for the running count, cumulative, based on the following algorithm:
Calculate the running count (cumulative) within combinations of id and age
Increment running count (cumulative) by 1 for every subsequent trial where accuracy = 0, block = 2, and condition = 1
Reset running count (cumulative) to 0 for each trial where accuracy = 1, block = 2, and condition = 1, and the next increment resumes at 1 (not the previous number)
For each trial where block != 2, or condition != 1, leave the running count (cumulative) as NA
Here's a minimal working example:
mydata <- data.frame(id = c(1,1,1,1,1,1,1,1,1,1,1),
age = c(1,1,1,1,1,1,1,1,1,1,2),
block = c(1,1,2,2,2,2,2,2,2,2,2),
trial = c(1,2,1,2,3,4,5,6,7,8,1),
condition = c(1,1,1,1,1,2,1,1,1,1,1),
accuracy = c(0,0,0,0,0,0,0,1,0,0,0)
)
id age block trial condition accuracy
1 1 1 1 1 0
1 1 1 2 1 0
1 1 2 1 1 0
1 1 2 2 1 0
1 1 2 3 1 0
1 1 2 4 2 0
1 1 2 5 1 0
1 1 2 6 1 1
1 1 2 7 1 0
1 1 2 8 1 0
1 2 2 1 1 0
The expected output is:
id age block trial condition accuracy cumulative
1 1 1 1 1 0 NA
1 1 1 2 1 0 NA
1 1 2 1 1 0 1
1 1 2 2 1 0 2
1 1 2 3 1 0 3
1 1 2 4 2 0 NA
1 1 2 5 1 0 4
1 1 2 6 1 1 0
1 1 2 7 1 0 1
1 1 2 8 1 0 2
1 2 2 1 1 0 1
Here is an option using data.table. Create a binary column based on matching the pasted values of 'accuracy', 'block', 'condition' with that of the custom values, grouped by run-length-id of the binary column ('ind'), 'id' and 'age', get the cumulative sum of 'ind' and assign (:=) it to a new column ('Cumulative')
library(data.table)
setDT(mydata)[, ind := match(do.call(paste0, .SD), c("121", "021")) - 1,
.SDcols = c("accuracy", "block", "condition")
][, Cumulative := cumsum(ind), .(rleid(ind), id, age)
][, ind := NULL][]
# id age block trial condition accuracy Cumulative
# 1: 1 1 1 1 1 0 NA
# 2: 1 1 1 2 1 0 NA
# 3: 1 1 2 1 1 0 1
# 4: 1 1 2 2 1 0 2
# 5: 1 1 2 3 1 0 3
# 6: 1 1 2 4 2 0 NA
# 7: 1 1 2 5 1 1 0
# 8: 1 1 2 6 1 0 1
# 9: 1 1 2 7 1 0 2
#10: 1 2 2 1 1 0 1
We can use case_when to assign the value which we need based on our conditions. We then add an additional group_by condition using cumsum to switch values when the temp column 0. In the final mutate step we temporarily replace NA values in temp to 0, then take cumsum over it and put back the NA values again to it's place to get the final output.
library(dplyr)
mydata %>%
group_by(id, age) %>%
mutate(temp = case_when(accuracy == 0 & block == 2 & condition == 1 ~ 1,
accuracy == 1 & block == 2 & condition == 1 ~ 0,
TRUE ~ NA_real_)) %>%
ungroup() %>%
group_by(id, age, group = cumsum(replace(temp == 0, is.na(temp), 0))) %>%
mutate(cumulative = replace(cumsum(replace(temp, is.na(temp), 0)),
is.na(temp), NA)) %>%
select(-temp, -group)
# group id age block trial condition accuracy cumulative
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0 1 1 1 1 1 0 NA
# 2 0 1 1 1 2 1 0 NA
# 3 0 1 1 2 1 1 0 1
# 4 0 1 1 2 2 1 0 2
# 5 0 1 1 2 3 1 0 3
# 6 0 1 1 2 4 2 0 NA
# 7 0 1 1 2 5 1 0 4
# 8 1 1 1 2 6 1 1 0
# 9 1 1 1 2 7 1 0 1
#10 1 1 1 2 8 1 0 2
#11 1 1 2 2 1 1 0 1
I wish to generate all possible combinations of a set of numbers, but with multiple constraints. I have found several similar questions on Stack Overflow, but none that appear to address all of my constraints:
R: sample() command subject to a constraint
R all combinations of 3 vectors with conditions
Generate all combinations given a constraint
R - generate all combinations from 2 vectors given constraints
Below is an example data set. This is a deterministic data set, in my mind anyway.
desired.data <- read.table(text = '
x1 x2 x3 x4
1 1 1 1
1 1 1 2
1 1 1 3
1 1 2 1
1 1 2 2
1 1 2 3
1 1 3 3
1 2 1 1
1 2 1 2
1 2 1 3
1 2 2 1
1 2 2 2
1 2 2 3
1 2 3 3
1 3 3 3
0 1 1 1
0 1 1 2
0 1 1 3
0 1 2 1
0 1 2 2
0 1 2 3
0 1 3 3
0 0 1 1
0 0 1 2
0 0 1 3
0 0 0 1
', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')
Here are the constraints:
Column 1 can only contain a 0 or 1
The last column can only contain 1, 2 or 3
All other columns can contain 0, 1, 2 or 3
Once a non-0 appears in a row the rest of that row cannot contain another 0
Once a 3 appears in a row the rest of that row must only contain 3's
The first non-0 number in a row must be a 1
The only way I know to generate this type of data set is to use nested for-loops as shown below. I have used this technique for years and finally decided to ask if there might be a better way.
I hope this is not a duplicate and I hope it is not considered too specialized. I create these types of data sets frequently and a simpler solution would be quite helpful.
my.data <- matrix(0, ncol = 4, nrow = 25)
my.data <- as.data.frame(my.data)
j <- 1
for(i1 in 0:1) {
if(i1 == 0) i2.begin = 0
if(i1 == 0) i2.end = 1
if(i1 == 1) i2.begin = 1
if(i1 == 1) i2.end = 3
if(i1 == 2) i2.begin = 1
if(i1 == 2) i2.end = 3
if(i1 == 3) i2.begin = 3
if(i1 == 3) i2.end = 3
for(i2 in i2.begin:i2.end) {
if(i2 == 0) i3.begin = 0
if(i2 == 0) i3.end = 1
if(i2 == 1) i3.begin = 1
if(i2 == 1) i3.end = 3
if(i2 == 2) i3.begin = 1
if(i2 == 2) i3.end = 3
if(i2 == 3) i3.begin = 3
if(i2 == 3) i3.end = 3
for(i3 in i3.begin:i3.end) {
if(i3 == 0) i4.begin = 1 # 1 not 0 because last column
if(i3 == 0) i4.end = 1
if(i3 == 1) i4.begin = 1
if(i3 == 1) i4.end = 3
if(i3 == 2) i4.begin = 1
if(i3 == 2) i4.end = 3
if(i3 == 3) i4.begin = 3
if(i3 == 3) i4.end = 3
for(i4 in i4.begin:i4.end) {
my.data[j,1] <- i1
my.data[j,2] <- i2
my.data[j,3] <- i3
my.data[j,4] <- i4
j <- j + 1
}
}
}
}
my.data
dim(my.data)
Here is the output:
V1 V2 V3 V4
1 0 0 0 1
2 0 0 1 1
3 0 0 1 2
4 0 0 1 3
5 0 1 1 1
6 0 1 1 2
7 0 1 1 3
8 0 1 2 1
9 0 1 2 2
10 0 1 2 3
11 0 1 3 3
12 1 1 1 1
13 1 1 1 2
14 1 1 1 3
15 1 1 2 1
16 1 1 2 2
17 1 1 2 3
18 1 1 3 3
19 1 2 1 1
20 1 2 1 2
21 1 2 1 3
22 1 2 2 1
23 1 2 2 2
24 1 2 2 3
25 1 2 3 3
26 1 3 3 3
EDIT
Sorry that I initially forgot to include Constraint #6.
Here is code that creates the desired data set for this specific example. I suspect the code can be generalized. If I succeed in generalizing it I will post the result. Although the code is messy and not intuitive I am convinced there is a basic general pattern.
desired.data <- read.table(text = '
x1 x2 x3 x4
1 1 1 1
1 1 1 2
1 1 1 3
1 1 2 1
1 1 2 2
1 1 2 3
1 1 3 3
1 2 1 1
1 2 1 2
1 2 1 3
1 2 2 1
1 2 2 2
1 2 2 3
1 2 3 3
1 3 3 3
0 1 1 1
0 1 1 2
0 1 1 3
0 1 2 1
0 1 2 2
0 1 2 3
0 1 3 3
0 0 1 1
0 0 1 2
0 0 1 3
0 0 0 1
', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')
n <- 3 # non-zero numbers
m <- 4-2 # number of middle columns
x1 <- rep(1:0, c(((n*(n-1)) * (n-1) + n), (n*(n-1) + n + (n-1))))
x2 <- rep(c(1:n, 1:0), c(n*m+1, n*m+1, 1, n*m+1, n*1+1))
x3 <- rep(c(rep(1:n, n-1), n, 1:n, 1:0), c(rep(c(n,n,1), n-1), 1, n,n,1, n,1))
x4 <- c(rep(c(rep(1:n, (n-1)), n), (n-1)), n, rep(1:n,(n-1)), n, 1:n, 1)
my.data <- data.frame(x1, x2, x3, x4)
all.equal(desired.data, my.data)
# [1] TRUE
I would use expand.grid to generate all combinations and then subset it, one constraint at a time:
x<-expand.grid(0:1,0:3,0:3,1:3)
## Once a non-0 appears in a row the rest of that row cannot contain another 0
b1<-apply(x,1,function(z) min(diff(z!=0))==0)
x<-x[b1,]
## Once a 3 appears in a row the rest of that row must only contain 3's
b1<-apply(x,1,function(z) min(diff(z==3))==0)
x<-x[b1,]
## The first non-0 number in a row must be a 1
b1<-apply(x,1,function(z) {
w<-which(z==0)
length(w)==0 || z[tail(w,1)+1]==1
})
x<-x[b1,]
And now sort it:
x<-x[order(x[,1],x[,2],x[,3],x[,4]),]
x
Output:
Var1 Var2 Var3 Var4
1 0 0 0 1
9 0 0 1 1
41 0 0 1 2
73 0 0 1 3
11 0 1 1 1
43 0 1 1 2
75 0 1 1 3
19 0 1 2 1
51 0 1 2 2
83 0 1 2 3
91 0 1 3 3
12 1 1 1 1
44 1 1 1 2
76 1 1 1 3
20 1 1 2 1
52 1 1 2 2
84 1 1 2 3
92 1 1 3 3
14 1 2 1 1
46 1 2 1 2
78 1 2 1 3
22 1 2 2 1
54 1 2 2 2
86 1 2 2 3
94 1 2 3 3
96 1 3 3 3
Similar to #mrip, start from expand.grid which can handle the first 3 constraints since they don't interact with the other columns
step1<-expand.grid(0:1,0:3,0:3,1:3)
Next I would filter it. The difference between this approach and mrip's is that my filtering is in one apply instead of 3 so it should be around 3 times faster to filter.
filtered<-step1[apply(step1,1,function(x) all(if(length(which(x==0))>0) {max(which(x==0))==length(which(x==0))} else {TRUE}, if(length(which(x==3))>0) {min(which(x==3))==length(x)-length(which(x==3))+1} else {TRUE}, x[!x%in%0][1]==1)),]
That should be it. If you want to inspect each element inside the apply here it is:
if(length(which(x==0))>0) {max(which(x==0))==length(which(x==0))} else {TRUE}
If there are any zeros then it makes sure that nothing comes before the zero
if(length(which(x==3))>0) {min(which(x==3))==length(x)-length(which(x==3))+1} else {TRUE}
If there are any 3s it makes sure nothing is after them.
x[!x%in%0][1]==1) This first filters the zeros out of the row and then takes the first element of the row after that filter and only allows it to be a one.