Randomly Assign Integers in R within groups without replacement - r

I am running an experiment with two experiments: experiment_1 and experiment_2. Each experiment has 5 different treatments (i.e. 1, 2, 3, 4, 5). We are trying to randomly assign the treatments within groups.
We would like to do this via sampling without replacement iteratively within each group. We want to do this to insure that we get as a balanced a sample as possible in the treatment (e.g. we don't want to end up with 4 subjects in group 1 getting assigned to treatment 2 and no one getting treatment 1). So if a group has 23 subjects, we want to split the respondent into 4 subgroups of 5, and 1 subgroup of 3. We then want to randomly sample without replacement across the first subgroup of 5, so everyone gets assigned 1 of the treatments, do the same things for the the second, third and 4th subgroup of 5, and for the final subgroup of 3 randomly sample without replacement. So we would guarantee that every treatment is assigned to at least 4 subjects, and 3 are assigned to 5 subjects within this group. We would like to do this for all the groups in the experiment and for both treatments. The resultant output would look something like this...
group experiment_1 experiment_2
[1,] 1 5 3
[2,] 1 3 2
[3,] 1 4 4
[4,] 1 1 5
[5,] 1 2 1
[6,] 1 2 3
[7,] 1 4 1
[8,] 1 3 2
[9,] 2 5 5
[10,] 2 1 4
[11,] 2 3 4
[12,] 2 1 5
[13,] 2 2 1
. . . .
. . . .
. . . .
I know how to use the sample function, but am unsure how to sample without replacement within each group, so that our output corresponds to above described procedure. Any help would be appreciated.

I think we just need to shuffle sample IDs, see this example:
set.seed(124)
#prepare groups and samples(shuffled)
df <- data.frame(group=sort(rep(1:3,9)),
sampleID=sample(1:27,27))
#treatments repeated nrow of df
df$ex1 <- rep(c(1,2,3,4,5),ceiling(nrow(df)/5))[1:nrow(df)]
df$ex2 <- rep(c(2,3,4,5,1),ceiling(nrow(df)/5))[1:nrow(df)]
df <- df[ order(df$group,df$sampleID),]
#check treatment distribution
with(df,table(group,ex1))
# ex1
# group 1 2 3 4 5
# 1 2 2 2 2 1
# 2 2 2 2 1 2
# 3 2 2 1 2 2
with(df,table(group,ex2))
# ex2
# group 1 2 3 4 5
# 1 1 2 2 2 2
# 2 2 2 2 2 1
# 3 2 2 2 1 2

How about this function:
f <- function(n,m) {sample( c( rep(1:m,n%/%m), sample(1:m,n%%m) ), n )}
"n" is the group size, "m" the number of treatments.
Each treatment must be containt at least "n %/% m" times in the group.
The treatment numbers of the remaining "n %% m" group members are
assigned arbitrarily without repetition.
The vector "c( rep(1:m,n%/%m), sample(1:m,n%%m) )" contains these treatment numbers. Finally the "sample" function
perturbes these numbers.
> f(8,5)
[1] 5 3 1 5 4 2 2 1
> f(8,5)
[1] 4 5 3 4 2 2 1 1
> f(8,5)
[1] 4 2 1 5 3 5 2 3
Here is a function that creates a dataframe, using the above function:
Plan <- function( groupSizes, numExp=2, numTreatment=5 )
{
numGroups <- length(groupSizes)
df <- data.frame( group = rep(1:numGroups,groupSizes) )
for ( e in 1:numExp )
{
df <- cbind(df,unlist(lapply(groupSizes,function(n){f(n,numTreatment)})))
colnames(df)[e+1] <- sprintf("Exp_%i", e)
}
return(df)
}
Example:
> P <- Plan(c(8,23,13,19))
> P
group Exp_1 Exp_2
1 1 4 1
2 1 1 4
3 1 2 2
4 1 2 1
5 1 3 5
6 1 5 5
7 1 1 2
8 1 3 3
9 2 5 1
10 2 2 1
11 2 5 2
12 2 1 2
13 2 2 1
14 2 1 4
15 2 3 5
16 2 5 3
17 2 2 4
18 2 5 4
19 2 2 5
20 2 1 1
21 2 4 2
22 2 3 3
23 2 4 3
24 2 2 5
25 2 3 3
26 2 5 2
27 2 1 5
28 2 3 4
29 2 4 4
30 2 4 2
31 2 4 3
32 3 2 5
33 3 5 3
34 3 5 1
35 3 5 1
36 3 2 5
37 3 4 4
38 3 1 4
39 3 3 2
40 3 3 2
41 3 3 3
42 3 1 1
43 3 4 2
44 3 4 4
45 4 5 1
46 4 3 1
47 4 1 2
48 4 1 5
49 4 3 3
50 4 3 1
51 4 4 5
52 4 2 4
53 4 5 3
54 4 2 1
55 4 4 2
56 4 2 5
57 4 4 4
58 4 5 3
59 4 5 4
60 4 1 2
61 4 2 5
62 4 3 2
63 4 4 4
Check the distribution:
> with(P,table(group,Exp_1))
Exp_1
group 1 2 3 4 5
1 2 2 2 1 1
2 4 5 4 5 5
3 2 2 3 3 3
4 3 4 4 4 4
> with(P,table(group,Exp_2))
Exp_2
group 1 2 3 4 5
1 2 2 1 1 2
2 4 5 5 5 4
3 3 3 2 3 2
4 4 4 3 4 4
>

The design of efficient experiments is a science on its own and there are a few R-packages dealing with this issue:
https://cran.r-project.org/web/views/ExperimentalDesign.html
I am afraid your approach is not optimal regarding the resources, no matter how you create the samples...
However this might help:
n <- 23
group <- sort(rep(1:5, ceiling(n/5)))[1:n]
exp1 <- rep(NA, length(group))
for(i in 1:max(group)) {
exp1[which(group == i)] <- sample(1:5)[1:sum(group == i)]
}

Not exactly sure if this meets all your constraints, but you could use the randomizr package:
library(randomizr)
experiment_1 <- complete_ra(N = 23, num_arms = 5)
experiment_2 <- block_ra(experiment_1, num_arms = 5)
table(experiment_1)
table(experiment_2)
table(experiment_1, experiment_2)
Produces output like this:
> table(experiment_1)
experiment_1
T1 T2 T3 T4 T5
4 5 5 4 5
> table(experiment_2)
experiment_2
T1 T2 T3 T4 T5
6 3 6 4 4
> table(experiment_1, experiment_2)
experiment_2
experiment_1 T1 T2 T3 T4 T5
T1 2 0 1 1 0
T2 1 1 1 1 1
T3 1 1 1 1 1
T4 1 0 2 0 1
T5 1 1 1 1 1

Related

dplyr: comparing values within a variable dependent on another variable

How can I compare values within a variable dependent on another variable with dplyr?
The df is based on choice data (long format) from a survey. It has one variable that indicates a participants id, another that indicates the choice instance and one that indicates which alternative was chosen.
In my data I have the feeling that a lot of people tend to get bored of the task and therefore stick to one alternative for every instance. I would therefore like to identify people who always selected the same option from a certain instance onwards till the end.
Here is an example df:
set.seed(0)
df <- tibble(
id = rep(1:5,each=12),
inst = rep(1:12,5),
alt = sample(1:3, size =60, replace=T),
)
That looks like the following:
id inst alt
1 1 1 3
2 1 2 1
3 1 3 2
4 1 4 2
5 1 5 3
6 1 6 1
7 1 7 3
8 1 8 3
9 1 9 2
10 1 10 2
11 1 11 1 <-
12 1 12 1 <-
13 2 1 1
14 2 2 3
...
I would like to create two new variables count and count_alt. The new variable count should indicate how often the same value appeared in alt based on id and inst, only counting values from the end of id. So for participant (id==1) the count variable should be 2, since alternative 1 was chosen in the last two instances (11 & 12). The count_alt would take the value 1 (always the same as inst == 12)
The new df schould look like the following
id inst alt count count_alt
1 1 1 3 2 1
2 1 2 1 2 1
3 1 3 2 2 1
4 1 4 2 2 1
5 1 5 3 2 1
6 1 6 1 2 1
7 1 7 3 2 1
8 1 8 3 2 1
9 1 9 2 2 1
10 1 10 2 2 1
11 1 11 1 2 1
12 1 12 1 2 1
...
I would prefer to solve this with dplyr and not with a loop since I want to incooperate it into further data wrangling steps.
See if that solves it:
library(dplyr)
df %>%
group_by(id) %>%
mutate(
count = cumsum(alt != lag(alt, default = "rndm")),
count = sum(count == max(count)),
count_alt = alt[n()]
)
Output:
id inst alt count count_alt
1 1 1 3 2 1
2 1 2 1 2 1
3 1 3 2 2 1
4 1 4 2 2 1
5 1 5 3 2 1
6 1 6 1 2 1
7 1 7 3 2 1
8 1 8 3 2 1
9 1 9 2 2 1
10 1 10 2 2 1
11 1 11 1 2 1
12 1 12 1 2 1
13 2 1 1 1 2
14 2 2 3 1 2
15 2 3 2 1 2
16 2 4 3 1 2
17 2 5 2 1 2
18 2 6 3 1 2
19 2 7 3 1 2
20 2 8 2 1 2
21 2 9 3 1 2
22 2 10 3 1 2
23 2 11 1 1 2
24 2 12 2 1 2
25 3 1 1 1 3
26 3 2 1 1 3
27 3 3 2 1 3
28 3 4 1 1 3
29 3 5 2 1 3
30 3 6 3 1 3
31 3 7 2 1 3
32 3 8 2 1 3
33 3 9 2 1 3
34 3 10 2 1 3
35 3 11 1 1 3
36 3 12 3 1 3
37 4 1 3 1 1
38 4 2 3 1 1
39 4 3 1 1 1
40 4 4 3 1 1
41 4 5 2 1 1
42 4 6 3 1 1
43 4 7 2 1 1
44 4 8 3 1 1
45 4 9 2 1 1
46 4 10 2 1 1
47 4 11 3 1 1
48 4 12 1 1 1
49 5 1 2 2 2
50 5 2 3 2 2
51 5 3 3 2 2
52 5 4 2 2 2
53 5 5 3 2 2
54 5 6 2 2 2
55 5 7 1 2 2
56 5 8 1 2 2
57 5 9 1 2 2
58 5 10 1 2 2
59 5 11 2 2 2
60 5 12 2 2 2

Group by each increasing sequence in data frame

If I have a data frame with a column of monotonically increasing values such as:
x
1
2
3
4
1
2
3
1
2
3
4
5
6
1
2
How do I add a column to group each increasing sequence that results in:
x y
1 1
2 1
3 1
4 1
1 2
2 2
3 2
1 3
2 3
3 3
4 3
5 3
6 3
1 4
2 4
I can only think of using a loop which will be slow.
You may choose cumsum function to do it.
> x <- c(1,2,3,4,1,2,3,1,2,4,5,1,2)
> cumsum(x==1)
[1] 1 1 1 1 2 2 2 3 3 3 3 4 4
I would use diff and compute the cumulative sum:
df$y <- c(1, cumsum(diff(df$x) < 0 ) + 1)
> df
x y
1 1 1
2 2 1
3 3 1
4 4 1
5 1 2
6 2 2
7 3 2
8 1 3
9 2 3
10 3 3
11 4 3
12 5 3
13 6 3
14 1 4
15 2 4

Relabel samples in kmean results considering the order of centers

I am using kmeans to cluster my data, for the produced result I have a plan.
I wanted to relabel the samples based on ordered centres. Consider following example :
a = c("a","b","c","d","e","F","i","j","k","l","m","n")
b = c(1,2,3,20,21,21,40,41,42,4,23,50)
mydata = data.frame(id=a,amount=b)
result = kmeans(mydata$amount,3,nstart=10)
Here is the result :
clus$cluster
2 2 2 3 3 3 1 1 1 2 3 1
clus$centers
1 43.25
2 2.50
3 21.25
mydata = data.frame(mydata,label =clus$cluster)
mydata
id amount label
1 a 1 2
2 b 2 2
3 c 3 2
4 d 20 3
5 e 21 3
6 F 21 3
7 i 40 1
8 j 41 1
9 k 42 1
10 l 4 2
11 m 23 3
12 n 50 1
What I am looking for is sorting the centres and producing the labels accordingly:
1 2.50
2 21.25
3 43.25
and label the samples going to:
1 1 1 2 2 2 3 3 3 1 2 3
and the result should be :
id amount label
1 a 1 1
2 b 2 1
3 c 3 1
4 d 20 2
5 e 21 2
6 F 21 2
7 i 40 3
8 j 41 3
9 k 42 3
10 l 4 1
11 m 23 2
12 n 50 3
I think it is possible to do it by, order the centres and for each sample taking the index of minimum distance of samples with centres as the label of that cluster.
Is there another way that R can do it automatically ?
One idea is to create a named vector by matching your centers with the sorted centers. Then match the vector with mydata$label and replace with the names of the vector, i.e.
i1 <- setNames(match(sort(result$centers), result$centers), rownames(result$centers))
as.numeric(names(i1)[match(mydata$label, i1)])
# [1] 1 1 1 2 2 2 3 3 3 1 2 3
You can use for loop, if you don't mind loops
cls <- result$cluster
for (i in 1 : length(result$cluster))
result$cluster[cls == order(result$centers)[i]] <- i
result$cluster
#[1] 1 1 1 2 2 2 3 3 3 1 2 3

Clogit function in CEDesign not converge

I designed a CE Experiment using the package support.CEs. I generated a CE Design with 3 attributes an 4 levels per attribute. The questionnaire had 4 alternatives and 4 blocks
des1 <- rotation.design(attribute.names = list(
Qualitat = c("Aigua potable", "Cosetes.blanques.flotant", "Aigua.pou", "Aigua.marro"),
Disponibilitat.acces = c("Aixeta.24h", "Aixeta.10h", "Diposit.comunitari", "Pou.a.20"),
Preu = c("No.problemes.€", "Esforç.economic", "No.pagues.acces", "No.pagues.no.acces")),
nalternatives = 4, nblocks = 4, row.renames = FALSE,
randomize = TRUE, seed = 987)
The questionnaire was replied by 15 persons (ID 1-15), so 60 outputs (15 persons responding per 4 blocks:
ID BLOCK q1 q2 q3 q4
1 1 1 1 2 3 3
2 1 2 1 3 3 4
3 1 3 5 1 3 5
4 1 4 5 2 2 5
5 2 1 1 2 4 3
6 2 2 1 4 3 4
7 2 3 3 1 3 2
8 2 4 1 2 2 2
9 3 1 1 2 2 2
10 3 2 1 4 3 4
11 3 3 3 1 3 4
12 3 4 3 2 1 4
13 4 1 1 5 4 3
14 4 2 1 4 5 4
15 4 3 5 5 3 2
16 4 4 5 2 5 5
17 5 1 1 2 4 2
18 5 2 3 2 3 2
19 5 3 3 1 3 4
20 5 4 3 2 1 4
21 6 1 1 5 5 5
22 6 2 1 3 3 4
23 6 3 3 1 3 4
24 6 4 1 2 2 2
25 7 1 1 2 4 3
26 7 2 4 2 3 4
27 7 3 3 1 3 3
28 7 4 3 4 5 5
29 8 1 1 3 2 3
30 8 2 1 4 3 4
31 8 3 3 1 3 4
32 8 4 1 2 2 1
33 9 1 1 2 3 3
34 9 2 1 3 3 4
35 9 3 5 1 3 5
36 9 4 5 2 2 5
37 15 1 1 5 5 5
38 15 2 4 4 5 4
39 15 3 5 5 3 5
40 15 4 4 3 5 5
41 11 1 1 5 5 5
42 11 2 4 4 5 4
43 11 3 5 5 3 5
44 11 4 5 3 5 5
45 12 1 1 2 4 3
46 12 2 4 2 3 4
47 12 3 3 1 3 3
48 12 4 3 4 5 5
49 13 1 1 2 2 2
50 13 2 1 4 3 4
51 13 3 3 1 3 2
52 13 4 1 2 2 2
53 14 1 1 1 3 3
54 14 2 1 4 1 4
55 14 3 4 1 3 2
56 14 4 3 2 1 2
57 15 1 1 1 3 2
58 15 2 5 2 1 4
59 15 3 4 4 3 1
60 15 4 3 4 1 4
The probles is that, when i merge the questions and answers matrix with the formula
dataset1 <- make.dataset(respondent.dataset = res1,
choice.indicators = c("q1","q2","q3","q4"),
design.matrix = desmat1)
R shows a warning message: In fitter(X, Y, strats, offset, init, control, weights = weights, :
Ran out of iterations and did not converge
I should expect that the matrix desmat1 generated had 4800 observations (80 possible combinations and 60 outputs). Instead of that i have only 1200 obseravations. The matrix dataset1 only shows the combination of 1 set of alternatives instead of the 4.
For example, for ID 1, Block 1, Question 1 only appears alternative 1. It match with the answer selected by the person, but in other cases it does not match, and that information is lost in R, so the results when clogit is applied are wrong.
I do hope thay the problems is understood.
Regards,
Edition:
I found my problem. When i make the dataset from the respondent.dataset that i generated in .csv format, r detects only the q1 response instead of q1-q4. dataset1
dataset1 <- make.dataset(respondent.dataset = res1,
choice.indicators = c("q1","q2","q3","q4"),
design.matrix = desmat1)
detects q1-q4 as new columns. But the key is that q1-q4 has to fill the columns QES in dataset1. I did another CE before with 1 block and the dataset was correctly done one reading the respondant.dataset. So the key point is that now i'm using 4 blocks but i do not know how to make R to interprete that q1-q4 are the columns QUES for each block.
res1 matrix (repondant.dataset) (Complete matriz has 60 rows = 15 respondants (ID 1-15) * 4 Questions (QES column in make.dataset)
Kind reagards,

Getting all combinations in R, repetition allowed

The built-in combn only gives half the combinations:
> t(combn(1:5, 2))
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 1 5
[5,] 2 3
[6,] 2 4
[7,] 2 5
[8,] 3 4
[9,] 3 5
[10,] 4 5
For example there is no (1,1) nor (2,1).
How can I get all combinations?
As #akrun said, it looks like expand.grid will do it.
> expand.grid(rep(list(1:5), 2))
Var1 Var2
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 1 2
7 2 2
8 3 2
9 4 2
10 5 2
11 1 3
12 2 3
13 3 3
14 4 3
15 5 3
16 1 4
17 2 4
18 3 4
19 4 4
20 5 4
21 1 5
22 2 5
23 3 5
24 4 5
25 5 5
You could get the Cartesian product using merge:
merge(1:5, 1:5)
Output:
x y
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 1 2
7 2 2
8 3 2
9 4 2
10 5 2
11 1 3
12 2 3
13 3 3
14 4 3
15 5 3
16 1 4
17 2 4
18 3 4
19 4 4
20 5 4
21 1 5
22 2 5
23 3 5
24 4 5
25 5 5
Using sqldf:
df1 <- data.frame(a = 1:5)
df2 <- df1
sqldf("SELECT df1.a, df2.a FROM df1
CROSS JOIN df2")
This is actually called as permutations with repeated elements. Besides the given recommendations, you can use gtools::permutations function:
gtools::permutations(5, 2, 1:5, repeats.allowed=TRUE)

Resources