Calculate score based on responses to multiple-responses questions - r

I have user data from between 1 and 20 multiple-response questions. In other words, users did not all answer the same number of questions; they could choose how many of the 20 questions they wanted to answer. All questions have the same number of response options (which was 44), and users could choose as few or as many response options as they wanted for each question.
To give an example, here's a subset of the data (representing 3 multiple-response questions with 5 response options on each):
mydata <- structure(list(id = 1:5, q1.response1 = c(1L, NA, 1L, NA, 1L),
q1.response2 = c(NA, 1L, 1L, NA, NA), q1.response3 = c(NA,
1L, 1L, 1L, NA), q1.response4 = c(1L, 1L, 1L, NA, 1L), q1.response5 = c(NA,
1L, 1L, NA, NA), q2.response1 = c(NA, 1L, NA, NA, NA), q2.response2 = c(1L,
NA, 1L, 1L, 1L), q2.response3 = c(NA, 1L, NA, 1L, NA), q2.response4 = c(1L,
NA, NA, NA, 1L), q2.response5 = c(NA, 1L, NA, 1L, NA), q3.response1 = c(1L,
1L, NA, 1L, NA), q3.response2 = c(NA, 1L, NA, NA, NA), q3.response3 = c(1L,
NA, NA, 1L, NA), q3.response4 = c(1L, 1L, NA, NA, NA), q3.response5 = c(1L,
NA, NA, NA, NA)), .Names = c("id", "q1.response1", "q1.response2",
"q1.response3", "q1.response4", "q1.response5", "q2.response1",
"q2.response2", "q2.response3", "q2.response4", "q2.response5",
"q3.response1", "q3.response2", "q3.response3", "q3.response4",
"q3.response5"), class = "data.frame", row.names = c(NA, -5L))
A "1" indicates that they checked off that option; NA indicates that they did not.
What I was to do is calculate the following for each of the 5 users: , where ni is the number of responses that appear in a particular combination of questions, and i = 1, ..., 2k, where k is the number of multiple-choice questions.
For example, if another person responds to 3 multiple-response questions (as in the above example), each of the 5 response options will fall into ONLY 1 of 23=8 possible group combinations:
1) Response options selected only in question 1
2) Response options selected only in question 2
3) Response options selected only in question 3
4) Response options selected in both question 1 and 2
5) Response options selected in both question 1 and 3
6) Response options selected in both question 2 and 3
7) Response options selected in question 1, 2, and 3
8) Response options that were not selected at all
As an example, for respondent #1 in the sample data:
1) Response options selected only in question 1: none = 0 responses
2) Response options selected only in question 2: response2 = 1 response
3) Response options selected only in question 3: response3, response5 = 2 responses
4) Response options selected in both question 1 and 2: none = 0 responses
5) Response options selected in both question 1 and 3: response1 = 1 response
6) Response options selected in both question 2 and 3: none = 0 responses
7) Response options selected in question 1, 2, and 3: response4 = 1 response
8) Response options that were not selected at all: none = 0 responses
So the score for this respondent would be:
(0*log2(0))+(1*log2(1))+(2*log2(2))+(0*log2(0))+(1*log2(1))+(0*log2(0))+(1*log2(1))+(0*log2(0)) = 2
Any idea how to code this in R?

The first thing I would do here is transform your data into long format. There are various ways of doing this, such as the base R reshape() function, and the reshape2 package, but I actually decided to do it manually in this case by constructing a new data.frame using the constructor function data.frame() and a few carefully written calls to rep(). This approach also depends on flattening the original data.frame (minus the initial id column, which I instantiate separately) to a vector via as.matrix() and then c(), which follows the original data across rows and then across columns. The rep() calls had to be designed to align with that order.
mydata;
## id q1.response1 q1.response2 q1.response3 q1.response4 q1.response5 q2.response1 q2.response2 q2.response3 q2.response4 q2.response5 q3.response1 q3.response2 q3.response3 q3.response4 q3.response5
## 1 1 1 NA NA 1 NA NA 1 NA 1 NA 1 NA 1 1 1
## 2 2 NA 1 1 1 1 1 NA 1 NA 1 1 1 NA 1 NA
## 3 3 1 1 1 1 1 NA 1 NA NA NA NA NA NA NA NA
## 4 4 NA NA 1 NA NA NA 1 1 NA 1 1 NA 1 NA NA
## 5 5 1 NA NA 1 NA NA 1 NA 1 NA NA NA NA NA NA
NU <- nrow(mydata);
NQ <- 3;
NO <- 5;
long <- data.frame(id=rep(mydata$id,NQ*NO),question=rep(1:NQ,each=NO*NU),option=rep(1:NO,each=NU,NQ),response=c(as.matrix(mydata[-1])));
long;
## id question option response
## 1 1 1 1 1
## 2 2 1 1 NA
## 3 3 1 1 1
## 4 4 1 1 NA
## 5 5 1 1 1
## 6 1 1 2 NA
## 7 2 1 2 1
## 8 3 1 2 1
## 9 4 1 2 NA
## 10 5 1 2 NA
## 11 1 1 3 NA
## 12 2 1 3 1
## 13 3 1 3 1
## 14 4 1 3 1
## 15 5 1 3 NA
## 16 1 1 4 1
## 17 2 1 4 1
## 18 3 1 4 1
## 19 4 1 4 NA
## 20 5 1 4 1
## 21 1 1 5 NA
## 22 2 1 5 1
## 23 3 1 5 1
## 24 4 1 5 NA
## 25 5 1 5 NA
## 26 1 2 1 NA
## 27 2 2 1 1
## 28 3 2 1 NA
## 29 4 2 1 NA
## 30 5 2 1 NA
## 31 1 2 2 1
## 32 2 2 2 NA
## 33 3 2 2 1
## 34 4 2 2 1
## 35 5 2 2 1
## 36 1 2 3 NA
## 37 2 2 3 1
## 38 3 2 3 NA
## 39 4 2 3 1
## 40 5 2 3 NA
## 41 1 2 4 1
## 42 2 2 4 NA
## 43 3 2 4 NA
## 44 4 2 4 NA
## 45 5 2 4 1
## 46 1 2 5 NA
## 47 2 2 5 1
## 48 3 2 5 NA
## 49 4 2 5 1
## 50 5 2 5 NA
## 51 1 3 1 1
## 52 2 3 1 1
## 53 3 3 1 NA
## 54 4 3 1 1
## 55 5 3 1 NA
## 56 1 3 2 NA
## 57 2 3 2 1
## 58 3 3 2 NA
## 59 4 3 2 NA
## 60 5 3 2 NA
## 61 1 3 3 1
## 62 2 3 3 NA
## 63 3 3 3 NA
## 64 4 3 3 1
## 65 5 3 3 NA
## 66 1 3 4 1
## 67 2 3 4 1
## 68 3 3 4 NA
## 69 4 3 4 NA
## 70 5 3 4 NA
## 71 1 3 5 1
## 72 2 3 5 NA
## 73 3 3 5 NA
## 74 4 3 5 NA
## 75 5 3 5 NA
Here's a demo of how to use reshape() to accomplish the same thing. As you can see, this requires two consecutive calls to reshape(), because we need to longify both the option variable and the question variable. The order of these two columns ends up being reversed from what I created above, but that's circumstantial. Note that this approach absolves us of the need to manually store (or derive, which could theoretically be done) NQ and NO in advance of the transformation, but at the expense of the complexity of appeasing the quirks of the reshape() function.
long1 <- transform(structure(reshape(mydata,dir='l',varying=2:ncol(mydata),timevar='option'),reshapeLong=NULL),option=as.integer(sub('^response','',option,perl=T)));
long2 <- transform(structure(reshape(long1,dir='l',idvar=c('id','option'),varying=3:ncol(long1),timevar='question',sep=''),reshapeLong=NULL),question=as.integer(question),response=q,q=NULL);
rownames(long2) <- NULL;
identical(long,long2[names(long)]);
## [1] TRUE
The next step is figuring out which options for each user fell into which category. By "category" I'm referring to the combination of questions for which that user selected that particular option. Your formula requires first summing up the number of options that fall into each category.
Initially, I got the idea to normalize each user's selections for a particular option to a single number by treating each question as a binary digit and summing up the place-value-weighted numerical value of each selection. So, for example, if a user selected a particular option on questions 1 and 3, but not on question 2, then that would be binary 101 which would normalize to 5. This was the result, using aggregate() to group by id and option:
combo <- aggregate(cbind(category=response)~id+option,long,function(x) sum(x*2^(length(x):1-1),na.rm=T),na.action=na.pass);
combo;
## id option category
## 1 1 1 5
## 2 2 1 3
## 3 3 1 4
## 4 4 1 1
## 5 5 1 4
## 6 1 2 2
## 7 2 2 5
## 8 3 2 6
## 9 4 2 2
## 10 5 2 2
## 11 1 3 1
## 12 2 3 6
## 13 3 3 4
## 14 4 3 7
## 15 5 3 0
## 16 1 4 7
## 17 2 4 5
## 18 3 4 4
## 19 4 4 0
## 20 5 4 6
## 21 1 5 1
## 22 2 5 6
## 23 3 5 4
## 24 4 5 2
## 25 5 5 0
However, I then realized that this approach could easily lead to a problem. The issue is that it requires multiplying by place values that extend up to 2k-1. For your particular case k is 20, so that's only 524288, which is perfectly manageable, but imagine if you had 100 questions; the largest place value would be 633825300114114700748351602688! That doesn't fit into a 32-bit integer and so it would be converted to a double (roughly 6.33825300114115e+29), and that would screw up the entire aggregation we're going to have to do next (stay tuned), since nearby categories would be "rounded" together, due to the floating absolute precision of doubles.
I thought about how to solve this problem, and realized that it makes most sense to just switch to a string representation of the category. This will allow us to handle large numbers of questions, while still providing a simple and easily comparable representation of the category. I also manually set it as a factor rather than a character vector, which will be useful later on for the tabulate() call. So, here's the improved solution, again using aggregate() to group by id and option:
combo <- aggregate(cbind(category=response)~id+option,long,function(x) factor(paste(replace(x,is.na(x),0),collapse='')),na.action=na.pass);
combo;
## id option category
## 1 1 1 101
## 2 2 1 011
## 3 3 1 100
## 4 4 1 001
## 5 5 1 100
## 6 1 2 010
## 7 2 2 101
## 8 3 2 110
## 9 4 2 010
## 10 5 2 010
## 11 1 3 001
## 12 2 3 110
## 13 3 3 100
## 14 4 3 111
## 15 5 3 000
## 16 1 4 111
## 17 2 4 101
## 18 3 4 100
## 19 4 4 000
## 20 5 4 110
## 21 1 5 001
## 22 2 5 110
## 23 3 5 100
## 24 4 5 010
## 25 5 5 000
As a slight alternative, to save on characters, we could use more compact encodings than the above binary strings. Here's a rather intricate line of code to build hex strings:
combo <- aggregate(cbind(category=response)~id+option,long,function(x) factor(paste(c(0:9,letters[1:6])[colSums(matrix(c(rep(0,ceiling(length(x)/4)*4-length(x)),x)*2^(3:0),4),na.rm=T)+1],collapse='')),na.action=na.pass);
combo;
## id option category
## 1 1 1 5
## 2 2 1 3
## 3 3 1 4
## 4 4 1 1
## 5 5 1 4
## 6 1 2 2
## 7 2 2 5
## 8 3 2 6
## 9 4 2 2
## 10 5 2 2
## 11 1 3 1
## 12 2 3 6
## 13 3 3 4
## 14 4 3 7
## 15 5 3 0
## 16 1 4 7
## 17 2 4 5
## 18 3 4 4
## 19 4 4 0
## 20 5 4 6
## 21 1 5 1
## 22 2 5 6
## 23 3 5 4
## 24 4 5 2
## 25 5 5 0
Note how this result looks identical to the place value solution given earlier. That's just because this sample data has only 3 questions and thus only 8 categories, which doesn't extend into the hexadecimal letter range. On the other hand, the identicalness is a nice demonstration of how both solutions use a kind of numerical value representation, with the place value solution using actual integers, and this solution using hexadecimal strings.
The next step is to aggregate on the category, summing up ni log2 ni for all categories.
Now, since the addend is zero for ni = 0, we don't actually have to add up the value for every possible category; we can ignore the ones that are not present. This is fortunate, since there are 2k categories, which would become huge for large k. In other words, all we have to do is sum up the expression for each category that is represented in the data to get the result. And furthermore, since the addend is also zero for ni = 1, since log2(1) = 0, we can excise every category that has less than 2 constituents. Thus we have:
res <- aggregate(cbind(score=category)~id,combo,function(x) { nc <- tabulate(x); nc <- nc[nc>1]; sum(nc*log2(nc)); });
res;
## id score
## 1 1 2
## 2 2 4
## 3 3 8
## 4 4 2
## 5 5 2
This was a very complex question, and I might have made a mistake somewhere, so please check my work!

Related

Grouping rows with mutliple conditions across columns, incl. a sorting, in R/dplyr

In the following dataframe, I have 24 points in the 3D space (2 horizontal locations along X and Y, each with 12 vertical values along Z).
I would like to group together the points vertically if:
they have the same val value and
they follow each other along the Z axis (so two 1 separated by another value would not have the same ID).
And this should be done only for the values beyond the 3 first Z values (which automatically get ID = 1, 2 and 3 respectively, the following ones start at 4).
set.seed(50)
library(dplyr)
mydf = data.frame(X = rep(1, 24), Y = rep(1:2, each = 12),
Z = c(sample(1:12,12,replace=F), sample(4:16,12,replace=F)),
val = c(rep(1:3, 8)))
mydf = mydf %>% group_by(X,Y) %>% arrange(X,Y,Z) %>% data.frame()
# X Y Z val
# 1 1 1 1 3 # In this X-Y location, Z starts at 1
# 2 1 1 2 3
# 3 1 1 3 3
# 4 1 1 4 2
# 5 1 1 5 2
# 6 1 1 6 1
# 7 1 1 7 1
# 8 1 1 8 1
# 9 1 1 9 1
# 10 1 1 10 2
# 11 1 1 11 2
# 12 1 1 12 3
# 13 1 2 4 2 # In this X-Y location, Z starts at 4
# [etc (see below)]
Desired output (note for example that lines 4-5 and 10-11 get a different ID):
rle1 = rle(mydf[4:12,]$val)
# Run Length Encoding
# lengths: int [1:4] 2 4 2 1
# values : int [1:4] 2 1 2 3
rle2 = rle(mydf[4:12 + 12,]$val)
# Run Length Encoding
# lengths: int [1:7] 2 1 1 2 1 1 1
# values : int [1:7] 3 1 2 1 3 1 2
mydf$ID = c(1:3, rep(4:(3+length(rle1$lengths)), rle1$lengths),
1:3, rep(4:(3+length(rle2$lengths)), rle2$lengths))
# X Y Z val ID
# 1 1 1 1 3 1
# 2 1 1 2 3 2
# 3 1 1 3 3 3
# 4 1 1 4 2 4
# 5 1 1 5 2 4
# 6 1 1 6 1 5
# 7 1 1 7 1 5
# 8 1 1 8 1 5
# 9 1 1 9 1 5
# 10 1 1 10 2 6
# 11 1 1 11 2 6
# 12 1 1 12 3 7 # In this X-Y location, I have 7 groups in the end
# 13 1 2 4 2 1
# 14 1 2 5 2 2
# 15 1 2 6 3 3
# 16 1 2 7 3 4
# 17 1 2 9 3 4
# 18 1 2 10 1 5
# 19 1 2 11 2 6
# 20 1 2 12 1 7
# 21 1 2 13 1 7
# 22 1 2 14 3 8
# 23 1 2 15 1 9
# 24 1 2 16 2 10 # In this X-Y location, I have 10 groups in the end
How could I perform this more efficiently, or in one line, and why not with dplyr, supposing this applies for many (X,Y) locations and with always the 3 first Z values (which starts at a different value at each location) followed by a location-dependent number of following ID groups?
I was starting with a try to work with a vector from a conditional subset in dplyr, which is wrong:
mydf %>% group_by(X,Y) %>% arrange(X,Y,Z) %>%
mutate(dummy = mean(rle(val)$values))
Error: error in evaluating the argument 'x' in selecting a method for function 'mean': Error in rle(c(1L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 1L, 1L, 2L, 2L))$function (x, :
invalid subscript type 'closure'
Thanks!
You can use data.table::rleid on val starting from the 4th element and then add an offset of 3, this could simplify the rle calculation;
library(dplyr); library(data.table)
mydf %>%
group_by(X, Y) %>%
mutate(ID = c(1:3, rleid(val[-(1:3)]) + 3)) %>%
as.data.frame() # for print purpose only
# X Y Z val ID
#1 1 1 1 3 1
#2 1 1 2 3 2
#3 1 1 3 3 3
#4 1 1 4 2 4
#5 1 1 5 2 4
#6 1 1 6 1 5
#7 1 1 7 1 5
#8 1 1 8 1 5
#9 1 1 9 1 5
#10 1 1 10 2 6
#11 1 1 11 2 6
#12 1 1 12 3 7
#13 1 2 4 2 1
#14 1 2 5 2 2
#15 1 2 6 3 3
#16 1 2 7 3 4
#17 1 2 9 3 4
#18 1 2 10 1 5
#19 1 2 11 2 6
#20 1 2 12 1 7
#21 1 2 13 1 7
#22 1 2 14 3 8
#23 1 2 15 1 9
#24 1 2 16 2 10
Or without rleid, use cumsum + diff:
mydf %>% group_by(X, Y) %>% mutate(ID = c(1:3, cumsum(c(4, diff(val[-(1:3)]) != 0))))

Create edgelist for all interactions from data.frame

I am trying to do network analysis in igraph but having some issues with transforming the dataset I have into an edge list (with weights), given the differing amount of columns.
The data set looks as follows (df1) (much larger of course): First is the main operator id (main operator can also be partner and vice versa, so the Ids are staying the same in the edge list) The challenge is that the amount of partners varies (from 0 to 40) and every interaction has to be considered (not just "IdMain to IdPartnerX").
IdMain IdPartner1 IdPartner2 IdPartner3 IdPartner4 .....
1 4 3 7 6
2 3 1 NA NA
3 1 4 2 NA
4 9 6 3 NA
.
.
I already got the helpful tip to use reshape to do this, like:
data_melt <- reshape2::melt(data, id.vars = "IdMain")
edgelist <- data_melt[!is.na(data_melt$value), c("IdMain", "value")]
However, this only creates a 'directed' edgelist (from Main to Partners). What I need is something like below, where every interaction is recorded.
Id1 Id2
1 4
1 3
1 7
1 6
4 3
4 7
4 6
3 7
etc
Does anyone have a tip what the best way to go is? I also looked into the igraph library and couldn't find the function to do this.
There is no need for reshape(2) and melting etc. You just need to grap every combination of column pairs and then bind them together.
x <- read.table(text="IdMain IdPartner1 IdPartner2 IdPartner3 IdPartner4
1 4 3 7 6
2 3 1 NA NA
3 1 4 2 NA
4 9 6 3 NA", header=TRUE)
idx <- t(combn(seq_along(x), 2))
edgelist <- lapply(1:nrow(idx), function(i) x[, c(idx[i, 1], idx[i, 2])])
edgelist <- lapply(edgelist, setNames, c("ID1","ID2"))
edgelist <- do.call(rbind, edgelist)
edgelist <- edgelist[rowSums(is.na(edgelist))==0, ]
edgelist
# ID1 ID2
# 1 1 4
# 2 2 3
# 3 3 1
# 4 4 9
# 5 1 3
# 6 2 1
# 7 3 4
# 8 4 6
# 9 1 7
# 11 3 2
# 12 4 3
# 13 1 6
# 17 4 3
# 18 3 1
# 19 1 4
# 20 9 6
# 21 4 7
# 23 1 2
# 24 9 3
# 25 4 6
# 29 3 7 <--
# 31 4 2
# 32 6 3
# 33 3 6 <--
# 37 7 6 <--
Using the data below. You can achieve what looks to be your goal with apply and combn. This returns a list matrices with the pairwise comparison of the row element of your data.frame
myPairs <- apply(t(dat), 2, function(x) t(combn(x[!is.na(x)], 2)))
Note that the output of apply can be finicky and it is necessary here to have at least one row with an NA so that apply will return a list rather than a matrix.
If you want a data.frame at the end, use do.call and rbind to put the matrices together and then data.frame and setNames for the object coercion and to add names.
setNames(data.frame(do.call(rbind, myPairs)), c("Id1", "Id2"))
Id1 Id2
1 1 4
2 1 3
3 1 7
4 1 6
5 4 3
6 4 7
7 4 6
8 3 7
9 3 6
10 7 6
11 2 3
12 2 1
13 3 1
14 3 1
15 3 4
16 3 2
17 1 4
18 1 2
19 4 2
20 4 9
21 4 6
22 4 3
23 9 6
24 9 3
25 6 3
data
dat <-
structure(list(IdMain = 1:4, IdPartner1 = c(4L, 3L, 1L, 9L),
IdPartner2 = c(3L, 1L, 4L, 6L), IdPartner3 = c(7L, NA, 2L,
3L), IdPartner4 = c(6L, NA, NA, NA)), .Names = c("IdMain",
"IdPartner1", "IdPartner2", "IdPartner3", "IdPartner4"),
class = "data.frame", row.names = c(NA, -4L))

Vectorized Conditional Random Matching

I want to create conditional random pairs without using for-loops so I can use the code with large datasets. At first, I create rows with unique IDs and randomly assign two different "types" to my rows:
df<-data.frame(id=1:10,type=NA,partner=NA)
df[sample(df$id,nrow(df)/2),"type"]<-1 ##random 50% type 1
df[which(is.na(df$type)==TRUE),"type"]<-2 ##other 50% type 2
df
id type partner
1 1 2 NA
2 2 1 NA
3 3 1 NA
4 4 1 NA
5 5 2 NA
6 6 1 NA
7 7 1 NA
8 8 2 NA
9 9 2 NA
10 10 2 NA
Now I want them to receive a random partner of the opposite type. So I randomize my type 1 IDs and match them to some type 2 IDs like so:
df$partner[which(df$type==2)]<-sample(df$id[which(df$type==1)],
nrow(df)/2)
df
id type partner
1 1 2 4
2 2 1 NA
3 3 1 NA
4 4 1 NA
5 5 2 2
6 6 1 NA
7 7 1 NA
8 8 2 6
9 9 2 3
10 10 2 7
And that's where I'm stuck. For some reason I can't think of a vectorized way to tell R "take the IDs of type 1, look where these IDs are in df$partner and return the corresponding row ID as df$partner instead of NA".
One example for a for-loop for conditional random pairing can be found here: click
I'm pretty sure that that's very basic and doable, however, any help appreciated!
Presumably, you want the type 1 and type 2 matched together to have each other's id in their respective partner entries. Fully vectorized solution.
# Define number of ids
n = 100
# Generate startingn data frame
df = data.frame(id = 1:n, type = NA, partner = NA)
# Generate the type column
df$type[(a<-sample(df$id, n/2))] = 1
df$type[(b<-setdiff(1:100, a))] = 2
# Select a random partner id from the other type
df$partner[a] = sample(df$id[b])
# Fill in partner values based on previous line
df$partner[b] = df$id[match(df$id[b], df$partner)]
Output:
id type partner
1 2 11
2 1 13
3 2 19
4 2 10
5 1 17
6 2 28
7 2 27
8 2 21
9 1 22
10 1 4
11 1 1
12 2 20
13 2 2
14 2 25
15 2 24
16 2 30
17 2 5
18 2 29
19 1 3
20 1 12
21 1 8
22 2 9
23 2 26
24 1 15
25 1 14
26 1 23
27 1 7
28 1 6
29 1 18
30 1 16

Randomly Assign Integers in R within groups without replacement

I am running an experiment with two experiments: experiment_1 and experiment_2. Each experiment has 5 different treatments (i.e. 1, 2, 3, 4, 5). We are trying to randomly assign the treatments within groups.
We would like to do this via sampling without replacement iteratively within each group. We want to do this to insure that we get as a balanced a sample as possible in the treatment (e.g. we don't want to end up with 4 subjects in group 1 getting assigned to treatment 2 and no one getting treatment 1). So if a group has 23 subjects, we want to split the respondent into 4 subgroups of 5, and 1 subgroup of 3. We then want to randomly sample without replacement across the first subgroup of 5, so everyone gets assigned 1 of the treatments, do the same things for the the second, third and 4th subgroup of 5, and for the final subgroup of 3 randomly sample without replacement. So we would guarantee that every treatment is assigned to at least 4 subjects, and 3 are assigned to 5 subjects within this group. We would like to do this for all the groups in the experiment and for both treatments. The resultant output would look something like this...
group experiment_1 experiment_2
[1,] 1 5 3
[2,] 1 3 2
[3,] 1 4 4
[4,] 1 1 5
[5,] 1 2 1
[6,] 1 2 3
[7,] 1 4 1
[8,] 1 3 2
[9,] 2 5 5
[10,] 2 1 4
[11,] 2 3 4
[12,] 2 1 5
[13,] 2 2 1
. . . .
. . . .
. . . .
I know how to use the sample function, but am unsure how to sample without replacement within each group, so that our output corresponds to above described procedure. Any help would be appreciated.
I think we just need to shuffle sample IDs, see this example:
set.seed(124)
#prepare groups and samples(shuffled)
df <- data.frame(group=sort(rep(1:3,9)),
sampleID=sample(1:27,27))
#treatments repeated nrow of df
df$ex1 <- rep(c(1,2,3,4,5),ceiling(nrow(df)/5))[1:nrow(df)]
df$ex2 <- rep(c(2,3,4,5,1),ceiling(nrow(df)/5))[1:nrow(df)]
df <- df[ order(df$group,df$sampleID),]
#check treatment distribution
with(df,table(group,ex1))
# ex1
# group 1 2 3 4 5
# 1 2 2 2 2 1
# 2 2 2 2 1 2
# 3 2 2 1 2 2
with(df,table(group,ex2))
# ex2
# group 1 2 3 4 5
# 1 1 2 2 2 2
# 2 2 2 2 2 1
# 3 2 2 2 1 2
How about this function:
f <- function(n,m) {sample( c( rep(1:m,n%/%m), sample(1:m,n%%m) ), n )}
"n" is the group size, "m" the number of treatments.
Each treatment must be containt at least "n %/% m" times in the group.
The treatment numbers of the remaining "n %% m" group members are
assigned arbitrarily without repetition.
The vector "c( rep(1:m,n%/%m), sample(1:m,n%%m) )" contains these treatment numbers. Finally the "sample" function
perturbes these numbers.
> f(8,5)
[1] 5 3 1 5 4 2 2 1
> f(8,5)
[1] 4 5 3 4 2 2 1 1
> f(8,5)
[1] 4 2 1 5 3 5 2 3
Here is a function that creates a dataframe, using the above function:
Plan <- function( groupSizes, numExp=2, numTreatment=5 )
{
numGroups <- length(groupSizes)
df <- data.frame( group = rep(1:numGroups,groupSizes) )
for ( e in 1:numExp )
{
df <- cbind(df,unlist(lapply(groupSizes,function(n){f(n,numTreatment)})))
colnames(df)[e+1] <- sprintf("Exp_%i", e)
}
return(df)
}
Example:
> P <- Plan(c(8,23,13,19))
> P
group Exp_1 Exp_2
1 1 4 1
2 1 1 4
3 1 2 2
4 1 2 1
5 1 3 5
6 1 5 5
7 1 1 2
8 1 3 3
9 2 5 1
10 2 2 1
11 2 5 2
12 2 1 2
13 2 2 1
14 2 1 4
15 2 3 5
16 2 5 3
17 2 2 4
18 2 5 4
19 2 2 5
20 2 1 1
21 2 4 2
22 2 3 3
23 2 4 3
24 2 2 5
25 2 3 3
26 2 5 2
27 2 1 5
28 2 3 4
29 2 4 4
30 2 4 2
31 2 4 3
32 3 2 5
33 3 5 3
34 3 5 1
35 3 5 1
36 3 2 5
37 3 4 4
38 3 1 4
39 3 3 2
40 3 3 2
41 3 3 3
42 3 1 1
43 3 4 2
44 3 4 4
45 4 5 1
46 4 3 1
47 4 1 2
48 4 1 5
49 4 3 3
50 4 3 1
51 4 4 5
52 4 2 4
53 4 5 3
54 4 2 1
55 4 4 2
56 4 2 5
57 4 4 4
58 4 5 3
59 4 5 4
60 4 1 2
61 4 2 5
62 4 3 2
63 4 4 4
Check the distribution:
> with(P,table(group,Exp_1))
Exp_1
group 1 2 3 4 5
1 2 2 2 1 1
2 4 5 4 5 5
3 2 2 3 3 3
4 3 4 4 4 4
> with(P,table(group,Exp_2))
Exp_2
group 1 2 3 4 5
1 2 2 1 1 2
2 4 5 5 5 4
3 3 3 2 3 2
4 4 4 3 4 4
>
The design of efficient experiments is a science on its own and there are a few R-packages dealing with this issue:
https://cran.r-project.org/web/views/ExperimentalDesign.html
I am afraid your approach is not optimal regarding the resources, no matter how you create the samples...
However this might help:
n <- 23
group <- sort(rep(1:5, ceiling(n/5)))[1:n]
exp1 <- rep(NA, length(group))
for(i in 1:max(group)) {
exp1[which(group == i)] <- sample(1:5)[1:sum(group == i)]
}
Not exactly sure if this meets all your constraints, but you could use the randomizr package:
library(randomizr)
experiment_1 <- complete_ra(N = 23, num_arms = 5)
experiment_2 <- block_ra(experiment_1, num_arms = 5)
table(experiment_1)
table(experiment_2)
table(experiment_1, experiment_2)
Produces output like this:
> table(experiment_1)
experiment_1
T1 T2 T3 T4 T5
4 5 5 4 5
> table(experiment_2)
experiment_2
T1 T2 T3 T4 T5
6 3 6 4 4
> table(experiment_1, experiment_2)
experiment_2
experiment_1 T1 T2 T3 T4 T5
T1 2 0 1 1 0
T2 1 1 1 1 1
T3 1 1 1 1 1
T4 1 0 2 0 1
T5 1 1 1 1 1

Sequentially numbering within many row blocks of unequal length [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
My actual dataset is composed of repeated measurements for each id, where the number of measurements can vary across individuals. A simplified example is:
dat <- data.frame(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L))
dat
## id
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## 7 2
## 8 2
## 9 3
## 10 3
## 11 3
I am trying to sequentially number the dat rows by the id variable. The result should be:
dat
## id s
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 1 5
## 6 1 6
## 7 2 1
## 8 2 2
## 9 3 1
## 10 3 2
## 11 3 3
How would you do that? I tried to select the last row of each id by using duplicated(), but this is probably not the way, since it works with the entire column.
Use ave(). The first item is the item you're going to apply the function to; the other items are your grouping variables, and FUN is the function you want to apply. See ?ave for more details.
transform(dat, s = ave(id, id, FUN = seq_along))
# id s
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 1 6
# 7 2 1
# 8 2 2
# 9 3 1
# 10 3 2
# 11 3 3
If you have a large dataset or are using the data.table package, you can make use of ".N" as follows:
library(data.table)
DT <- data.table(dat)
DT[, s := 1:.N, by = "id"]
## Or
## DT[, s := sequence(.N), id][]
Or, you can use rowid, like this:
library(data.table)
setDT(dat)[, s := rowid(id)][]
# id s
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 1 4
# 5: 1 5
# 6: 1 6
# 7: 2 1
# 8: 2 2
# 9: 3 1
# 10: 3 2
# 11: 3 3
For completeness, here's the "tidyverse" approach:
library(tidyverse)
dat %>%
group_by(id) %>%
mutate(s = row_number(id))
## # A tibble: 11 x 2
## # Groups: id [3]
## id s
## <int> <int>
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 1 5
## 6 1 6
## 7 2 1
## 8 2 2
## 9 3 1
## 10 3 2
## 11 3 3
dat <- read.table(text = "
id
1
1
1
1
1
1
2
2
3
3
3",
header=TRUE)
data.frame(
id = dat$id,
s = sequence(rle(dat$id)$lengths)
)
Gives:
id s
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 2 1
8 2 2
9 3 1
10 3 2
11 3 3
using tapply but not elegant as ave
cbind(dat$id,unlist(tapply(dat$id,dat$id,seq_along)))
[,1] [,2]
11 1 1
12 1 2
13 1 3
14 1 4
15 1 5
16 1 6
21 2 1
22 2 2
31 3 1
32 3 2
33 3 3

Resources