I need to assign people randomly into groups and a category. Unfortunately, I don’t really know where to start with this. I have tried to explain my problem using the example below. Any help with this would be very much appreciated.
I have 207 ‘Home Type A’ and 408 ‘Home Type B’ categories. In total there are 1524 people that need to be assigned to either the 207 Home Type A’s or the 408 Home Type B’s categories. However, the 1524 people also need to be grouped together in either groups of 2 to 7 for Home Type A or of 2 to 6 for Home Type B .
The end result should be 1524 people assigned to 207 groups (containing between 2 to 7 people) and 408 groups (containing between 2 to 6 people).
The assignment to groups has to be random and can use any combination of group sizes required as it does not matter if a group category is not used (for example, it would be fine if the 207 groups for Home Type A only contained 2, 3 or 4 people one time, or only 5 and 7 people another).
I imagine an output that would look something like this:
GroupSize <- c(2:7)
Num.Groups <- 0
Num.People <- 0
HouseTypeA <- data.frame(GroupSize, Num.Groups, Num.People)
GroupSize <- c(2:6)
HouseTypeB <- data.frame(GroupSize, Num.Groups, Num.People)
With the 'Num.Groups' column summing to either 207 or 408 and the sum of 'Num.People' between the two data frames being 1524.
I tried this in 3 steps:
Make List of house type
Distribute each person to houses, checking
that there is space (max=7 for type a and max=6 for type b)
Check that each house has the minimum of 2 people. If not, grab a person
from another house without allowing their number of people to drop
below 2.
homeType=rep(c("a", "b"), times=c(207, 408))
H <- vector(mode="list", length(homeType))
for(i in seq(H)){
H[[i]]$type <- homeType[i]
H[[i]]$n <- 0
}
H
# Place people in houses up to max number of people
npeople <- 1524
for(i in seq(npeople)){
placed_in_house <- FALSE
while(!placed_in_house){
house_num <- sample(length(H), 1)
if(H[[house_num]]$type == "a"){
if(H[[house_num]]$n < 7){
H[[house_num]]$n <- H[[house_num]]$n + 1
placed_in_house <- TRUE
}
}
if(H[[house_num]]$type == "b"){
if(H[[house_num]]$n < 6){
H[[house_num]]$n <- H[[house_num]]$n + 1
placed_in_house <- TRUE
}
}
}
}
H
hist(unlist(lapply(H, function(x)x$n)))
# move people around to get up to min number of people
for(i in seq(H)){
while(H[[i]]$n < 2){
knock_on_door <- sample(length(H), 1)
if( H[[knock_on_door]]$n > 2){
H[[i]]$n <- H[[i]]$n + 1 # house i takes 1 person
H[[knock_on_door]]$n <- H[[knock_on_door]]$n - 1 # house knock_on_door loses 1 person
}
}
}
H
Ha <- H[which(lapply(H, function(x){x$type}) == "a")]
Hb <- H[which(lapply(H, function(x){x$type}) == "b")]
op <- par(mfcol=c(1,2))
breaks=2:7
hist(unlist(lapply(Ha, function(x)x$n)), breaks=breaks, col=8, xlab="people per house", main="type A") # 2:7
hist(unlist(lapply(Hb, function(x)x$n)), breaks=breaks, col=8, xlab="people per house", main="type B") # 2:6
par(op)
sum(unlist(lapply(Ha, function(x)x$n))) + sum(unlist(lapply(Hb, function(x)x$n)))
# [1] 1524
Houses <- data.frame(
Num.Groups = seq(H),
type=unlist(lapply(H, function(x){x$type})),
Num.People=unlist(lapply(H, function(x){x$n}))
)
head(Houses)
As you can see, the total numbers and distribution of people per house works out.
There are two outer categories, A and B. Outer category A always has 207 inner categories each of size in 2:7, outer category B always has 408 inner categories each of size in 2:6.
This means:
The minimum number of people needed in outer category A must be 207*2 == 414.
The maximum number of people possible in outer category A is 207*7 == 1449.
The minimum number of people needed in outer category B must be 408*2 == 816.
The maximum number of people possible in outer category B is 408*6 == 2448.
We can further conclude:
The minimum number of people needed in total is 414+816 == 1230.
The maximum number of people possible in total is 1449+2448 == 3897.
I've captured these values as constants assigned at the start of my solution, copied here for reference:
Ainner <- 207L;
Binner <- 408L;
Amin <- Ainner*2L;
Bmin <- Binner*2L;
Amax <- Ainner*7L;
Bmax <- Binner*6L;
NPmin <- Amin+Bmin;
NPmax <- Amax+Bmax;
I designed my solution around the idea that we want to minimize the amount of looping that we need to do in order to find a suitable allocation of people to categories. We can get closer to that goal by initially computing a random division of the total number of people into the two outer categories A and B.
Once that is done, we must select random sizes for each inner category such that the outer category totals hit our division numbers exactly. This is not an easy task; I have not been able to think of a way of using the PRNG primitives to instantaneously produce values that satisfy the constraints. I think we need to iterate.
My solution to minimize looping was to derive normal distributions whose distribution parameters (that is, the mean and standard deviation) are parameterized on the selected outer category sizes (Asize and Bsize in the code), contrived in such a way that the distributions are likely to produce values whose aggregation will be very close to the required total in each outer category, while still providing significant randomness in the inner category size selections. We can then run a loop, making one increment or decrement every iteration as necessary, until we arrive at the required total. The loop is still necessary, but the number of iterations is minimized.
The form of the mean equations follows this pattern:
{extended-min} + {extended-range}*({size}-{min})/({max}-{min})
{extended-min} is the bottom end of a range that extends just outside the inner category size range. For example, for outer category A, the inner category size range is 2:7, and my extended range is 1.5:7.5. I used an extended range because I wanted to permit some variance in the random variates produced at the extreme ends of the function, even though they should theoretically approach a limit of the end of the inner category size range. It made the derivation and manipulation of the formula easier than it would have been had I tried to make it reach the theoretically ideal limit at each end, and it's aesthetically more pleasing to plot it this way. Note that the possibility (in fact, certainty) that the normal distribution deviates will fall outside the inner category size range is not a problem, because I cap the values at the endpoints with pmin() and pmax().
{extended-range} is the full extended range, e.g. 6 for outer category A.
{size} is the size that was randomly chosen for the outer category.
{min} and {max} are the minimum and maximum permissible values of the outer category size.
These are the actual equations I derived:
Amean <- function(Asize) 1.5 + 6*(Asize-Amin)/(Amax-Amin);
Bmean <- function(Bsize) 1.5 + 5*(Bsize-Bmin)/(Bmax-Bmin);
The form of the standard deviation equations follows this pattern:
{mult1}*exp(-({mult2}*(2*{size}-{min}-{max})/({max}-{min}))^2)
{mult1} and {mult2} are simply multipliers that I fiddled with to get an intuitive behavior of the final normal distributions, based on the goal of mapping the distributions to the required inner category allocations as closely as possible.
{size} is the same as before.
{min} and {max} are the same as before.
The rationale for the standard deviation form is that the standard deviation will be symmetric about the midpoint of the valid size range, producing a wide normal distribution for middleground sizes, and becoming more narrow towards the extreme ends. Note that the quotient in the exponential is equivalent to this:
({size}-({min}+{max})/2)/(({max}-{min})/2)
Hence it's the deviation of the size from the midpoint of the inner category size range, divided by half the range. That provides a domain of [-1,1], which is then multiplied by {mult2} and squared. The resulting negative exponent gets very large for extreme values, causing the entire exponential to become very small. This small standard deviation is what gives the normal distribution its narrowness towards the ends of the inner category size range.
These are the actual equations:
Asd <- function(Asize) 1.3*exp(-(1.22*(2*Asize-Amin-Amax)/(Amax-Amin))^2);
Bsd <- function(Bsize) 1.3*exp(-(1.22*(2*Bsize-Bmin-Bmax)/(Bmax-Bmin))^2);
Here's some code I wrote to nicely visualize the normal distributions:
Outer Category A
xlim <- c(-3,10);
ylim <- c(0,1.7);
xticks <- seq(xlim[1L],xlim[2L]);
yticks <- seq(ylim[1L],ylim[2L],0.1);
plot(NA,xlim=xlim,ylim=ylim,xlab='Inner Category Size',ylab='P',axes=F);
axis(1L,xticks,xticks);
axis(2L);
box();
abline(v=xticks,col='lightgrey');
abline(h=yticks,col='lightgrey');
x <- seq(xlim[1L],xlim[2L],0.01);
Asize.col <- data.frame(Asize=trunc(seq(Amin,Amax,len=7L)),col=c('red','green','blue','brown','gold','cyan','magenta'),stringsAsFactors=F);
for (ri in seq_len(nrow(Asize.col))) {
Asize <- Asize.col$Asize[ri];
col <- Asize.col$col[ri];
lines(x,dnorm(x,Amean(Asize),Asd(Asize)),col=col,lwd=2);
};
with(Asize.col,legend(-2.5,1.65,Asize,col,col,title=expression(bold(Asize))),cex=0.7);
subEnv <- as.environment(mget(c('Amin','Amax')));
text(0.5,1.6,parse(text=paste0('mu == ',deparse(do.call(substitute,c(list(body(Amean)),subEnv))))),pos=4L);
text(0.5,1.53,parse(text=paste0('sigma == ',deparse(do.call(substitute,c(list(body(Asd)),subEnv))))),pos=4L);
Outer Category B
xlim <- c(-3,10);
ylim <- c(0,1.7);
xticks <- seq(xlim[1L],xlim[2L]);
yticks <- seq(ylim[1L],ylim[2L],0.1);
plot(NA,xlim=xlim,ylim=ylim,xlab='Inner Category Size',ylab='P',axes=F);
axis(1L,xticks,xticks);
axis(2L);
box();
abline(v=xticks,col='lightgrey');
abline(h=yticks,col='lightgrey');
x <- seq(xlim[1L],xlim[2L],0.01);
Bsize.col <- data.frame(Bsize=trunc(seq(Bmin,Bmax,len=7L)),col=c('red','green','blue','brown','gold','cyan','magenta'),stringsAsFactors=F);
for (ri in seq_len(nrow(Bsize.col))) {
Bsize <- Bsize.col$Bsize[ri];
col <- Bsize.col$col[ri];
lines(x,dnorm(x,Bmean(Bsize),Bsd(Bsize)),col=col,lwd=2);
};
with(Bsize.col,legend(-2.5,1.65,Bsize,col,col,title=expression(bold(Bsize))),cex=0.7);
subEnv <- as.environment(mget(c('Bmin','Bmax')));
text(0.3,1.6,parse(text=paste0('mu == ',deparse(do.call(substitute,c(list(body(Bmean)),subEnv))))),pos=4L);
text(0.3,1.53,parse(text=paste0('sigma == ',deparse(do.call(substitute,c(list(body(Bsd)),subEnv))))),pos=4L);
Solution
## fixed constants
Ainner <- 207L;
Binner <- 408L;
Amin <- Ainner*2L;
Bmin <- Binner*2L;
Amax <- Ainner*7L;
Bmax <- Binner*6L;
NPmin <- Amin+Bmin;
NPmax <- Amax+Bmax;
## normal mean and sd functions
Amean <- function(Asize) 1.5 + 6*(Asize-Amin)/(Amax-Amin);
Asd <- function(Asize) 1.3*exp(-(1.22*(2*Asize-Amin-Amax)/(Amax-Amin))^2);
Bmean <- function(Bsize) 1.5 + 5*(Bsize-Bmin)/(Bmax-Bmin);
Bsd <- function(Bsize) 1.3*exp(-(1.22*(2*Bsize-Bmin-Bmax)/(Bmax-Bmin))^2);
## primary implementation function
bgoldst <- function(NP,seed=NULL,check=F) {
if (!is.null(seed)) set.seed(seed);
## in order to parameterize the total number of ppl, must consider exactly which constraints impose which limits
## the A min 414 and max 1449 are fixed based on the 207 and 408 inner categories
## the B min 816 and max 2448 are also fixed for the same reason
## the mins cannot be changed by the parameterized total number of ppl
## moreover, we should validate that the total number of ppl is sufficient for all inner categories
## this requires 414+816 == 1230
if (NP<NPmin) stop(paste0('insufficient NP=',NP,'.'));
## additionally we should validate that the total number of ppl does not exceed the maximum possible that can be handled by the inner categories
## this is 1449+2448 == 3897
if (NP>NPmax) stop(paste0('excessive NP=',NP,'.'));
## the A max varies from 1449 down to 414, depending on NP
## the B max varies from 2448 down to 816, depending on NP
## so what we can do as the first step is calculate the maxes based on NP
AminCur <- max(Amin,NP-Bmax);
BminCur <- max(Bmin,NP-Amax);
AmaxCur <- min(Amax,NP-Bmin);
BmaxCur <- min(Bmax,NP-Amin);
## now we can select a random division from the available space
Asize <- if (AminCur==AmaxCur) AminCur else sample(AminCur:AmaxCur,1L);
Bsize <- NP-Asize;
## will use carefully designed sliding normal distributions to couple the probability distribution to the constraints
## see global functions for formulae
## randomly choose inner category sizes for A
## we know the exact number of inner categories we need, so choose that many inner category sizes using the normal dist
AG <- pmin(7L,pmax(2L,as.integer(rnorm(Ainner,Amean(Asize),Asd(Asize)))));
## iterate adding/removing one member at a time to get to the required size
AGsum <- sum(AG);
if (AGsum>Asize) {
while (AGsum>Asize) {
i <- which(AG>2L);
if (length(i)>1L) i <- sample(i,1L); ## don't let sample()'s inconsistency screw us
AG[i] <- AG[i]-1L;
AGsum <- AGsum-1L;
}; ## end while
} else if (AGsum<Asize) {
while (AGsum<Asize) {
i <- which(AG<7L);
if (length(i)>1L) i <- sample(i,1L); ## don't let sample()'s inconsistency screw us
AG[i] <- AG[i]+1L;
AGsum <- AGsum+1L;
}; ## end while
}; ## end if
## randomly choose inner category sizes for B
BG <- pmin(6L,pmax(2L,as.integer(rnorm(Binner,Bmean(Bsize),Bsd(Bsize)))));
## iterate adding/removing one member at a time to get to the required size
BGsum <- sum(BG);
if (BGsum>Bsize) {
while (BGsum>Bsize) {
i <- which(BG>2L);
if (length(i)>1L) i <- sample(i,1L); ## don't let sample()'s inconsistency screw us
BG[i] <- BG[i]-1L;
BGsum <- BGsum-1L;
}; ## end while
} else if (BGsum<Bsize) {
while (BGsum<Bsize) {
i <- which(BG<6L);
if (length(i)>1L) i <- sample(i,1L); ## don't let sample()'s inconsistency screw us
BG[i] <- BG[i]+1L;
BGsum <- BGsum+1L;
}; ## end while
}; ## end if
## combine into data.frame, randomly distributing the inner categories across inner category ids
res <- data.frame(
outer=rep(c('A','B'),c(Ainner,Binner)),
inner=c(1:Ainner,1:Binner),
num=c(sample(AG),sample(BG))
);
if (check) bgoldst.check(NP,res,seed);
res;
}; ## end bgoldst()
## validation check helper function
bgoldst.check <- function(NP,res,seed=NULL) {
seedStr <- if (is.null(seed)) 'NULL' else as.character(seed);
## A
with(res[res$outer=='A',],{
if (length(outer)!=Ainner) stop(paste0('outer category A has wrong number of inner categories ',length(outer),'!=',Ainner,' [',seedStr,'].'));
x <- num>=2L & num<=7L;
if (!all(x)) stop(paste0('outer category A has invalid inner category size ',num[which(!x)[1L]],' [',seedStr,'].'));
x <- sum(num);
if (!(x>=Amin && x<=Amax)) stop(paste0('outer category A has invalid size ',x,' [',seedStr,'].'));
});
## B
with(res[res$outer=='B',],{
if (length(outer)!=Binner) stop(paste0('outer category B has wrong number of inner categories ',length(outer),'!=',Binner,' [',seedStr,'].'));
x <- num>=2L & num<=6L;
if (!all(x)) stop(paste0('outer category B has invalid inner category size ',num[which(!x)[1L]],' [',seedStr,'].'));
x <- sum(num);
if (!(x>=Bmin && x<=Bmax)) stop(paste0('outer category B has invalid size ',x,' [',seedStr,'].'));
});
## all
with(res,{
x <- sum(num);
if (x!=NP) stop(paste0('result has invalid total size ',x,' [',seedStr,'].'));
});
}; ## end bgoldst.check()
## one-off demo
res <- bgoldst(1524L,1L,T);
head(res,10L); tail(res,10L);
## outer inner num
## 1 A 1 2
## 2 A 2 3
## 3 A 3 3
## 4 A 4 2
## 5 A 5 2
## 6 A 6 2
## 7 A 7 2
## 8 A 8 4
## 9 A 9 2
## 10 A 10 2
## outer inner num
## 606 B 399 3
## 607 B 400 2
## 608 B 401 4
## 609 B 402 2
## 610 B 403 2
## 611 B 404 2
## 612 B 405 6
## 613 B 406 2
## 614 B 407 2
## 615 B 408 5
table(res$outer,res$num);
##
## 2 3 4 5 6
## A 158 28 13 8 0
## B 282 68 33 18 7
## extensive testing
for (seed in seq_len(1e5L)) {
print(seed);
set.seed(seed);
bgoldst(sample(NPmin:NPmax,1L),NULL,T);
}; ## end for
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
##
## ... snip ... (all succeed, all fast)
##
## [1] 99996
## [1] 99997
## [1] 99998
## [1] 99999
## [1] 100000
First Attempt
(Original intro: Ok, unfortunately, there was some ambiguity in the OP's wording, specifically where he said "it does not matter if a group category is not used". I assumed that meant that inner categories could have zero members. My solution below was based on that premise. My assumption was incorrect, and that changes everything. I will leave my answer the way I wrote it as I work on a new solution.)
## primary implementation function
bgoldst <- function(seed=NULL,check=F) {
if (!is.null(seed)) set.seed(seed);
## divide 1524 into two outer categories -- sample the acceptable divisions
## notably, cannot allow only 1 person into either outer category
## also, cannot take more than 1449 ppl into A; most it can hold is 7*207 == 1449
## B can hold any number from zero to 1524
NHA <- sample(c(0L,2:1449),1L);
NHB <- 1524L-NHA;
## also, since 1449 would *require* 7 ppl in every category, must prep normal dist params
## specifically, will slide mean from 4.5 over towards (and past) 7, sd smaller the closer we are to 1449
Amean <- 4.5 + 3*NHA/1449;
Asd <- 1.5*exp(-(NHA/1e3)^1.6);
## divide A into 207 inner categories of 2:7 -- iterative sampling until valid
## should be very few iterations, since we over-append with high likelihood
AG <- integer();
if (NHA>0L) {
repeat {
AG <- c(AG,pmin(7L,pmax(2L,as.integer(rnorm(max(1,NHA/3),Amean,Asd)))));
## find last inner category
AGcum <- cumsum(AG);
AGLastIndex <- which(AGcum>=NHA)[1L];
if (!is.na(AGLastIndex)) { ## sufficient coverage
## also must guard against too lightly allocated to fit within inner category num limit
if (AGLastIndex>207L) {
AG <- integer(); ## hard reset
} else {
break; ## done
}; ## end if
}; ## end if
}; ## end repeat
## remove surplus inner categories and surplus in last inner category
length(AG) <- AGLastIndex;
AG[AGLastIndex] <- AG[AGLastIndex] - (AGcum[AGLastIndex]-NHA);
if (AG[AGLastIndex]==1L) { ## special case for last inner category remnant of 1; must even out against previous inner category
## also, can't join max inner category size since it would overflow the last group
## also, can't take 1 less than previous inner category size since then *it* would be left with 1
takeOpt <- setdiff(1:if (AG[AGLastIndex-1L]==7L) 5L else AG[AGLastIndex-1L],AG[AGLastIndex-1L]-1L);
take <- if (length(takeOpt)==1L) takeOpt else sample(takeOpt,1L); ## don't let sample()'s inconsistent behavior screw us
AG[AGLastIndex-1L] <- AG[AGLastIndex-1L]-take;
AG[AGLastIndex] <- AG[AGLastIndex]+take;
}; ## end if
}; ## end if
## divide Bs into 408 inner categories of 2:6 -- iterative sampling until valid
BG <- integer();
if (NHB>0L) {
repeat {
BG <- c(BG,sample(2:6,max(1,NHB/3),replace=T));
## find last inner category
BGcum <- cumsum(BG);
BGLastIndex <- which(BGcum>=NHB)[1L];
if (!is.na(BGLastIndex)) { ## sufficient coverage
## also must guard against too lightly allocated to fit within inner category num limit
if (BGLastIndex>408L) {
BG <- integer(); ## hard reset
} else {
break; ## done
}; ## end if
}; ## end if
}; ## end repeat
## remove surplus inner categories and surplus in last inner category
length(BG) <- BGLastIndex;
BG[BGLastIndex] <- BG[BGLastIndex] - (BGcum[BGLastIndex]-NHB);
if (BG[BGLastIndex]==1L) { ## special case for last inner category remnant of 1; must even out against previous inner category
## also, can't join max inner category size since it would overflow the last group
## also, can't take 1 less than previous inner category size since then *it* would be left with 1
takeOpt <- setdiff(1:if (BG[BGLastIndex-1L]==6L) 4L else BG[BGLastIndex-1L],BG[BGLastIndex-1L]-1L);
take <- if (length(takeOpt)==1L) takeOpt else sample(takeOpt,1L); ## don't let sample()'s inconsistent behavior screw us
BG[BGLastIndex-1L] <- BG[BGLastIndex-1L]-take;
BG[BGLastIndex] <- BG[BGLastIndex]+take;
}; ## end if
}; ## end if
## combine into data.frame, randomly distributing the inner categories across inner category ids
res <- data.frame(
outer=rep(c('A','B'),c(207L,408L)),
inner=c(1:207,1:408),
num=c(sample(c(AG,rep(0L,207L-length(AG)))),sample(c(BG,rep(0L,408L-length(BG)))))
);
if (check) bgoldst.check(res,seed);
res;
}; ## end bgoldst()
## validation check helper function
bgoldst.check <- function(res,seed=NULL) {
seedStr <- if (is.null(seed)) 'NULL' else as.character(seed);
## A
with(res[res$outer=='A',],{
if (length(outer)!=207L) stop(paste0('outer category A has wrong number of inner categories ',length(outer),'!=207 [',seedStr,'].'));
x <- num>=2L & num<=7L | num==0L;
if (!all(x)) stop(paste0('outer category A has invalid inner category size ',num[which(!x)[1L]],' [',seedStr,'].'));
x <- sum(num);
if (!(x>=0L && x<=1524L)) stop(paste0('outer category A has invalid size ',x,' [',seedStr,'].'));
});
## B
with(res[res$outer=='B',],{
if (length(outer)!=408L) stop(paste0('outer category B has wrong number of inner categories ',length(outer),'!=408 [',seedStr,'].'));
x <- num>=2L & num<=6L | num==0L;
if (!all(x)) stop(paste0('outer category B has invalid inner category size ',num[which(!x)[1L]],' [',seedStr,'].'));
x <- sum(num);
if (!(x>=0L && x<=1524L)) stop(paste0('outer category B has invalid size ',x,' [',seedStr,'].'));
});
## all
with(res,{
x <- sum(num);
if (x!=1524L) stop(paste0('result has invalid total size ',x,' [',seedStr,'].'));
});
}; ## end bgoldst.check()
## one-off demo
res <- bgoldst(1L,T);
head(res,10L); tail(res,10L);
## outer inner num
## 1 A 1 5
## 2 A 2 4
## 3 A 3 0
## 4 A 4 0
## 5 A 5 0
## 6 A 6 5
## 7 A 7 0
## 8 A 8 5
## 9 A 9 0
## 10 A 10 4
## outer inner num
## 606 B 399 3
## 607 B 400 5
## 608 B 401 5
## 609 B 402 0
## 610 B 403 6
## 611 B 404 0
## 612 B 405 5
## 613 B 406 2
## 614 B 407 0
## 615 B 408 0
table(res$outer,res$num);
##
## 0 2 3 4 5 6 7
## A 125 1 9 25 29 15 3
## B 116 71 57 54 50 60 0
## extensive testing
for (seed in seq_len(1e5L)) {
print(seed);
bgoldst(seed,T);
}; ## end for
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
##
## ... snip ... (all succeed, all fast)
##
## [1] 99996
## [1] 99997
## [1] 99998
## [1] 99999
## [1] 100000
Normal distribution behavior:
xlim <- c(-3,10);
ylim <- c(0,1.7);
xticks <- seq(xlim[1L],xlim[2L]);
yticks <- seq(ylim[1L],ylim[2L],0.1);
plot(NA,xlim=xlim,ylim=ylim,xlab='AG',ylab='P',axes=F);
axis(1L,xticks,xticks);
axis(2L);
box();
abline(v=xticks,col='lightgrey');
abline(h=yticks,col='lightgrey');
x <- seq(xlim[1L],xlim[2L],0.01);
Amean <- function(NHA) 4.5 + 3*NHA/1449;
Asd <- function(NHA) 1.5*exp(-(NHA/1e3)^1.6);
NHA.col <- data.frame(NHA=c(0,300,600,900,1200,1449),col=c('red','green','blue','gold','cyan','magenta'),stringsAsFactors=F);
for (ri in seq_len(nrow(NHA.col))) {
NHA <- NHA.col$NHA[ri];
col <- NHA.col$col[ri];
lines(x,dnorm(x,Amean(NHA),Asd(NHA)),col=col,lwd=2);
};
with(NHA.col,legend(-2.5,1.65,NHA,col,col,title=expression(bold(NHA))),cex=0.7);
text(-2.5,0.92,parse(text=paste0('mu == ',deparse(body(Amean)))),pos=4L);
text(-2.5,0.87,parse(text=paste0('sigma == ',deparse(body(Asd)))),pos=4L);
we can assign 2 people to each category in type A (207) and type B (408) and sample categories for the rest of the 1524 - 2*207 - 2*408 = 294 people
rm(list=ls())
options(stringsAsFactors=FALSE)
numPeeps <- 1524
aSize <- 207
bSize <- 408
aRange <- c(2,7)
bRange <- c(2,6)
vacancies <- c(paste(rep("A", diff(aRange)*aSize), rep(1:aSize, each=diff(aRange))),
paste(rep("B", diff(bRange)*bSize), rep(1:bSize, each=diff(bRange))))
#perform allocation by assigning 2 people to each category in type A (207) and type B (408) and sample categories for the rest of the 1524-2*207-2*408=294 people
allocation <- c(paste(rep("A", aRange[1]*aSize), rep(1:aSize, each=aRange[1])),
paste(rep("B", bRange[1]*bSize), rep(1:bSize, each=bRange[1])),
sample(vacancies, numPeeps - aRange[1]*aSize - bRange[1]*bSize))
#spit out results
library(dplyr)
library(stringi)
df <- setNames(data.frame(stri_split_fixed(allocation, " ", simplify=T)),
c("AB","Category"))
df$Category <- as.integer(df$Category)
houseDensity <- group_by(df, AB, Category) %>%
summarise(GroupSize = n()) %>%
ungroup() %>%
select(-Category) %>%
group_by(AB, GroupSize) %>%
summarise(Num.Groups = n()) %>%
mutate(Num.People = Num.Groups * GroupSize)
# Source: local data frame [9 x 4]
# Groups: AB
#
# AB GroupSize Num.Groups Num.People
# 1 A 2 113 226
# 2 A 3 68 204
# 3 A 4 21 84
# 4 A 5 4 20
# 5 A 6 1 6
# 6 B 2 266 532
# 7 B 3 118 354
# 8 B 4 22 88
# 9 B 5 2 10
sum(houseDensity[houseDensity$AB=="A", "Num.Groups"])
# [1] 207
sum(houseDensity[houseDensity$AB=="B", "Num.Groups"])
# [1] 408
sum(houseDensity$Num.People)
# [1] 1524
I need to analyze some simulated data with the following structure:
h c x1 y1 x1c10
1 0 37.607056431 104.83097593 5
1 1 27.615251557 140.85532974 10
1 0 34.68915314 114.59312842 2
1 1 30.090387454 131.60485642 9
1 1 39.274429397 106.76042522 10
1 0 33.839385007 122.73681319 2
...
where h ranges from 1 to 2500, and indexes the Monte Carlo sample, each sample with 1000 observations. I'm analysing these data with the following code that gives me two objects (fnN1, fdQB101):
mc<-2500 ##create loop index
fdN1<-matrix(0,mc,1000)
fnQB101 <- matrix(0,mc,1000) ##create 2500x1000 storage matrices, elements zero
for(j in 1:mc){
fdN1[j,] <- dnorm(residuals(lm(x1 ~ c,data=s[s$h==j,])),
mean(residuals(lm(x1 ~ c,data=s[s$h==j,]))),
sd(residuals(lm(x1 ~ c,data=s[s$h==j,]))))
x1c10<-as.matrix(subset(s,s$h==j,select=x1c10))
fdQB100 <- as.matrix(predict(polr(as.factor(x1c10) ~ c ,
method="logistic", data=s[s$h==j,]),
type="probs"))
indx10<- as.matrix(cbind(as.vector(seq(1:nrow(fdQB100))),x1c10))
fdQB101[j,] <- fdQB100[indx10]
}
The objects fdN1 and fdQB101 are 2500x1000 matrices with predicted probabilities as elements. I need to create a function out of this loop that I can call with lapply() or mclapply(). When I wrap this in the following function command:
ndMC <- function(mc){
for(j in 1:mc){
...
}
return(list(fdN1,fdQB101))
}
lapply(mc,ndMC)
the objects fdN1 and fdQB101 are each returned as 2500x1000 matrices of zeros, instead of the predicted probabilities. What am I doing wrong?
You should be able to do this with the data.table package. Here is an example:
library(data.table)
dt<-data.table(h=rep(1L,6), c=c(0L,1L,0L,1L,1L,0L),
X1=c(37.607056431,27.615251557,34.68915314,30.090387454,39.274429397,33.839385007),
y1=c(104.83097593,140.85532974,114.59312842,131.60485642,106.76042522,122.73681319),
x1c10=c(5L,10L,2L,9L,10L,2L))
## Create a linear model for every grouping of variable h:
fdN1.partial<-dt[,list(lm=list(lm(X1~c))),by="h"]
## Retrieve the linear model for h==1:
fdN1.partial[h==1,lm]
## [[1]]
##
## Call:
## lm(formula = X1 ~ c)
##
## Coefficients:
## (Intercept) c
## 35.379 -3.052
You could also write a function to generalize this solution:
f.dnorm<-function(y,x) {
f<-lm(y ~ x)
out<-list(dnorm(residuals(f), mean(residuals(f)), sd(residuals(f))))
return(out)
}
## Generate two dnorm lists for every grouping of variable h:
dt.lm<-dt[,list(dnormX11=list(f.dnorm(X1,rep(1,length(X1)))), dnormX1c=list(f.dnorm(X1,c))),by="h"]
## Retrieve one of the dnorm lists for h==1:
unlist(dt.lm[h==1,dnormX11])
## 1 2 3 4 5 6
## 0.06296194 0.03327407 0.08884549 0.06286739 0.04248756 0.09045784