automatically try different initial values in optim - r

I use optim(.) to try to find the best fitting parameters for some function fn(dat, par, out=FALSE) where par must be a vector of two elements and out determines the output format. I use
optim(par=c(1,1), fn, dat=dat)
to identify the best-fitting values of par. Depending on the data in dat, this either works ot throws an error that
function cannot be evaluated at initial parameters
which I understand requires different starting values for optim(.). My problem is that I apply the function to many data sets in parallel and wonder whether I indeed need to try different values by hand or whether there is some way of automatizing this along the lines of
if no error then great
if error try par=c(0.5,1)
if no error then great
if error try par=c(0.5,0.5)
...

You could run a grid search before you start and discard NA parameters. Here is an example.
A test function:
fn <- function(x) {
if (x[1] < 0)
NA
else
prod(x)
}
Now run a grid search.
library("NMOF")
res <- gridSearch(fn,
npar = 2, ## length of x
lower = -1, ## lower bound for x
upper = 3, ## upper bound for x
n = 5) ## number of levels per element in x
## 2 variables with 5, 5 levels: 25 function evaluations required.
The function shows you all the parameter combinations it tried.
res$levels
## [[1]]
## [1] -1 -1
##
## [[2]]
## [1] 0 -1
##
## [[3]]
## [1] 1 -1
##
## ....
And it provides the objective function values associated with these combinations.
res$values
## [1] NA 0 -1 -2 -3 NA 0 0 0 0 NA 0 1 2 3
## [16] NA 0 2 4 6 NA 0 3 6 9
## => many objective functions values are NA
The best (none-NA) solution:
res$minlevels
## [1] 3 -1
## => your starting value for optim:
##
## optim(gridSearch(fn, npar = 2,
## lower = -1, upper = 3, n = 5)$minlevels,
## fn, dat = dat)
Of course, this won't give you a guarantee that at least one none-NAvector is found, but the chances may improve.

Related

R compiler options for optimization

Coming from some code I got on the internet, I got to wonder whether there is some option to set, to optimize the bytecode compiler. Documentation of compiler::cmpfun says there is an optimization level, that can be set with the "option" parameter (but how?) and that has already the value of 2 from a range from 0 to 3, which seems rather high.
But in my example, optimization seems very poor. The display below is just some post processing around the output of the compiler::disassemble function, in order to make the things more pretty. As the disassembly says, it tries to do z <- z which doesn't make much sense. It could make sense if z was an active binding but it is obviously not the case.
f <- function (x) {z <- 0; z <- if (x==42) 0 else z; x}
f <- compiler::cmpfun(f)
purrr::walk(dasm(f),function(x) message(paste(x,collapse=' ')))
## LDCONST.OP 1 # the constant 0
## SETVAR.OP 3 # z bound to it (3 must be some kind of location for z)
## POP.OP # ignore result of <-
## GETVAR.OP 5 # the value of x
## LDCONST.OP 7 # the constant 42
## EQ.OP 8 # test for equality
## BRIFNOT.OP 9 19 # if not eq, goto step (2)
## LDCONST.OP 10 # else (eq), the constant 0
## GOTO.OP 21 # and goto step (3)
## GETVAR.OP 3 # step 2(eq): get the value of z
## SETVAR.OP 3 # and set z to be a reference
## POP.OP # and ignore the result of <-
## GETVAR.OP 5 # step 3: get the value of x
## RETURN.OP # and return it as the result of f
I saw other related questions on SO, but without any beginning of answer for mine, probably because these answers are a bit old, from a era where compiling was not so frequent...

Randomly assign people into different size groups and category

I need to assign people randomly into groups and a category. Unfortunately, I don’t really know where to start with this. I have tried to explain my problem using the example below. Any help with this would be very much appreciated.
I have 207 ‘Home Type A’ and 408 ‘Home Type B’ categories. In total there are 1524 people that need to be assigned to either the 207 Home Type A’s or the 408 Home Type B’s categories. However, the 1524 people also need to be grouped together in either groups of 2 to 7 for Home Type A or of 2 to 6 for Home Type B .
The end result should be 1524 people assigned to 207 groups (containing between 2 to 7 people) and 408 groups (containing between 2 to 6 people).
The assignment to groups has to be random and can use any combination of group sizes required as it does not matter if a group category is not used (for example, it would be fine if the 207 groups for Home Type A only contained 2, 3 or 4 people one time, or only 5 and 7 people another).
I imagine an output that would look something like this:
GroupSize <- c(2:7)
Num.Groups <- 0
Num.People <- 0
HouseTypeA <- data.frame(GroupSize, Num.Groups, Num.People)
GroupSize <- c(2:6)
HouseTypeB <- data.frame(GroupSize, Num.Groups, Num.People)
With the 'Num.Groups' column summing to either 207 or 408 and the sum of 'Num.People' between the two data frames being 1524.
I tried this in 3 steps:
Make List of house type
Distribute each person to houses, checking
that there is space (max=7 for type a and max=6 for type b)
Check that each house has the minimum of 2 people. If not, grab a person
from another house without allowing their number of people to drop
below 2.
homeType=rep(c("a", "b"), times=c(207, 408))
H <- vector(mode="list", length(homeType))
for(i in seq(H)){
H[[i]]$type <- homeType[i]
H[[i]]$n <- 0
}
H
# Place people in houses up to max number of people
npeople <- 1524
for(i in seq(npeople)){
placed_in_house <- FALSE
while(!placed_in_house){
house_num <- sample(length(H), 1)
if(H[[house_num]]$type == "a"){
if(H[[house_num]]$n < 7){
H[[house_num]]$n <- H[[house_num]]$n + 1
placed_in_house <- TRUE
}
}
if(H[[house_num]]$type == "b"){
if(H[[house_num]]$n < 6){
H[[house_num]]$n <- H[[house_num]]$n + 1
placed_in_house <- TRUE
}
}
}
}
H
hist(unlist(lapply(H, function(x)x$n)))
# move people around to get up to min number of people
for(i in seq(H)){
while(H[[i]]$n < 2){
knock_on_door <- sample(length(H), 1)
if( H[[knock_on_door]]$n > 2){
H[[i]]$n <- H[[i]]$n + 1 # house i takes 1 person
H[[knock_on_door]]$n <- H[[knock_on_door]]$n - 1 # house knock_on_door loses 1 person
}
}
}
H
Ha <- H[which(lapply(H, function(x){x$type}) == "a")]
Hb <- H[which(lapply(H, function(x){x$type}) == "b")]
op <- par(mfcol=c(1,2))
breaks=2:7
hist(unlist(lapply(Ha, function(x)x$n)), breaks=breaks, col=8, xlab="people per house", main="type A") # 2:7
hist(unlist(lapply(Hb, function(x)x$n)), breaks=breaks, col=8, xlab="people per house", main="type B") # 2:6
par(op)
sum(unlist(lapply(Ha, function(x)x$n))) + sum(unlist(lapply(Hb, function(x)x$n)))
# [1] 1524
Houses <- data.frame(
Num.Groups = seq(H),
type=unlist(lapply(H, function(x){x$type})),
Num.People=unlist(lapply(H, function(x){x$n}))
)
head(Houses)
As you can see, the total numbers and distribution of people per house works out.
There are two outer categories, A and B. Outer category A always has 207 inner categories each of size in 2:7, outer category B always has 408 inner categories each of size in 2:6.
This means:
The minimum number of people needed in outer category A must be 207*2 == 414.
The maximum number of people possible in outer category A is 207*7 == 1449.
The minimum number of people needed in outer category B must be 408*2 == 816.
The maximum number of people possible in outer category B is 408*6 == 2448.
We can further conclude:
The minimum number of people needed in total is 414+816 == 1230.
The maximum number of people possible in total is 1449+2448 == 3897.
I've captured these values as constants assigned at the start of my solution, copied here for reference:
Ainner <- 207L;
Binner <- 408L;
Amin <- Ainner*2L;
Bmin <- Binner*2L;
Amax <- Ainner*7L;
Bmax <- Binner*6L;
NPmin <- Amin+Bmin;
NPmax <- Amax+Bmax;
I designed my solution around the idea that we want to minimize the amount of looping that we need to do in order to find a suitable allocation of people to categories. We can get closer to that goal by initially computing a random division of the total number of people into the two outer categories A and B.
Once that is done, we must select random sizes for each inner category such that the outer category totals hit our division numbers exactly. This is not an easy task; I have not been able to think of a way of using the PRNG primitives to instantaneously produce values that satisfy the constraints. I think we need to iterate.
My solution to minimize looping was to derive normal distributions whose distribution parameters (that is, the mean and standard deviation) are parameterized on the selected outer category sizes (Asize and Bsize in the code), contrived in such a way that the distributions are likely to produce values whose aggregation will be very close to the required total in each outer category, while still providing significant randomness in the inner category size selections. We can then run a loop, making one increment or decrement every iteration as necessary, until we arrive at the required total. The loop is still necessary, but the number of iterations is minimized.
The form of the mean equations follows this pattern:
{extended-min} + {extended-range}*({size}-{min})/({max}-{min})
{extended-min} is the bottom end of a range that extends just outside the inner category size range. For example, for outer category A, the inner category size range is 2:7, and my extended range is 1.5:7.5. I used an extended range because I wanted to permit some variance in the random variates produced at the extreme ends of the function, even though they should theoretically approach a limit of the end of the inner category size range. It made the derivation and manipulation of the formula easier than it would have been had I tried to make it reach the theoretically ideal limit at each end, and it's aesthetically more pleasing to plot it this way. Note that the possibility (in fact, certainty) that the normal distribution deviates will fall outside the inner category size range is not a problem, because I cap the values at the endpoints with pmin() and pmax().
{extended-range} is the full extended range, e.g. 6 for outer category A.
{size} is the size that was randomly chosen for the outer category.
{min} and {max} are the minimum and maximum permissible values of the outer category size.
These are the actual equations I derived:
Amean <- function(Asize) 1.5 + 6*(Asize-Amin)/(Amax-Amin);
Bmean <- function(Bsize) 1.5 + 5*(Bsize-Bmin)/(Bmax-Bmin);
The form of the standard deviation equations follows this pattern:
{mult1}*exp(-({mult2}*(2*{size}-{min}-{max})/({max}-{min}))^2)
{mult1} and {mult2} are simply multipliers that I fiddled with to get an intuitive behavior of the final normal distributions, based on the goal of mapping the distributions to the required inner category allocations as closely as possible.
{size} is the same as before.
{min} and {max} are the same as before.
The rationale for the standard deviation form is that the standard deviation will be symmetric about the midpoint of the valid size range, producing a wide normal distribution for middleground sizes, and becoming more narrow towards the extreme ends. Note that the quotient in the exponential is equivalent to this:
({size}-({min}+{max})/2)/(({max}-{min})/2)
Hence it's the deviation of the size from the midpoint of the inner category size range, divided by half the range. That provides a domain of [-1,1], which is then multiplied by {mult2} and squared. The resulting negative exponent gets very large for extreme values, causing the entire exponential to become very small. This small standard deviation is what gives the normal distribution its narrowness towards the ends of the inner category size range.
These are the actual equations:
Asd <- function(Asize) 1.3*exp(-(1.22*(2*Asize-Amin-Amax)/(Amax-Amin))^2);
Bsd <- function(Bsize) 1.3*exp(-(1.22*(2*Bsize-Bmin-Bmax)/(Bmax-Bmin))^2);
Here's some code I wrote to nicely visualize the normal distributions:
Outer Category A
xlim <- c(-3,10);
ylim <- c(0,1.7);
xticks <- seq(xlim[1L],xlim[2L]);
yticks <- seq(ylim[1L],ylim[2L],0.1);
plot(NA,xlim=xlim,ylim=ylim,xlab='Inner Category Size',ylab='P',axes=F);
axis(1L,xticks,xticks);
axis(2L);
box();
abline(v=xticks,col='lightgrey');
abline(h=yticks,col='lightgrey');
x <- seq(xlim[1L],xlim[2L],0.01);
Asize.col <- data.frame(Asize=trunc(seq(Amin,Amax,len=7L)),col=c('red','green','blue','brown','gold','cyan','magenta'),stringsAsFactors=F);
for (ri in seq_len(nrow(Asize.col))) {
Asize <- Asize.col$Asize[ri];
col <- Asize.col$col[ri];
lines(x,dnorm(x,Amean(Asize),Asd(Asize)),col=col,lwd=2);
};
with(Asize.col,legend(-2.5,1.65,Asize,col,col,title=expression(bold(Asize))),cex=0.7);
subEnv <- as.environment(mget(c('Amin','Amax')));
text(0.5,1.6,parse(text=paste0('mu == ',deparse(do.call(substitute,c(list(body(Amean)),subEnv))))),pos=4L);
text(0.5,1.53,parse(text=paste0('sigma == ',deparse(do.call(substitute,c(list(body(Asd)),subEnv))))),pos=4L);
Outer Category B
xlim <- c(-3,10);
ylim <- c(0,1.7);
xticks <- seq(xlim[1L],xlim[2L]);
yticks <- seq(ylim[1L],ylim[2L],0.1);
plot(NA,xlim=xlim,ylim=ylim,xlab='Inner Category Size',ylab='P',axes=F);
axis(1L,xticks,xticks);
axis(2L);
box();
abline(v=xticks,col='lightgrey');
abline(h=yticks,col='lightgrey');
x <- seq(xlim[1L],xlim[2L],0.01);
Bsize.col <- data.frame(Bsize=trunc(seq(Bmin,Bmax,len=7L)),col=c('red','green','blue','brown','gold','cyan','magenta'),stringsAsFactors=F);
for (ri in seq_len(nrow(Bsize.col))) {
Bsize <- Bsize.col$Bsize[ri];
col <- Bsize.col$col[ri];
lines(x,dnorm(x,Bmean(Bsize),Bsd(Bsize)),col=col,lwd=2);
};
with(Bsize.col,legend(-2.5,1.65,Bsize,col,col,title=expression(bold(Bsize))),cex=0.7);
subEnv <- as.environment(mget(c('Bmin','Bmax')));
text(0.3,1.6,parse(text=paste0('mu == ',deparse(do.call(substitute,c(list(body(Bmean)),subEnv))))),pos=4L);
text(0.3,1.53,parse(text=paste0('sigma == ',deparse(do.call(substitute,c(list(body(Bsd)),subEnv))))),pos=4L);
Solution
## fixed constants
Ainner <- 207L;
Binner <- 408L;
Amin <- Ainner*2L;
Bmin <- Binner*2L;
Amax <- Ainner*7L;
Bmax <- Binner*6L;
NPmin <- Amin+Bmin;
NPmax <- Amax+Bmax;
## normal mean and sd functions
Amean <- function(Asize) 1.5 + 6*(Asize-Amin)/(Amax-Amin);
Asd <- function(Asize) 1.3*exp(-(1.22*(2*Asize-Amin-Amax)/(Amax-Amin))^2);
Bmean <- function(Bsize) 1.5 + 5*(Bsize-Bmin)/(Bmax-Bmin);
Bsd <- function(Bsize) 1.3*exp(-(1.22*(2*Bsize-Bmin-Bmax)/(Bmax-Bmin))^2);
## primary implementation function
bgoldst <- function(NP,seed=NULL,check=F) {
if (!is.null(seed)) set.seed(seed);
## in order to parameterize the total number of ppl, must consider exactly which constraints impose which limits
## the A min 414 and max 1449 are fixed based on the 207 and 408 inner categories
## the B min 816 and max 2448 are also fixed for the same reason
## the mins cannot be changed by the parameterized total number of ppl
## moreover, we should validate that the total number of ppl is sufficient for all inner categories
## this requires 414+816 == 1230
if (NP<NPmin) stop(paste0('insufficient NP=',NP,'.'));
## additionally we should validate that the total number of ppl does not exceed the maximum possible that can be handled by the inner categories
## this is 1449+2448 == 3897
if (NP>NPmax) stop(paste0('excessive NP=',NP,'.'));
## the A max varies from 1449 down to 414, depending on NP
## the B max varies from 2448 down to 816, depending on NP
## so what we can do as the first step is calculate the maxes based on NP
AminCur <- max(Amin,NP-Bmax);
BminCur <- max(Bmin,NP-Amax);
AmaxCur <- min(Amax,NP-Bmin);
BmaxCur <- min(Bmax,NP-Amin);
## now we can select a random division from the available space
Asize <- if (AminCur==AmaxCur) AminCur else sample(AminCur:AmaxCur,1L);
Bsize <- NP-Asize;
## will use carefully designed sliding normal distributions to couple the probability distribution to the constraints
## see global functions for formulae
## randomly choose inner category sizes for A
## we know the exact number of inner categories we need, so choose that many inner category sizes using the normal dist
AG <- pmin(7L,pmax(2L,as.integer(rnorm(Ainner,Amean(Asize),Asd(Asize)))));
## iterate adding/removing one member at a time to get to the required size
AGsum <- sum(AG);
if (AGsum>Asize) {
while (AGsum>Asize) {
i <- which(AG>2L);
if (length(i)>1L) i <- sample(i,1L); ## don't let sample()'s inconsistency screw us
AG[i] <- AG[i]-1L;
AGsum <- AGsum-1L;
}; ## end while
} else if (AGsum<Asize) {
while (AGsum<Asize) {
i <- which(AG<7L);
if (length(i)>1L) i <- sample(i,1L); ## don't let sample()'s inconsistency screw us
AG[i] <- AG[i]+1L;
AGsum <- AGsum+1L;
}; ## end while
}; ## end if
## randomly choose inner category sizes for B
BG <- pmin(6L,pmax(2L,as.integer(rnorm(Binner,Bmean(Bsize),Bsd(Bsize)))));
## iterate adding/removing one member at a time to get to the required size
BGsum <- sum(BG);
if (BGsum>Bsize) {
while (BGsum>Bsize) {
i <- which(BG>2L);
if (length(i)>1L) i <- sample(i,1L); ## don't let sample()'s inconsistency screw us
BG[i] <- BG[i]-1L;
BGsum <- BGsum-1L;
}; ## end while
} else if (BGsum<Bsize) {
while (BGsum<Bsize) {
i <- which(BG<6L);
if (length(i)>1L) i <- sample(i,1L); ## don't let sample()'s inconsistency screw us
BG[i] <- BG[i]+1L;
BGsum <- BGsum+1L;
}; ## end while
}; ## end if
## combine into data.frame, randomly distributing the inner categories across inner category ids
res <- data.frame(
outer=rep(c('A','B'),c(Ainner,Binner)),
inner=c(1:Ainner,1:Binner),
num=c(sample(AG),sample(BG))
);
if (check) bgoldst.check(NP,res,seed);
res;
}; ## end bgoldst()
## validation check helper function
bgoldst.check <- function(NP,res,seed=NULL) {
seedStr <- if (is.null(seed)) 'NULL' else as.character(seed);
## A
with(res[res$outer=='A',],{
if (length(outer)!=Ainner) stop(paste0('outer category A has wrong number of inner categories ',length(outer),'!=',Ainner,' [',seedStr,'].'));
x <- num>=2L & num<=7L;
if (!all(x)) stop(paste0('outer category A has invalid inner category size ',num[which(!x)[1L]],' [',seedStr,'].'));
x <- sum(num);
if (!(x>=Amin && x<=Amax)) stop(paste0('outer category A has invalid size ',x,' [',seedStr,'].'));
});
## B
with(res[res$outer=='B',],{
if (length(outer)!=Binner) stop(paste0('outer category B has wrong number of inner categories ',length(outer),'!=',Binner,' [',seedStr,'].'));
x <- num>=2L & num<=6L;
if (!all(x)) stop(paste0('outer category B has invalid inner category size ',num[which(!x)[1L]],' [',seedStr,'].'));
x <- sum(num);
if (!(x>=Bmin && x<=Bmax)) stop(paste0('outer category B has invalid size ',x,' [',seedStr,'].'));
});
## all
with(res,{
x <- sum(num);
if (x!=NP) stop(paste0('result has invalid total size ',x,' [',seedStr,'].'));
});
}; ## end bgoldst.check()
## one-off demo
res <- bgoldst(1524L,1L,T);
head(res,10L); tail(res,10L);
## outer inner num
## 1 A 1 2
## 2 A 2 3
## 3 A 3 3
## 4 A 4 2
## 5 A 5 2
## 6 A 6 2
## 7 A 7 2
## 8 A 8 4
## 9 A 9 2
## 10 A 10 2
## outer inner num
## 606 B 399 3
## 607 B 400 2
## 608 B 401 4
## 609 B 402 2
## 610 B 403 2
## 611 B 404 2
## 612 B 405 6
## 613 B 406 2
## 614 B 407 2
## 615 B 408 5
table(res$outer,res$num);
##
## 2 3 4 5 6
## A 158 28 13 8 0
## B 282 68 33 18 7
## extensive testing
for (seed in seq_len(1e5L)) {
print(seed);
set.seed(seed);
bgoldst(sample(NPmin:NPmax,1L),NULL,T);
}; ## end for
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
##
## ... snip ... (all succeed, all fast)
##
## [1] 99996
## [1] 99997
## [1] 99998
## [1] 99999
## [1] 100000
First Attempt
(Original intro: Ok, unfortunately, there was some ambiguity in the OP's wording, specifically where he said "it does not matter if a group category is not used". I assumed that meant that inner categories could have zero members. My solution below was based on that premise. My assumption was incorrect, and that changes everything. I will leave my answer the way I wrote it as I work on a new solution.)
## primary implementation function
bgoldst <- function(seed=NULL,check=F) {
if (!is.null(seed)) set.seed(seed);
## divide 1524 into two outer categories -- sample the acceptable divisions
## notably, cannot allow only 1 person into either outer category
## also, cannot take more than 1449 ppl into A; most it can hold is 7*207 == 1449
## B can hold any number from zero to 1524
NHA <- sample(c(0L,2:1449),1L);
NHB <- 1524L-NHA;
## also, since 1449 would *require* 7 ppl in every category, must prep normal dist params
## specifically, will slide mean from 4.5 over towards (and past) 7, sd smaller the closer we are to 1449
Amean <- 4.5 + 3*NHA/1449;
Asd <- 1.5*exp(-(NHA/1e3)^1.6);
## divide A into 207 inner categories of 2:7 -- iterative sampling until valid
## should be very few iterations, since we over-append with high likelihood
AG <- integer();
if (NHA>0L) {
repeat {
AG <- c(AG,pmin(7L,pmax(2L,as.integer(rnorm(max(1,NHA/3),Amean,Asd)))));
## find last inner category
AGcum <- cumsum(AG);
AGLastIndex <- which(AGcum>=NHA)[1L];
if (!is.na(AGLastIndex)) { ## sufficient coverage
## also must guard against too lightly allocated to fit within inner category num limit
if (AGLastIndex>207L) {
AG <- integer(); ## hard reset
} else {
break; ## done
}; ## end if
}; ## end if
}; ## end repeat
## remove surplus inner categories and surplus in last inner category
length(AG) <- AGLastIndex;
AG[AGLastIndex] <- AG[AGLastIndex] - (AGcum[AGLastIndex]-NHA);
if (AG[AGLastIndex]==1L) { ## special case for last inner category remnant of 1; must even out against previous inner category
## also, can't join max inner category size since it would overflow the last group
## also, can't take 1 less than previous inner category size since then *it* would be left with 1
takeOpt <- setdiff(1:if (AG[AGLastIndex-1L]==7L) 5L else AG[AGLastIndex-1L],AG[AGLastIndex-1L]-1L);
take <- if (length(takeOpt)==1L) takeOpt else sample(takeOpt,1L); ## don't let sample()'s inconsistent behavior screw us
AG[AGLastIndex-1L] <- AG[AGLastIndex-1L]-take;
AG[AGLastIndex] <- AG[AGLastIndex]+take;
}; ## end if
}; ## end if
## divide Bs into 408 inner categories of 2:6 -- iterative sampling until valid
BG <- integer();
if (NHB>0L) {
repeat {
BG <- c(BG,sample(2:6,max(1,NHB/3),replace=T));
## find last inner category
BGcum <- cumsum(BG);
BGLastIndex <- which(BGcum>=NHB)[1L];
if (!is.na(BGLastIndex)) { ## sufficient coverage
## also must guard against too lightly allocated to fit within inner category num limit
if (BGLastIndex>408L) {
BG <- integer(); ## hard reset
} else {
break; ## done
}; ## end if
}; ## end if
}; ## end repeat
## remove surplus inner categories and surplus in last inner category
length(BG) <- BGLastIndex;
BG[BGLastIndex] <- BG[BGLastIndex] - (BGcum[BGLastIndex]-NHB);
if (BG[BGLastIndex]==1L) { ## special case for last inner category remnant of 1; must even out against previous inner category
## also, can't join max inner category size since it would overflow the last group
## also, can't take 1 less than previous inner category size since then *it* would be left with 1
takeOpt <- setdiff(1:if (BG[BGLastIndex-1L]==6L) 4L else BG[BGLastIndex-1L],BG[BGLastIndex-1L]-1L);
take <- if (length(takeOpt)==1L) takeOpt else sample(takeOpt,1L); ## don't let sample()'s inconsistent behavior screw us
BG[BGLastIndex-1L] <- BG[BGLastIndex-1L]-take;
BG[BGLastIndex] <- BG[BGLastIndex]+take;
}; ## end if
}; ## end if
## combine into data.frame, randomly distributing the inner categories across inner category ids
res <- data.frame(
outer=rep(c('A','B'),c(207L,408L)),
inner=c(1:207,1:408),
num=c(sample(c(AG,rep(0L,207L-length(AG)))),sample(c(BG,rep(0L,408L-length(BG)))))
);
if (check) bgoldst.check(res,seed);
res;
}; ## end bgoldst()
## validation check helper function
bgoldst.check <- function(res,seed=NULL) {
seedStr <- if (is.null(seed)) 'NULL' else as.character(seed);
## A
with(res[res$outer=='A',],{
if (length(outer)!=207L) stop(paste0('outer category A has wrong number of inner categories ',length(outer),'!=207 [',seedStr,'].'));
x <- num>=2L & num<=7L | num==0L;
if (!all(x)) stop(paste0('outer category A has invalid inner category size ',num[which(!x)[1L]],' [',seedStr,'].'));
x <- sum(num);
if (!(x>=0L && x<=1524L)) stop(paste0('outer category A has invalid size ',x,' [',seedStr,'].'));
});
## B
with(res[res$outer=='B',],{
if (length(outer)!=408L) stop(paste0('outer category B has wrong number of inner categories ',length(outer),'!=408 [',seedStr,'].'));
x <- num>=2L & num<=6L | num==0L;
if (!all(x)) stop(paste0('outer category B has invalid inner category size ',num[which(!x)[1L]],' [',seedStr,'].'));
x <- sum(num);
if (!(x>=0L && x<=1524L)) stop(paste0('outer category B has invalid size ',x,' [',seedStr,'].'));
});
## all
with(res,{
x <- sum(num);
if (x!=1524L) stop(paste0('result has invalid total size ',x,' [',seedStr,'].'));
});
}; ## end bgoldst.check()
## one-off demo
res <- bgoldst(1L,T);
head(res,10L); tail(res,10L);
## outer inner num
## 1 A 1 5
## 2 A 2 4
## 3 A 3 0
## 4 A 4 0
## 5 A 5 0
## 6 A 6 5
## 7 A 7 0
## 8 A 8 5
## 9 A 9 0
## 10 A 10 4
## outer inner num
## 606 B 399 3
## 607 B 400 5
## 608 B 401 5
## 609 B 402 0
## 610 B 403 6
## 611 B 404 0
## 612 B 405 5
## 613 B 406 2
## 614 B 407 0
## 615 B 408 0
table(res$outer,res$num);
##
## 0 2 3 4 5 6 7
## A 125 1 9 25 29 15 3
## B 116 71 57 54 50 60 0
## extensive testing
for (seed in seq_len(1e5L)) {
print(seed);
bgoldst(seed,T);
}; ## end for
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
##
## ... snip ... (all succeed, all fast)
##
## [1] 99996
## [1] 99997
## [1] 99998
## [1] 99999
## [1] 100000
Normal distribution behavior:
xlim <- c(-3,10);
ylim <- c(0,1.7);
xticks <- seq(xlim[1L],xlim[2L]);
yticks <- seq(ylim[1L],ylim[2L],0.1);
plot(NA,xlim=xlim,ylim=ylim,xlab='AG',ylab='P',axes=F);
axis(1L,xticks,xticks);
axis(2L);
box();
abline(v=xticks,col='lightgrey');
abline(h=yticks,col='lightgrey');
x <- seq(xlim[1L],xlim[2L],0.01);
Amean <- function(NHA) 4.5 + 3*NHA/1449;
Asd <- function(NHA) 1.5*exp(-(NHA/1e3)^1.6);
NHA.col <- data.frame(NHA=c(0,300,600,900,1200,1449),col=c('red','green','blue','gold','cyan','magenta'),stringsAsFactors=F);
for (ri in seq_len(nrow(NHA.col))) {
NHA <- NHA.col$NHA[ri];
col <- NHA.col$col[ri];
lines(x,dnorm(x,Amean(NHA),Asd(NHA)),col=col,lwd=2);
};
with(NHA.col,legend(-2.5,1.65,NHA,col,col,title=expression(bold(NHA))),cex=0.7);
text(-2.5,0.92,parse(text=paste0('mu == ',deparse(body(Amean)))),pos=4L);
text(-2.5,0.87,parse(text=paste0('sigma == ',deparse(body(Asd)))),pos=4L);
we can assign 2 people to each category in type A (207) and type B (408) and sample categories for the rest of the 1524 - 2*207 - 2*408 = 294 people
rm(list=ls())
options(stringsAsFactors=FALSE)
numPeeps <- 1524
aSize <- 207
bSize <- 408
aRange <- c(2,7)
bRange <- c(2,6)
vacancies <- c(paste(rep("A", diff(aRange)*aSize), rep(1:aSize, each=diff(aRange))),
paste(rep("B", diff(bRange)*bSize), rep(1:bSize, each=diff(bRange))))
#perform allocation by assigning 2 people to each category in type A (207) and type B (408) and sample categories for the rest of the 1524-2*207-2*408=294 people
allocation <- c(paste(rep("A", aRange[1]*aSize), rep(1:aSize, each=aRange[1])),
paste(rep("B", bRange[1]*bSize), rep(1:bSize, each=bRange[1])),
sample(vacancies, numPeeps - aRange[1]*aSize - bRange[1]*bSize))
#spit out results
library(dplyr)
library(stringi)
df <- setNames(data.frame(stri_split_fixed(allocation, " ", simplify=T)),
c("AB","Category"))
df$Category <- as.integer(df$Category)
houseDensity <- group_by(df, AB, Category) %>%
summarise(GroupSize = n()) %>%
ungroup() %>%
select(-Category) %>%
group_by(AB, GroupSize) %>%
summarise(Num.Groups = n()) %>%
mutate(Num.People = Num.Groups * GroupSize)
# Source: local data frame [9 x 4]
# Groups: AB
#
# AB GroupSize Num.Groups Num.People
# 1 A 2 113 226
# 2 A 3 68 204
# 3 A 4 21 84
# 4 A 5 4 20
# 5 A 6 1 6
# 6 B 2 266 532
# 7 B 3 118 354
# 8 B 4 22 88
# 9 B 5 2 10
sum(houseDensity[houseDensity$AB=="A", "Num.Groups"])
# [1] 207
sum(houseDensity[houseDensity$AB=="B", "Num.Groups"])
# [1] 408
sum(houseDensity$Num.People)
# [1] 1524

Find the parameters provided to an unknown vectorized function

I am trying to create an scoring function (called evalFunc). To get a score I am trying to calculate the R-squared value of a generated model. How do I know how the values are being passed to 'evalFunc' from within the rbga.bin function?
library(genalg)
library(ggplot2)
set.seed(1)
df_factored<-data.frame(colA=runif(30),colB=runif(30),colC=runif(30),colD=runif(30))
dataset <- colnames(df_factored)[2:length(df_factored)]
chromosome = sample(c(0,1),length(dataset),replace=T)
#dataset[chromosome == 1]
evalFunc <- function(x) {
#My end goal is to have the values passed into evalFunc be evaluated as the IV's in a linear model
res<-summary(lm(as.numeric(colA)~x,
data=df_factored))$r.squared
return(res)
}
iter = 10
GAmodel <- rbga.bin(size = 2, popSize = 200, iters = iter, mutationChance = 0.01, elitism = T, evalFunc = evalFunc)
cat(summary(GAmodel))
You can view the source by typing rbga.bin, but better than that you can run debug(rbga.bin), then the next time you call that function, it allows you to step through the function. In this case the first time you get to your function is in this line (approximately line 82 of the function):
evalVals[object] = evalFunc(population[object,
])
At this point, population is 200x2 matrix consisting of 0s and 1s:
head(population)
# [1,] 0 1
# [2,] 0 1
# [3,] 0 1
# [4,] 1 0
# [5,] 0 1
# [6,] 1 0
And object is the number 1, so population[object,] is the vector c(0,1).
When you've finished with debug you can undebug(rbga.bin) and it won't go into debug mode every time you call rbga.bin.

Calculate autocorrelation with lag u in R

Hi I tried calculating autocorrelation with lag u, u = 1...9
I expect 9x1 autocorrelation functions. However when I try to use this code it always gave me 10x1 autocorrelation function with the first term = 1. I am not sure how to proceed.
# initialize a vector to store autocovariance
maxlag <- 9
varstore <- rep(NA,maxlag)
# Calculate Variance
varstore[1] <- sd(as.vector(sample1),na.rm=T)^2
# Estimate autocovariances for all residuals
for (lag in 1:maxlag)
varstore[lag+1] <- mean(sample1[,1:(10-lag)] *
sample1[,(lag+1):10],na.rm=T)
print(round(varstore,3))
# calculate autocorrelations
corrstore <- varstore/varstore[1]
print(corrstore)
And this is what I get:
[1] 1.0000000 0.6578243 0.5670389 0.5292314 0.5090411 0.4743944 0.4841038 0.4756297
[9] 0.4275208 0.4048436
You get a vector of length 10 because of the recycling.
for lag =maxlog ( the last step of your for loop)
varstore[lag+1]
will create a new entry with NA. To see this clearly, try this for example :
v <- NA ## a vector of length 1
v[10] <- 2
v
[1] NA NA NA NA NA NA NA NA NA 2 ## you get a vector of legnth 10!!
That'said , why do you want a vector of length 9? Why not to use the acf function? Here the output of the acf function:
length(acf(1:10)$lag)
[1] 10

Creating an R function to use mclapply from the multicore package

I need to analyze some simulated data with the following structure:
h c x1 y1 x1c10
1 0 37.607056431 104.83097593 5
1 1 27.615251557 140.85532974 10
1 0 34.68915314 114.59312842 2
1 1 30.090387454 131.60485642 9
1 1 39.274429397 106.76042522 10
1 0 33.839385007 122.73681319 2
...
where h ranges from 1 to 2500, and indexes the Monte Carlo sample, each sample with 1000 observations. I'm analysing these data with the following code that gives me two objects (fnN1, fdQB101):
mc<-2500 ##create loop index
fdN1<-matrix(0,mc,1000)
fnQB101 <- matrix(0,mc,1000) ##create 2500x1000 storage matrices, elements zero
for(j in 1:mc){
fdN1[j,] <- dnorm(residuals(lm(x1 ~ c,data=s[s$h==j,])),
mean(residuals(lm(x1 ~ c,data=s[s$h==j,]))),
sd(residuals(lm(x1 ~ c,data=s[s$h==j,]))))
x1c10<-as.matrix(subset(s,s$h==j,select=x1c10))
fdQB100 <- as.matrix(predict(polr(as.factor(x1c10) ~ c ,
method="logistic", data=s[s$h==j,]),
type="probs"))
indx10<- as.matrix(cbind(as.vector(seq(1:nrow(fdQB100))),x1c10))
fdQB101[j,] <- fdQB100[indx10]
}
The objects fdN1 and fdQB101 are 2500x1000 matrices with predicted probabilities as elements. I need to create a function out of this loop that I can call with lapply() or mclapply(). When I wrap this in the following function command:
ndMC <- function(mc){
for(j in 1:mc){
...
}
return(list(fdN1,fdQB101))
}
lapply(mc,ndMC)
the objects fdN1 and fdQB101 are each returned as 2500x1000 matrices of zeros, instead of the predicted probabilities. What am I doing wrong?
You should be able to do this with the data.table package. Here is an example:
library(data.table)
dt<-data.table(h=rep(1L,6), c=c(0L,1L,0L,1L,1L,0L),
X1=c(37.607056431,27.615251557,34.68915314,30.090387454,39.274429397,33.839385007),
y1=c(104.83097593,140.85532974,114.59312842,131.60485642,106.76042522,122.73681319),
x1c10=c(5L,10L,2L,9L,10L,2L))
## Create a linear model for every grouping of variable h:
fdN1.partial<-dt[,list(lm=list(lm(X1~c))),by="h"]
## Retrieve the linear model for h==1:
fdN1.partial[h==1,lm]
## [[1]]
##
## Call:
## lm(formula = X1 ~ c)
##
## Coefficients:
## (Intercept) c
## 35.379 -3.052
You could also write a function to generalize this solution:
f.dnorm<-function(y,x) {
f<-lm(y ~ x)
out<-list(dnorm(residuals(f), mean(residuals(f)), sd(residuals(f))))
return(out)
}
## Generate two dnorm lists for every grouping of variable h:
dt.lm<-dt[,list(dnormX11=list(f.dnorm(X1,rep(1,length(X1)))), dnormX1c=list(f.dnorm(X1,c))),by="h"]
## Retrieve one of the dnorm lists for h==1:
unlist(dt.lm[h==1,dnormX11])
## 1 2 3 4 5 6
## 0.06296194 0.03327407 0.08884549 0.06286739 0.04248756 0.09045784

Resources