Genetic algorithm for permutations without repetition - r

I am writing a Genetic Algorithm (GA) to find a certain permutation in my social network data (an iGraph object). I am using the R library GA, but the permutations it generates contain repetitions and their length varies, while I want permutations without repetitions and the same length.
I understand that the mutation and the cross-over functions cause this phenomenon, but I cannot find a way around it. I have tried to implement a fitness function that gives a low score to the "bad" permutations but this caused an error (see below).
cp_GA <- function(g, ratio = 0.2, maxiter = 1000, run = 40, pop = 200) {
library("igraph")
library("GA")
# ratio : ratio of the number of core/all vertices
# this is describing the desired size of the core group
# maxiter: max number of iterations for the GA
# run : max number of runs with the same fitness
# pop : population size for tha GA
# desired core size:
coresize <- round(vcount(g) * ratio, 0)
fitness_vertex_perm <- function(permutation) {
# this is the fitness function for the GA
# it calculates the density of the core with the current permutation
# the if-else structure prevents permutations with repetitions
if (sort(permutation) == c(1:vcount(g))) {
dens <- edge_density(
induced_subgraph(permute(g, as.numeric(permutation)), 1:coresize, impl =
"auto"))
} else {
dens <- 0
}
return(dens)
}
lowerlimit <- 1:vcount(g)
upperlimit <- vcount(g):1
hint <- order(degree(g), decreasing = TRUE)
maxfitness <- 1
GA <- ga(type = "permutation",
fitness = fitness_vertex_perm,
lower = lowerlimit,
upper = upperlimit,
maxiter = maxiter,
run = run,
maxFitness = maxfitness,
suggestions = hint,
popSize = pop
)
return(GA)
}
In the fitness function above the if else statement checks if a permutation is OK but this drops an error:
testresult <- cp_GA(g, ratio = 0.13, maxiter = 1000, run = 60, pop = 400)
Error in getComplete(control) :
argument "control" is missing, with no default
In addition: Warning message:
In if (sort(permutation) == c(1:vcount(g))) { :
Error in getComplete(control) :
argument "control" is missing, with no default
without the if-else it runs but produces a permutation result that is not useful for me.
How can I set GA to generate the right type of permutations?

See now you have two thigs to implement:
1.GA selection mechanism.
2.Without replacement strategy.
Theory of GA selection is that when you have parent selction you can do that randomly or alao can do that by applying some technique, you seem to have just done what is required.
Theory of without replacement is that you will have to deduct the remaining population one less as compared to the previous.
Probability(new) = 1/ (Proability(old) - 1)
So if you adjust your upper limit of the population in the looping section to one less you can achieve your result.
Hope this is all you need a hint in right direction.

Related

Struggling to run moveHMM using lognormal function in parallelised routines

I am attempting to run a two state HMM using a lognormal distribution. I have read Michelot and Langrock (2019) regarding choosing starting parameters through inspecting the data in a histogram and then running iterations in parallel, which has worked for my gamma distribution. Identifying the starting parameters for the lognormal distribution is troubling me however. Do I plot the log of my step length distribution then attempt extracting starting parameters or use the same starting parameters as my gamma distribution and rely on stepDist="lnorm"?
My code for the lognormal attempt currently looks like this:
ncores <- detectCores() - 1
cl <- makeCluster(getOption("cl.cores", ncores))
clusterExport(cl, list("data", "fitHMM"))
niter <- 20
allPar0 <- lapply(as.list(1:niter), function(x) {
stepMean0 <- runif(2,
min = c(x,y),
max = c(y,z))
stepSD0 <- runif(2,
min = c(x,y),
max = c(y,z))
angleMean0 <- c(0, 0)
angleCon0 <- runif(2,
min = c(a,b),
max = c(a,b))
stepPar0 <- c(stepMean0, stepSD0)
anglePar0 <- c(angleMean0, angleCon0)
return(list(step = stepPar0, angle = anglePar0))
})
# Fit the niter models in parallel
logP <- parLapply(cl = cl, X = allPar0, fun = function(par0) {
m <- fitHMM(data = data, nbStates = 2, stepDist = "lnorm", stepPar0 = par0$step,
anglePar0 = par0$angle)
return(m)
})
# Extract likelihoods of fitted models
likelihoodL <- unlist(lapply(logP, function(m) m$mod$minimum))
likelihoodL
# Index of best fitting model (smallest negative log-likelihood)
whichbestpL <- which.min(likelihoodL)
bestL <- logP[[whichbestpL]]
bestL
If I use negative values from plotting the log of the step length of the data then I get the error:
Error in checkForRemoteErrors(val) :
7 nodes produced errors; first error: Check the step parameters bounds (the initial parameters should be strictly between the bounds of their parameter space).
If I use the same starting parameter values that I used for my gamma distribution then I get the error
Error in unserialize(node$con) :
embedded nul in string: 'X\n\0\0\0\003\0\004\002\0\0\003\005\0\0\0'
Please could someone shed some light on how I'm failing at this?
Thank you!
Unfortunately, I can't tell for sure what the problem is from the code you included. If you don't get an error when you run fitHMM outside of parLapply, then it suggests that the problem is in how you choose the values of x, y, and z in your code.
The first parameter of the log-normal distribution can be negative or positive, and it is actually the mean of the logarithm of the step length. So, to find good starting values for this, you should look at a histogram of the log step lengths (e.g., following the dedicated moveHMM vignette). The second parameter is the standard deviation of the log step lengths, and this should be strictly positive (but could also be chosen based on the spread of the histogram of log step lengths).
To summarise, you should choose all the initial values based on plots of the log step lengths (rather than the step lengths themselves), and you should not use the same ranges of values for stepMean0 and stepSD0 (because the former can be negative or positive, whereas the latter is positive). Hopefully, this should help you choose x, y, and z.

Finding minimum by optimising a vector in R

I need to find a minimum of an objective function by optimising a vector. The problem is finance related if that helps - the function RC (provided below) computes the sum of squared differences of risk contribution of different assets, where the risk contribution is a product of input Risk Measure (RM, given) and weights.
The goal is to find such weights that the sum is zero, i.e. all assets have equal risk contributions.
RC = function (RM, w){
w = w/sum(w) # normalizing weights so they sum up to 1
nAssets = length(RM)
rc_matrix = matrix(nrow=1,ncol=nAssets)
rc_matrix = RM*w #risk contributions: RM (risk measure multiplied by asset's
#w eight in the portfolio)
rc_sum_squares = numeric(length=1) #placeholder
rc_sum_squares = sum(combn(
seq_along(RM),
2,
FUN = function(x)
(rc_matrix[ , x[1]] - rc_matrix[, x[2]]) ** 2
)) # this function sums the squared differences of the risk contributions
return(rc_sum_squares)
}
I searched and the solution seems to lie in the "optim" function, so I tried:
out <- optim(
par = rep(1 / length(RM), length(RM)), # initial guess
fn = RC,
RM = RM,
method = "L-BFGS-B",
lower = 0.00001,
upper = 1)
However, this returns an error message: "Error in rc_matrix[, x[1]] : incorrect number of dimensions"
I don't know how the optimization algorithm works, so I can't really wrap my head around it. The RC function works though, here is a sample for replicability:
RM <- c(0.06006928, 0.06823795, 0.05716360, 0.08363529, 0.06491009, 0.06673174, 0.03103578, 0.05741140)
w <- matrix(0.125, nrow=1, ncol=1)
I saw also CVXR package, which crashes my RStudio for some reason and nlm(), which is little more complicated and I can't write the function properly.
A solution might be not to do the funky summation of the squared differences, but finding the weights so that the risk contributions (RM*weight) are equal. I will be very glad for your help.
Note: the vector of the weights has to sum up to 1 and the values have to lie between 0 and 1.
Cheers
Daniel

Finding optimal parameter for each input combination in the objective function in an optimization

I am calibrating a model and for that I have to estimate a parameter for each input combination I give to the objective function. I have a bit more than 10k input combinations and I want to minimize the parameter for each combination. All other variables in the model are known. I achieved to estimate 1 minimal value for the whole set but that doesn't help me, and when I tried my approach for each combination I get the error: Error in mP[, logik] <- mPv[, logik, drop = FALSE] : NAs are not allowed in subscripted assignments.
My objective function looks like this
x_vol <- vector(mode = "double", length = 10776)
objective_function_vol <- function(x_vol){
S <- calibration_set$index_level
K <- calibration_set$strike
tau <- calibration_set$tau
r <- calibration_set$riskfree_rate
q <- calibration_set$q
model_prices_vol <- vector(mode = "double", length = 10776)
for (i in 1:10776){
model_prices_vol[i] <- hestonCallcf(S = S[i], K = K[i], t = tau[i],
r = r[i], q = 0,
v0 = x_vol[i],
vbar = 0.1064688, rho = -0.9914710,
a = 1.6240300, vvol = 0.98839192)
print(i)
}
diff_sq <- (market_price - model_prices_vol)^2
wdiff <- diff_sq/market_price
error <- sum(wdiff)/10776
return(error)
}
I am using NMOF::DEopt for the optimization. Is it maybe possible to write a second loop which stores the optimal values of x_vol because I think using the subscript i for the known inputted values as well as the unknown is somehow wrong.
The error means that some objective-function calls resulted in NA.
If you only wish to minimize a single parameter (i.e. a scalar), Differential Evolution is probably not the method you want. A grid search along one dimension, possibly with refinements, would likely work better.

R function loglik() returning -inf?

Simulating an SIR model in R. I have a data set I am trying to plot accurately with the model. I am right now using the particle filter function, then would like to use the corresponding logLik method on the result. When I do this, I get "[1] -Inf" as a result. I can't find in the documentation why this is and how I can avoid it. Are my parameters for the model not accurate enough? Is there something else wrong?
My function looks like this:
SIRsim %>%
pfilter(Np=5000) -> pf
logLik(pf)
From an online course lesson entitled Likelihood for POMPS https://kingaa.github.io/sbied/pfilter/ , this is the R script for the lesson. However, the code works here... I'm not sure how to reproduce my specific problem with it and unfortunately cannot share the dataset or code I am using because it is for academic research.
library(tidyverse)
library(pomp)
options(stringsAsFactors=FALSE)
stopifnot(packageVersion("pomp")>="3.0")
set.seed(1350254336)
library(tidyverse)
library(pomp)
sir_step <- Csnippet("
double dN_SI = rbinom(S,1-exp(-Beta*I/N*dt));
double dN_IR = rbinom(I,1-exp(-mu_IR*dt));
S -= dN_SI;
I += dN_SI - dN_IR;
R += dN_IR;
H += dN_IR;
")
sir_init <- Csnippet("
S = nearbyint(eta*N);
I = 1;
R = nearbyint((1-eta)*N);
H = 0;
")
dmeas <- Csnippet("
lik = dbinom(reports,H,rho,give_log);
")
rmeas <- Csnippet("
reports = rbinom(H,rho);
")
read_csv("https://kingaa.github.io/sbied/pfilter/Measles_Consett_1948.csv")
%>%
select(week,reports=cases) %>%
filter(week<=42) %>%
pomp(
times="week",t0=0,
rprocess=euler(sir_step,delta.t=1/7),
rinit=sir_init,
rmeasure=rmeas,
dmeasure=dmeas,
accumvars="H",
statenames=c("S","I","R","H"),
paramnames=c("Beta","mu_IR","eta","rho","N"),
params=c(Beta=15,mu_IR=0.5,rho=0.5,eta=0.06,N=38000)
) -> measSIR
measSIR %>%
pfilter(Np=5000) -> pf
logLik(pf)
library(doParallel)
library(doRNG)
registerDoParallel()
registerDoRNG(652643293)
foreach (i=1:10, .combine=c) %dopar% {
measSIR %>% pfilter(Np=5000)
} -> pf
logLik(pf) -> ll
logmeanexp(ll,se=TRUE)
If I set Beta=100 in the code above I can get a negative-infinite log-likelihood.
Replacing the measurement-error snippet with this:
dmeas <- Csnippet("
double ll = dbinom(reports,H,rho,give_log);
lik = (!isfinite(ll) ? -1000 : ll );
")
appears to 'solve' the problem, although you should be a little bit careful; papering over numerical cracks like this is sometimes OK, but could conceivably come back to bite you in some way later on. If you just need to avoid non-finite values long enough to get into a reasonable parameter range this might be OK ...
Some guesses as to why this is happening:
you are somehow getting an "impossible" situation like a positive number of reported cases when the underlying true number of infections is zero.
Sometimes non-finite log-likelihoods occur when a very small positive probability underflows to zero. The equivalent here is likely that the probability of infection 1-exp(-Beta*I/N*dt) goes to 1.0; then any observed outcome where less than 100% of the population is infected is impossible.
You can try to diagnose the situation by seeing what the filtered trajectory actually looks like and comparing it with the data, or by adding debugging statements to the code. If there's a way to run just the deterministic simulation with your parameter values that might tell you pretty quickly what's going wrong.
An easier/more direct way to debug would be to replace the Csnippet you're using for dmeas with an R function: this will be slower but easier to work with (especially if you're not familiar with C coding). If you uncomment the browser() statement below, the code will drop into debug mode when you encounter the bad situation ...
dmeas <- function(reports,H,rho,log, ...) {
lik <- dbinom(reports,size=H,prob=rho,log=log)
if (!is.finite(lik)) {
lik <- -1000
## browser()
}
return(lik)
}
For example:
(t = 3, reports = 2, S = 2280, I = 0, R = 35721, H = 0, Beta = 100,
mu_IR = 0.5, rho = 0.5, eta = 0.06, N = 38000, log = TRUE)
Browse[1]> debug at /tmp/SO65554258.R!ZlSILG#7: return(lik)
Browse[2]> reports
[1] 2
Browse[2]> H
[1] 0
Browse[2]> rho
[1] 0.5
This shows that the problem is indeed that you have a positive number of reported cases when there have been zero infections ... R is trying to compute the binomial probability of observing reports cases out when there are H infections that are potentially reportable, each reported with a probability rho. When the number of trials N in a binomial probability Binom(N,p) is zero, the only possible outcome is zero 'successes' (reported cases), with probability 1. All other outcomes have probability 0 (and log-probability -Inf).

R: Convergence problems with numerical integration

Not sure if this numerical methods problem should really be here or in crossvalidated, but since I have a nice reproducible example I though I would start here.
I am going to be estimating and fitting a bunch of distributions both to some large data sets and to data sets generated randomly from similar distributions. As part of this process I will be generating estimates for the conditional mean of various value ranges, including truncated and non-truncated values of the right tail.
The function cr_moment below, given a pdf function for dfun and parameters for that function in params calculates the unconditional mean of that distribution. Given the upper, lower, or both bounds, it calculates the conditional mean for the range specified by those bounds, using the singly- or doubly-truncated distribution for those bounds. The function beneath it, cr_gb2, specializes cr_moment to the generalized beta distribution of the second kind. Finally, the parameter values supplied beneath that approximate the unadjusted current-dollar household income distribution from the US Census/BLS Current Population Survey for the year 2000. McDonald & Ransom 2008. (Also, kudos to Mikko Marttila on this list for help with coding this function).
This function gives me a failure to converge error, copied below, for various lower bounds and an upper bound equal to 4.55e8, or higher, but not at 4.54e8. The kth moment of the GB2 exists for k < shape1 * shape3, here about 2.51. This is a nice smooth unimodal function being integrated over a finite interval, and I don’t know why it is failing to converge and don-t know what to do about it. For other parameter values, but not this one, I have also seen convergence problems at the low end for lower bounds ranging from 6 to a couple of hundred.
Error in integrate(f = prob_interval, lower = lb, upper = ub, subdivisions = 100L):
the integral is probably divergent
455 billion will be above the highest observable income level, by one or two orders of magnitude, but given a wider range of parameter values and using hill-climbing algorithms to fit real and simulated data I think I will hit this wall many times. I know very little about numerical methods in a case like this and don’t really know where to start. Help and suggestions greatly appreciated.
cr_moment <- function(lb = -Inf, ub = Inf, dfun, params, v=1, ...){
x_pdf <- function(X){
X^v * do.call(what=dfun, args=c(list(x=X), params))
}
prob_interval <- function(X){
do.call(what=dfun, args=c(list(x=X), params))
}
integral_val <- integrate(f = x_pdf, lower = lb, upper = ub)
integral_prob <- integrate(f = prob_interval, lower = lb, upper = ub)
crm <- interval_val[[1]] / interval_prob[[1]]
out <- list(value = integral_val[[1]], probability = integral_prob[[1]],
cond_moment = crm)
out
}
library(GB2)
cr_gb2 <- function(lb = -Inf, ub = Inf, v = 1, params){
cr_moment(lb, ub, dfun = dgb2, params = get("params"))
}
GB2_params <- list(shape1 = 2.2474, scale = 58441.5, shape2 = 0.6186, shape3 = 1.118)
cr_gb2(lb=1, ub= 4.55e8, params = GB2_params)

Resources