Genetic algorithms under R, adding suggestions - r

In the genalg package, the rbga.bin command offers the possibility to add a list of suggestion, however, I can't find any example of this feature actually working, could anyone give me some help ?
library(genalg)
evaluation<-function(x){
n<- 2
if (sum(x)!= n){
return(100) }
if (sum(x)== n){
sequen<- which(x>0)
l=sum(sequen)
return(-l) } }
vect1<-rep(0,times=40)
vect1[c(1,2)]<-c(1,1)
sug<-list(vect1)
for (iin 2:100){
vect1<-sample(vect1)
sug[[i]]<-vect1
}
GAmodel <- rbga.bin(size=40,popSize =100, iters =100, suggestions=sug,mutationChance = 0.01,elitism =T, evalFunc=evaluation,verbose=T)

Although documentation for rbga.bin function says:
suggestions: optional list of suggested chromosomes
rbga.bin apparently wants a data.frame or matrix:
# taken from the rbga.bin source code
suggestionCount = dim(suggestions)[1]
for (i in 1:suggestionCount) {
population[i, ] = suggestions[i, ]
}
When given a matrix, it seems to work fine:
sug2 <- t(replicate(sample(vect1),n = 10)) # needs to be rotated. check your solution n = 99 and it will fail
GAmodel <- rbga.bin(size=40,popSize =100, iters =100, suggestions=sug2,mutationChance = 0.01,elitism =T, evalFunc=evaluation,verbose=T)
Output:
Testing the sanity of parameters...
Not showing GA settings...
Adding suggestions to first population...
Filling others with random values in the given domains...
Starting iteration 1
Calucating evaluation values... .................................................................................................... done.
Creating next generation...
sorting results...
applying elitism...
applying crossover...
applying mutations... 40 mutations applied
Starting iteration 2
Calucating evaluation values... .................................................................................................. done.
Creating next generation...
<...>
Starting iteration 100
Calucating evaluation values... .................................................................................................. done.

Related

R function loglik() returning -inf?

Simulating an SIR model in R. I have a data set I am trying to plot accurately with the model. I am right now using the particle filter function, then would like to use the corresponding logLik method on the result. When I do this, I get "[1] -Inf" as a result. I can't find in the documentation why this is and how I can avoid it. Are my parameters for the model not accurate enough? Is there something else wrong?
My function looks like this:
SIRsim %>%
pfilter(Np=5000) -> pf
logLik(pf)
From an online course lesson entitled Likelihood for POMPS https://kingaa.github.io/sbied/pfilter/ , this is the R script for the lesson. However, the code works here... I'm not sure how to reproduce my specific problem with it and unfortunately cannot share the dataset or code I am using because it is for academic research.
library(tidyverse)
library(pomp)
options(stringsAsFactors=FALSE)
stopifnot(packageVersion("pomp")>="3.0")
set.seed(1350254336)
library(tidyverse)
library(pomp)
sir_step <- Csnippet("
double dN_SI = rbinom(S,1-exp(-Beta*I/N*dt));
double dN_IR = rbinom(I,1-exp(-mu_IR*dt));
S -= dN_SI;
I += dN_SI - dN_IR;
R += dN_IR;
H += dN_IR;
")
sir_init <- Csnippet("
S = nearbyint(eta*N);
I = 1;
R = nearbyint((1-eta)*N);
H = 0;
")
dmeas <- Csnippet("
lik = dbinom(reports,H,rho,give_log);
")
rmeas <- Csnippet("
reports = rbinom(H,rho);
")
read_csv("https://kingaa.github.io/sbied/pfilter/Measles_Consett_1948.csv")
%>%
select(week,reports=cases) %>%
filter(week<=42) %>%
pomp(
times="week",t0=0,
rprocess=euler(sir_step,delta.t=1/7),
rinit=sir_init,
rmeasure=rmeas,
dmeasure=dmeas,
accumvars="H",
statenames=c("S","I","R","H"),
paramnames=c("Beta","mu_IR","eta","rho","N"),
params=c(Beta=15,mu_IR=0.5,rho=0.5,eta=0.06,N=38000)
) -> measSIR
measSIR %>%
pfilter(Np=5000) -> pf
logLik(pf)
library(doParallel)
library(doRNG)
registerDoParallel()
registerDoRNG(652643293)
foreach (i=1:10, .combine=c) %dopar% {
measSIR %>% pfilter(Np=5000)
} -> pf
logLik(pf) -> ll
logmeanexp(ll,se=TRUE)
If I set Beta=100 in the code above I can get a negative-infinite log-likelihood.
Replacing the measurement-error snippet with this:
dmeas <- Csnippet("
double ll = dbinom(reports,H,rho,give_log);
lik = (!isfinite(ll) ? -1000 : ll );
")
appears to 'solve' the problem, although you should be a little bit careful; papering over numerical cracks like this is sometimes OK, but could conceivably come back to bite you in some way later on. If you just need to avoid non-finite values long enough to get into a reasonable parameter range this might be OK ...
Some guesses as to why this is happening:
you are somehow getting an "impossible" situation like a positive number of reported cases when the underlying true number of infections is zero.
Sometimes non-finite log-likelihoods occur when a very small positive probability underflows to zero. The equivalent here is likely that the probability of infection 1-exp(-Beta*I/N*dt) goes to 1.0; then any observed outcome where less than 100% of the population is infected is impossible.
You can try to diagnose the situation by seeing what the filtered trajectory actually looks like and comparing it with the data, or by adding debugging statements to the code. If there's a way to run just the deterministic simulation with your parameter values that might tell you pretty quickly what's going wrong.
An easier/more direct way to debug would be to replace the Csnippet you're using for dmeas with an R function: this will be slower but easier to work with (especially if you're not familiar with C coding). If you uncomment the browser() statement below, the code will drop into debug mode when you encounter the bad situation ...
dmeas <- function(reports,H,rho,log, ...) {
lik <- dbinom(reports,size=H,prob=rho,log=log)
if (!is.finite(lik)) {
lik <- -1000
## browser()
}
return(lik)
}
For example:
(t = 3, reports = 2, S = 2280, I = 0, R = 35721, H = 0, Beta = 100,
mu_IR = 0.5, rho = 0.5, eta = 0.06, N = 38000, log = TRUE)
Browse[1]> debug at /tmp/SO65554258.R!ZlSILG#7: return(lik)
Browse[2]> reports
[1] 2
Browse[2]> H
[1] 0
Browse[2]> rho
[1] 0.5
This shows that the problem is indeed that you have a positive number of reported cases when there have been zero infections ... R is trying to compute the binomial probability of observing reports cases out when there are H infections that are potentially reportable, each reported with a probability rho. When the number of trials N in a binomial probability Binom(N,p) is zero, the only possible outcome is zero 'successes' (reported cases), with probability 1. All other outcomes have probability 0 (and log-probability -Inf).

Vectorize Hoeffding's distance

I am trying to vectorise the following function which calculates the Hoeffdings distance between two random variable on [0,1]^2, in a discretise way.
Indeed, if you use the hoeffd function from the Hmisc package, it provides you with a fortran implementation ( that you can find here : https://github.com/harrelfe/Hmisc/blob/master/src/hoeffd.f ), but only give back the maximum of the matrix i'm trying to analyse here. I'm here interested in the place of the maximum, and hence i need to compute the whole matrix.
Here is my current implementation :
hoeffding_D <- function(x,y){
n = length(x)
indep <- outer(0:n,0:n)/(n)^2
bp = list(
c(0,sort(x)) + (c(sort(x),1) - c(0,sort(x)))/2,
c(0,sort(y)) + (c(sort(y),1) - c(0,sort(y)))/2
)
pre_calc <- t(outer(rep(1,n+1),x)<=bp[[1]])
# This is the problematic part :
dep <- t(sapply(bp[[2]],function(bpy){
colMeans(pre_calc*(y<=bpy))
}))
rez <- abs(dep-indep)
return(rez)
}
To use it, consider the folloiwing exemple :
library(copula)
# for 10 values, it's fast enough, but for 1000 it takes a lot of time..
x = pobs(rnorm(10),ties.method = "max")
y = pobs(rnorm(10),ties.method = "max")
hoeffding_D(x,y)
I already suppressed a first sapply via the use of the outer function, but i cant get rid of the other. The issue is that the comparaison x<=bpx must be done for all x and for all bpx, and the same for y, altogether this is a lot of dimensions to the problem...
Do you have an idea on how to speed it up ?

Function to Produce Repeating Spikes

I asked a similar question on CrossValidated, but did not get a response. I went ahead anyway, and built out a function but am having a problem with replication...
The original question, posted here is as such:
I am seeking a function (or short algorithm, ideally implemented in R) that produces something similar to the following:
See, I would like to be able to generate a vector of n items that follows this sort of pattern, mapped to a set of inputs (say, seq(1:n)). Ideally, I would be able to tell the algorithm to "spike" to a maximum height h on every kth time period, and decay at rate r. However, I would be sufficiently happy with simply being able to generate a spike pattern that occurs periodically.
I wrote some code in R, which is included here, that works fairly well...
## Neural Networks / Deep Learning ##
# first, must install Python from:
# https://www.anaconda.com/download/#windows
# https://www.python.org/downloads/
if (!require(keras)) devtools::install_github("rstudio/keras") ; library(keras)
# install_tensorflow()
spikes_model <- function(maxiter, total_spikes = 10, max_height = 0.001, min_height = 0.000005, decay_rate = 1) {
value_at_iteration <- rep(0, maxiter)
spike_at <- maxiter / total_spikes
current_rate <- min_height
holder_timeval <- 0
for(i in 1:maxiter) {
spike_indicator <- i / spike_at
if (is.integer(spike_indicator)) {
current_rate <- max_height
value_at_iteration[i] <- current_rate
holder_timeval <- spike_indicator
} else if (i < spike_at) {
current_rate <- min_height
value_at_iteration[i] <- current_rate
} else {
timeval <- i - (holder_timeval*spike_at)
current_rate <- max_height*exp(-decay_rate*timeval) + min_height
value_at_iteration[i] <- current_rate
}
}
return(value_at_iteration)
}
asdf <- spikes_model(maxiter = 100)
plot(asdf, type="l")
... which results in the following plot:
This is exactly what I want, except there is only one spike. I know there is a code or logic error somewhere, but I can not find where I am going wrong. Please help me replicate this spike procedure across time.
The code this scheduler is used in:
eps <- 1000
sch <- spikes_model(eps)
lr_schedule <- function(epoch, lr) {
lrn <- sch[as.integer(epoch)]
lrn <- k_cast_to_floatx(lrn)
return(lrn)
}
## Add callback to automatically adjust learning rate downward when training reaches plateau ##
reduce_lr <- callback_learning_rate_scheduler(lr_schedule)
## Fit model using trainig data, validate with validation data ##
mod1.hst <- mod1 %>% fit(
x=X.train, y=Y.train,
epochs=eps, batch_size=nrow(X.train),
validation_data = list(X.val, Y.val),
shuffle=TRUE, callbacks = list(checkpoint, reduce_lr)
)
Wow, I just figured out my own error. I was using the is.integer() function, which does not work how I wanted. I needed to use the is.whole.number() function from mosaic.
Fixing that single error, I find the following chart, which is exactly what I wanted.

Find the best value for KNN for statement

I am trying to to write a for statement to find the best value k in KNN. Unfortunately, I tried my code snippet now several times, but it seems like it does not calculate the correct value. Do you have an idea what is wrong about the statement
# Tune the value of K using K-Fold Cross Validation
bestaccuracy = 0
bestaccuracy
n.folds <- 100
for (k in 1:n.folds) {
set.seed(1)
knn.cvac <- knn.cv(train= x.australian.stan, cl= y.australian, k=k)
knn.cvac.table <- table (knn.cvac, y.australian)
knn.cvac.accuracy <- sum(diag(knn.cvac.table))/sum(knn.cvac.table)
if(bestaccuracy< knn.cvac.accuracy) bestk=k
if(bestaccuracy< knn.cvac.accuracy) bestaccuracy = knn.cvac.accuracy}
print(bestk)
print(bestaccuracy)
I tested it on a few simulation-based data and it works just fine! The only thing to notice is that you may have different Ks for which you get the highest accuracy and you print the biggest K(Because of the way it is coded).
Perhaps you can change the line of your code to this:
if(bestaccuracy< knn.cvac.accuracy) bestk=c(bestk, k)
So you can see all the optimal Ks when you print bestk.

how to solve errors in frbs package of R using GFC.GCCL method?

I'm using frbs package in R on my data set using 5-fold stratified cross validation. I've implemented stratified CV. I use GFS.GCCL method for frbs.learn function in each fold and predict the result using test data. I get this error as well as 30 equal warning messages:
Error: object 'temp.rule.degree' not found
Warning: In max(MF.temp[m, ], na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
My code is written in below:
library(frbs)
data<-read.csv(file.address)
data[,30] <- unclass(data[,30]) #column 30 has the class of samples
data <- data[,c(1,14,20,26,27, 30)] # I choose to have 5 attr. since
#my data is high dimensional
k <- 5 # 5-fold
seed <- 1
folds <- strf.cv(data, k, seed) #stratification function for CV
range.data.inp <- matrix(apply(data[,-ncol(data)], 2, range), nrow=2)
data<-norm.data(as.matrix(data[,-ncol(data)]),range.data.
inp,min.scale = 0.1, max.scale = 1)
ctrl <- list(popu.size = 30, num.class = 2, num.labels= 3,
persen_cross = 0.9, max.gen = 200, persen_mutant = 0.3,
name="sim-1")
for(i in 1:k){
str <- paste("fold",i)
print(str)
test.ind <- folds[[str]]
test.data <- data[test.ind,]
train.data <- data[-test.ind,]
obj <- frbs.learn(train.data , method.type="GFS.GCCL",
range.data.inp , ctrl)
pred <- predict(obj, test.data)
print("Predicted classes:")
print(pred)
}
I don't have any idea about error and warnings. Please let me know what I should do.
I've had similar problem (and others) trying to reproduce the SLAVE learning starting with the iris example data. I had 2 format items to solve before being able to run this with my artifical data:
my dataframe import was giving me integer, where the learn needs at least numeric.
my distribution of criteria was not flat. When I flattened the distribution (3 values so n/3 samples per value) everything went fine.
That's all I know.
Hope it helps.
I encountered the same issue when I was running SLAVE and GFS.GCCL. When I was looking at the source code of the library. I found that in frbs.learn(), each method has an implementation to calculate the range of input data. So, I think it might be a problem with the range of input data. For example, in GFS.GCCL, in the source code, for setting the parameters, it looks like this:
range.data.input <- range.data
data.train.ori <- data.train
popu.size <- control$popu.size
persen_cross <- control$persen_cross
persen_mutant <- control$persen_mutant
max.gen <- control$max.gen
name <- control$name
n.labels <- control$num.labels
n.class <- control$num.class
num.labels <- matrix(rep(n.labels, ncol(range.data)), nrow = 1)
num.labels <- cbind(num.labels, n.class)
## normalize range of data and data training
range.data.norm <- range.data.input
range.data.norm[1, ] <- 0
range.data.norm[2, ] <- 1
range.data.input.ori <- range.data.input
data.tra.norm <- norm.data(data.train[, 1 : ncol(data.train) - 1], range.data.input, min.scale = 0, max.scale = 1)
data.train <- cbind(data.tra.norm, matrix(data.train[, ncol(data.train)], ncol = 1))
in the first line, range.data is either coming from your specification nor the default setting of frbs.learn(). For the default setting, it gets the max and min for each row. In the source code:
range.data <- rbind(dt.min, dt.max)
After that, the range of data taken by the GFS.GCCL is
range.data.norm <- range.data.input
range.data.norm[1, ] <- 0
range.data.norm[2, ] <- 1
which is between 0 and 1. The GFS.GCCL is also taken the range.data.input as parameter. So, it takes both range.data.norm and range.data.input.
Therefore, I think if internally, there are some calculation corresponding to range.data.input (it needs to be set as min, max for each row), but the setting for this is actually not min and max for each row. The error is generated.
But, in summary, after I remove "range.data"from frbs.learn(), both GFS.GCCL and SLAVE work for me.
You can download the source code from here:
https://cran.r-project.org/web/packages/frbs/index.html
You can find the code for GFS.GCCL and SLAVE in:
FRBS.MainFunction.R
GFS.Methods.R
In addition to #Pilip38's good advice, I have three other ideas that have fixed similar errors for me while working with the frbs package.
Most important: Make sure your output variable is never equal to 0. It looks like you have a binary output variable so I am hoping just adding 1 to it so it is 1/2 instead of 0/1 will work.
Try setting your range.data.inp matrix to be all 0's in the first row and all 1's in the second. Naturally it's better to have a tighter range but it may be causing your bug.
Try decreasing the number of labels to 2.
It's can be a brittle procedure.

Resources