Combinatorial optimization with discrete options in R - r

I have a function with five variables that I want to maximize using only an specific set of parameters for each variable.
Are there any methods in R that can do this, other than by brutal force? (e.g. Particle Swarm Optimization, Genetic Algorithm, Greedy, etc.). I have read a few packages but they seem to create their own set of parameters from within a given range. I am only interested in optimizing the set of options provided.
Here is a simplified version of the problem:
#Example of 5 variable function to optimize
Fn<-function(x){
a=x[1]
b=x[2]
c=x[3]
d=x[4]
e=x[5]
SUM=a+b+c+d+e
return(SUM)
}
#Parameters for variables to optimize
Vars=list(
As=c(seq(1.5,3, by = 0.3)), #float
Bs=c(1,2), #Binary
Cs=c(seq(1,60, by=10)), #Integer
Ds=c(seq(60,-60, length.out=5)), #Negtive
Es=c(1,2,3)
)
#Full combination
FullCombn= expand.grid(Vars)
Results=data.frame(I=as.numeric(), Sum=as.numeric())
for (i in 1:nrow(FullCombn)){
ParsI=FullCombn[i,]
ResultI=Fn(ParsI)
Results=rbind(Results,c(I=i,Sum=ResultI))
}
#Best iteration (Largest result)
Best=Results[Results[, 2] == max(Results[, 2]),]
#Best parameters
FullCombn[Best$I,]

Two more possibilities. Both minimize by default, so I flip the sign in your objective function (i.e. return -SUM).
#Example of 5 variable function to optimize
Fn<-function(x, ...){
a=x[1]
b=x[2]
c=x[3]
d=x[4]
e=x[5]
SUM=a+b+c+d+e
return(-SUM)
}
#Parameters for variables to optimize
Vars=list(
As=c(seq(1.5,3, by = 0.3)), #float
Bs=c(1,2), #Binary
Cs=c(seq(1,60, by=10)), #Integer
Ds=c(seq(60,-60, length.out=5)), #Negtive
Es=c(1,2,3)
)
First, a grid search. Exactly what you did, just convenient. And the implementation allows you to distribute the evaluations of the objective function.
library("NMOF")
gridSearch(fun = Fn,
levels = Vars)[c("minfun", "minlevels")]
## 5 variables with 6, 2, 6, 5, ... levels: 1080 function evaluations required.
## $minfun
## [1] -119
##
## $minlevels
## [1] 3 2 51 60 3
An alternative: a simple Local Search. You start with a valid initial guess, and then move randomly through possible feasible solutions. The key ingredient is the neighbourhood function. It picks one element randomly and then, again randomly, sets this element to one allowed value.
nb <- function(x, levels, ...) {
i <- sample(length(levels), 1)
x[i] <- sample(levels[[i]], 1)
x
}
(There would be better algorithms for neighbourhood functions; but this one is simple and so demonstrates the idea well.)
LSopt(Fn, list(x0 = c(1.8, 2, 11, 30, 2), ## a feasible initial solution
neighbour = nb,
nI = 200 ## iterations
),
levels = Vars)$xbest
## Local Search.
## ##...
## Best solution overall: -119
## [1] 3 2 51 60 3
(Disclosure: I am the maintainer of package NMOF, which provides functions gridSearch and LSopt.)
In response to the comment, a few remarks on Local Search and the neighbourhood function above (nb). Local Search, as implemented in
LSopt, will start with an arbitrary solution, and
then change that solution slightly. This new solution,
called a neighbour, will be compared (by its
objective-function value) to the old solution. If the new solution is
better, it becomes the current solution; otherwise it
is rejected and the old solution remains the current one.
Then the algorithm repeats, for a number of iterations.
So, in short, Local Search is not random sampling, but
a guided random-walk through the search space. It's
guided because only better solutions get accepted, worse one's get rejected. In this sense, LSopt will narrow down on good parameter values.
The implementation of the neighbourhood is not ideal
for two reasons. The first is that a solution may not
be changed at all, since I sample from feasible
values. But for a small set of possible values as here,
it might often happen that the same element is selected
again. However, for larger search spaces, this
inefficiency is typically negligible, since the
probability of sampling the same value becomes
smaller. Often so small, that the additional code for
testing if the solution has changed becomes more
expensive that the occasionally-wasted iteration.
A second thing could be improved, albeit through a more
complicated function. And again, for this small problem it does not matter. In the current neighbourhood, an
element is picked and then set to any feasible value.
But that means that changes from one solution to the
next might be large. Instead of picking any feasible values of the As,
in realistic problems it will often be better to pick a
value close to the current value. For example, when you are at 2.1, either move to 1.8 or 2.4, but not to 3.0. (This reasoning is only relevant, of course, if the variable in question is on a numeric or at least ordinal scale.)
Ultimately, what implementation works well can be
tested only empirically. Many more details are in this tutorial.
Here is one alternative implementation. A solution is now a vector of positions for the original values, e.g. if x[1] is 2, it "points" to 1.8, if x[2] is 2, it points to 1, and so on.
## precompute lengths of vectors in Vars
lens <- lengths(Vars)
nb2 <- function(x, lens, ...) {
i <- sample(length(lens), 1)
if (x[i] == 1L) {
x[i] <- 2
} else if (x[i] == lens[i]) {
x[i] <- lens[i] - 1
} else
x[i] <- x[i] + sample(c(1, -1), 1)
x
}
## the objective function now needs to map the
## indices in x back to the levels in Vars
Fn2 <- function(x, levels, ...){
y <- mapply(`[`, levels, x)
## => same as
## y <- numeric(length(x))
## y[1] <- Vars[[1]][x[1]]
## y[2] <- Vars[[2]][x[2]]
## ....
SUM <- sum(y)
return(-SUM)
}
xbest <- LSopt(Fn2,
list(x0 = c(1, 1, 1, 1, 1), ## an initial solution
neighbour = nb2,
nI = 200 ## iterations
),
levels = Vars,
lens = lens)$xbest
## Local Search.
## ....
## Best solution overall: -119
## map the solution back to the values
mapply(`[`, Vars, xbest)
## As Bs Cs Ds Es
## 3 2 51 60 3

Here is a genetic algorithm solution with package GA.
The key is to write a function decode enforcing the constraints, see the package vignette.
library(GA)
#> Loading required package: foreach
#> Loading required package: iterators
#> Package 'GA' version 3.2.2
#> Type 'citation("GA")' for citing this R package in publications.
#>
#> Attaching package: 'GA'
#> The following object is masked from 'package:utils':
#>
#> de
decode <- function(x) {
As <- Vars$As
Bs <- Vars$Bs
Cs <- Vars$Cs
Ds <- rev(Vars$Ds)
# fix real variable As
i <- findInterval(x[1], As)
if(x[1L] - As[i] < As[i + 1L] - x[1L])
x[1L] <- As[i]
else x[1L] <- As[i + 1L]
# fix binary variable Bs
if(x[2L] - Bs[1L] < Bs[2L] - x[2L])
x[2L] <- Bs[1L]
else x[2L] <- Bs[2L]
# fix integer variable Cs
i <- findInterval(x[3L], Cs)
if(x[3L] - Cs[i] < Cs[i + 1L] - x[3L])
x[3L] <- Cs[i]
else x[3L] <- Cs[i + 1L]
# fix integer variable Ds
i <- findInterval(x[4L], Ds)
if(x[4L] - Ds[i] < Ds[i + 1L] - x[4L])
x[4L] <- Ds[i]
else x[4L] <- Ds[i + 1L]
# fix the other, integer variable
x[5L] <- round(x[5L])
setNames(x , c("As", "Bs", "Cs", "Ds", "Es"))
}
Fn <- function(x){
x <- decode(x)
# a <- x[1]
# b <- x[2]
# c <- x[3]
# d <- x[4]
# e <- x[5]
# SUM <- a + b + c + d + e
SUM <- sum(x, na.rm = TRUE)
return(SUM)
}
#Parameters for variables to optimize
Vars <- list(
As = seq(1.5, 3, by = 0.3), # Float
Bs = c(1, 2), # Binary
Cs = seq(1, 60, by = 10), # Integer
Ds = seq(60, -60, length.out = 5), # Negative
Es = c(1, 2, 3)
)
res <- ga(type = "real-valued",
fitness = Fn,
lower = c(1.5, 1, 1, -60, 1),
upper = c(3, 2, 51, 60, 3),
popSize = 1000,
seed = 123)
summary(res)
#> ── Genetic Algorithm ───────────────────
#>
#> GA settings:
#> Type = real-valued
#> Population size = 1000
#> Number of generations = 100
#> Elitism = 50
#> Crossover probability = 0.8
#> Mutation probability = 0.1
#> Search domain =
#> x1 x2 x3 x4 x5
#> lower 1.5 1 1 -60 1
#> upper 3.0 2 51 60 3
#>
#> GA results:
#> Iterations = 100
#> Fitness function value = 119
#> Solutions =
#> x1 x2 x3 x4 x5
#> [1,] 2.854089 1.556080 46.11389 49.31045 2.532682
#> [2,] 2.869408 1.638266 46.12966 48.71106 2.559620
#> [3,] 2.865254 1.665405 46.21684 49.04667 2.528606
#> [4,] 2.866494 1.630416 46.12736 48.78017 2.530454
#> [5,] 2.860940 1.650015 46.31773 48.92642 2.521276
#> [6,] 2.851644 1.660358 46.09504 48.81425 2.525504
#> [7,] 2.855078 1.611837 46.13855 48.62022 2.575492
#> [8,] 2.857066 1.588893 46.15918 48.60505 2.588992
#> [9,] 2.862644 1.637806 46.20663 48.92781 2.579260
#> [10,] 2.861573 1.630762 46.23494 48.90927 2.555612
#> ...
#> [59,] 2.853788 1.640810 46.35649 48.87381 2.536682
#> [60,] 2.859090 1.658127 46.15508 48.85404 2.590679
apply(res#solution, 1, decode) |> t() |> unique()
#> As Bs Cs Ds Es
#> [1,] 3 2 51 60 3
Created on 2022-10-24 with reprex v2.0.2

Related

fill NA raster cells using focal defined by boundary

I have a raster and a shapefile. The raster contains NA and I am filling the NAs using the focal function
library(terra)
v <- vect(system.file("ex/lux.shp", package="terra"))
r <- rast(system.file("ex/elev.tif", package="terra"))
r[45:60, 45:60] <- NA
r_fill <- terra::focal(r, 5, mean, na.policy="only", na.rm=TRUE)
However, there are some NA still left. So I do this:
na_count <- terra::freq(r_fill, value = NA)
while(na_count$count != 0){
r_fill <- terra::focal(r_fill, 5, mean, na.policy="only", na.rm=TRUE)
na_count <- terra::freq(r_fill, value = NA)
}
Once all NA's are filled, I clip the raster again using the shapefile
r_fill <- terra::crop(r_fill, v, mask = T, touches = T)
This is what my before and after looks like:
I wondered if the while loop is an efficient way to fill the NAs or basically determine how many times I have to run focal to fill all the NAs in the raster.
Perhaps we can, or want to, dispense with the while( altogether by making a better estimate of focal('s w= arg in a world where r, as ground truth, isn't available. Were it available, we could readily derive direct value of w
r <- rast(system.file("ex/elev.tif", package="terra"))
# and it's variants
r2 <- r
r2[45:60, 45:60] <- NA
freq(r2, value=NA) - freq(r, value=NA)
layer value count
1 0 NA 256
sqrt((freq(r2, value=NA) - freq(r, value=NA))$count)
[1] 16
which might be a good value for w=, and introducing another variant
r3 <- r
r3[40:47, 40:47] <- NA
r3[60:67, 60:67] <- NA
r3[30:37, 30:37] <- NA
r3[70:77, 40:47] <- NA
rm(r)
We no longer have our ground truth. How might we estimate an edge of w=? Turning to boundaries( default values (inner)
r2_bi <- boundaries(r2)
r3_bi <- boundaries(r3)
# examining some properties of r2_bi, r3_bi
freq(r2_bi, value=1)$count
[1] 503
freq(r3_bi, value=1)$count
[1] 579
freq(r2_bi, value=1)$count/freq(r2_bi, value = 0)$count
[1] 0.1306833
freq(r3_bi, value=1)$count/freq(r3_bi, value = 0)$count
[1] 0.1534588
sum(freq(r2_bi, value=1)$count,freq(r2_bi, value = 0)$count)
[1] 4352
sum(freq(r3_bi, value=1)$count,freq(r3_bi, value = 0)$count)
[1] 4352
Taken in reverse order, sum[s] and freq[s] suggest that while the total area of (let's call them holes) are the same, they differ in number and r2 is generally larger than r3. This is also clear from the first pair of freq[s].
Now we drift into some voodoo, hocus pocus in pursuit of a better edge estimate
sum(freq(r2)$count) - sum(freq(r2, value = NA)$count)
[1] 154
sum(freq(r3)$count) - sum(freq(r3, value = NA)$count)
[1] 154
(sum(freq(r3)$count) - sum(freq(r3, value = NA)$count))
[1] 12.40967
freq(r2_bi, value=1)$count/freq(r2_bi, value = 0)$count
[1] 0.1306833
freq(r2_bi, value=0)$count/freq(r2_bi, value = 1)$count
[1] 7.652087
freq(r3_bi, value=1)$count/freq(r3_bi, value = 0)$count
[1] 0.1534588
taking the larger, i.e. freq(r2_bi 7.052087
7.652087/0.1306833
[1] 58.55444
154+58
[1] 212
sqrt(212)
[1] 14.56022
round(sqrt(212)+1)
[1] 16
Well, except for that +1 part, maybe still a decent estimate for w=, to be used on both r2 and r3 if called upon to find a better w, and perhaps obviate the need for while(.
Another approach to looking for squares and their edges:
wtf3 <- values(r3_bi$elevation)
wtf2 <- values(r2_bi$elevation)
wtf2_tbl_df2 <- as.data.frame(table(rle(as.vector(is.na(wtf2)))$lengths))
wtf3_tbl_df2 <- as.data.frame(table(rle(as.vector(is.na(wtf3)))$lengths))
names(wtf2_tbl_df2)
[1] "Var1" "Freq"
wtf2_tbl_df2[which(wtf2_tbl_df2$Var1 == wtf2_tbl_df2$Freq), ]
Var1 Freq
14 16 16
wtf3_tbl_df2[which(wtf3_tbl_df2$Freq == max(wtf3_tbl_df2$Freq)), ]
Var1 Freq
7 8 35
35/8
[1] 4.375 # 4 squares of 8 with 3 8 length vectors
bringing in v finally and filling
v <- vect(system.file("ex/lux.shp", package="terra"))
r2_fill_17 <- focal(r2, 16 + 1 , mean, na.policy='only', na.rm = TRUE)
r3_fill_9 <- focal(r3, 8 + 1 , mean, na.policy='only', na.rm = TRUE)
r2_fill_17_cropv <- crop(r2_fill_17, v, mask = TRUE, touches = TRUE)
r3_fill_9_cropv <- crop(r3_fill_9, v, mask = TRUE, touches = TRUE)
And I now appreciate your while( approach as your r2 looks better, more naturally transitioned, though the r3 looks fine. In my few, brief experiments with smaller than 'hole', i.e. focal(r2, 9, I got the sense it would take 2 passes to fill, that suggests focal(r2, 5 would take 4.
I guess further determining the proportion of fill:hole:rast for when to deploy a while would be worthwhile.

Pi Estimator in R

The code below estimates pi in R, now I am trying to find the minimum number of terms N_Min
you would have to include in your estimate of pie to make it accurate to three decimal places.
pi_Est<- function(NTerms){
NTerms = 5 # start with an estimate of just five terms
pi_Est = 0 # initialise the value of pi to zero
Sum_i = NA # initialise the summation variable to null
for(ii in 1:NTerms)
{
Sum_i[ii] = (-1)^(ii+1)/(2*ii - 1) # this is the series equation for calculating pi
}
Sum_i = 4*Sum_i # multiply by four as required in the formula (see lecture notes)
pi_Est = sum(Sum_i)
cat('\nThe estimate of pi with terms = ', NTerms ,' is ',pi_Est)
}
First of all, I would change some things about your function. Instead of getting it to print out a message, get it to return a value. Otherwise it becomes very difficult to do anything with its output, including testing it for convergence to pi.
Also, no matter what the value of NTerms is you feed this function, you are immediately over-writing NTerms inside the function.
You could rewrite the function like this:
pi_Est <- function(NTerms) {
pi_Est <- 0
Sum_i <- numeric()
for(ii in seq(NTerms))
{
Sum_i[ii] <- (-1)^(ii+1)/(2*ii - 1)
}
return(sum(4 * Sum_i))
}
And to show it converges to pi, let's test it with 50,000 terms:
pi_Est(50000)
#> [1] 3.141573
Now, if we want to find the first value of NTerms that is correct to 3 decimal places, we are going to need to be able to call this function on a vector of NTerms - at the moment it is only working on a single number. So let's define the function f that vectorizes pi_Est:
f <- Vectorize(pi_Est)
Now, let's create the estimate for all values of NTerms between 1 and 2,000 and store them in a vector:
estimates <- f(1:2000)
We can see that the values of estimates seem to oscillate round and converge to pi if we plot the first 100 values:
plot(estimates[1:100], type = 'l')
abline(h = pi)
Our answer is just the first value which, when rounded to three decimal places, is the same as pi rounded to three decimal places:
result <- which(round(estimates, 3) == round(pi, 3))[1]
result
#> [1] 1103
And we can check this is correct by feeding 1103 into our original function:
pi_Est(result)
#> [1] 3.142499
You will see that this gives us 3.142, which is the same as pi rounded to 3 decimal places.
Created on 2022-01-31 by the reprex package (v2.0.1)
1000 terms are required to make the estimate accurate to within 0.001:
pi_Est1 <- function(n) {
if (n == 0) return(0)
neg <- 1/seq(3, 2*n + 1, 4)
if (n%%2) neg[length(neg)] <- 0
4*sum(1/seq(1, 2*n, 4) - neg)
}
pi_Est2 <- function(tol) {
for (i in ceiling(1/tol + 0.5):0) {
est <- pi_Est1(i)
if (abs(est - pi) > tol) break
est1 <- est
}
list(NTerms = i + 1, Estimate = est1)
}
tol <- 1e-3
pi_Est2(tol)
#> $NTerms
#> [1] 1000
#>
#> $Estimate
#> [1] 3.140593
tol - abs(pi - pi_Est2(tol)$Estimate)
#> [1] 2.500001e-10
tol - abs(pi - pi_Est1(pi_Est2(tol)$NTerms - 1))
#> [1] -1.00075e-06
Created on 2022-01-31 by the reprex package (v2.0.1)
Perhaps we can try the code below
pi_Est <- function(digits = 3) {
s <- 0
ii <- 1
repeat {
s <- s + 4 * (-1)^(ii + 1) / (2 * ii - 1)
if (round(s, digits) == round(pi, digits)) break
ii <- ii + 1
}
list(est = s, iter = ii)
}
and you will see
> pi_Est()
$est
[1] 3.142499
$iter
[1] 1103
> pi_Est(5)
$est
[1] 3.141585
$iter
[1] 130658
Why not use a single line of code for the calculation?
Pi <- tail(cumsum(4*(1/seq(1,4*50000000,2))*rep(c(1,-1), 50000000)),1)

optimize R code for min() and sample() by group

I generate a network with npeople(=80), ncomp(=4) components and I want each component to have density equal to dens(=0.2).
I want to optimize 2 lines of the code which take most of the time (especially if I want to have 5k people in the network).
the 2 lines are:
# adjust probability to keep density
nodes[,p:= as.numeric(min(c(1, p * (1/(mean(nodes$p) / c.dens))))), by = c("ID","ALTERID")]
# simulate edges
nodes[, edge := sample(c(0,1),1, prob = c(1-p,p)), by = c("ID","ALTERID")]
I have tried using the lapply() function, but the execution time increased - see below the line of code:
nodes[,lapply(.SD, function(p) min(c(1, p * (1/(mean(nodes$p) / c.dens))))), by = c("ID","ALTERID")]
rm(list=ls())
library(data.table)
library(intergraph)
library(igraph)
library(Matrix)
library(profvis)
library(ggplot2)
draw.var <- function(n, var1, rho, mean){
C <- matrix(rho, nrow = 2, ncol = 2)
diag(C) <- 1
C <- chol(C)
S <- rnorm(n, mean = mean)
S <- cbind(scale(var1)[1:n],S)
ZS <- S %*% C
return(ZS[,2])
}
set.seed(1123)
profvis({
# create empty list to store data
dt.list <- list()
npeople <- 500
dens <- .2
OC.impact <- FALSE
cor_iv_si <- .6
cor_iv_uc <- 0
cor_uc_oc <- 0.6
ncomp <- 4
beta_oc <- 2 # observed characteristics
beta_uc <- 2 # unobserved characteristics
beta_si <- 1
# create data.table
dt.people <- data.table(ego = 1:npeople)
# draw observed characteristics
dt.people[, OC := abs(rt(npeople,2))]
# draw unobserved variable
dt.people[, UC := draw.var(npeople, dt.people$OC, rho = cor_uc_oc,mean = 5)]
# set component idientifier
dt.people$group <- cut_number(dt.people$UC, ncomp,labels = F)
for(q in 1:ncomp){
# subset comp
dt.sub <- dt.people[group == q]
# create undirected graph
nodes <- as.data.table(t(combn(dt.sub$ego, 2)))
setnames(nodes,c("ID","ALTERID"))
# add attributes
nodes <- merge(nodes,dt.people[,list(ID = ego, ID.UC = UC, ID.OC = OC)], by = "ID")
nodes <- merge(nodes,dt.people[,list(ALTERID = ego, ALTERID.UC = UC, ALTERID.OC = OC)], by = "ALTERID")
# calculate distance
nodes[,d := abs(ID.UC - ALTERID.UC)]
# estimate the appropiate density per component
n.edges <- (dens * (npeople * (npeople - 1)))/ncomp
n.nodes <- npeople/ncomp
c.dens <- n.edges/(n.nodes * (n.nodes - 1))
# estimate initial probability of tie based on distance
coefficient <- log(c.dens / (1 - c.dens))
alpha <- coefficient / mean(nodes$d)
nodes[,p := exp(alpha * d) / (1 + exp(alpha * d))]
# adjust probability to keep density
nodes[,p:= as.numeric(min(c(1, p * (1/(mean(nodes$p) / c.dens))))), by = c("ID","ALTERID")]
# simulate edges
nodes[, edge := sample(c(0,1),1, prob = c(1-p,p)), by = c("ID","ALTERID")]
# keep the edges
nodes <- nodes[edge == 1,list(ID,ALTERID)]
# bind the networks
if(q == 1){
net <- copy(nodes)
} else{
net <- rbind(net,nodes)
}
}
# create opposide direction
net <- rbind(net,net[,list(ID = ALTERID, ALTERID = ID)])
})
This incorporates #BenBolker and # DavidArenburg's suggestions and also incorporates some of data.table's tools.
Non-Equi joins
The OP code loops through each group. One part of the code also uses combn and multiple joins to get the data in the right format. Using non-equi joins, we can combine all of those steps in one data.table call
dt_non_sub <- dt.people[dt.people,
on = .(ego < ego, group = group),
allow.cartesian = T,
nomatch = 0L,
.(group,
ALTERID = i.ego, ID = x.ego,
ID.UC = UC, ID.OC = OC,
ALTERID.OC = i.OC, ALTERID.UC = i.UC,
d = abs(UC - i.UC)) #added to be more efficient
]
# dt_non_sub[, d:= abs(ID.UC - ALTERID.UC)]
Vectorization
The original code was mostly slow because of two calls with by groupings. Since each call split the dataframe in around 8,000 individual groups, there were 8,000 functions calls each time. This eliminates those by using pmin as suggested by #DavidArenburg and then uses runif(N)<p as suggested by #BenBolker. My addition was that since your final result don't seem to care about p, I only assigned the edge by using {} to only return the last thing calculated in the call.
# alpha <- coefficient / mean(nodes$d)
dt_non_sub[,
edge := {
alpha = coefficient / mean(d)
p = exp(alpha * d) / (1 + exp(alpha * d))
p_mean = mean(p)
p = pmin(1, p * (1/(p_mean / c.dens)))
as.numeric(runif(.N)<p)
}
, by = .(group)]
net2 <- rbindlist(dt_non_sub[edge == 1, .(group, ALTERID, ID)],
dt_non_sub[edge == 1, .(group, ID = ALTERID, ALTERID = ID)]
One thing to note is that the vectorization is not 100% identical. Your code was recursive, each split updated the mean(node$p) for the next ID, ALTERID group. If you need that recursive part of the call, there's not much help to make it faster.
In the end, the modified code runs in 20 ms vs. the 810 ms of your original function. The results, while different, are somewhat similar in the total number of results:
Original:
net
ID ALTERID
1: 5 10
2: 10 14
3: 5 25
4: 10 25
5: 14 25
---
48646: 498 458
48647: 498 477
48648: 498 486
48649: 498 487
48650: 498 493
Modified
net2
group ALTERID ID
1: 2 4 3
2: 2 6 4
3: 4 7 1
4: 4 8 7
5: 2 9 4
---
49512: 3 460 500
49513: 3 465 500
49514: 3 478 500
49515: 3 482 500
49516: 3 497 500

Optimizing K-means clustering using Genetic Algorithm

I have the following dataset (obtained here):
----------item survivalpoints weight
1 pocketknife 10 1
2 beans 20 5
3 potatoes 15 10
4 unions 2 1
5 sleeping bag 30 7
6 rope 10 5
7 compass 30 1
I can cluster this dataset into three clusters with kmeans() using a binary string as my initial choice of centers. For eg:
## 1 represents the initial centers
chromosome = c(1,1,1,0,0,0,0)
## exclude first column (kmeans only support continous data)
cl <- kmeans(dataset[, -1], dataset[chromosome == 1, -1])
## check the memberships
cl$clusters
# [1] 1 3 3 1 2 1 2
Using this fundamental concept, I tried it out with GA package to conduct the search where I am trying to optimize(minimize) Davies-Bouldin (DB) Index.
library(GA) ## for ga() function
library(clusterSim) ## for index.DB() function
## defining my fitness function (Davies-Bouldin)
DBI <- function(x) {
## converting matrix to vector to access each row
binary_rep <- split(x, row(x))
## evaluate the fitness of each chromsome
for(each in 1:nrow(x){
cl <- kmeans(dataset, dataset[binary_rep[[each]] == 1, -1])
dbi <- index.DB(dataset, cl$cluster, centrotypes = "centroids")
## minimizing db
return(-dbi)
}
}
g<- ga(type = "binary", fitness = DBI, popSize = 100, nBits = nrow(dataset))
Of course (I have no idea what's happening), I received error message of
Warning messages:
Error in row(x) : a matrix-like object is required as argument to 'row'
Here are my questions:
How can correctly use the GA package to solve my problem?
How can I make sure the randomly generated chromosomes contains the same number of 1s which corresponds to k number of clusters (eg. if k=3 then the chromosome must contain exactly three 1s)?
I can't comment on the sense of combining k-means with ga, but I can point out that you had issue in your fitness function. Also, errors are produced when all genes are on or off, so fitness is only calculated when that is not the case:
DBI <- function(x) {
if(sum(x)==nrow(dataset) | sum(x)==0){
score <- 0
} else {
cl <- kmeans(dataset[, -1], dataset[x==1, -1])
dbi <- index.DB(dataset[,-1], cl=cl$cluster, centrotypes = "centroids")
score <- dbi$DB
}
return(score)
}
g <- ga(type = "binary", fitness = DBI, popSize = 100, nBits = nrow(dataset))
plot(g)
g#solution
g#fitnessValue
Looks like several gene combinations produced the same "best" fitness value

view values used by function boot to bootstrap estimates

I have written the code below to obtain a bootstrap estimate of a mean. My objective is to view the numbers selected from the data set, ideally in the order they are selected, by the function boot in the boot package.
The data set only contains three numbers: 1, 10, and 100 and I am only using two bootstrap samples.
The estimated mean is 23.5 and the R code below indicates that the six numbers included one '1', four '10' and one '100'. However, there are 30 possible combinations of those numbers that would have resulted in a mean of 23.5.
Is there a way for me to determine which of those 30 possible combinations is the combination that actually appeared in the two bootstrap samples?
library(boot)
set.seed(1234)
dat <- c(1, 10, 100)
av <- function(dat, i) { sum(dat[i])/length(dat[i]) }
av.boot <- boot(dat, av, R = 2)
av.boot
#
# ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
# Call:
# boot(data = dat, statistic = av, R = 2)
#
#
# Bootstrap Statistics :
# original bias std. error
# t1* 37 -13.5 19.09188
#
mean(dat) + -13.5
# [1] 23.5
# The two samples must have contained one '1', four '10' and one '100',
# but there are 30 possibilities.
# Which of these 30 possible sequences actual occurred?
# This code shows there must have been one '1', four '10' and one '100'
# and shows the 30 possible combinations
my.combos <- expand.grid(V1 = c(1, 10, 100),
V2 = c(1, 10, 100),
V3 = c(1, 10, 100),
V4 = c(1, 10, 100),
V5 = c(1, 10, 100),
V6 = c(1, 10, 100))
my.means <- apply(my.combos, 1, function(x) {( (x[1] + x[2] + x[3])/3 + (x[4] + x[5] + x[6])/3 ) / 2 })
possible.samples <- my.combos[my.means == 23.5,]
dim(possible.samples)
n.1 <- rowSums(possible.samples == 1)
n.10 <- rowSums(possible.samples == 10)
n.100 <- rowSums(possible.samples == 100)
n.1[1]
n.10[1]
n.100[1]
length(unique(n.1)) == 1
length(unique(n.10)) == 1
length(unique(n.100)) == 1
I think you can determine the numbers sampled and the order in which they are sampled with the code below. You have to extract the function ordinary.array from the boot package and paste that function into your R code. Then specify the values for n, R and strata, where n is the number of observations in the data set and R is the number of replicate samples you want.
I do not know how general this approach is, but it worked with a couple of simple examples I tried, including the example below.
library(boot)
set.seed(1234)
dat <- c(1, 10, 100, 1000)
av <- function(dat, i) { sum(dat[i])/length(dat[i]) }
av.boot <- boot(dat, av, R = 3)
av.boot
#
# ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
# Call:
# boot(data = dat, statistic = av, R = 3)
#
#
# Bootstrap Statistics :
# original bias std. error
# t1* 277.75 -127.5 132.2405
#
#
mean(dat) + -127.5
# [1] 150.25
# boot:::ordinary.array
ordinary.array <- function (n, R, strata)
{
inds <- as.integer(names(table(strata)))
if (length(inds) == 1L) {
output <- sample.int(n, n * R, replace = TRUE)
dim(output) <- c(R, n)
}
else {
output <- matrix(as.integer(0L), R, n)
for (is in inds) {
gp <- seq_len(n)[strata == is]
output[, gp] <- if (length(gp) == 1)
rep(gp, R)
else bsample(gp, R * length(gp))
}
}
output
}
# I think the function ordinary.array determines which elements
# of the data are sampled in each of the R samples
set.seed(1234)
ordinary.array(n=4,R=3,1)
# [,1] [,2] [,3] [,4]
# [1,] 1 3 1 3
# [2,] 3 4 1 3
# [3,] 3 3 3 3
#
# which equals:
((1+100+1+100) / 4 + (100+1000+1+100) / 4 + (100+100+100+100) / 4) / 3
# [1] 150.25

Resources