Finding a subset in a large dataset (>100) optimizing a parameter - r

I have a large matrix having this structure :
A B C
A' 9 2 0
B' 2 8 0
C' 0 1 7
The diagonal terms represent the interaction of an individual (A) with his/her brother/sister (A'). Off diagonal elements represent the interaction of an individual with individuals not part of the family.
From a large set of individuals (say a few hundreds), I would like to find subsets (say 10) of individuals having minimal interactions with individuals not part of the family.
I was thinking of using a genetic algorithm (to optimize a parameter that I could calculate from the matrix) but could not find any algorithm that deals with subsets.
Is there a package in R (preferable) doing this ?
Thanks

I'll outline binary genetic algorithm approach using GA package interface that you can use as baseline to create your own implementation. Because you have constraint that you want subset of specific length I will create mutation and crossover operators that will not ruin that specification (otherwise those operators will mostly create infeasible individuals).
mutation
mutation <- function(obj, parent){
vec <- as.vector(obj#population[parent,])
ind1 <- which(vec == 1)
ind2 <- setdiff(seq_along(vec), ind1)
ind1 <- sample(ind1, 1)
ind2 <- sample(ind2, 1)
vec[c(sample(ind1, 1), sample(ind2, 1))] <- c(0, 1)
vec
}
Mutation picks one position with 1 and one position with 0 and change their values to 0 and 1 respectively.
crossover
sizeCrossover <- function(size){
function(obj, parents){
vec1 <- obj#population[parents[1],] == 1
vec2 <- obj#population[parents[2],] == 1
c1 <- vec1 & vec2
c1[sample(which(!c1), size - sum(c1))] <- 1
c2 <- vec1 | vec2
c2[sample(which(c2), sum(c2) - size)] <- 0
list(children = rbind(c1, c2), fitness = c(NA, NA))
}
}
Crossover is variation of arithmetic crossover. In case of & it needs to additionally change some 0 to 1 and in case of | it needs to additionally change some 1 to 0.
initial population
initialPopulation <- function(popSize, N, size){
indices <- do.call(
rbind,
mapply(
function(x, y) cbind(x, y),
seq_len(popSize),
replicate(popSize, sample(seq_len(N), size), simplify = FALSE),
SIMPLIFY = FALSE
)
)
mm <- matrix(0, nrow = popSize, ncol = N)
mm[indices] <- 1
mm
}
If you create completely random initial population most (if not every) individuals will be infeasible. You need to create feasible initial population.
fitness
fitness <- function(vec, m, size){
indices <- which(vec == 1)
-sum(m[indices,indices])
}
GA::ga performas maximization hence the minus sign.
random data and parameters
N <- 200 # matrix dimensions
size <- 10 # subset length
popSize <- 100 # population size for genetic algorithm
m <- matrix(sample(0:10, N^2, TRUE), nrow = N)
diag(m) <- 0
ga
obj <- GA::ga(
type = "binary",
nBits = N,
run = 500,
maxiter = 10000,
popSize = popSize,
fitness = fitness,
m = m,
size = 10,
suggestions = initialPopulation(popSize, N, size),
mutation = mutation,
crossover = sizeCrossover(size)
)
subset <- which(obj#solution[1,] == 1)
note
I'm using sample as modified in Advanced R:
sample <- function(x, size = NULL, replace = FALSE, prob = NULL) {
size <- size %||% length(x)
x[sample.int(length(x), size, replace = replace, prob = prob)]
}

The problem in the question is not fully specified so assume that the problem is as follows. Let v be the vector of column sums of m-diag(diag(m)) and let k be an input specifying the required number of individuals in the subset. Then we wish to find the column names corresponding to the k smallest values in v or in terms of the language of integer linear programming we want to find the 0/1 vector x which minimizes v'x such that sum(x) = k.
We use the inputs and v defined in the Note at the end.
1) sort To minimize this we simply take the column names corresponding to the k smallest values of v.
names(head(sort(v), k))
## [1] "C" "A"
2) knapsack This can also be expressed as a knapsack problem. knapsack maximizes so we use max(v)-v to get the effect of minimization.
library(adagio)
res.knap <- knapsack(rep(1, length(v)), max(v) - v, k)
names(v)[res.knap$indices]
## [1] "A" "C"
3) linear programming We can also use integer linear programming with the following solution.
library(lpSolve)
res.lp <- lp("min", v, t(rep(1, length(v))), "=", k, all.bin = TRUE)
res.lp
## Success: the objective function is 2
names(v)[res.lp$solution == 1]
## [1] "A" "C"
4) genalg We use the rbga.bin genetic algorithm.
library(genalg)
set.seed(13)
f <- function(x) if (sum(x) == k) sum(x * v) else Inf
res.ga <- rbga.bin(size = length(v), evalFunc = f, popSize = 200,
mutation = .01, iters = 400)
cat(summary(res.ga))
## ...snip...
## GA Results
## Best Solution : 1 0 1
Note
The inputs are assumed to be:
k <- 2
m <- structure(c(9L, 2L, 0L, 2L, 8L, 1L, 0L, 0L, 7L), dim = c(3L,
3L), dimnames = list(c("A'", "B'", "C'"), c("A", "B", "C")))
v <- colSums(m - diag(diag(m)))

Related

Area Under the Curve using Simpson's rule in R

I would like to compute the Area Under the Curve defined by a set of experimental values. I created a function to calculate an aproximation of the AUC using the Simpson's rule as I saw in this post. However, the function only works when it receives a vector of odd length. How can I modify the code to add the area of the last trapezoid when the input vector has an even length.
AUC <- function(x, h=1){
# AUC function computes the Area Under the Curve of a time serie using
# the Simpson's Rule (numerical method).
# https://link.springer.com/chapter/10.1007/978-1-4612-4974-0_26
# Arguments
# x: (vector) time serie values
# h: (int) temporal resolution of the time serie. default h=1
n = length(x)-1
xValues = seq(from=1, to=n, by=2)
sum <- list()
for(i in 1:length(xValues)){
n_sub <- xValues[[i]]-1
n <- xValues[[i]]
n_add <- xValues[[i]]+1
v1 <- x[[n_sub+1]]
v2 <- x[[n+1]]
v3 <- x[[n_add+1]]
s <- (h/3)*(v1+4*v2+v3)
sum <- append(sum, s)
}
sum <- unlist(sum)
auc <- sum(sum)
return(auc)
}
Here a data example:
smoothed = c(0.3,0.317,0.379,0.452,0.519,0.573,0.61,0.629,0.628,0.613,0.587,0.556,0.521,
0.485,0.448,0.411,0.363,0.317,0.273,0.227,0.185,0.148,0.12,0.103,0.093,0.086,
0.082,0.079,0.076,0.071,0.066,0.059,0.053,0.051,0.052,0.057,0.067,0.081,0.103,
0.129,0.165,0.209,0.252,0.292,0.328,0.363,0.398,0.431,0.459,0.479,0.491,0.494,
0.488,0.475,0.457,0.43,0.397,0.357,0.316,0.285,0.254,0.227,0.206,0.189,0.181,
0.171,0.157,0.151,0.162,0.192,0.239)
One recommended way to handle an even number of points and still achieve precision is to combine Simpson's 1/3 rule with Simpson's 3/8 rule, which can handle an even number of points. Such approaches can be found in (at least one or perhaps more) engineering textbooks on numerical methods.
However, as a practical matter, you can write a code chunk to check the data length and add a single trapezoid at the end, as was suggested in the last comment of the post to which you linked. I wouldn't assume that it is necessarily as precise as combining Simpson's 1/3 and 3/8 rules, but it is probably reasonable for many applications.
I would double-check my code edits below, but this is the basic idea.
AUC <- function(x, h=1){
# AUC function computes the Area Under the Curve of a time serie using
# the Simpson's Rule (numerical method).
# https://link.springer.com/chapter/10.1007/978-1-4612-4974-0_26
# Arguments
# x: (vector) time serie values
# h: (int) temporal resolution of the time serie. default h=1
#jh edit: check for even data length
#and chop off last data point if even
nn = length(x)
if(length(x) %% 2 == 0){
xlast = x[length(x)]
x = x[-length(x)]
}
n = length(x)-1
xValues = seq(from=1, to=n, by=2)
sum <- list()
for(i in 1:length(xValues)){
n_sub <- xValues[[i]]-1
n <- xValues[[i]]
n_add <- xValues[[i]]+1
v1 <- x[[n_sub+1]]
v2 <- x[[n+1]]
v3 <- x[[n_add+1]]
s <- (h/3)*(v1+4*v2+v3)
sum <- append(sum, s)
}
sum <- unlist(sum)
auc <- sum(sum)
##jh edit: add trapezoid for last two data points to result
if(nn %% 2 == 0){
auc <- auc + (x[length(x)] + xlast)/2 * h
}
return(auc)
}
sm = smoothed[-length(smoothed)]
length(sm)
[1] 70
#even data as an example
AUC(sm)
[1] 20.17633
#original odd data
AUC(smoothed)
[1] 20.389
There may be a good reason for you to prefer using Simpson's rule, but if you're just looking for a quick and efficient estimate of AUC, the trapezoid rule is far easier to implement, and does not require an even number of breaks:
AUC <- function(x, h = 1) sum((x[-1] + x[-length(x)]) / 2 * h)
AUC(smoothed)
#> [1] 20.3945
Here, I show example code that uses the Simpson's 1/3 and 3/8 rules in tandem for the numerical integration of data. As always, the usual caveats about the possibility of coding errors or compatibility issues apply.
The output at the end compares the numerical estimates of this algorithm with the trapezoidal rule using R's "integrate" function.
#Algorithm adapted from:
#Numerical Methods for Engineers, Seventh Edition,
#By Chapra and Canale, page 623
#Modified to accept data instead of functional values
#Modified by: Jeffrey Harkness, M.S.
##Begin Simpson's rule function code
simp13 <- function(dat, h = 1){
ans = 2*h*(dat[1] + 4*dat[2] + dat[3])/6
return(ans)}
simp13m <- function(dat, h = 1){
summ <- dat[1]
n <- length(dat)
nseq <- seq(2,(n-2),2)
for(i in nseq){
summ <- summ + 4*dat[i] + 2*dat[i+1]}
summ <- summ + 4*dat[n-1] + dat[n]
result <- (h*summ)/3
return(result)}
simp38 <- function(dat, h = 1){
ans <- 3*h*(dat[1] + 3*sum(dat[2:3]) + dat[4])/8
return(ans)}
simpson = function(dat, h = 1){
hin = h
len = length(dat)
comp <- len %% 2
##number of segments
if(len == 2){
ans = sum(dat)/2*h} ##n = 2 is the trapezoidal rule
if(len == 3){
ans = simp13(dat, h = hin)}
if(len == 4){
ans = simp38(dat,h = hin)}
if(len == 6){
ans <- simp38(dat[1:4],h = hin) + simp13(dat[4:len],h = hin)}
if(len > 6 & comp == 0){
ans = simp38(dat[1:4],h = hin) + simp13m(dat[4:len],h = hin)}
if(len >= 5 & comp == 1){
ans = simp13m(dat,h = hin)}
return(ans)}
##End Simpson's rule function code
This next section of code shows the performance comparison. This code can easily be altered for different test functions and cases.
The precision difference tends to change with the sample size and test function used; this example is not intended to imply that the difference is always this pronounced.
#other algorithm for comparison purposes, from Allan Cameron above
oa <- function(x, h = 1) sum((x[-1] + x[-length(x)]) / 2 * h)
#Testing and algorithm comparison code
simans = NULL; oaans = NULL; simerr = NULL; oaerr = NULL; mp = NULL
for( j in 1:10){
n = j
#f = function(x) cos(x) + 2 ##Test functions
f = function(x) 0.2 + 25*x - 200*x^2 + 675*x^3 - 900*x^4 + 400*x^5
a = 0;b = 10
h = (b-a)/n
datain = seq(a,b,by = h)
preans = integrate(f,a,b)$value #precise numerical estimate of test function
simans[j] = simpson(f(datain), h = h)
oaans[j] = oa(f(datain), h = h)
(simerr[j] = abs(simans[j] - preans)/preans * 100)
(oaerr[j] = abs(oaans[j] - preans)/preans * 100)
mp[j] = simerr[j] < oaerr[j]
}
(outframe = data.frame("simpsons percent diff" = simerr,"trapezoidal percent diff" = oaerr, "more precise?" = mp, check.names = F))
simpsons percent diff trapezoidal percent diff more precise?
1 214.73489738 214.734897 FALSE
2 15.07958148 64.993410 TRUE
3 6.70203621 29.816799 TRUE
4 0.94247384 16.955208 TRUE
5 0.54830021 10.905620 TRUE
6 0.18616767 7.593825 TRUE
7 0.12051767 5.588209 TRUE
8 0.05890462 4.282980 TRUE
9 0.04087107 3.386525 TRUE
10 0.02412733 2.744500 TRUE

Optimization under constraint under a list of possibilities in R

I'm trying to optimize a function using two variables in R. My concern is that these 2 variables have only specific possible values. I found solution with lower/upper limits using noptr but I'm not able to "force" the value taken by both variables. An example will be easier to understand using constrOptim function:
g <- function(x,y) 100*x+150*y
gb <- function(x) g(x[1], x[2])
A <- matrix(c(1,0,0,1,100,150),3,2,byrow=T)
b <- c(0,0,350)
constrOptim(theta=c(2,2), f=gb, grad=NULL, ui=A, ci=b)
Thus, I want x & y to take the values 0, 1 or 2. In my example, the constraints are further written as x>=0,y>=0 and 100x+150y>=350.
My goal is to minimize 100*x+150*y respecting 100x+150y>=350 where x and y are taking values in c(0,1,2) only!
Depending on what features of the example apply to your actual problem you may be able to use brute force (if problem is not too large), integer linear programming (if objective and constraints are linear) or integer convex programming (if objective and constraints are convex). All of these hold for the example in the question.
# brute force
list(grid = expand.grid(x = 0:2, y = 0:2)) |>
with(cbind(grid, g = apply(grid, 1, gb))) |>
subset(g >= 350) |>
subset(g == min(g))
## x y g
## 6 2 1 350
# integer linear programming
library(lpSolve)
res <- lp("min", c(100, 150), A, c("<=", "<=", ">="), c(2, 2, 350), all.int = TRUE)
res
## Success: the objective function is 350
res$solution
## [1] 2 1
# integer convex programming
library(CVXR)
X <- Variable(2, integer = TRUE)
v <- c(100, 150)
objective <- Minimize(sum(v * X))
constraints <- list(X >= 0, X <= 2, sum(v * X) >= 350)
prob <- Problem(objective, constraints)
CVXR_result <- solve(prob)
CVXR_result$status
## [1] "optimal"
CVXR_result$getValue(X)
## [,1]
## [1,] 2.0000228
## [2,] 0.9999743
Since your objective function and constraint are both linear, your problem is a standard Mixed Integer Linear Programming (MIP) problem. There is a collection of solvers to solve those problems. Here is a formulation using the ompr package as the model manager and the glpk solver:
g <- c(100, 150)
rhs <- 350
model <- MIPModel() %>%
add_variable(x[i], i = 1:2, type = "integer", lb = 0, ub = 2) %>%
set_objective(sum_over(g[i] * x[i], i = 1:2), "min") %>%
add_constraint(sum_over(g[i] * x[i], i = 1:2) >= rhs)
result <- solve_model(model, with_ROI(solver = "glpk"))
result
Status: success
Objective value: 350
solution <- get_solution(result, x[i])
solution
variable i value
1 x 1 2
2 x 2 1
ompr uses simple algebraic notation and is easy to learn:
https://dirkschumacher.github.io/ompr/index.html

Optimize the weight of vectors given the similarity matrix of mean vectors

I want to solve the optimazation problem to search best weights for groups of vectors. Would you like to give some suggestions about how to solve it by R? Thanks very much.
The problem is as follows.
Given there are N groups, we know their similarity matrix among these N groups. The dimension of S is N*N.
In each group, there are K vectors . There are M elements in each vector which value is 0 or 1. .
we can fit an average vector based on these K vectors. For example, average vector
Based on these avearge vectors in each group, we could calculate the correlation among these avearge vectors.
The object is to minimize the differene between correlation matrix C and known similarity matrix S.
Beacuse you didn't provide any data I will generate random and demonstrate way you can approach your problem.
Similarity matrix:
N <- 6
S <- matrix(runif(N^2, -1, 1), ncol = N, nrow = N)
similarity_matrix <- (S + t(S)) / 2
N is number of groups. Each value of similarity matrix is between -1 and 1 and matrix is symmetric (beacuse you want to compare it to covariance matrix these makes sense).
group vectors:
M <- 10
K <- 8
group_vectors <- replicate(N, replicate(K, sample(c(0, 1), M, TRUE)), FALSE)
M is dimension of vector and K is number of binary vectors in each group.
fitness function
fitness <- function(W, group_vectors, similarity_matrix){
W <- as.data.frame(matrix(W, nrow = K, ncol = N))
SS <- cov(
mapply(function(x,y) rowSums(sweep(x, 2, y, "*")), group_vectors, W)
)
sum(abs(SS - similarity_matrix))
}
fitness for given weights calculates described covariance matrix and its distance from similarity_matrix.
differential evolution approach
res <- DEoptim::DEoptim(
fn = fitness,
lower = rep(-1, K*N),
upper = rep(1, K*N),
group_vectors = group_vectors,
similarity_matrix = similarity_matrix,
control = DEoptim::DEoptim.control(VTR = 0, itermax = 1000, trace = 50, NP = 100)
)
W <- matrix(res$optim$bestmem, nrow = K, ncol = N)
genetic algorithm approach
res <- GA::ga(
type = "real-valued",
fitness = function(W, ...) -fitness(W, ...),
lower = rep(-1, K*N),
upper = rep(1, K*N),
group_vectors = group_vectors,
similarity_matrix = similarity_matrix,
maxiter = 10000,
run = 200
)
W <- matrix(res#solution[1,], nrow = K, ncol = N)

How to create the sampling matrixes for Sobol sensitivity analysis in R (package "sensitivity")

I would like to perform a Sobol sensitivity analysis in R
The package "sensitivity" should allow me to do so, but I don't understand how to generate the sampling matrixes (X1, X2). I have a model that runs outside of R. I have 6 parameters with uniform distribution.
In my text book: N = (2k+2)*M ; M = 2^b ; b=[8,12] (New sampling method : Wu et al. 2012)
I had the feeling that I should create two sampling matrix and feed the two to the sobol function X1_{M,k} X2_{M,k}.
The dimension of final sampling matrix x$X is then (k+2)*M. because:
X <- rbind(X1, X2)
for (i in 1:k) {
Xb <- X1
Xb[, i] <- X2[, i]
X <- rbind(X, Xb)
}
How should I conduct my sampling to get the right number of runs as (2*k+2)*M ?
This script is for the old method but does someone know if the new method is already implemented yet in the sensitivity package? Feel free to comment this procedure
name = c("a" , "b" , "c" , "d" , "e", "f")
vals <- list(list(var="a",dist="unif",params=list(min=0.1,max=1.5)),
list(var="b",dist="unif",params=list(min=-0.3,max=0.4)),
list(var="c",dist="unif",params=list(min=-0.3,max=0.3)),
list(var="d",dist="unif",params=list(min=0,max=0.5)),
list(var="e",dist="unif",params=list(min=2.4E-5,max=2.4E-3)),
list(var="f",dist="unif",params=list(min=3E-5,max=3E-3)))
k = 6
b = 8
M = 2^b
n <- 2*M
X1 <- makeMCSample(n,vals, p = 1)
X2 <- makeMCSample(n,vals, p = 2)
x <- sobol2007(model = NULL, X1, X2, nboot = 200)
if I understand correctly, I should provide a y for each x$X sampling combination
then I can use the function "tell" which will generate the Sobol' first-order indices as well as the total indices
tell(x,y)
ggplot(x)
Supplemental R function SobolR
makeMCSample <- function(n, vals) {
# Packages to generate quasi-random sequences
# and rearrange the data
require(randtoolbox)
require(plyr)
# Generate a Sobol' sequence
if (p == 2){ sob <- sobol(n, length(vals), seed = 4321, scrambling = 1)
}else{sob <- sobol(n, length(vals), seed = 1234, scrambling = 1)}
# Fill a matrix with the values
# inverted from uniform values to
# distributions of choice
samp <- matrix(rep(0,n*(length(vals)+1)), nrow=n)
samp[,1] <- 1:n
for (i in 1:length(vals)) {
# i=1
l <- vals[[i]]
dist <- l$dist
params <- l$params
fname <- paste("q",dist,sep="")
samp[,i+1] <- do.call(fname,c(list(p=sob[,i]),params))
}
# Convert matrix to data frame and add labels
samp <- as.data.frame(samp)
names(samp) <- c("n",laply(vals, function(l) l$var))
return(samp)
}
ref: Qiong-Li Wu, Paul-Henry Cournède, Amélie Mathieu, 2012, Efficient computational method for global sensitivity analysis and its application to tree growth modelling

Genetic algorythm (GA) to select the optimal n values of a vector

I have to choose 10 elements of a vector to maximizes a function. Since the vector is pretty long there are to many possibilities (~1000 choose 10) to compute them all. So I started to look into the GA package to use a genetic algorithm.
I came up with this MWE:
values <- 1:1000
# Fitness function which I want to maximise
f <- function(x){
# Choose values
y <- values[x]
# From the first 10 sum up the odd values.
y <- ifelse(y %% 2 != 0, y, 0)
y <- y[1:10]
return(sum(y))
}
# Maximum value of f for this example
y <- ifelse(values %% 2 != 0, values, 0)
sum(sort(y, decreasing = TRUE)[1:10])
# [1] 9900
# Genetic algorithm
GA <- ga(type = "permutation", fitness = f, lower = rep(1, 10), upper = rep(1000, 10), maxiter = 100)
summary(GA)
The results are a bit underwhelming. From summary(GA), I get the feeling that the algorithm always permutates all 1000 values (the solution goes from x1 to x1000) which leads to an inefficient optimization. How can I tell the algorithm that it should only should use 10 values (so the solution is x1 .. x10)?
You should read https://www.jstatsoft.org/article/view/v053i04. You don't have permutation problem but selection one hence you should use binary type of genetic algorithm. Because you want to select exclusively 10 (10 ones and 990 zeroes) you should probably write your own genetic operators because that is constraint that will hardly ever be satisfied by default operators (with inclusion of -Inf in fitness function if you have more than 10 zeroes). One approach:
Population (k tells how much ones you want):
myInit <- function(k){
function(GA){
m <- matrix(0, ncol = GA#nBits, nrow = GA#popSize)
for(i in seq_len(GA#popSize))
m[i, sample(GA#nBits, k)] <- 1
m
}
}
Crossover
myCrossover <- function(GA, parents){
parents <- GA#population[parents,] %>%
apply(1, function(x) which(x == 1)) %>%
t()
parents_diff <- list("vector", 2)
parents_diff[[1]] <- setdiff(parents[2,], parents[1,])
parents_diff[[2]] <- setdiff(parents[1,], parents[2,])
children_ind <- list("vector", 2)
for(i in 1:2){
k <- length(parents_diff[[i]])
change_k <- sample(k, sample(ceiling(k/2), 1))
children_ind[[i]] <- if(length(change_k) > 0){
c(parents[i, -change_k], parents_diff[[i]][change_k])
} else {
parents[i,]
}
}
children <- matrix(0, nrow = 2, ncol = GA#nBits)
for(i in 1:2)
children[i, children_ind[[i]]] <- 1
list(children = children, fitness = c(NA, NA))
}
Mutation
myMutation <- function(GA, parent){
ind <- which(GA#population[parent,] == 1)
n_change <- sample(3, 1)
ind[sample(length(ind), n_change)] <- sample(setdiff(seq_len(GA#nBits), ind), n_change)
parent <- integer(GA#nBits)
parent[ind] <- 1
parent
}
Fitness (your function adapted for binary GA):
f <- function(x, values){
ind <- which(x == 1)
y <- values[ind]
y <- ifelse(y %% 2 != 0, y, 0)
y <- y[1:10]
return(sum(y))
}
GA:
GA <- ga(
type = "binary",
fitness = f,
values = values,
nBits = length(values),
population = myInit(10),
crossover = myCrossover,
mutation = myMutation,
run = 300,
pmutation = 0.3,
maxiter = 10000,
popSize = 100
)
Chosen values
values[which(GA#solution[1,] == 1)]

Resources