knapsack case r implementation for multiple persons using genetic algorithm - r

I am trying to implement genetic algorithm in R. I found out that r has 'GA' and 'genalg' packages for genetic algorithm implementation. I encountered the example i the link http://www.r-bloggers.com/genetic-algorithms-a-simple-r-example/. They tried solving the Knapsack problem. The problem can be briefly explained as:
"You are going to spend a month in the wilderness. You’re taking a backpack with you, however, the maximum weight it can carry is 20 kilograms. You have a number of survival items available, each with its own number of 'survival points'. You’re objective is to maximize the number of survival points"
The problem is easily solved using 'genalg' package for a single person and the output is binary string. Now i have a doubt, lets say instead of one person there are 2 or more i.e multiple persons and we need to distribute the survival points. The weight constraints apply for each person. Then how can we solve this problem? Can we use 'genalg' or 'GA' package? If so how can we apply them? Are there any examples on this that are solved in R or other software's?
Thanks

The R package adagio (https://cran.r-project.org/web/packages/adagio/index.html) comes with two functions (knapsack and mknapsack) which solves this type of problem more efficient by dynamic programming.

A simple approach could be to have one chromosome containing all individuals in the group and have the evaluation function split this chromosome in multiple parts, one for each individual and then have these parts evaluated. In the example below (based on the example in the question) I have assumed each individual has the same weight limit and multiple individuals can bring the same item.
library(genalg)
#Set up the problem parameters
#how many people in the group
individual_count <-3
#The weight limit for one individual
weightlimit <- 20
#The items with their survivalpoints
dataset <- data.frame(item = c("pocketknife", "beans", "potatoes", "unions",
"sleeping bag", "rope", "compass"), survivalpoints = c(10, 20, 15, 2, 30,
10, 30), weight = c(1, 5, 10, 1, 7, 5, 1))
#Next, we choose the number of iterations, design and run the model.
iter <- 100
#Our chromosome has to be large enough to contain a bit for all individuals and for all items in the dataset
chromosomesize <- individual_count * nrow(dataset)
#Function definitions
#A function to split vector X in N equal parts
split_vector <- function(x,n) split(x, cut(seq_along(x), n, labels = FALSE))
#EValuate an individual (a part of the chromosome)
evalIndividual <- function(x) {
current_solution_survivalpoints <- x %*% dataset$survivalpoints
current_solution_weight <- x %*% dataset$weight
if (current_solution_weight > weightlimit)
return(0) else return(-current_solution_survivalpoints)
}
#Evaluate a chromosome
evalFunc <- function(x) {
#First split the chromosome in a list of individuals, then we can evaluate all individuals
individuals<-split_vector(x,individual_count)
#now we need to sapply the evalIndividual function to each element of individuals
return(sum(sapply(individuals,evalIndividual)))
}
#Run the Genetic Algorithm
GAmodel <- rbga.bin(size = chromosomesize, popSize = 200, iters = iter, mutationChance = 0.01,
elitism = T, evalFunc = evalFunc)
#First show a summary
summary(GAmodel,echo=TRUE)
#Then extract the best solution from the GAmodel, copy/paste from the source code of the summary function
filter = GAmodel$evaluations == min(GAmodel$evaluations)
bestSolution = GAmodel$population[filter, , drop= FALSE][1,]
#Now split the solution in the individuals.
split_vector(bestSolution,individual_count)

Related

Mclust() - NAs in model selection

I recently tried to perform a GMM in R on a multivariate matrix (400 obs of 196 var), which elements belong to known categories. The Mclust() function (from package mclust) gave very poor results (around 30% of individuals were well classified, whereas with k-means the result reaches more than 90%).
Here is my code :
library(mclust)
X <- read.csv("X.csv", sep = ",", h = T)
y <- read.csv("y.csv", sep = ",")
gmm <- Mclust(X, G = 5) #I want 5 clusters
cl_gmm <- gmm$classification
cl_gmm_lab <- cl_gmm
for (k in 1:nclusters){
ii = which(cl_gmm == k) # individuals of group k
counts=table(y[ii]) # number of occurences for each label
imax = which.max(counts) # Majority label
maj_lab = attributes(counts)$dimnames[[1]][imax]
print(paste("Group ",k,", majority label = ",maj_lab))
cl_gmm_lab[ii] = maj_lab
}
conf_mat_gmm <- table(y,cl_gmm_lab) # CONFUSION MATRIX
The problem seems to come from the fact that every other model than "EII" (spherical, equal volume) is "NA" when looking at gmm$BIC.
Until now I did not find any solution to this problem...are you familiar with this issue?
Here is the link for the data: https://drive.google.com/file/d/1j6lpqwQhUyv2qTpm7KbiMRO-0lXC3aKt/view?usp=sharing
Here is the link for the labels: https://docs.google.com/spreadsheets/d/1AVGgjS6h7v6diLFx4CxzxsvsiEm3EHG7/edit?usp=sharing&ouid=103045667565084056710&rtpof=true&sd=true
I finally found the answer. GMMs simply cannot apply every model when two much explenatory variables are involved. The right thing to do is first reduce dimensions and select an optimal number of dimensions that make it possible to properly apply GMMs while preserving as much informations as possible about the data.

Which statistical test in R to use to detect differential expression on a simulated dataset when there are three replicates

I was asked to begin this exercise in bioinformatics (https://uclouvain-cbio.github.io/WSBIM1322/sec-testing.html) by simulating a dataset of log2 fold-changes measured in triplicate for 1000 genes (abs function used to avoid negative logs)
sim <- abs(rnorm(3000, mean = 0, sd = 1))
simlog <- log2(sim)
simlog_mat <- matrix(simlog, ncol = 3,
dimnames = list(paste0("gene", 1:1000), paste0("repl", 1:3)))
What statistical test should I use to test for 'differential expression'? The way the question is phrased, it seems I need to compare each replicate against the other? As there are 3 replicates I don't think I can use a t.test although the course material I'm using has only covered the ttest and FDR in this chapter.

How can I automate creation of a list of vectors containing simulated data from a known distribution, using a "for loop" in R?

First stack exchange post so please bear with me. I'm trying to automate the creation of a list, and the list will be made up of many empty vectors of various, known lengths. The empty vectors will then be filled with simulated data. How can I automate creation of this list using a for loop in R?
In this simplified example, fish have been caught by casting a net 4 times, and their abundance is given in the vector "abundance" (from counting the number of total fish in each net). We don't have individual fish weights, just the mean weight of all fish each net, so I need to simulate their weights from a lognormal distribution. So, I'm then looking to fill those empty vectors for each net, each with a length equal to the number of fish caught in that net, with weight data simulated from a lognormal distribution with a known mean and standard deviation.
A simplified example of my code:
abundance <- c(5, 10, 9, 20)
net1 <- rep(NA, abundance[1])
net2 <- rep(NA, abundance[2])
net3 <- rep(NA, abundance[3])
net4 <- rep(NA, abundance[4])
simulated_weights <- list(net1, net2, net3, net4)
#meanlog vector for each net
weight_per_net
#meansd vector for each net
sd_per_net
for (i in 1:4) {
simulated_weights[[i]] <- rlnorm(n = abundance[i], meanlog = weight_per_net[i], sd = sd_per_net[i])
print(simulated_weights_VM)
}
Could anyone please help me automate this so that I don't have to write out each net vector (e.g. net1) by hand, and then also write out all the net names in the list() function? There are far more nets than 4 so it would be extremely time consuming and inefficient to do it this way. I've tried several things from other posts like paste0(), other for loops, as.list(c()), all to no avail.
Thanks!
HM
Turns out you don't need the net1, net2, etc variables at all. You can just do
abundance <- c(5, 10, 9, 20)
simulated_weights <- lapply(abundance, function(x) rep(NA, x))
The lapply function will return the list you need by calling the function once for each value of abundance
We could create the 'simulated_weights' with split and rep
simulated_weights <- split(rep(rep(NA, length(abundance)), abundance),
rep(seq_along(abundance), abundance))

K-means iterated for same data for 10 times

I am a fresher to R. Trying to evaluate if I can get an optimization of K-means (using R) by iteratively calling the k-means routine for same dataset and same value for K (i.e. k=3 in my case) of 10/15 times and see if if can give me good results. I see the clustering changes at every call, even the total sum of squares and withinss starts changing but not sure how to halt at the best situation.
Can anyone guide me?
code:
run_kmeans <- function(xtimes)
{
for (x in 1:xtimes)
{
kmeans_results <- kmeans(filtered_data, 3)
print(kmeans_results["totss"])
print(kmeans_results["tot.withinss"])
}
return(kmeans_results)
}
kmeans_results = run_kmeans(10)
Not sure I understood your question because this is not the usual way of selecting the best partition (elbow method, silhouette method, etc.)
Let's say you want to find the kmeans partition that minimizes your within-cluster sum of squares.
Let's take the example from ?kmeans
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
You could write that to run repetitively kmeans:
xtimes <- 10
kmeans <- lapply(seq_len(xtimes), function(i){
kmeans_results <- kmeans(x, 3)
})
lapply is always preferrable to for. You output a list. To extract withinss and see which one is minimal:
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)
However, unless I misunderstood your objective, this is a strange way to select the most performing partition. Usually, this is the number of clusters that is evaluated ; not different partititons produced with the same sample data and the same number of clusters.
Edit from your comment
Ok, so you want to find the combination of columns that give you the best performance. I give you an example below where every two by two combinations of three variables is tested. You could generalize a little bit (but the number of combinations possible with 8 variables is very big, you should have a routine to reduce the number of tested combinations)
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 3),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 3)
)
colnames(x) <- c("x", "y","z")
combinations <- combn(colnames(x), 2, simplify = FALSE)
kmeans <- lapply(combinations, function(i){
kmeans_results <- kmeans(x[,i], 3)
})
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)

Topic models: cross validation with loglikelihood or perplexity

I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60.
I have divided my corpus into ten batches and set aside one batch for a holdout set. I have ran latent dirichlet allocation (LDA) using nine batches (total 180 documents) with topics 10 to 60. Now, I have to calculate perplexity or log likelihood for the holdout set.
I found this code from one of CV's discussion sessions. I really don't understand several lines of code below. I have dtm matrix using the holdout set (20 documents). But I don't know how to calculate the perplexity or log likelihood of this holdout set.
Questions:
Can anybody explain to me what seq(2, 100, by =1) mean here? Also, what AssociatedPress[21:30] mean? What function(k) is doing here?
best.model <- lapply(seq(2, 100, by=1), function(k){ LDA(AssociatedPress[21:30,], k) })
If I want to calculate perplexity or log likelihood of the holdout set called dtm, is there better code? I know there are perplexity() and logLik() functions but since I'm new I can not figure out how to implement it with my holdout matrix, called dtm.
How can I do ten fold cross validation with my corpus, containing 200 documents? Is there existing code that I can invoke? I found caret for this purpose, but again cannot figure that out either.
The accepted answer to this question is good as far as it goes, but it doesn't actually address how to estimate perplexity on a validation dataset and how to use cross-validation.
Using perplexity for simple validation
Perplexity is a measure of how well a probability model fits a new set of data. In the topicmodels R package it is simple to fit with the perplexity function, which takes as arguments a previously fit topic model and a new set of data, and returns a single number. The lower the better.
For example, splitting the AssociatedPress data into a training set (75% of the rows) and a validation set (25% of the rows):
# load up some R packages including a few we'll need later
library(topicmodels)
library(doParallel)
library(ggplot2)
library(scales)
data("AssociatedPress", package = "topicmodels")
burnin = 1000
iter = 1000
keep = 50
full_data <- AssociatedPress
n <- nrow(full_data)
#-----------validation--------
k <- 5
splitter <- sample(1:n, round(n * 0.75))
train_set <- full_data[splitter, ]
valid_set <- full_data[-splitter, ]
fitted <- LDA(train_set, k = k, method = "Gibbs",
control = list(burnin = burnin, iter = iter, keep = keep) )
perplexity(fitted, newdata = train_set) # about 2700
perplexity(fitted, newdata = valid_set) # about 4300
The perplexity is higher for the validation set than the training set, because the topics have been optimised based on the training set.
Using perplexity and cross-validation to determine a good number of topics
The extension of this idea to cross-validation is straightforward. Divide the data into different subsets (say 5), and each subset gets one turn as the validation set and four turns as part of the training set. However, it's really computationally intensive, particularly when trying out the larger numbers of topics.
You might be able to use caret to do this, but I suspect it doesn't handle topic modelling yet. In any case, it's the sort of thing I prefer to do myself to be sure I understand what's going on.
The code below, even with parallel processing on 7 logical CPUs, took 3.5 hours to run on my laptop:
#----------------5-fold cross-validation, different numbers of topics----------------
# set up a cluster for parallel processing
cluster <- makeCluster(detectCores(logical = TRUE) - 1) # leave one CPU spare...
registerDoParallel(cluster)
# load up the needed R package on all the parallel sessions
clusterEvalQ(cluster, {
library(topicmodels)
})
folds <- 5
splitfolds <- sample(1:folds, n, replace = TRUE)
candidate_k <- c(2, 3, 4, 5, 10, 20, 30, 40, 50, 75, 100, 200, 300) # candidates for how many topics
# export all the needed R objects to the parallel sessions
clusterExport(cluster, c("full_data", "burnin", "iter", "keep", "splitfolds", "folds", "candidate_k"))
# we parallelize by the different number of topics. A processor is allocated a value
# of k, and does the cross-validation serially. This is because it is assumed there
# are more candidate values of k than there are cross-validation folds, hence it
# will be more efficient to parallelise
system.time({
results <- foreach(j = 1:length(candidate_k), .combine = rbind) %dopar%{
k <- candidate_k[j]
results_1k <- matrix(0, nrow = folds, ncol = 2)
colnames(results_1k) <- c("k", "perplexity")
for(i in 1:folds){
train_set <- full_data[splitfolds != i , ]
valid_set <- full_data[splitfolds == i, ]
fitted <- LDA(train_set, k = k, method = "Gibbs",
control = list(burnin = burnin, iter = iter, keep = keep) )
results_1k[i,] <- c(k, perplexity(fitted, newdata = valid_set))
}
return(results_1k)
}
})
stopCluster(cluster)
results_df <- as.data.frame(results)
ggplot(results_df, aes(x = k, y = perplexity)) +
geom_point() +
geom_smooth(se = FALSE) +
ggtitle("5-fold cross-validation of topic modelling with the 'Associated Press' dataset",
"(ie five different models fit for each candidate number of topics)") +
labs(x = "Candidate number of topics", y = "Perplexity when fitting the trained model to the hold-out set")
We see in the results that 200 topics is too many and has some over-fitting, and 50 is too few. Of the numbers of topics tried, 100 is the best, with the lowest average perplexity on the five different hold-out sets.
I wrote the answer on CV that you refer to, here's a bit more detail:
seq(2, 100, by =1) simply creates a number sequence from 2 to 100 by ones, so 2, 3, 4, 5, ... 100. Those are the numbers of topics that I want to use in the models. One model with 2 topics, another with 3 topics, another with 4 topics and so on to 100 topics.
AssociatedPress[21:30] is simply a subset of the built-in data in the topicmodels package. I just used a subset in that example so that it would run faster.
Regarding the general question of optimal topic numbers, I now follow the example of Martin
Ponweiser on Model Selection by Harmonic Mean (4.3.3 in his thesis, which is here: http://epub.wu.ac.at/3558/1/main.pdf). Here's how I do it at the moment:
library(topicmodels)
#
# get some of the example data that's bundled with the package
#
data("AssociatedPress", package = "topicmodels")
harmonicMean <- function(logLikelihoods, precision=2000L) {
library("Rmpfr")
llMed <- median(logLikelihoods)
as.double(llMed - log(mean(exp(-mpfr(logLikelihoods,
prec = precision) + llMed))))
}
# The log-likelihood values are then determined by first fitting the model using for example
k = 20
burnin = 1000
iter = 1000
keep = 50
fitted <- LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) )
# where keep indicates that every keep iteration the log-likelihood is evaluated and stored. This returns all log-likelihood values including burnin, i.e., these need to be omitted before calculating the harmonic mean:
logLiks <- fitted#logLiks[-c(1:(burnin/keep))]
# assuming that burnin is a multiple of keep and
harmonicMean(logLiks)
So to do this over a sequence of topic models with different numbers of topics...
# generate numerous topic models with different numbers of topics
sequ <- seq(2, 50, 1) # in this case a sequence of numbers from 1 to 50, by ones.
fitted_many <- lapply(sequ, function(k) LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) ))
# extract logliks from each topic
logLiks_many <- lapply(fitted_many, function(L) L#logLiks[-c(1:(burnin/keep))])
# compute harmonic means
hm_many <- sapply(logLiks_many, function(h) harmonicMean(h))
# inspect
plot(sequ, hm_many, type = "l")
# compute optimum number of topics
sequ[which.max(hm_many)]
## 6
Here's the output, with numbers of topics along the x-axis, indicating that 6 topics is optimum.
Cross-validation of topic models is pretty well documented in the docs that come with the package, see here for example: http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf Give that a try and then come back with a more specific question about coding CV with topic models.

Resources