Genetic Algorithms Introduction - r

Starting off let me clarify that i have seen This Genetic Algorithm Resource question and it does not answer my question.
I am doing a project in Bioinformatics. I have to take data about the NMR spectrum of a cell(E. Coli) and find out what are the different molecules(metabolites) present in the cell.
To do this i am going to be using Genetic Algorithms in R language. I DO NOT have the time to go through huge books on Genetic algorithms. Heck! I dont even have time to go through little books.(That is what the linked question does not answer)
So i need to know of resources which will help me understand quickly what it is Genetic Algorithms do and how they do it. I have read the Wikipedia entry ,this webpage and also a couple of IEEE papers on the subject.
Any working code in R(even in C) or pointers to which R modules(if any) to be used would be helpful.

A brief (and opinionated) introduction to genetic algorithms is at http://www.burns-stat.com/pages/Tutor/genetic.html
A simple GA written in R is available at http://www.burns-stat.com/pages/Freecode/genopt.R The "documentation" is in 'S Poetry' http://www.burns-stat.com/pages/Spoetry/Spoetry.pdf and the code.

I assume from your question you have some function F(metabolites) which yields a spectrum but you do not have the inverse function F'(spectrum) to get back metabolites. The search space of metabolites is large so rather than brute force it you wish to try an approximate method (such as a genetic algorithm) which will make a more efficient random search.
In order to apply any such approximate method you will have to define a score function which compares the similarity between the target spectrum and the trial spectrum. The smoother this function is the better the search will work. If it can only yield true/false it will be a purely random search and you'd be better off with brute force.
Given the F and your score (aka fitness) function all you need to do is construct a population of possible metabolite combinations, run them all through F, score all the resulting spectrums, and then use crossover and mutation to produce a new population that combines the best candidates. Choosing how to do the crossover and mutation is generally domain specific because you can speed the process greatly by avoiding the creation of nonsense genomes. The best mutation rate is going to be very small but will also require tuning for your domain.
Without knowing about your domain I can't say what a single member of your population should look like, but it could simply be a list of metabolites (which allows for ordering and duplicates, if that's interesting) or a string of boolean values over all possible metabolites (which has the advantage of being order invariant and yielding obvious possibilities for crossover and mutation). The string has the disadvantage that it may be more costly to filter out nonsense genes (for example it may not make sense to have only 1 metabolite or over 1000). It's faster to avoid creating nonsense rather than merely assigning it low fitness.
There are other approximate methods if you have F and your scoring function. The simplest is probably Simulated Annealing. Another I haven't tried is the Bees Algorithm, which appears to be multi-start simulated annealing with effort weighted by fitness (sort of a cross between SA and GA).

I've found the article "The science of computing: genetic algorithms", by Peter J. Denning (American Scientist, vol 80, 1, pp 12-14). That article is simple and useful if you want to understand what genetic algorithms do, and is only 3 pages to read!!

Related

Handling-Costraints in Genetic Algorithms: implementing the death penalty

I would like to compare the "death penalty method" with other penalty methods proposed in the Genetic Algorithms' literature.
I'm using the R software, so I need to write the codes of these penalty methods. I've finding lots of difficulties because I have not understood one thing about the death penalty function: how I have to handle the infeasible offsprings since the population size usually is fixed in genetic algorithms?
I mean, I understand that, in order to use appropriately the death penalty, I have to initialize the genetic algorithm with all feasible solutions. But even if I have all feasible solutions in the first population (t=0), I could have infeasible solutions in the next generation since the crossover and the mutations are "blind" operators.
So, since the death penalty rejects all the infeasible solutions, then what happen?
Will the next generation have a population side smaller (original dim size - number of infeasible solutions) or I have to select more parents to put in the mating pool for reproduction until the next generation is composed by "original dim size" feasible offsprings or I have to try again the genetic operators until all the individuals in t+1 are feasible?
I do not know R, but the theory of the death penalty implies that you should generate more offspring.
I would generate do the following pseudo-code (translate to R):
n=<desired_population_size>;
while (n>0) {
generate n offspring;
eliminate the non feasible ones
add the feasible ones to the new generation
n=<desired population size> - <current new generation population size>
}
The only problem with this loop is the risk that it may go on forever (if we never generate feasible solutions). Even though it is quite small, if you want to protect yourself from it, you can limit the number of iterations allowed in the while loop, using a simple counter.
There is a pretty interesting article on this by Michalewicz. Have a look.

Graph partitioning optimization

The problem
I have a set of locations on a plane (actually they are pins in a KML file) and I want to partition this graph into subgraphs. Connectivity is pretty good - as with all real world road networks - so I assume that if two locations are close they have some kind of connection. The resulting set of subgraphs should adhere to these constraints:
Every node has to be covered by a subgraph
Every node should be in exactly 1 subgraph
Every node within a subgraph should be close to each other (L2 norm distances)
Every subgraph should contain at least 5 locations
The amount of subgraphs should be minimal
Right now the amount of locations is no more than 100 so I thought about brute forcing through every possibility but this obviously won't scale well.
I thought about using some k-Nearest-Neighbors algorithm (e.g. using QuickGraph) but I can't get my head around where to start and how to extend/shrink the subgraphs on the way. Maybe it's possible to map this problem to another problem that can easily be solved with some numerical procedure (e.g. Simplex) ...
Maybe someone has experience in this kind of optimization problems and is willing to help me find a solution? I don't have access to Mathematica/Matlab or the like ... but sufficient .NET programming skills and hmm Excel :-)
Thanks a lot!
As soon as there are multiple criteria that need to be appeased in the best possible way simultanously, it is usually starting to get difficult.
A numerical solution could work as follows: You could define yourself a utility function, that maps partitionings of your locations to positive real values, describing how "good" a partition is by assigning it a "rating" (good could be high "bad" could be near zero).
Once you have such a function assigning partitions their according "values", you simply need to optimize it and then you hopefully obtain a good solution if you defined your utility function reasonably. Evolutionary algorithms are good at that task since your utility function is probably analytically too complex to solve due to its discrete nature.
The problem is then only how you assign "values" to partitions via this utility function. This is then your task. It can be done for example by weighing each criterion with a factor and summing the results up, or even more complex functions (least squares etc.). The factors you use in the definition of the utility function are tuning parameters and can be varied until the result seems to be good.
Some CA software wold help a lot for testing if you can get your hands on one, bit I guess to obtain a black box solver for your partitioning problem, you need to implement the complete procedure yourself using a language of your choice.

Can any existing Machine Learning structures perfectly emulate recursive functions like the Fibonacci sequence?

To be clear I don't mean, provided the last two numbers in the sequence provide the next one:
(2, 3, -> 5)
But rather given any index provide the Fibonacci number:
(0 -> 1) or (7 -> 21) or (11 -> 144)
Adding two numbers is a very simple task for any machine learning structure, and by extension counting by ones, twos or any fixed number is a simple addition rule. Recursive calculations however...
To my understanding, most learning networks rely on forwards only evaluation, whereas most programming languages have loops, jumps, or circular flow patterns (all of which are usually ASM jumps of some kind), thus allowing recursion.
Sure some networks aren't forwards only; But can processing weights using the hyperbolic tangent or sigmoid function enter any computationally complete state?
i.e. conditional statements, conditional jumps, forced jumps, simple loops, complex loops with multiple conditions, providing sort order, actual reordering of elements, assignments, allocating extra registers, etc?
It would seem that even a non-forwards only network would only find a polynomial of best fit, reducing errors across the expanse of the training set and no further.
Am I missing something obvious, or did most of Machine Learning just look at recursion and pretend like those problems don't exist?
Update
Technically any programming language can be considered the DNA of a genetic algorithm, where the compiler (and possibly console out measurement) would be the fitness function.
The issue is that programming (so far) cannot be expressed in a hill climbing way - literally, the fitness is 0, until the fitness is 1. Things don't half work in programming, and if they do, there is no way of measuring how 'working' a program is for unknown situations. Even an off by one error could appear to be a totally different and chaotic system with no output. This is exactly the reason learning to code in the first place is so difficult, the learning curve is almost vertical.
Some might argue that you just need to provide stronger foundation rules for the system to exploit - but that just leads to attempting to generalize all programming problems, which circles right back to designing a programming language and loses all notion of some learning machine at all. Following this road brings you to a close variant of LISP with mutate-able code and virtually meaningless fitness functions that brute force the 'nice' and 'simple' looking code-space in attempt to follow human coding best practices.
Others might argue that we simply aren't using enough population or momentum to gain footing on the error surface, or make a meaningful step towards a solution. But as your population approaches the number of DNA permutations, you are really just brute forcing (and very inefficiently at that). Brute forcing code permutations is nothing new, and definitely not machine learning - it's actually quite common in regex golf, I think there's even an xkcd about it...
The real problem isn't finding a solution that works for some specific recursive function, but finding a solution space that can encompass the recursive domain in some useful way.
So other than Neural Networks trained using Backpropagation hypothetically finding the closed form of a recursive function (if a closed form even exists, and they don't in most real cases where recursion is useful), or a non-forwards only network acting like a pseudo-programming language with awful fitness prospects in the best case scenario, plus the virtually impossible task of tuning exit constraints to prevent infinite recursion... That's really it so far for machine learning and recursion?
According to Kolmogorov et al's On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition, a three layer neural network can model arbitrary function with the linear and logistic functions, including f(n) = ((1+sqrt(5))^n - (1-sqrt(5))^n) / (2^n * sqrt(5)), which is the close form solution of Fibonacci sequence.
If you would like to treat the problem as a recursive sequence without a closed-form solution, I would view it as a special sliding window approach (I called it special because your window size seems fixed as 2). There are more general studies on the proper window size for your interest. See these two posts:
Time Series Prediction via Neural Networks
Proper way of using recurrent neural network for time series analysis
Ok, where to start...
Firstly, you talk about 'machine learning' and 'perfectly emulate'. This is not generally the purpose of machine learning algorithms. They make informed guesses given some evidence and some general notions about structures that exist in the world. That typically means an approximate answer is better than an 'exact' one that is wrong. So, no, most existing machine learning approaches aren't the right tools to answer your question.
Second, you talk of 'recursive structures' as some sort of magic bullet. Yet they are merely convenient ways to represent functions, somewhat analogous to higher order differential equations. Because of the feedbacks they tend to introduce, the functions tend to be non-linear. Some machine learning approaches will have trouble with this, but many (neural networks for example) should be able to approximate you function quite well, given sufficient evidence.
As an aside, having or not having closed form solutions is somewhat irrelevant here. What matters is how well the function at hand fits with the assumptions embodied in the machine learning algorithm. That relationship may be complex (eg: try approximating fibbonacci with a support vector machine), but that's the essence.
Now, if you want a machine learning algorithm tailored to the search for exact representations of recursive structures, you could set up some assumptions and have your algorithm produce the most likely 'exact' recursive structure that fits your data. There are probably real world problems in which such a thing would be useful. Indeed the field of optimisation approaches similar problems.
The genetic algorithms mentioned in other answers could be an example of this, especially if you provided a 'genome' that matches the sort of recursive function you think you may be dealing with. Closed form primitives could form part of that space too, if you believe they are more likely to be 'exact' than more complex genetically generated algorithms.
Regarding your assertion that programming cannot be expressed in a hill climbing way, that doesn't prevent a learning algorithm from scoring possible solutions by how many much of your evidence it's able to reproduce and how complex they are. In many cases (most? though counting cases here isn't really possible) such an approach will find a correct answer. Sure, you can come up with pathological cases, but with those, there's little hope anyway.
Summing up, machine learning algorithms are not usually designed to tackle finding 'exact' solutions, so aren't the right tools as they stand. But, by embedding some prior assumptions that exact solutions are best, and perhaps the sort of exact solution you're after, you'll probably do pretty well with genetic algorithms, and likely also with algorithms like support vector machines.
I think you also sum things up nicely with this:
The real problem isn't finding a solution that works for some specific recursive function, but finding a solution space that can encompass the recursive domain in some useful way.
The other answers go a long way to telling you where the state of the art is. If you want more, a bright new research path lies ahead!
See this article:
Turing Machines are Recurrent Neural Networks
http://lipas.uwasa.fi/stes/step96/step96/hyotyniemi1/
The paper describes how a recurrent neural network can simulate a register machine, which is known to be a universal computational model equivalent to a Turing machine. The result is "academic" in the sense that the neurons have to be capable of computing with unbounded numbers. This works mathematically, but would have problems pragmatically.
Because the Fibonacci function is just one of many computable functions (in fact, it is primitive recursive), it could be computed by such a network.
Genetic algorithms should do be able to do the trick. The important this is (as always with GAs) the representation.
If you define the search space to be syntax trees representing arithmetic formulas and provide enough training data (as you would with any machine learning algorithm), it probably will converge to the closed-form solution for the Fibonacci numbers, which is:
Fib(n) = ( (1+srqt(5))^n - (1-sqrt(5))^n ) / ( 2^n * sqrt(5) )
[Source]
If you were asking for a machine learning algorithm to come up with the recursive formula to the Fibonacci numbers, then this should also be possible using the same method, but with individuals being syntax trees of a small program representing a function.
Of course, you also have to define good cross-over and mutation operators as well as a good evaluation function. And I have no idea how well it would converge, but it should at some point.
Edit: I'd also like to point out that in certain cases there is always a closed-form solution to a recursive function:
Like every sequence defined by a linear recurrence with constant coefficients, the Fibonacci numbers have a closed-form solution.
The Fibonacci sequence, where a specific index of the sequence must be returned, is often used as a benchmark problem in Genetic Programming research. In most cases recursive structures are generated, although my own research focused on imperative programs so used an iterative approach.
There's a brief review of other GP research that uses the Fibonacci problem in Section 3.4.2 of my PhD thesis, available here: http://kar.kent.ac.uk/34799/. The rest of the thesis also describes my own approach, which is covered a bit more succinctly in this paper: http://www.cs.kent.ac.uk/pubs/2012/3202/
Other notable research which used the Fibonacci problem is Simon Harding's work with Self-Modifying Cartesian GP (http://www.cartesiangp.co.uk/papers/eurogp2009-harding.pdf).

Text clustering with Levenshtein distances

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a distance function for strings. Also, since I do not know in advance the number of clusters, hierarchical clustering is the way to go and not k-means.
Although I get the problem in its abstract form, I do not know what is the easie way to actually do it. For example, is MATLAB or R a better choice for the actual implementation of hierarchical clustering with the custom function (Levenshtein distance).
For both software, one may easily find a Levenshtein distance implementation. The clustering part seems harder. For example Clustering text in MATLAB calculates the distance array for all strings, but I cannot understand how to use the distance array to actually get the clustering. Can you any of you gurus show me the way to how to implement the hierarchical clustering in either MATLAB or R with a custom function?
This may be a bit simplistic, but here's a code example that uses hierarchical clustering based on Levenshtein distance in R.
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))
In this example, we create a set of 30 random char(5) strings artificially in 3 groups (starting with "aa", "bb", and "cc"). We calculate the Levenshtein distance matrix using adist(...), and we run heirarchal clustering using hclust(...). Then we cut the dendrogram into three clusters with cutree(...) and append the cluster id's to the original strings.
ELKI includes Levenshtein distance, and offers a wide choice of advanced clustering algorithms, for example OPTICS clustering.
Text clustering support was contributed by Felix Stahlberg, as part of his work on:
Stahlberg, F., Schlippe, T., Vogel, S., & Schultz, T. Word segmentation through cross-lingual word-to-phoneme alignment. Spoken Language Technology Workshop (SLT), 2012 IEEE. IEEE, 2012.
We would of course appreciate additional contributions.
While the answer depends to a degree on the meaning of the strings, in general your problem is solved by the sequence analysis family of techniques. More specifically, Optimal Matching Analysis (OMA).
Most often the OMA is carried out in three steps. First, you define your sequences. From your description I can assume that each letter is a separate "state", the building block in a sequence. Second, you will employ one of the several algorithms to calculate the distances between all sequences in your dataset, thus obtaining the distance matrix. Finally, you will feed that distance matrix into a clustering algorithm, such as hierarchical clustering or Partitioning Around Medoids (PAM), which seems to gain popularity due to the additional information on the quality of the clusters. The latter guides you in the choice of the number of clusters, one of the several subjective steps in the sequence analysis.
In R the most convenient package with a great number of functions is TraMineR, the website can be found here. Its user guide is very accessible, and developers are more or less active on SO as well.
You are likely to find that clustering is not the most difficult part, except for the decision on the number of clusters. The guide for TraMineR shows that is the syntax is very straighforward, and the results are easy to interpret based on visual sequence graphs. Here is an example from the user guide:
clusterward1 <- agnes(dist.om1, diss = TRUE, method = "ward")
dist.om1 is the distance matrix obtained by OMA, cluster membership is contained in the clusterward1 object, which which you can do whatever you want: plotting, recoding as variables etc. The diss=TRUE option indicates that the data object is the dissimilarity (or distance) matrix. Easy, eh? The most difficult choice (not syntactically, but methodologically) is to choose the right distance algorithm, suitable for your particular application. Once you have that, being able to justify the choice, the rest is quite easy. Good luck!
If you would like a clear explanation of how to use partitional clustering (which will surely be faster) to solve your problem, check this paper: Effective Spell Checking Methods Using Clustering Algorithms.
https://www.researchgate.net/publication/255965260_Effective_Spell_Checking_Methods_Using_Clustering_Algorithms?ev=prf_pub
The authors explain how to cluster a dictionary using a modified (PAM-like) version of iK-Means.
Best of Luck!

When and why is crossover beneficial in differential evolution?

I implemented a differential evolution algorithm for a side project I was doing. Because the crossover step seemed to involve a lot of parameter choices (e.g. crossover probabilities), I decided to skip it and just use mutation. The method seemed to work ok, but I am unsure whether I would get better performance if I introduced crossover.
Main Question: What is the motivation behind introducing crossover to differential evolution? Can you provide a toy example where introducing crossover out-performs pure mutation?
My intuition is that crossover will produce something like the following in 2-dimensions. Say
we have two parent vectors (red). Uniform crossover could produce a new trial vector at one of the blue points.
I am not sure why this kind of exploration would be expected to be beneficial. In fact, it seems like this could make performance worse if high-fitness solutions follow some linear trend. In the figure below, lets say the red points are the current population, and the optimal solution is towards the lower right corner. The population is traveling down a valley such that the upper right and lower left corners produce bad solutions. The upper left corner produces "okay" but suboptimal solutions. Notice how uniform crossover produces trials (in blue) that are orthogonal to the direction of improvement. I've used a cross-over probability of 1 and neglected mutation to illustrate my point (see code). I imagine this situation could arise quite frequently in optimization problems, but could be misunderstanding something.
Note: In the above example, I am implicitly assuming that the population was randomly initialized (uniformly) across this space, and has begun to converge to the correct solution down the central valley (top left to bottom right).
This toy example is convex, and thus differential evolution wouldn't even be the appropriate technique. However, if this motif was embedded in a multi-modal fitness landscape, it seems like crossover might be detrimental. While crossover does support exploration, which could be beneficial, I am not sure why one would choose to explore in this particular direction.
R code for the example above:
N = 50
x1 <- rnorm(N,mean=2,sd=0.5)
x2 <- -x1+4+rnorm(N,mean=0,sd=0.1)
plot(x1,x2,pch=21,col='red',bg='red',ylim=c(0,4),xlim=c(0,4))
x1_cx = list(rep(0, 50))
x2_cx = list(rep(0, 50))
for (i in 0:N) {
x1_cx[i] <- x1[i]
x2_cx[i] <- x2[sample(1:N,1)]
}
points(x1_cx,x2_cx,pch=4,col='blue',lwd=4)
Follow-up Question: If crossover is beneficial in certain situations, is there a sensible approach to a) determining if your specific problem would benefit from crossover, and b) how to tune the crossover parameters to optimize the algorithm?
A related stackoverflow question (I am looking for something more specific, with a toy example for instance): what is the importance of crossing over in Differential Evolution Algorithm?
A similar question, but not specific to differential evolution: Efficiency of crossover in genetic algorithms
I am not particularly familiar with the specifics of the DE algorithm but in general the point of crossover is that if you have two very different individuals with high fitness it will produce an offspring that is intermediate between them without being particularly similar to either. Mutation only explores the local neighbourhood of each individual without taking the rest of the population into account. If you think of genomes as points in some high dimensional vector space, then a mutation is shift in a random direction. Therefore mutation needs to take small steps since if your are starting from a significantly better than random position, a long step in a random direction is almost certain to make things worse because it is essentially just introducing entropy into an evolved genome. You can think of a cross over as a step from one parent towards the other. Since the other parent is also better than random, it is more promising to take a longer step in that direction. This allows for faster exploration of the promising parts of the fitness landscape.
In real biological organisms the genome is often organized in such a way that genes that depend on each other are close together on the same chromosome. This means that crossover is unlikely to break synergetic gene combinations. Real evolution actually moves genes around to achieve this (though this is much slower than the evolution of individual genes) and sometimes the higher order structure of the genome (the 3 dimensional shape of the DNA) evolves to prevent cross-overs in particularly sensitive areas. These mechanisms are rarely modeled in evolutionary algorithms, but you will get more out of crossovers if you order your genome in a way that puts genes that are likely to interact close to each other.
No. Crossover is not useful. There I said it. :P
I've never found a need for crossover. People seem to think it does some kind of magic. But it doesn't (and can't) do anything more useful than simple mutation. Large mutations can be used to explore the entire problem space and small mutations can be used to exploit niches.
And all the explanations I've read are (to put it mildly) unsatisfactory. Crossover only complicates your algorithms. Drop it asap. Your life will be simpler. .... IMHO.
As Daniel says, cross over is a way to take larger steps across the problem landscape, allowing you to escape local maxima that a single mutation would be unable to do so.
Whether it is appropriate or not will depend on the complexity of the problem space, how the genotype -> phenotype expression works (will related genes be close together), etc.
More formally this is the concept of 'Connectivity' in Local Search algorithms, providing strong enough operators that the local search neighbourhood is sufficentally large to escape local minima.

Resources