R Constained Combination generation - r

Say I have an set of string:
x=c("a1","b2","c3","d4")
If I have a set of rules that must be met:
if "a1" and "b2" are together in group, then "c3" cannot be in that group.
if "d4" and "a1" are together in a group, then "b2" cannot be in that group.
I was wondering what sort of efficient algorithm are suitable for generating all combinations that meet those rules? What research or papers or anything talk about these type of constrained combination generation problems?
In the above problem, assume its combn(x,3)

I don't know anything about R, so I'll just address the theoretical aspect of this question.
First, the constraints are really boolean predicates of the form "a1 ^ b2 -> ¬c3" and so on. That means that all valid combinations can be represented by one binary decision diagram, which can be created by taking each of the constraints and ANDing them together. In theory you might make an exponentially large BDD that way (that usually doesn't happen, but depends on the structure of the problem), but that would mean that you can't really list all combinations anyway, so it's probably not too bad.
For example the BDD generated for those two constraints would be (I think - not tested - just to give an idea)
But since this is really about a family of sets, a ZDD probably works even better. The difference, roughly, between a BDD and a ZDD is that a BDD compresses nodes that have equal sub-trees (in the total tree of all possibilities), while the ZDD compresses nodes where the solid edge (ie "set this variable to 1") goes to False. Both re-use equal sub-trees and thus form a DAG.
The ZDD of the example would be (again not tested)
I find ZDDs a bit easier to manipulate in code, because any time a variable can be set, it will appear in the ZDD. In contrast, in a BDD, "skipped" nodes have to be detected, including "between the last node and the leaf", so for a BDD you have to keep track of your universe. For a ZDD, most operations are independent of the universe (except complement, which is rarely needed in the family-of-sets scenario). A downside is that you have to be aware of the universe when constructing the constraints, because they have to contain "don't care" paths for all the variables not mentioned in the constraint.
You can find more information about both BDDs and ZDDs in The Art of Computer Programming volume 4A, chapter 7.1.4, there is an old version available for free here.
These methods are in particular nice to represent large numbers of such combinations, and to manipulate them somehow before generating all the possibilities. So this will also work when there are many items and many constraints (such that the final count of combinations is not too large), (usually) without creating intermediate results of exponential size.

Related

Possibilities of dividing a class in groups with several criteria

I have to divide a class of 50 students writing a dissertation in 10 different discussion groups of 5 members each. In theory, there are 1.35363x10^37 possible ways of doing this, which is just the result of {50!}/{(5!^10)*10!)}, if it is already decided that the groups will consist of 5.
However, each group is to be led by a facilitator. This reduces the number of possible combinations considerably, because each facilitaror has one field of expertise among 5 possible ones, which should be matched to the topics the students are writing about as much as possible. If there are three facilitators with competence A, three with competence B, two with competence C, one with competence D and one with competence E, and 15 students are assigned to A, 15 to B, 10 to C, 5 to D and 5 to E, the number of possible combinations comes down to 252 505.
But both students and facilitators keep advocating for the use of more criteria, instead of just focusing on field of expertise. For example, wanting to be in a group of students that know each other, or being in a group with a facilitator that has particular knowledge of a specific research method.
I am trying to illustrate my intuitive reasoning, which tells me that each new criteria increases the complexity/impossibility of the task, if the objective is a completely efficient solution. But I can't get my head around expressing this analytically in a satisfactory manner.
Is my reasoning correct, that adding criteria would reduce the amount of possibilities that can be discarded following the inclusion-exclusion principle, thus making the task more complex, adding possible combinations? I also think that if the criteria are not compatible (for example if students that know each other are writing about different topics, and there aren't enough competent facilitators), certain constraints become inviable.
You need to distinguish between computational complexity and human complexity. Adding constraints almost automatically increases the human complexity of the problem in the sense that it means that there is more to wrap your mind around. But -- it isn't true that the computational complexity increases. At least sometimes it decreases.
For example, say you have a set of 200 items and you want to determine if there is a subset of them which satisfy some constraint. Depending on the constraint, There might be no feasible way to do it. After all, 2^200 is much too large to brute-force. Now add the constraint that the subset needs to have exactly 3 elements. Now all of a sudden it is possible to brute force (just run through all 1,313,400 3-element subsets until you either find a solution or determine that none exist). This is enough to show that it isn't true that adding a constraint always makes a problem intrinsically more difficult. In the discrete case a new constraint can cut down on the size of the search space in a way that can be exploited. In the continuous cases it can reduce degrees of freedom and thus lower the dimension of the problem. This isn't to say that it always makes it easier. Probably as a rule of thumb, additional constraints tend to make a problem more difficult.
Your actual problem isn't spelled out enough to give concrete advice. One possibility (and one way to handle a proliferation of somewhat extraneous constraints) is to divide the constraints into hard constraints which need to be satisfied and soft constraints which are merely desired but not strictly needed. Turn it into an optimization problem: find the solution which maximizes the number of soft-constraints that are satisfied, subject to the condition that it satisfies the hard constraints. Perhaps you can formulate it as an integer programming problem and hopefully find an exact solution. Or, if it is easy to generate solutions that satisfy the hard constraints and it is easy to mutate one such solution to obtain another (e.g. swap two students who are in different groups), then an evolutionary algorithm would be a reasonable heuristic.

How to quantitatively measure how simplified a mathematical expression is

I am looking for a simple method to assign a number to a mathematical expression, say between 0 and 1, that conveys how simplified that expression is (being 1 as fully simplified). For example:
eval('x+1') should return 1.
eval('1+x+1+x+x-5') should returns some value less than 1, because it is far from being simple (i.e., it can be further simplified).
The parameter of eval() could be either a string or an abstract syntax tree (AST).
A simple idea that occurred to me was to count the number of operators (?)
EDIT: Let simplified be equivalent to how close a system is to the solution of a problem. E.g., given an algebra problem (i.e. limit, derivative, integral, etc), it should assign a number to tell how close it is to the solution.
The closest metaphor I can come up with it how a maths professor would look at an incomplete problem and mentally assess it in order to tell how close the student is to the solution. Like in a math exam, were the student didn't finished a problem worth 20 points, but the professor assigns 8 out of 20. Why would he come up with 8/20, and can we program such thing?
I'm going to break a stack-overflow rule and post this as an answer instead of a comment, because not only I'm pretty sure the answer is you can't (at least, not the way you imagine), but also because I believe it can be educational up to a certain degree.
Let's assume that a criteria of simplicity can be established (akin to a normal form). It seems to me that you are very close to trying to solve an analogous to entscheidungsproblem or the halting problem. I doubt that in a complex rule system required for typical algebra, you can find a method that gives a correct and definitive answer to the number of steps of a series of term reductions (ipso facto an arbitrary-length computation) without actually performing it. Such answer would imply knowing in advance if such computation could terminate, and so contradict the fact that automatic theorem proving is, for any sufficiently powerful logic capable of representing arithmetic, an undecidable problem.
In the given example, the teacher is actually either performing that computation mentally (going step by step, applying his own sequence of rules), or gives an estimation based on his experience. But, there's no generic algorithm that guarantees his sequence of steps are the simplest possible, nor that his resulting expression is the simplest one (except for trivial expressions), and hence any quantification of "distance" to a solution is meaningless.
Wouldn't all this be true, your problem would be simple: you know the number of steps, you know how many steps you've taken so far, you divide the latter by the former ;-)
Now, returning to the criteria of simplicity, I also advice you to take a look on Hilbert's 24th problem, that specifically looked for a "Criteria of simplicity, or proof of the greatest simplicity of certain proofs.", and the slightly related proof compression. If you are philosophically inclined to further understand these subjects, I would suggest reading the classic Gödel, Escher, Bach.
Further notes: To understand why, consider a well-known mathematical artefact called the Mandelbrot fractal set. Each pixel color is calculated by determining if the solution to the equation z(n+1) = z(n)^2 + c for any specific c is bounded, that is, "a complex number c is part of the Mandelbrot set if, when starting with z(0) = 0 and applying the iteration repeatedly, the absolute value of z(n) remains bounded however large n gets." Despite the equation being extremely simple (you know, square a number and sum a constant), there's absolutely no way to know if it will remain bounded or not without actually performing an infinite number of iterations or until a cycle is found (disregarding complex heuristics). In this sense, every fractal out there is a rough approximation that typically usages an escape time algorithm as an heuristic to provide an educated guess whether the solution will be bounded or not.

Recursive hypothesis-building with ambiguites - what's it called?

There's a problem I've encountered a lot (in the broad fields of data analyis or AI). However I can't name it, probably because I don't have a formal CS background. Please bear with me, I'll give two examples:
Imagine natural language parsing:
The flower eats the cow.
You have a program that takes each word, and determines its type and the relations between them. There are two ways to interpret this sentence:
1) flower (substantive) -- eats (verb) --> cow (object)
using the usual SVO word order, or
2) cow (substantive) -- eats (verb) --> flower (object)
using a more poetic world order. The program would rule out other possibilities, e.g. "flower" as a verb, since it follows "the". It would then rank the remaining possibilites: 1) has a more natural word order than 2), so it gets more points. But including the world knowledge that flowers can't eat cows, 2) still wins. So it might return both hypotheses, and give 1) a score of 30, and 2) a score of 70.
Then, it remembers both hypotheses and continues parsing the text, branching off. One branch assumes 1), one 2). If a branch reaches a contradiction, or a ranking of ~0, it is discarded. In the end it presents ranked hypotheses again, but for the whole text.
For a different example, imagine optical character recognition:
** **
** ** *****
** *******
******* **
* ** **
** **
I could look at the strokes and say, sure this is an "H". After identifying the H, I notice there are smudges around it, and give it a slightly poorer score.
Alternatively, I could run my smudge recognition first, and notice that the horizontal line looks like an artifact. After removal, I recognize that this is ll or Il, and give it some ranking.
After processing the whole image, it can be Hlumination, lllumination or Illumination. Using a dictionary and the total ranking, I decide that it's the last one.
The general problem is always some kind of parsing / understanding. Examples:
Natural languages or ambiguous languages
OCR
Path finding
Dealing with ambiguous or incomplete user imput - which interpretations make sense, which is the most plausible?
I'ts recursive.
It can bail out early (when a branch / interpretation doesn't make sense, or will certainly end up with a score of 0). So it's probably some kind of backtracking.
It keeps all options in mind in light of ambiguities.
It's based off simple rules at the bottom can_eat(cow, flower) = true.
It keeps a plausibility ranking of interpretations.
It's recursive on a meta level: It can fork / branch off into different 'worlds' where it assumes different hypotheses when dealing with the next part of data.
It'll forward the individual rankings, probably using bayesian probability, to dependent hypotheses.
In practice, there will be methods to train this thing, determine ranking coefficients, and there will be cutoffs if the tree becomes too big.
I have no clue what this is called. One might guess 'decision tree' or 'recursive descent', but I know those terms mean different things.
I know Prolog can solve simple cases of this, like genealogies and finding out who is whom's uncle. But you have to give all the data in code, and it doesn't seem convienent or powerful enough to do this for my real life cases.
I'd like to know, what is this problem called, are there common strategies for dealing with this? Is there good literature on the topic? Are there libraries for ideally C(++), Python, were you can just define a bunch of rules, and it works out all the rankings and hypotheses?
I don't think there is one answer that fits all the bullet points you have. But I hope my links will lead you closer to an answer or might give you a different question.
I think the closest answer is Bayesian network since you have probabilities affecting each other as I understand it, it is also related to Conditional probability and Fuzzy Logic
You also describe a bit of genetic programming as well as Artificial Neural Networks
I can name drop some more topics which might be related:
http://en.wikipedia.org/wiki/Rule-based_programming
http://en.wikipedia.org/wiki/Expert_system
http://en.wikipedia.org/wiki/Knowledge_engineering
http://en.wikipedia.org/wiki/Fuzzy_system
http://en.wikipedia.org/wiki/Bayesian_inference

pattern matching

Suppose I have a set of tuples like this (each tuple will have 1,2 or 3 items):
Master Set:
{(A) (A,C) (B,C,E)}
and suppose I have another set of tuples like this:
Real Set: {(BOB) (TOM) (ERIC,SALLY,CHARLIE) (TOM,SALLY) (DANNY) (DANNY,TOM) (SALLY) (SALLY,TOM,ERIC) (BOB,SALLY) }
What I want to do is to extract all subsets of Tuples from the Real Set where the tuple members can be substituted to become the same as the Master Set.
In the example above, two sets would be returned:
{(BOB) (BOB,SALLY) (ERIC,SALLY,CHARLIE)}
(let BOB=A,ERIC=B,SALLY=C,CHARLIE=E)
and
{(DANNY) (DANNY,TOM) (SALLY,TOM,ERIC)}
(let DANNY=A,SALLY=B,TOM=C,ERIC=E)
Its sort of pattern matching, sort of combinatorics I guess. I really don't know how to classify this problem and what common plans of attack there are for it. What would the stackoverflow experts suggest?
Seperate your tuples into sets by size. Within each set, create a data structure that allows you to efficiently query for tuples containing a given element. The first part of this structure is your tuples as an array (so that each tuple has a cannonical index). The second set is: Map String (Set Int). This is somewhat space intensive but hopefully not prohibative.
Then, you, essentially, brute force it. For all assignments to the first master set, restrict all assignments to other master sets. For all remaining assignments to the second, restrict all assignments to the third and beyond, etc. The algorithm is basically inductive.
I should add that I don't think the problem is NP-complete so much as just flat worst-case exponential. It's not a decision problem, but an enumeration problem. And it's fairly easy to imagine scenarios of inputs that blow up exponentially.
It will be difficult to do efficiently since your problem is probably NP-complete (it includes subgraph isomorphism as a special case). That assumes the patterns and database both vary in size, though. How much data are you searching? How complicated will your patterns be? I would recommend the brute force solution first, then test if that is too slow and you need something fancier.

The difference between MapReduce and the map-reduce combination in functional programming

I read the mapreduce at http://en.wikipedia.org/wiki/MapReduce ,understood the example of how to get the count of a "word" in many "documents". However I did not understand the following line:
Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values. This behavior is different from the functional programming map and reduce combination, which accepts a list of arbitrary values and returns one single value that combines all the values returned by map.
Can someone elaborate on the difference again(MapReduce framework VS map and reduce combination)? Especially, what does the reduce functional programming do?
Thanks a great deal.
The main difference would be that MapReduce is apparently patentable. (Couldn't help myself, sorry...)
On a more serious note, the MapReduce paper, as I remember it, describes a methodology of performing calculations in a massively parallelised fashion. This methodology builds upon the map / reduce construct which was well known for years before, but goes beyond into such matters as distributing the data etc. Also, some constraints are imposed on the structure of data being operated upon and returned by the functions used in the map-like and reduce-like parts of the computation (the thing about data coming in lists of key/value pairs), so you could say that MapReduce is a massive-parallelism-friendly specialisation of the map & reduce combination.
As for the Wikipedia comment on the function being mapped in the functional programming's map / reduce construct producing one value per input... Well, sure it does, but here there are no constraints at all on the type of said value. In particular, it could be a complex data structure like perhaps a list of things to which you would again apply a map / reduce transformation. Going back to the "counting words" example, you could very well have a function which, for a given portion of text, produces a data structure mapping words to occurrence counts, map that over your documents (or chunks of documents, as the case may be) and reduce the results.
In fact, that's exactly what happens in this article by Phil Hagelberg. It's a fun and supremely short example of a MapReduce-word-counting-like computation implemented in Clojure with map and something equivalent to reduce (the (apply + (merge-with ...)) bit -- merge-with is implemented in terms of reduce in clojure.core). The only difference between this and the Wikipedia example is that the objects being counted are URLs instead of arbitrary words -- other than that, you've got a counting words algorithm implemented with map and reduce, MapReduce-style, right there. The reason why it might not fully qualify as being an instance of MapReduce is that there's no complex distribution of workloads involved. It's all happening on a single box... albeit on all the CPUs the box provides.
For in-depth treatment of the reduce function -- also known as fold -- see Graham Hutton's A tutorial on the universality and expressiveness of fold. It's Haskell based, but should be readable even if you don't know the language, as long as you're willing to look up a Haskell thing or two as you go... Things like ++ = list concatenation, no deep Haskell magic.
Using the word count example, the original functional map() would take a set of documents, optionally distribute subsets of that set, and for each document emit a single value representing the number of words (or a particular word's occurrences) in the document. A functional reduce() would then add up the global counts for all documents, one for each document. So you get a total count (either of all words or a particular word).
In MapReduce, the map would emit a (word, count) pair for each word in each document. A MapReduce reduce() would then add up the count of each word in each document without mixing them into a single pile. So you get a list of words paired with their counts.
MapReduce is a framework built around splitting a computation into parallelizable mappers and reducers. It builds on the familiar idiom of map and reduce - if you can structure your tasks such that they can be performed by independent mappers and reducers, then you can write it in a way which takes advantage of a MapReduce framework.
Imagine a Python interpreter which recognized tasks which could be computed independently, and farmed them out to mapper or reducer nodes. If you wrote
reduce(lambda x, y: x+y, map(int, ['1', '2', '3']))
or
sum([int(x) for x in ['1', '2', '3']])
you would be using functional map and reduce methods in a MapReduce framework. With current MapReduce frameworks, there's a lot more plumbing involved, but it's the same concept.

Resources