decision tree for significant variables - r

how can I use decision tree graph to determine the significant variables,I know which one has largest information gain should be in the root of tree which means has small entropy so this is my graph if I want to know which variables are significant how can I interpret

What does significant mean to you? At each node, the variable selected it the most significant given the context and assuming that selecting by information gain will actually work (it's not always the case). For example, at node 11, BB is the most significant discriminator given AA>20.
Clearly, AA and BB are the most useful assuming selecting by information gain gives the best way to partition the data. The rest give further refinement. C and N would be next.
What you should be asking is: Should I keep all the nodes?
The answer depends on many things and there is likely no best answer.
One way would be by using the total case count of each leaf and merge them.
Not sure how I would do this given your image. It's not really clear what is being shown at the leaves and what 'n' is. Also not sure what 'p' is.

Related

Representation of graphs in a hash table

I'm currently writing my master thesis about clusterings in graphs. My prof said he wants the graph to be represented as a hash table. Because it needs less space than the adjency matrix and it is faster in checking if a edge exists between two vertices than adjency lists.
Anyway, I have a lot of problems understanding how a graph can be built with (perfect) hash functions. I know there should be two tables inside each other. The first includes every node and the second contains all the adjacent vertices. But how do I find a hash function that makes this correctly?
After I built the graph I have to assign a weight to each edge. Is it better to build a new graph or keep the old one? How can I assign the weights correctly to each edge and how do I save it?
And the last question: How fast can I do a degree query for one vertex? O(1)?
Sorry for all these questions but I read so many papers and I'm still confused.
Thank you in advance for any help!!!
Lisa
You have to ask your professor, but I would assume it is something simple.
E.g. let us say you have a triangle A,B,C then in the hash you just represent it as
A {B,C}
B {A,C}
C {A,B}
So the entry to the link A,B could be both from A and B.

Possibilities of dividing a class in groups with several criteria

I have to divide a class of 50 students writing a dissertation in 10 different discussion groups of 5 members each. In theory, there are 1.35363x10^37 possible ways of doing this, which is just the result of {50!}/{(5!^10)*10!)}, if it is already decided that the groups will consist of 5.
However, each group is to be led by a facilitator. This reduces the number of possible combinations considerably, because each facilitaror has one field of expertise among 5 possible ones, which should be matched to the topics the students are writing about as much as possible. If there are three facilitators with competence A, three with competence B, two with competence C, one with competence D and one with competence E, and 15 students are assigned to A, 15 to B, 10 to C, 5 to D and 5 to E, the number of possible combinations comes down to 252 505.
But both students and facilitators keep advocating for the use of more criteria, instead of just focusing on field of expertise. For example, wanting to be in a group of students that know each other, or being in a group with a facilitator that has particular knowledge of a specific research method.
I am trying to illustrate my intuitive reasoning, which tells me that each new criteria increases the complexity/impossibility of the task, if the objective is a completely efficient solution. But I can't get my head around expressing this analytically in a satisfactory manner.
Is my reasoning correct, that adding criteria would reduce the amount of possibilities that can be discarded following the inclusion-exclusion principle, thus making the task more complex, adding possible combinations? I also think that if the criteria are not compatible (for example if students that know each other are writing about different topics, and there aren't enough competent facilitators), certain constraints become inviable.
You need to distinguish between computational complexity and human complexity. Adding constraints almost automatically increases the human complexity of the problem in the sense that it means that there is more to wrap your mind around. But -- it isn't true that the computational complexity increases. At least sometimes it decreases.
For example, say you have a set of 200 items and you want to determine if there is a subset of them which satisfy some constraint. Depending on the constraint, There might be no feasible way to do it. After all, 2^200 is much too large to brute-force. Now add the constraint that the subset needs to have exactly 3 elements. Now all of a sudden it is possible to brute force (just run through all 1,313,400 3-element subsets until you either find a solution or determine that none exist). This is enough to show that it isn't true that adding a constraint always makes a problem intrinsically more difficult. In the discrete case a new constraint can cut down on the size of the search space in a way that can be exploited. In the continuous cases it can reduce degrees of freedom and thus lower the dimension of the problem. This isn't to say that it always makes it easier. Probably as a rule of thumb, additional constraints tend to make a problem more difficult.
Your actual problem isn't spelled out enough to give concrete advice. One possibility (and one way to handle a proliferation of somewhat extraneous constraints) is to divide the constraints into hard constraints which need to be satisfied and soft constraints which are merely desired but not strictly needed. Turn it into an optimization problem: find the solution which maximizes the number of soft-constraints that are satisfied, subject to the condition that it satisfies the hard constraints. Perhaps you can formulate it as an integer programming problem and hopefully find an exact solution. Or, if it is easy to generate solutions that satisfy the hard constraints and it is easy to mutate one such solution to obtain another (e.g. swap two students who are in different groups), then an evolutionary algorithm would be a reasonable heuristic.

More info needed on number of nodes generated by Breadth First Search

I am new to AI and was going through Peter Norvig book. I've looked into this question already What is the number of nodes generated by breadth-first search?.
It says that if we apply goal test to each node when it is selected for expansion then we have nodes = 1 + b + b^2 + b^3 + ... + b^d + (b^(d+1) - b)
But what if my goal state is a leaf node at the final depth. So there is no depth at all after the goal. Then how can b^(d+1) evaluate?. eg: in a tree with max depth 3, if my goal lies at depth 3, then how would I evaluate b^(3+1) when there is no 4th level at all?. Please clear my doubt. Thanks in advance!
Note that the answer you linked mentioned that that is the amount of nodes that will be generated in the worst case.
Generated means that not all of those nodes are tested to see if they are the goal; they're simply generated and stored so that they can eventually be compared to the goal in case the goal is not found yet.
Worst case has two important implications. Try to visualize the Breadth-First Search going from left to right, then down one level, then left to right again, then down, etc. With worst case we assume that, on whatever depth level d the goal is located, the goal is the very last (rightmost) node. This means that all nodes to the left of it are compared to the goal node, and any successors/children of them are generated as well.
Now, I know that you said that in your case there are no nodes at a depth level below d, but the second implication of saying worst case is that we do assume there are basically infinitely many depth levels.
Indeed, for your case that equation is not entirely correct, but this is simply because you don't have the worst case. In your case, the search process would indeed not have to generate the last (b^(d+1) - b) nodes of the equation.
A final note on the terminology you used: you asked how b^(d+1) (for example, b^(3+1) can be evaluated if there is no depth level below d = 3. There is still no problem to mathematically evaluate that term. Even in your case there is no depth level 4, we can still mathematically evaluate the term b^(3+1). In your case it would not make sense to do so, because it is not correct, but we can still evaluate the term just fine.

Find solution minimum spanning tree (with conditions) when extending graph

I have a logic question, therefore chose from two explanations:
Mathematical:
I have a undirected weighted complete graph over 2-14 nodes. The nodes always come in pairs (startpoint to endpoint). For this I already have the minimum spanning tree, which considers that the pairs startpoint always comes before his endpoint. Now I want to add another pair of nodes.
Real life explanation:
I already have a optimal taxi route for 1-7 people. Each joins (startpoint) and leaves (endpoint) at different places. Now I want to find the optimal route when I add another person to the taxi. I have already the calculated subpaths from each point to each point in my database (therefore this is a weighted graph). All calculated paths are real value, not heuristics.
Now I try to find the most performant solution to solve this. My current idea:
Find the point nearest to the new startpoint. Add it a) before and b) after this point. Choose the faster one.
Find the point nearest to the new endpoint. Add it a) before and b) after this point. Choose the faster one.
Ignoring the case that the new endpoint comes before the new start point, this seams feasible.
I expect that the general direction of the taxi is one direction, this eliminates the following edge case.
Is there any case I'm missing in which this algorithm wouldn't calculate the optimal solution?
There are definitely many cases were this algorithm (which is a First Fit construction heuristic) won't find the optimal solution. Given a reasonable sized dataset, in my experience, I would guess to get improvements of 10-20% by simply taking that result and adding metaheuristics (or other optimization algo's).
Explanation:
If you have multiple taxis with a limited person capacity, it has an inherit bin packing problem, which is NP-complete (which is proven to be suboptimally solved by all known construction heuristics in P).
But even if you have just 1 taxi, it is similar to TSP: if you have the optimal solution for 10 locations and add 1 location, it can create a snowball effect in the optimal solution to make the optimal solution look completely different. (sorry, no visual image of this yet)
And if you need to any additional constraints on top of that later on, you need to be aware of these false assumptions.

What Are Large Graphs? What is Large Graph Analysis? What Is Big Data? What is Big Data Analysis?

I know what are these as I have started working with them. But for now, I just want to know the formal definitions of these terms and questions.
Any help in these regards is highly appreciated.
In my opinion, there is no absolute, formal criterion of when a graph becomes 'large' of when the amount of data becomes 'big'. These adjectives are meaningless without a frame of reference.
For instance, when you say someone is 'tall', it is implicitly assumed that you are either comparing this person to yourself, or to a perceived average height of people. If you change your frame of reference and compare this person to, let's say Mount Everest, this person's height becomes negligible. I could give a billion other examples, but the take-home message is: there is no absolute notion of 'bigness' or 'smallness'. The notion of scale is a relative notion. Simple concept, but with very strong implication: in a sense, physics has been so successful because physicists understood it very early.
So, to answer this question, I think a good of thumb is:
'large graphs' are graphs the exploration of which require long computation times on a typical quad-core machine compared to what people judge reasonable (an hour, a day. Your patience may vary).
'big data' are typically data which take too much memory space to be stored on a single hard drive.
Of course these are just rules of thumbs.
Usually A graph that has a set of nodes and arrows is a Small graph; otherwise, it is a Large graph.
If we show collection of nodes of a graph G by G0 and the collection of arrows by G1 then let G0= {1,2}, G1= {a,b,c},
source(a) = target(a) = source(b) = target(c)=1 and target(b) = source(c) = 2. It is small graph but The graph of sets and functions has all sets as nodes and all functions between sets as arrows. The source of a function is its domain, and its target is its codomain.
In this example, unlike the previous ones, the nodes do not form a set.Thus the graph of sets and functions is a large graph.
More generally we refer to any kind of mathematical structure as ‘small’ if the collection(s) it is built on form sets, and ‘large’ otherwise.

Resources