Node's position in tree as a Feature Vector? - vector

Background
I have a tree of nodes and I'm trying to run some machine learning algorithms to classify them. One of the features I want to use is the position of the nodes in the tree, i.e. closer nodes are likely to be in the same class.
My Problem
I'm representing all the features as a vector of numbers. Any thoughts on how I can represent position in the tree as a vector? So that distance b/n two vectors corresponds to distance between nodes in the tree? (I have a small tree of depth around 5-7 and branching around 2-3)
What I tried
P.S. I read about algorithms to find shortest distance between 2 nodes (finding each one's distance to their closest common ancestor) One idea I found was to have a vector x where each index corresponds to possible ancestors in the tree. Then set x[i] = numbers of levels from that ancestor. The problem with that is- I don't know what to do with nodes that aren't ancestors.

just put the path of the tree as the vector. then simply calculate the length of the difference between the two paths. so for example. 2,3,1,5,3 is one path. and 2,3,3,5,9,5 is another path. so 2,3 they have in common. so the length of the difference is 1,5,3 and 3,5,9,5 which is 7. good luck

So there is actually a very nice way to derive the features that you want; you can do so with MDS.
What MDS does is that it takes a N by N matrix (here N is number of nodes) where entry a_{i,j} is the distance between item i and item j (node i and node j) and for each item i it will return a D (pre-specified) position vector, D_i, such that the distance between D_i and D_j is approximately a_{i,j}.
Thus, we can have your feature vector with a bit of pre-processing. First, find the shortest distance (in hops) for each pair of nodes (you could use Floyd-Warshall) then use the distance matrix as input for MDS and specify the number of dimensions for you're position vector, and MDS will output position vectors of said dimensions.
If you search the web I'm sure you can find open sourced implementations to both Floyd-Warshall and MDS.

Related

Shortest distance to cover all the (N+1) points. All the N points lie on x- axis. Remaining one point lies anywhere in the coordinate plane

Given (N+1) points. All the N points lie on x- axis. Remaining one point (HEAD point) lies anywhere in the coordinate plane.
Given a START point on x- axis.
Find the shortest distance to cover all the points starting from START point.We can traverse a point multiple times.
Example N+1=4
points on x axis
(0,1),(0,2),(0,3)
HEAD Point
(1,1) //only head point can lie anywhere //Rest all on x axis
START Point
(0,1)
I am looking for a method as of how to approach this problem.
Whether we should visit HEAD point first or HEAD point in between.
I tried to find a way using Graph Theory to simplify this problem and reduce the paths that need to be considered. If there is an elegant way to represent this problem using graphs to identify a solution, I was not able to find it. This approach becomes very inefficient as the n increases - the time and memory is O(2^n).
Looking at this as a tree graph, the root node would be the START point, then each of its child nodes would be the points it is connected to.
Since the START point and the rest of the points aside from the HEAD all lay on the x-axis, all non-HEAD points only need to be connected to adjacent points on the x-axis. This is because the distance of the path between any two points is the sum of the distances between any adjacent points along the path between those two points (the subset of nodes representing points on the x-axis does not need to form a complete graph). This reduces the brute force approach some.
Here's a simple example:
The upper left shows the original problem: points on the x-axis along with the START and HEAD points.
In the upper right, this has been transformed into a graph with each node representing a point from the original problem. The edges represent the paths that can be taken between points. This assumes that the START point only represents the first point in the path. Unlike the other nodes, it is only included in the path once. If that is not the case and the path can return to the START point, this would approximately double the possible paths, but the same approach can be followed.
In the bottom left, the START point, a, is the root of a tree graph, and each node connected to the START point is a child node. This process is repeated for each child node until either:
A path that is obviously not optimal is identified, in which case that node can just be excluded from the graph. See the nodes in red boxes; going back and forth between the same nodes is unnecessary.
All points are included when traversing the tree from the root to that node, producing a potential solution.
Note that when creating the tree graph, each time a node is repeated, its "potential" child nodes are the same as the first time the node was included. By "potential", I mean cases above still need to be checked, because the result might include a nonsensical path, in which case that node would not be included. It is also possible a potential solution results from the path after its child nodes are included.
The last step is to add up the distances for each of the potential solutions to determine which path is shortest.
This requires a careful examination of the different cases.
Assume for now START (S) is on the far left, and HEAD (H) is somewhere in the middle the path maybe something like
H
/ \
S ---- * ----*----* * --- * ----*
Or it might be shorter to from H to and from the one of the other node
H
//
S ---- * --- * -- *----------*---*
If S is not at one end you might have something like
H
/ \
* ---- * ----*----* * --- * ----*
--------S
Or even going direct from S to H on the first step
H
/ |
* ---- * ----*----* |
S
A full analysis of cases would be quite extensive.
Actually solving the problem, might depend on the number of nodes you have. If the number is small < 10, then compete enumeration might be possible. Just work out every possible path, eliminate the ones which are illegal, and choose the smallest. The number of paths is I think in the order of n!, so its computable for small n.
For large n you can break the problem into small segments. I think its enough just to consider a small patch with nodes either side of H and a small patch with nodes either side of S.
This is not really a solution, but a possible way to think about tackling the problem.
(To be pedantic stackoverflow.com is not the right site for this question in the stack exchange network. Computational Science : algorithms might be a better place.
This is a fun problem. First, lets try to find a brute force solution, as Poosh did.
Observations about the Shortest Path
No repeated points
You are in an Euclidean geometry, thus the triangle inequality holds: For all points a,b,c, the distance d(a,b) + d(b,c) <= d(a,c). Thus, whenever you have an optimal path that contains a point that occurs more than once, you can remove one of them, which means it is not an optimal path, which leads to a contradiction and proves that your optimal path contains each point exactly once.
Permutations
Our problem is thus to find the permutation, lets call it M_i, of the numbers 1...n for points P1...Pn (where P0 is the fixed start point and Pn the head point, P1...Pn-1 are ordered by increasing x value) that minimizes the sum of |(P_M_i)-(P_M_(i-1))| for i from 1 to n, || being the vector length sqrt(v_x²+v_y²).
The number of permutations of a set of size n is n!. In this case we have n+1 points, so a brute force approach testing all permutations would have complexity (n+1)!, which is higher than even 2^n and definitely not practical, so we need further observations to improve this.
Next Steps
My next step would now be to see if there are any other sequences that can be proven to be not optimal, leading to a reduction in the number of candidates to be tested.
Paths of non-head points
Lets look at all paths (sequences of indices of points that don't contain a head point and that are parts of the optimal path. If we don't change the start and end point of a path, then any other transpositions have no effect on the outside environment and we can perform purely local optimizations. We can prove that those sequences must have monotonic (increasing or decreasing) x coordinate values and thus monotonic indices (as they are ordered by ascending x coordinate between indices 0 and n-1):
We are in a purely one dimensional subspace and the total distance of the path is thus equal to the sum of the absolute values of the differences in x coordinates between one such point and the next. It is clear that this sum is minimized by ordering by x coordinate in either ascending or descending order and thus ordering the indices in the same way. Note that this is true for maximal such paths as well as for all continuous "subpaths" of them.
Wrapping it up
The only choices we have left are:
where do we place the head node in the optimal path?
which way do we order the two paths to the left and right?
This means we have n values for the index of the head node (1...n, 0 is fixed as the start node) and 2x2 values for sort order. So we have 4n choices which we can all calculate and pick the shortest one. One of the sort orders probably determines the other but I leave that to you.
Anyways, the complexity of this algorithm is O(4n) = O(n). Because reading in the input of the problems is in O(n) and writing the output is as well, I believe that is an algorithm of optimal complexity. However, if we could reformulate the problem somewhat, so that we could read and write the input and output in some compressed form, as in only the parameters that we actually need to solve the problem, then it is possible that we could do better.
P.S.: I'm not a mathematician so I probably used wrong words for some concepts and missed the usual notation for the variables and functions. I would be glad for some expert to check this for any obvious errors.

algorithm for 'generalized' matching in complete graphs

My problem is a generalization of a task solved by [Blossom algorithm] by Edmonds. The original task is the following: given a complete graph with weighted undirected edges, find a set of edges such that
1) every vertex of the graph is adjacent to only one edge from this set (i.e. vertices are grouped into pairs)
2) sum over weights of edges in this set is minimal.
Now, I would like to modify the first goal into
1') vertices are grouped into sets of 3 vertices (or in general, d vertices), and leave condition 2) unchanged.
My questions:
Do you know if this 'generalised' problem has a name?
Do you know about an algorithm solving it in number of steps being polynomial of number of vertices (like Blossom algorithm for an original problem)? I don't see a straightforward generalisation of Blossom algorithm, as it is based on looking for augmenting paths on a graph compressed to a bipartite graph (and uses here Hungarian algorithm). But augmenting paths do not seem to point to groups of vertices different than pairs.
Best regards,
Paweł

Number of triangles with N points inside

Given some points in plane (upto 500 points), no 3 collinear. We have to determine the number of triangles whose vertices are from the given points and that contain exactly N points inside them. How to efficiently solve this problem? The naive O(n^4) algorithm is too slow. Any better approach?
You could try thinking of the triangle as the intersection of three half-spaces. To find the number of points inside a triangle A, B, C first consider the set of points on one side of the infinite line in direction AB. Let these sets L(AB) and R(AB) for points of the left and right. Similarly you the same with other two edges and build sets L(AC) and R(AC) and sets L(BC) and R(BC).
So the number of points in ABC will be the number of points in the intersection of L(AB), L(AC) and L(BC). (You might want to consider R(AB) instead depending on the orientation of the triangle).
Now if we want to consider the full set of 500 points. First take all pairs of points AB and construct the sets L(AB) and R(AB). This will take O(n^3) operations.
Next we test all triangles and find the intersections of the three sets. If we use some hash table structure for the sets then to find the intersection points is like a hashtable lookups. If L(AB) has l elements, L(AC) has m elements and L(BC) n elements. Say l > m > n. For each point in L(BC) we need to do a lookup in L(AC) and L(BC) so thats a maximum of 2n hashtable lookups.
It might be faster to consider a geometric lookup table.
Divide your whole domain into a coarse grid say a 10 by 10 grid. We can then put each point into a set G(i,j). We can then split the sets L(AB) into each grid cell. Say call these sets L(AB,i,j) and R(AB,i,j). In testing for intersections first workout which grid cells lie in the intersection. This dramatically reduces the search space and as each set L(AB,i,j) contain fewer members there will be fewer hashtable lookups.
Actually I happened to encounter similar problem recently but the only difference was that there were around 300 pts and I solved it using bitset (C++ STL). For every pair of points, say (x[i],y[i]) and (x[j],y[j]), I formed a bitset<302>B[i][j] and B[i][j][k] stores 1 if k-th point is above line segment from point i to point j else I would store 0.
Now in a brute force manner I get three points so as to form a triangle, lets say (x[i],y[i]), (x[j],y[j]) and (x[k],y[k]), then a point,say z-th point ,would be inside triangle if B[i][j][z]==B[i][j][k] && B[j][k][z]==B[j][k][i] && B[k][i][z]==B[k][i][j] because a point inside triangle would show similar sign w.r.t. a side of triangle as the third point of triangle(one which is not on this side).
So i get three bitset variables P=B[i][j], Q=B[j][k] and R=B[k][i] and there taking there bitwise AND then applying count() function to give me the active number of bits and hence the number of points within the triangle. But make sure you change variable P such that it gives B[i][j][k]=1 if not then take bitwise not (~) of this variable.
Though the above solution is problem specific, i hope it helps. This is the problem link: http://usaco.org/current/index.php?page=viewproblem&cpid=660

R: Find cliques of special sizes efficiently using igraph

I have a large graph(100000 nodes) and i want to find its cliques of size 5.
I use this command for this goal:
cliques(graph, min=5, max=5)
It takes lots of time to calculate this operation. It seems that it first tries to find all of the maximal cliques of the graph and then pick the cliques with size 5; I guess this because of the huge difference of run time between these two commands while both of them are doing a same job:
adjacent.triangles (graph) # takes about 30s
cliques(graph, min=3, max=3) # takes more than an hour
I am looking for a command like adjacent.triangles to find clique with size 5 efficiently.
Thanks
There is a huge difference between adjacent.triangles() and cliques(). adjacent.triangles() only needs to count the triangles, while cliques() needs to store them all. This could easily account for the time difference if there are lots of triangles. (Another factor is that the algorithm in cliques() has be generic and not limited to triangles - it could be the case that adjacent.triangles() contains some optimizations since we know that we are interested in triangles only).
For what it's worth, cliques() does not find all the maximal cliques; it starts from 2-cliques (i.e. edges) and then merges them into 3-cliques, 4-cliques etc until it reaches the maximum size that you specified. But again, if you have lots of 3-cliques in your graph, this could easily become a bottleneck as there is one point in the algorithm where all the 3-cliques have to be stored (even if you are not interested in them) since we need them to find the 4-cliques.
You are probably better off with maximal.cliques() first to get a rough idea of how large the maximal cliques are in your graph. The idea here is that you have a maximal clique of size k, then all its subsets of size 5 are 5-cliques. This means that it is enough to search for the maximal cliques, keep the ones that are at least size 5, and then enumerate all their subsets of size 5. But then you have a different problem as some cliques may be counted more than once.
Update: I have checked the source code for adjacent.triangles and basically all it does is that it loops over all vertices, and for each vertex v it enumerates all the (u, w) pairs of its neighbors, and checks whether u and w are connected. If so, there is a triangle adjacent on vertex v. This is an O(nd2) operation if you have n vertices and the average degree is d, but it does not generalize to groups of vertices of arbitrary size (because you would need to hardcode k-1 nested for loops in the code for a group of size k).

Representing a graph where nodes are also weights

Consider a graph that has weights on each of its nodes instead of between two nodes. Therefore the cost of traveling to a node would be the weight of that node.
1- How can we represent this graph?
2- Is there a minimum spanning path algorithm for this type of graph (or could we modify an existing algorithm)?
For example, consider a matrix. What path, when traveling from a certain number to another, would produce a minimum sum? (Keep in mind the graph must be directed)
if one don't want to adjust existing algorithms and use edge oriented approaches, one could transform node weights to edge weights. For every incoming edge of node v, one would save the weight of v to the edge. Thats the representation.
well, with the approach of 1. this is now easy to do with well known algorithms like MST.
You could also represent the graph as wished and hold the weight at the node. The algorithm simply didn't use Weight w = edge.weight(); it would use Weight w = edge.target().weight()
simply done. no big adjustments are necessary.
if you have to use adjacency matrix, you need a second array with node weights and in adjacency matrix are just 0 - for no edge or 1 - for an edge.
hope that helped

Resources