For loop number path calculation [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a bidirectional network, that is, a network where flow exists both from i->j and j->i. And I want to calculate the number of simple paths between each [i,j] and report in matrices according to the path length, that is, for each [i,j] pair there's a certain number of simple paths of length 2, 3, 4, etc, and I would like to calculate this and have the results being reported in: a reporting matrix of the number of simple paths of length 2 between i and j; a reporting matrix of the number of simple paths of length 3 between i and j, etc....
The solution I found was to create a code that would look to the original input matrix and search for paths of length x from i->n by looking to the connections of i with the other variables, then these with the other variables excluding i, and so on for x+1 variables until we got to x->n. E.g. for length two paths will look for i->x connection and any x->n connection. If this is true then there's a length two simple path between i and n. If the approach was left like this, when analysing bidirectional matrices or matrices with self-loops, the code would count self-loops has simple paths, and pass more than once by the same vertex. To solve this problem, in the conditions set in the code another parameter need to be verified. This parameter is a restriction on the assignment of the variables of the original matrix to our general variables, that is, when assigning a new general variable for the path search, the variable assigned cannot be one already assigned in that path search to another general variable:
* when looking for a path of length 2 between iand n, the variable to be assigned to x cannot be the one already assigned to i (this eliminates the self-loops from counting in as paths), and in the same way n cannot be assigned a variable already used either by i or x (this eliminates de reporting of cases of i->x->i has paths of length 2 and also eliminates de reporting of paths passing more than once by the same variable [i->x->x2->i for 3length paths e.g.]). So the code I use is basically this:
#the adjacency matrix
> MM<-matrix(c(1,1,0,0,0,1,1,1,1,0,0,1,1,1,0,0,1,0,1,1,0,0,0,1,1), 5, byrow=T)
> colnames(MM)<-c("A", "B", "C", "D", "E")
> row.names(MM)=colnames(MM)
> MM
A B C D E
A 1 1 0 0 0
B 1 1 1 1 0
C 0 1 1 1 0
D 0 1 0 1 1
E 0 0 0 1 1
#this is the reporting matrix where the results will be reported
> MMres2<-matrix(rep(0,length(MM)), sqrt(length(MM)))
> colnames(MMres2)=colnames(MM)
> row.names(MMres2)=row.names(MM)
#this is the code for the calculation and report of simple paths of lenght 2
> for(i in 1:dim(MM)){
for(j in 1:dim(MM)){
for(k in 1:dim(MM)){
if(MM[i,j]==1 & MM[j,k]==1 & j!=i & k!=i & k!=j){
MMres2[i,k]=MMres2[i,k]+1
}
}
}
}
#the reported results
> MMres2
A B C D E
A 0 0 1 1 0
B 0 0 0 1 2
C 1 1 0 2 1
D 1 0 1 0 0
E 0 1 0 0 0
If I want to calculate the number of simple paths of length 3 between any i->n we just need to had the condition of [x2,n]==1 and make sure we restrict the new variable to not be equal to any of the previously assigned ones.
And here, at last, lays my problem. I don't want to simply calculate the number of paths of length 2 or three or four, but all the possible (maximum possible length of a path is the total number of variables minus 1). Obviously, having a code for each path of length x for each matrix would be cumbersome, and for matrices with ever higher N number of variables, the more cumbersome would it be to create such code. To simplify this, the ideal solution would be to develop a code that would look for all pairs i and j and and calculate the number of paths between each for all the possible number of links per path up to paths of tot.var-1 links (that is, the maximum number of links on a path between each pair of i and j).
Take again the M2 matrix, the ideal code would look for the existence of a link between i and a x variable and then between x variable and j, and in the case of the condition being reported, it would report the result each time a path was found:
[i,x]==1 & [x’,j]==1 -> Res.mat[i,j] + 1
Where, x and x’ are any (and any number) of variables between i and j.
The point that differs this approach from the original above is that here x can be a multitude of variables, that is, in one iteration, when looking for a path of 2 links, x will be one variable, while one looking for a path of 3 links, x will be two variables and so forth.
E.g.:
For a path of length 2:
[i,xa]==1 & [xa,j]==1 -> Res.mat2[i,j] +1
For a path of length 3:
[i,xa]==1 & [xa,xb]==1 & [xb,j]==1 -> Res.mat3[i,j] +1
For a path of length 4:
[i,xa]==1 & [xa,xb]==1 & [xb,xc]==1 & [xc,j]==1 -> Res.mat4[i,j] +1
In this code, x would progressively assume all the other variables excluding i and j, and reporting each path for the respective reporting matrix, the ones of length two for the length2 reporting matrix, etc.
Sorry for the very, very long post, this is something I've been searching for long and talked with colleagues and no one seems to either understand or help me and that's why I made it in a long post to try and be the clearest possible.
So, anyone knows a how I can make this?

Related

Is there a closed form available for the following table?

Below is a table which has a recursive relation as current cell value is the sum of the upper and left cell.
I want to find the odd positions for any given row denoted by v(x) as represented in the first column.
Currently, I am maintaining two one arrays which I update with new sum values and literally checking if each positions value is odd or even.
Is there a closed form that exists which would allow me to directly say what are the odd positions available (say, for the 4th row, in which case it should tell me that p1 and p4 are the odd places).
Since it is following a particular pattern I feel very certain that a closed form should exist which would mathematically tell me the positions rather than calculating each value and checking it.
The numbers that you're looking at are the numbers in Pascal's triangle, just rotated ninety degrees. You more typically see it written out like this:
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 20 15 6 1
1 7 21 35 35 21 7 1
...
You're cutting Pascal's triangle along diagonal stripes going down the left (or right, depending on your perspective) strips, and the question you're asking is how to find the positions of the odd numbers in each stripe.
There's a mathematical result called Lucas's theorem which is useful for determining whether a given entry in Pascal's triangle is even or odd. The entry in row m, column n of Pascal's triangle is given by (m choose n), and Lucas's theorem says that (m choose n) mod 2 (1 if the number is odd, 0 otherwise) can be found by comparing the bits of m and n. If n has a bit that's set in a position where m doesn't have that bit set, then (m choose n) is even. Otherwise, (m choose n) is odd.
As an example, let's try (5 choose 3). The bits in 5 are 101. The bits in 3 are 011. Since the 2's bit of 3 is set and the 2's bit of 5 is not set, the quantity (5 choose 3) should be even. And since (5 choose 3) = 10, we see that this is indeed the case!
In pseudocode using relational operators, you essentially want the following:
if ((~m & n) != 0) {
// Row m, entry n is even
} else {
// Row m, entry n is odd.
}

Generate sets from given overlap matrix

Note: I edited the original question to explain more precisely.
While I was doing a simulation for my new method, I needed to generate a special type of dataset consists of multiple subset. The problem is that there is some "shared" variables across the subsets, and the number of shared variable is called "overlap" here. Since the distribution of overlap proportion is given, I need to generate an appropriate list of variables and their overlap follows the given distribution. But I have failed to implement such algorithm...
I am not sure whether there is a specific algorithm for this kind of question,
but I have failed to find such thing after a long search.
I prefer R solution, but anything others also will be very appreciated. Please help me to solve this problem! Thank you so much in advance!
The below is a standardized explanation for my problem. I tried to explain as general as possible I can, but please give me any suggestion if it is not sufficient.
Purpose: Generate n sets from given overlap matrix of elements. Each set contains k elements.
Input: There is a n*n matrix whose (i,j)th cell value represents a number of overlapped elements from (i)th set to (j)th set.
Output: A list of k element identifiers (whatever can be used such as number) for n sets.
Assumption: The number of elements for each set is k, and it is same across all n sets. Hence, the input matrix is symmetric.
Example (assumes k=3 and n=3)
Input
3 1 0
1 3 1
0 1 3
Output
Set 1: A B C
Set 2: A D E
Set 3: D F G
In the above example input, (1,2)th and (2,1)th cells are 1 because set 1 and 2 share "A" element and vice versa, and diagonal cells are 3(=k) because each set shares all elements with itself.
I would repeat the following process until I had accounted for all the matrix entries:
1) Treat the matrix as the adjacency matrix of a graph, and find the largest clique in it. That is, find the largest possible set S of indexes such that for all i, j in set S M(i,j) > 0
2) Create an item that is in all of the sets which correspond to the indexes in S - in fact, if the minimum value of M(i,j) = v, create v such items.
3) subtract v from M(i,j) for all i, j in set S, accounting for the counts generated by the items you have just created.

Computing the size of UID possibilities

Per DICOM specification, a UID is defined by: 9.1 UID Encoding Rules. In other words the following are valid DICOM UIDs:
"1.2.3.4.5"
"1.3.6.1.4.35045.103501438824148998807202626810206788999"
"1.2.826.0.1.3680043.2.1143.5028470438645158236649541857909059554"
while the following are illegal DICOM UIDs:
".1.2.3.4.5"
"1..2.3.4.5"
"1.2.3.4.5."
"1.2.3.4.05"
"12345"
"1.2.826.0.1.3680043.2.1143.50284704386451582366495418579090595540"
Therefore I know that the string is at most 64 bytes, and should match the following regex [0-9\.]+. However this regex is really a superset, since there are a lot less than (10+1)^64 (=4457915684525902395869512133369841539490161434991526715513934826241L) possibilities.
How would one computes precisely the number of possibilities to respect the DICOM UID rules ?
Reading the org root / suffix rule clearly indicates that I need at least one dot ('.'). In which case the combination is at least 3 bytes (char) in the form: [0-9].[0-9]. In which case there are 10x10=100 possibilities for UID of length 3.
Looking at the first answer, there seems to be something unclear about:
The first digit of each component shall not be zero unless the
component is a single digit.
What this means is that:
"0.0" is valid
"00.0" or "1.01" are not valid
Thus I would say a proper expression would be:
(([1-9][0-9]*)|0)(\.([1-9][0-9]*|0))+
Using a simple C code, I could find:
f(0) = 0
f(1) = 0
f(2) = 0
f(3) = 100
f(4) = 1800
f(5) = 27100
f(6) = 369000
f(7) = 4753000
f(8) = 59049000
The validation of the Root UID part is outside the scope of this question. A second validation step could take care of rejecting some OID that cannot possibly be registered (some people mention restriction on first and second arc for example). For simplicity we'll accept all possible (valid) Root UID.
While my other answer takes good care of this specific application, here is a more generic approach. It takes care of situations where you have a different regular expression describing the language in question. It also allows for considerably longer string lengths, since it only requires O(log n) arithmetic operations to compute the number of combinations for strings of length up to n. In this case the number of strings grows so quickly that the cost of these arithmetic operations will grow dramatically, but that may not be the case for other, otherwise similar situations.
Build a finite state automaton
Start with a regular expression description of your language in question. Translate that regular expression into a finite state automaton. In your case the regular expression can be given as
(([1-9][0-9]*)|0)(\.([1-9][0-9]*|0))+
The automaton could look like this:
Eliminate ε-transitions
This automaton usually contains ε-transitions (i.e. state transitions which do not correspond to any input character). Remove those, so that one transition corresponds to one character of input. Then add an ε-transition to the accepting state(s). If the accepting states have other outgoing transitions, don't add ε-loops to them, but instead add an ε-transition to an accepting state with no outgoing edges and then add the loop to that. This can be seen as padding the input with ε at its end, without allowing ε in the middle. Taken together, this transformation ensures that performing exactly n state transitions corresponds to processing an input of n characters or less. The modified automaton might look like this:
Note that both the construction of the first automaton from the regular expression and the elimination of ε-transitions can be performed automatically (and perhaps even in a single step. The resulting automata might be more complicated than what I constructed here manually, but the principle is the same.
Ensuring unique paths
You don't have to make the automaton deterministic in the sense that for every combination of source state and input character there is only one target state. That's not the case in my manually constructed one either. But you have to make sure that every complete input has only one possible path to the accepting state, since you'll essentially be counting paths. Making the automaton deterministic would ensure this weaker property, too, so go for that unless you can ensure unique paths without this. In my example the length of each component clearly dictates which path to use, so I didn't make it deterministic. But I've included an example with a deterministic approach at the end of this post.
Build transition matrix
Next, write down the transition matrix. Associate the rows and columns with your states (in order a, b, c, d, e, f in my example). For each arrow in your automaton, write the number of characters included in the label of that arrow in the column associated with the source state and the row associated with the target state of that arrow.
⎛ 0 0 0 0 0 0⎞
⎜ 9 10 0 0 0 0⎟
⎜10 10 0 10 10 0⎟
⎜ 0 0 1 0 0 0⎟
⎜ 0 0 0 9 10 0⎟
⎝ 0 0 0 10 10 1⎠
Read result off that matrix
Now applying this matrix with a column vector once has the following meaning: if the number of possible ways to arrive in a given state is encoded in the input vector, the output vector gives you the number of ways one transition later. Take the 64th power of that matrix, concentrate on the first column (since ste start situation is encoded as (1,0,0,0,0,0), meaning only one way to end up in the start state) and sum up all the entries that correspond to accepting states (only the last one in this case). The bottom left element of the 64th power of this matrix is
1474472506836676237371358967075549167865631190000000000000000000000
which confirms my other answer.
Compute matrix powers efficiently
In order to actually compute the 64th power of that matrix, the easiest approach would be repeated squaring: after squaring the matrix 6 times you have an exponent of 26 = 64. If in some other scenario your exponent (i.e. maximal string length) is not a power of two, you can still perform exponentiation by squaring by multiplying the relevant squares according to the bit pattern of the exponent. This is what makes this approach take O(log n) arithmetic operations to compute the result for string length n, assuming a fixed number of states and therefore fixed cost for each matrix squaring.
Example with deterministic automaton
If you were to make my automaton deterministic using the usual powerset construction, you'd end up with
and sorting the states as a, bc, c, d, cf, cef, f one would get the transition matrix
⎛ 0 0 0 0 0 0 0⎞
⎜ 9 10 0 0 0 0 0⎟
⎜ 1 0 0 0 0 0 0⎟
⎜ 0 1 1 0 1 1 0⎟
⎜ 0 0 0 1 0 0 0⎟
⎜ 0 0 0 9 0 10 0⎟
⎝ 0 0 0 0 1 1 1⎠
and could sum the last three elements of the first column of its 64th power to obtain the same result as above.
Single component
Start by looking for ways to form a single component. The corresponding regular expression for a single component is
0|[1-9][0-9]*
so it is either zero or a non-zero digit followed by arbitrary many zero digits. (I had missed the possible sole zero case at first, but the comment by malat made me aware of this.) If the total length of such a component is to be n, and you write h(n) to denote the number of ways to form such a component of length exactly n, then you can compute that h(n) as
h(n) = if n = 1 then 10 else 9 * 10^(n - 1)
where the n = 1 case allows for all possible digits, and the other cases ensure a non-zero first digit.
One or more components
Subsection 9.1 only writes that a UID is a bunch of dot-separated number components, as outlined above. So in regular expressions that would be
(0|[1-9][0-9]*)(\.(0|[1-9][0-9]*))*
Suppose f(n) is the number of ways to write a UID of length n. Then you have
f(n) = h(n) + sum h(i) * f(n-i-1) for i from 1 to n-2
The first term describes the case of a single component, while the sum takes care of the case where it consists of more than one component. In that case you have a first component of length i, then a dot which accounts for the -1 in the formula, and then the remaining digits form one or more components which is expressed via the recursive use of f.
Two or more components
As the comment by cneller indicates, the part of section 9 before subsection 9.1 indicates that there has to be at least two components. So the proper regular expression would be more like
(0|[1-9][0-9]*)(\.(0|[1-9][0-9]*))+
with a + at the end indicating that we want at least one repetition of the parenthesized expression. Deriving an expression for this simply means leaving out the one-component-only case in the definition of f:
g(n) = sum h(i) * f(n-i-1) for i from 1 to n-2
If you sum all the g(n) for n from 3 (the minimal possible UID length) through 64 you get the number of possible UIDs as
1474472506836676237371358967075549167865631190000000000000000000000
or approximately 1.5e66. Which is considerably less than the 4.5e66 you get from your computation, in terms of absolute difference, although it's definitely on the same order of magnitude. By the way, your estimate doesn't explicitely mention UIDs shorter than 64, but you can always consider padding them with dots in your setup. I did the computation using a few lines of Python code:
f = [0]
g = [0]
h = [0, 10] + [9 * (10**(n-1)) for n in range(2, 65)]
s = 0
for n in range(1, 65):
x = 0
if n >= 3:
for i in range(1, n - 1):
x += h[i] * f[n-i-1]
g.append(x)
f.append(x + h[n])
s += x
print(h)
print(f)
print(g)
print(s)

Determine how different are some vectors

I want to differentiate data vectors to find those that are similar. For example:
A=[4,5,6,7,8];
B=[4,5,6,6,8];
C=[4,5,6,7,7];
D=[1,2,3,9,9];
E=[1,2,3,9,8];
In the previous example I want to distinguish that A,B,C vectors are similar (not the same) to each other and D,E are similiar to each other. The result should be something like: A,B,C are similar and D,E are similar, but the group A,B,C is not similar to the group of D,E. Matlab can do this?
I was thinking using some classification algorithm or Kmeans,ROC,etc.. but I'm not sure which one will be the best one.
Any suggestion? Thanks in advance
One of my new favourite methods for this sort of thing is agglomerate clustering.
First, concatenate all your vectors into a matrix, where each row is a separate vector. This makes such methods much easier to use:
F = [A; B; C; D; E];
Then the linkages can be found:
Z = linkage(F, 'ward', 'euclidean');
This can be plotted using:
dendrogram(Z);
This shows a tree, where each leaf at the bottom is one of the original vectors. Lengths of the branches show similarities and dissimilarities.
As you can see, 1, 2 and 3 are shown to be very close, as are 4 and 5. This even gives a measure of closeness, and shows that vectors 1 and 3 are deemed to be closer than vectors 2 and 3 (in the sense that, percentagewise, 7 is closer to 8 than 6 is to 7).
If all the vectors you are comparing are of the same length, a suitable norm on pairwise differences may well be enough. The norm to choose will depend on your particular criteria of closeness, of course, but with the examples you show, simply summing the absolute values of the components of the pairwise differences gives:
A B C D E
A 0 1 1 12 11
B 0 2 13 12
C 0 13 12
D 0 1
E 0
which doesn't need a particularly well-tuned threshold to work.
You can use pdist(), this function gives you the pairwise distances.
Various distance (opposite of similarity) metrics are already implemented, 'euclidean' seems appropriate for your situation, although you may want to try out the effect of different metrics.
Here it goes the solution I propose based on your results:
Z = [A;B;C;D;E];
Y = pdist(Z);
matrix = SQUAREFORM(Y);
matrix_round = round(matrix);
Now that we have the vector we can set the threshold based on the maximun value and decide with which theshold is the most appropriate.
It would be nice to create some cluster plot showing the differences between them.
Best regards

Matrix operations to enumerate all paths through n-partite graph

I have an n-partite (undirected) graph, given as an adjacency matrix, for instance this one here:
a b c d
a 0 1 1 0
b 0 0 0 1
c 0 0 0 1
d 0 0 0 0
I would like to know if there is a set of matrix operations that I can apply to this matrix, which will result in a matrix that "lists" all paths (of length n, i.e. through all the partitions) in this graph. For the above example, there are paths a->b->d and a->c->d. Hence, I would like to get the following matrix as a result:
a b c d
1 1 0 1
1 0 1 1
The first path contains nodes a,b,d and the second one nodes a,c,d. If necessary, the result matrix may have some all-0 lines, as here:
a b c d
1 1 0 1
0 0 0 0
1 0 1 1
0 0 0 0
Thanks!
P.S. I have looked at algorithms for computing the transitive closure, but these usually only tell if there is a path between two nodes, and not directly which nodes are on that path.
One thing you can do is to compute the nth power of you matrix A. The result will tell you how many paths there of length n from any one vertex to any other.
Now if you're interested in knowing all of the vertices along the path, I don't think that using purely matrix operations is the way to go. Bearing in mind that you have an n-partite graph, I would set up a data structure as follows: (Bear in mind that space costs will be expensive for all but small values.)
Each column will have one entry of each of the nodes in our graph. The n-th column will contain 1 in if this node is reachable on the n-th iteration from our designated start vertex or start set, and zero otherwise. Each column entry will also contain a list of back pointers to the vertices in the n-1 column which led to this vertex in the nth column. (This is like the viterbi algorithm, except that we have to maintain a list of backpointers for each entry rather than just one.) The complexity of doing this is (m^2)*n, where m is the number of vertices in the graph, and n is the length of the desired path.
I'm a little bit confused by your top matrix: with an undidrected graph, I would expect the adjacency matrix to be symmetric.
No, There is no pure matrix way to generate all paths. Please use pure combinatorial algorithms.
'One thing you can do is to compute the nth power of you matrix A. The result will tell you how many paths there of length n from any one vertex to any other.'
The power of matriax generates walks not paths.

Resources