I want to differentiate data vectors to find those that are similar. For example:
A=[4,5,6,7,8];
B=[4,5,6,6,8];
C=[4,5,6,7,7];
D=[1,2,3,9,9];
E=[1,2,3,9,8];
In the previous example I want to distinguish that A,B,C vectors are similar (not the same) to each other and D,E are similiar to each other. The result should be something like: A,B,C are similar and D,E are similar, but the group A,B,C is not similar to the group of D,E. Matlab can do this?
I was thinking using some classification algorithm or Kmeans,ROC,etc.. but I'm not sure which one will be the best one.
Any suggestion? Thanks in advance
One of my new favourite methods for this sort of thing is agglomerate clustering.
First, concatenate all your vectors into a matrix, where each row is a separate vector. This makes such methods much easier to use:
F = [A; B; C; D; E];
Then the linkages can be found:
Z = linkage(F, 'ward', 'euclidean');
This can be plotted using:
dendrogram(Z);
This shows a tree, where each leaf at the bottom is one of the original vectors. Lengths of the branches show similarities and dissimilarities.
As you can see, 1, 2 and 3 are shown to be very close, as are 4 and 5. This even gives a measure of closeness, and shows that vectors 1 and 3 are deemed to be closer than vectors 2 and 3 (in the sense that, percentagewise, 7 is closer to 8 than 6 is to 7).
If all the vectors you are comparing are of the same length, a suitable norm on pairwise differences may well be enough. The norm to choose will depend on your particular criteria of closeness, of course, but with the examples you show, simply summing the absolute values of the components of the pairwise differences gives:
A B C D E
A 0 1 1 12 11
B 0 2 13 12
C 0 13 12
D 0 1
E 0
which doesn't need a particularly well-tuned threshold to work.
You can use pdist(), this function gives you the pairwise distances.
Various distance (opposite of similarity) metrics are already implemented, 'euclidean' seems appropriate for your situation, although you may want to try out the effect of different metrics.
Here it goes the solution I propose based on your results:
Z = [A;B;C;D;E];
Y = pdist(Z);
matrix = SQUAREFORM(Y);
matrix_round = round(matrix);
Now that we have the vector we can set the threshold based on the maximun value and decide with which theshold is the most appropriate.
It would be nice to create some cluster plot showing the differences between them.
Best regards
Related
I have a challenge. This may be little tricky or even not possible but wanted to check if anyone has any thoughts on this?
PS : This question is in general and not related to only to R. May be I can say its general mathematics
I have a data
df
ColA ColB ColC
6 9 27
1 4 32
4 8 40
If you observe closely, there is some relationship between these columns.
Example, (ColC/ColB)+ColA will give you number 9.
df
ColA ColB ColC ColD
6 9 27 9
1 4 32 9
4 8 40 9
However this data is manipulated and I made sure there is some relation.
But in general, lets us take any numbers, is there a way to find if there is any relationship between these numbers. Need not be (ColC/ColB)+ColA . It could be anything.
Say we have 5 columns of numeric data. I need to find mathematical operation between these so that common number exists.
This is more into mathematics(algebra).
Can anyone let me know is this even possible ?
For some types of relationships this is doable. But when such a method fails to find a relationship, it typically just means there could be a relationship of a kind not covered by your approach.
One common tool for finding relationships is linear algebra, and linear dependencies in particular. Write your data in a matrix like you did. Consider that a linear equation
a*ColA + b*ColB + c*ColC = 0
Use standard techniques such as Gaussian elimination to find coefficients a, b, c which satisfy this equation but are not all zero themselves. You probably can find a library to compute the kernel of a matrix which you can use for that. Now you know whether one of the columns can be expressed as a linear combination of the other two.
This is a very limited class of relationships, and doesn't cover your example yet. But you can improve it by including more columns. Include a column with ones everywhere to allow for a constant term in your formula. Include all pair wise products.
x + a*ColA + b*ColB + c*ColC + ab*ColA*ColB + ac*ColA*ColC + bc*ColB*ColC + aa*ColA^2 + bb*ColB^2 + cc*ColC^2 = 0
Now for your data this could tell you that there is a solution of the form
b=-9 c=1 ab=1 x=a=ac=bc=aa=bb=cc=0
-9*ColB + ColC + ColA*ColB = 0
which is equivalent to the relationship you described in your question.
But also observed that you are now using 3 data points to determine 10 variables. So this one relationship is by far not the only one.
In general you want at least as many data points as you have variables in your equation. You want at least as many rows as you have columns in your extended matrix. Only then can you say that a relationship between them us indeed a property of the underlying data and not merely an artifact of having too much flexibility and too little data.
In R you might want to look into using linear models for determining coefficients in the presence of imprecise data. You can also use powers of formulas to include all interactions between columns, i.e. those higher degree terms which I included above as well.
Let's say I have a vector V, and I want to either turn this vector into multiple m x n matrices, or get multiple m x n matrices from this Vector V.
For the most basic example: Turn V = collect(1:75) into 3 5x5 matrices.
As far as I am aware this can be done by first using reshape reshape(V, 5, :) and then looping through it. Is there a better way in Julia without using a loop?
If possible, a solution that can easily change between row-major and column-major results is preferrable.
TL:DR
m, n, n_matrices = 4, 2, 5
V = collect(1:m*n*n_matrices)
V = reshape(V, m, n, :)
V = permutedims(V, [2,1,3])
display(V)
From my limited knowledge about Julia:
When doing V = collect(1:m*n), you initialize a contiguous array in memory. From V you wish to create a container of m by n matrices. You can achieve this by doing reshape(V, m, n, :), then you can access the first matrix with V[:,:,1]. The "container" in this case is just another array (thus you have a three dimensional array), which in this case we interpret as "an array of matrices" (but you could also interpret it as a box). You can then transpose every matrix in your array by swapping the first two dimensions like this: permutedims(V, [2,1,3]).
How this works
From what I understand; n-dimensional arrays in Julia are contiguous arrays in memory when you don't do any "skipping" (e.g. V[1:2:end]). For example the 2 x 4 matrix A:
1 3 5 7
2 4 6 8
is in memory just 1 2 3 4 5 6 7 8. You simply interpret the data in a specific way, where the first two numbers makes up the first column, then the second two numbers makes the next column so on so forth. The reshape function simply specifies how you want to interpret the data in memory. So if we did reshape(A, 4, 2) we basically interpret the numbers in memory as "the first four values makes the first column, the second four values makes the second column", and we would get:
1 5
2 6
3 7
4 8
We are basically doing the same thing here, but with an extra dimension.
From my observations it also seems to be that permutedims in this case reallocates memory. Also, feel free to correct me if I am wrong.
Old answer:
I don't know much about Julia, but in Python using NumPy I would have done something like this:
reshape(V, :, m, n)
EDIT: As #BatWannaBe states, the result is technically one array (but three dimensional). You can always interpret a three dimensional array as a container of 2D arrays, which from my understanding is what you ask for.
Note: I edited the original question to explain more precisely.
While I was doing a simulation for my new method, I needed to generate a special type of dataset consists of multiple subset. The problem is that there is some "shared" variables across the subsets, and the number of shared variable is called "overlap" here. Since the distribution of overlap proportion is given, I need to generate an appropriate list of variables and their overlap follows the given distribution. But I have failed to implement such algorithm...
I am not sure whether there is a specific algorithm for this kind of question,
but I have failed to find such thing after a long search.
I prefer R solution, but anything others also will be very appreciated. Please help me to solve this problem! Thank you so much in advance!
The below is a standardized explanation for my problem. I tried to explain as general as possible I can, but please give me any suggestion if it is not sufficient.
Purpose: Generate n sets from given overlap matrix of elements. Each set contains k elements.
Input: There is a n*n matrix whose (i,j)th cell value represents a number of overlapped elements from (i)th set to (j)th set.
Output: A list of k element identifiers (whatever can be used such as number) for n sets.
Assumption: The number of elements for each set is k, and it is same across all n sets. Hence, the input matrix is symmetric.
Example (assumes k=3 and n=3)
Input
3 1 0
1 3 1
0 1 3
Output
Set 1: A B C
Set 2: A D E
Set 3: D F G
In the above example input, (1,2)th and (2,1)th cells are 1 because set 1 and 2 share "A" element and vice versa, and diagonal cells are 3(=k) because each set shares all elements with itself.
I would repeat the following process until I had accounted for all the matrix entries:
1) Treat the matrix as the adjacency matrix of a graph, and find the largest clique in it. That is, find the largest possible set S of indexes such that for all i, j in set S M(i,j) > 0
2) Create an item that is in all of the sets which correspond to the indexes in S - in fact, if the minimum value of M(i,j) = v, create v such items.
3) subtract v from M(i,j) for all i, j in set S, accounting for the counts generated by the items you have just created.
I'd like to understand better how outer works and how to vectorize functions. Below is a minimal example for what I am trying to do:I have a set of numbers 2,3,4. for each combination (a,b) of create a diagonal matrix with a a b b b on the diagonal, and then do something with it, e.g. calculating its determinant (this is just for demonstration purposes). The results of the calculation should be written in a 3 by 3 matrix, one field for each combination.
The code below isn't working - apparently, outer (or my.func) doesn't understand that I don't want the whole lambdas vector to be applied - you can see that this is the case when you uncomment the print command included.
lambdas <- c(1:2)
my.func <- function(lambda1,lambda2){
# print(diag(c(rep(lambda1,2),rep(lambda2,2))))
det(diag(c(rep(lambda1,2),rep(lambda2,2))))
}
det.summary <- outer(lambdas,lambdas, FUN="my.func")
How do I need to modify my function or the call of outer so things behave like I'd like to?
I guess I need to vectorize my function somehow, but I don't know how, and in which way the outer call would be processed differently.
Edit:
I've changed size of the matrices to make it a bit less messy. I'd like to generate 4 diagonal 4 by 4 matrices, with the following diagonals; in are brackets the corresponding parameters lambda1, lambda2:
1 1 1 1 (1,1), 1 1 2 2 (1,2), 2 2 1 1 (2,1), 2 2 2 2 (2,2).
Then, I want to calculate their determinants (which is an arbitrary choice here) and put the results into a matrix, whose first column corresponds to lambda1=1, the second to lambda1=2, and the rows correspond to the choice of lambda2. det.summary should be a 2 by to matrix with the following values:
1 4
4 16
as these are the determinants of the diagonal matrices listed above.
What do you know, there is a Vectorize function (capital "V")!
outer(lambdas,lambdas, Vectorize(my.func))
# [,1] [,2]
# [1,] 1 4
# [2,] 4 16
As you figured out (and as it took me a while to figure out) outer requires the function to be vectorized. In some ways, it is the opposite of the *pply functions which effectively vectorize an operation by feeding the operator/function each value in turn. But this is easily dealt with, as shown above.
I am interested in deriving dominance metrics (as in a dominance hierarchy) for nodes in a dominance directed graph, aka a tournament graph. I can use R and the package igraph to easily construct such graphs, e.g.
library(igraph)
create a data frame of edges
the.froms <- c(1,1,1,2,2,3)
the.tos <- c(2,3,4,3,4,4)
the.set <- data.frame(the.froms, the.tos)
set.graph <- graph.data.frame(the.set)
plot(set.graph)
This plotted graph shows that node 1 influences nodes 2, 3, and 4 (is dominant to them), that 2 is dominant to 3 and 4, and that 3 is dominant to 4.
However, I see no easy way to actually calculate a dominance hierarchy as in the page: https://www.math.ucdavis.edu/~daddel/linear_algebra_appl/Applications/GraphTheory/GraphTheory_9_17/node11.html . So, my first and main question is does anyone know how to derive a dominance hierarchy/node-based dominance metric for a graph like this using some hopefully already coded solution in R?
Moreover, in my real case, I actually have a sparse matrix that is missing some interactions, e.g.
incomplete.set <- the.set[-2, ]
incomplete.graph <- graph.data.frame(incomplete.set)
plot(incomplete.graph)
In this plotted graph, there is no connection between species 1 and 3, however making some assumptions about transitivity, the dominance hierarchy is the same as above.
This is a much more complicated problem, but if anyone has any input about how I might go about deriving node-based metrics of dominance for sparse matrices like this, please let me know. I am hoping for an already coded solution in R, but I'm certainly MORE than willing to code it myself.
Thanks in advance!
Not sure if this is perfect or that I fully understand this, but it seems to work as it should from some trial and error:
library(relations)
result <- relation_consensus(endorelation(graph=the.set),method="Borda")
relation_class_ids(result)
#1 2 3 4
#1 2 3 4
There are lots of potential options for method= for dealing with ties etc - see ?relation_consensus for more information. Using method="SD/L" which is a linear order might be the most appropriate for your data, though it can suggest multiple possible solutions due to conflicts in more complex examples. For the current simple data this is not the case though - try:
result <- relation_consensus(endorelation(graph=the.set),method="SD/L",
control=list(n="all"))
result
#An ensemble of 1 relation of size 4 x 4.
lapply(result,relation_class_ids)
#[[1]]
#1 2 3 4
#1 2 3 4
Methods of dealing with this are again provided in the examples in ?relation_consensus.