I am interested in deriving dominance metrics (as in a dominance hierarchy) for nodes in a dominance directed graph, aka a tournament graph. I can use R and the package igraph to easily construct such graphs, e.g.
library(igraph)
create a data frame of edges
the.froms <- c(1,1,1,2,2,3)
the.tos <- c(2,3,4,3,4,4)
the.set <- data.frame(the.froms, the.tos)
set.graph <- graph.data.frame(the.set)
plot(set.graph)
This plotted graph shows that node 1 influences nodes 2, 3, and 4 (is dominant to them), that 2 is dominant to 3 and 4, and that 3 is dominant to 4.
However, I see no easy way to actually calculate a dominance hierarchy as in the page: https://www.math.ucdavis.edu/~daddel/linear_algebra_appl/Applications/GraphTheory/GraphTheory_9_17/node11.html . So, my first and main question is does anyone know how to derive a dominance hierarchy/node-based dominance metric for a graph like this using some hopefully already coded solution in R?
Moreover, in my real case, I actually have a sparse matrix that is missing some interactions, e.g.
incomplete.set <- the.set[-2, ]
incomplete.graph <- graph.data.frame(incomplete.set)
plot(incomplete.graph)
In this plotted graph, there is no connection between species 1 and 3, however making some assumptions about transitivity, the dominance hierarchy is the same as above.
This is a much more complicated problem, but if anyone has any input about how I might go about deriving node-based metrics of dominance for sparse matrices like this, please let me know. I am hoping for an already coded solution in R, but I'm certainly MORE than willing to code it myself.
Thanks in advance!
Not sure if this is perfect or that I fully understand this, but it seems to work as it should from some trial and error:
library(relations)
result <- relation_consensus(endorelation(graph=the.set),method="Borda")
relation_class_ids(result)
#1 2 3 4
#1 2 3 4
There are lots of potential options for method= for dealing with ties etc - see ?relation_consensus for more information. Using method="SD/L" which is a linear order might be the most appropriate for your data, though it can suggest multiple possible solutions due to conflicts in more complex examples. For the current simple data this is not the case though - try:
result <- relation_consensus(endorelation(graph=the.set),method="SD/L",
control=list(n="all"))
result
#An ensemble of 1 relation of size 4 x 4.
lapply(result,relation_class_ids)
#[[1]]
#1 2 3 4
#1 2 3 4
Methods of dealing with this are again provided in the examples in ?relation_consensus.
Related
I have a challenge. This may be little tricky or even not possible but wanted to check if anyone has any thoughts on this?
PS : This question is in general and not related to only to R. May be I can say its general mathematics
I have a data
df
ColA ColB ColC
6 9 27
1 4 32
4 8 40
If you observe closely, there is some relationship between these columns.
Example, (ColC/ColB)+ColA will give you number 9.
df
ColA ColB ColC ColD
6 9 27 9
1 4 32 9
4 8 40 9
However this data is manipulated and I made sure there is some relation.
But in general, lets us take any numbers, is there a way to find if there is any relationship between these numbers. Need not be (ColC/ColB)+ColA . It could be anything.
Say we have 5 columns of numeric data. I need to find mathematical operation between these so that common number exists.
This is more into mathematics(algebra).
Can anyone let me know is this even possible ?
For some types of relationships this is doable. But when such a method fails to find a relationship, it typically just means there could be a relationship of a kind not covered by your approach.
One common tool for finding relationships is linear algebra, and linear dependencies in particular. Write your data in a matrix like you did. Consider that a linear equation
a*ColA + b*ColB + c*ColC = 0
Use standard techniques such as Gaussian elimination to find coefficients a, b, c which satisfy this equation but are not all zero themselves. You probably can find a library to compute the kernel of a matrix which you can use for that. Now you know whether one of the columns can be expressed as a linear combination of the other two.
This is a very limited class of relationships, and doesn't cover your example yet. But you can improve it by including more columns. Include a column with ones everywhere to allow for a constant term in your formula. Include all pair wise products.
x + a*ColA + b*ColB + c*ColC + ab*ColA*ColB + ac*ColA*ColC + bc*ColB*ColC + aa*ColA^2 + bb*ColB^2 + cc*ColC^2 = 0
Now for your data this could tell you that there is a solution of the form
b=-9 c=1 ab=1 x=a=ac=bc=aa=bb=cc=0
-9*ColB + ColC + ColA*ColB = 0
which is equivalent to the relationship you described in your question.
But also observed that you are now using 3 data points to determine 10 variables. So this one relationship is by far not the only one.
In general you want at least as many data points as you have variables in your equation. You want at least as many rows as you have columns in your extended matrix. Only then can you say that a relationship between them us indeed a property of the underlying data and not merely an artifact of having too much flexibility and too little data.
In R you might want to look into using linear models for determining coefficients in the presence of imprecise data. You can also use powers of formulas to include all interactions between columns, i.e. those higher degree terms which I included above as well.
I'm using the R package mclust to estimate the number of clusters in my data and get this result:
Clustering table:
2 7 8 9
205693 4465 2418 91
Warning messages:
1: In map(z) : no assignment to 1,3,4,5,6
2: In map(z) : no assignment to 1,3,4,5,6
I have 9 clusters as the best, but it has no assignment to 5 of the clusters.
So does this mean I want to use 9 or 5 clusters?
If the answer can be found somewhere online, a link would be appreciated. Thanks in advance.
Most likely, the method just did not work at all on your data...
You may try other seeds, because when you "lose" clusters (i.e. they become empty) this usually means your seeds were not chosen well enough. And your cluster 9 is also pretty much gone, too.
However, if your data is actually generated by a mixture of Gaussians, it's hard to find such a bad starting point... so most likely, all of your results are bad, because the data does not satisfy your assumptions.
Judging from your cluster sizes, I'd say you have 1 cluster and a lot of noise...
Have you visualized and validated the results?
Don't blindly follow some number. Validate.
I have been trying to build this program or find out how to access what KKNN does to produce its results. I am using the KKNN function and package to help predict future baseball stats. It takes in 11 predictor variables (previous 3 year stats, PA and level, along with age and another predictor). The predictions work great but what I am hoping to do is when I am predicting only one player (as this would be ridiculous while predicting 100s of players), I would like to see maybe the 3 closest neighbors to the player in question and their previous stats with what they produced the next year. I am most concerned with the name of the nearest neighbors as knowing which players are closest will give context to the prediction that it makes.
I am fine with trying to edit the actual code to the function if that is the only way to get at these. Even finding the indices would be helpful as I can backsolve from there to get the names. Thank you so much for all of your help!
Here is some sample code that should help:
name=c("McGwire,Mark","Bonds,Barry","Helton,Todd","Walker,Larry","Pujols,Albert","Pedroia,Dustin")
z
lag1=c(100,90,75,89,95,70)
lag2=c(120,80,95,79,92,90)
Runs=c(65,120,105,99,65,100)
full=cbind(name,lag1,lag2,Runs)
full=data.frame(full)
learn=full
learn
learn$lag1=as.numeric(as.character(learn$lag1))
learn$lag2=as.numeric(as.character(learn$lag2))
learn$Runs=as.numeric(as.character(learn$Runs))
valid=learn[5,]
learn=learn[-5,]
valid
k=kknn(Runs~lag1+lag2,learn,valid,k=2,distance=1)
summary(k)
fit=fitted(k)
fit
Here is the function that I am actually calling if that helps you tailor your answers for workarounds!
kknn(RVPA~(lag1*lag1LVL*lag1PA)+(lag2*lag2LVL*lag2PA)+(lag3*lag3LVL*lag3PA)+Age1+PAsize, RV.learn, RV.valid,k=86, distance = 1,kernel = "optimal")
Here's a slightly modified version of your example:
full= data.frame(
name=c("McGwire,Mark","Bonds,Barry","Helton,Todd","Walker,Larry","Pujols,Albert","Pedroia,Dustin"),
lag1=c(100,90,75,89,95,70),
lag2=c(120,80,95,79,92,90),
Runs=c(65,120,105,99,65,100)
)
library(kknn)
train=full[full$name!="Bonds,Barry",]
test=full[full$name=="Bonds,Barry",]
k=kknn(Runs~lag1+lag2,train=train, test=test,k=2,distance=1)
This predicts Bonds to have 80.2 runs. The Runs variable acts like a class label and if you call k$CL you'll get back 65 and 99 (the number of runs corresponding to the two nearest neighbors). There are two players (McGwire, Pujols) with 65 runs and one with 99, so you can't tell directly who the neighbors are. It appears that the output for kknn does not include a list of the nearest neighbors to the test set (though you could probably back it out from the various outputs).
The FNN package, however, will let you do a query against your training data in the way you want:
library(FNN)
get.knnx(data=train[,c("lag1","lag2")], query=test[,c("lag1","lag2")],k=2)
$nn.index
[,1] [,2]
[1,] 3 4
$nn.dist
[,1] [,2]
[1,] 1.414214 13
train[c(3,4),"name"]
[1] Walker,Larry Pujols,Albert
So nearest neighbors to Bonds are Pujols and Walker.
I'm working on a dataset that consists of ~10^6 values which clustered into a variable number of bins. In the course of my analysis, I am trying to randomize my clustering, but keeping bin size constant. As a toy example (in pseudocode), this would look something like this:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
for (rand in 1:no.of.randomizations) {
rand.data <- partition.sample(seq(1,15), partitions=sizes, replace=F)
}
So, I am looking for a function like "partition.sample" that will take a vector (like seq(1,15)) and randomly sample from it, returning a list with the data partitioned into the right bin sizes given already by "sizes".
I've been trying to write one such function myself, since the task seems to be not so hard. However, the partitioning of a vector into given bin sizes looks like it would be a lot faster and more efficient if done "under the hood", meaning probably not in native R. So I wonder whether I have just missed the name of the appropriate function, or whether someone could please point me to a smart solution that is around :-)
Your help & time are very much appreciated! :-)
Best,
Lymond
UPDATE:
By "no.of.randomizations" I mean the actual number of times I run through the whole "randomization loop". This will, later on, obviously include more steps than just the actual sampling.
Moreover, I would in addition be interested in a trick to do the above feat for sampling without replacement.
Thanks in advance, your help is very much appreciated!
Revised: This should be fairly efficient. It's complexity should be primarily in the permutation step:
# A single step:
x <- sample( unlist(data))
list( one=x[1:4], two=x[5:8], three=x[9], four=x[10:12], five=x[13:16])
As mentioned above the "no.of.randomizations" may have been the number of repeated applications of this proces, in which case you may want to wrap replicate around that:
replic <- replicate(n=4, { x <- sample(unlist(data))
list( x[1:4], x[5:8], x[9], x[10:12], x[13:15]) } )
After some more thinking and googling, I have come up with a feasible solution. However, I am still not convinced that this is the fastest and most efficient way to go.
In principle, I can generate one long vector of a uniqe permutation of "data" and then split it into a list of vectors of lengths "sizes" by going via a factor argument supplied to split. For this, I need an additional ID scheme for my different groups of "data", which I happen to have in my case.
It becomes clearer when viewed as code:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
So far, everything as above
names <- c("set1", "set2", "set3", "set4", "set5");
In my case, I am lucky enough to have "names" already provided from the data. Otherwise, I would have to obtain them as (e.g.)
names <- seq(1, length(data));
This "names" vector can then be expanded by "sizes" using rep:
cut.by <- rep(names, times = sizes);
[1] 1 1 1 1 2 2 2 2 3 4 4 4 5
[14] 5 5
This new vector "cut.by" can then by provided as argument to split()
rand.data <- split(sample(1:15, 15), cut.by)
$`1`
[1] 8 9 14 4
$`2`
[1] 10 2 15 13
$`3`
[1] 12
$`4`
[1] 11 3 5
$`5`
[1] 7 6 1
This does the job I was looking for alright. It samples from the background "1:15" and splits the result into vectors of lengths "sizes" through the vector "cut.by".
However, I am still not happy to have to go via an additional (possibly) long vector to indicate the split positions, such as "cut.by" in the code above. This definitely works, but for very long data vectors, it could become quite slow, I guess.
Thank you anyway for the answers and pointers provided! Your help is very much appreciated :-)
I want to differentiate data vectors to find those that are similar. For example:
A=[4,5,6,7,8];
B=[4,5,6,6,8];
C=[4,5,6,7,7];
D=[1,2,3,9,9];
E=[1,2,3,9,8];
In the previous example I want to distinguish that A,B,C vectors are similar (not the same) to each other and D,E are similiar to each other. The result should be something like: A,B,C are similar and D,E are similar, but the group A,B,C is not similar to the group of D,E. Matlab can do this?
I was thinking using some classification algorithm or Kmeans,ROC,etc.. but I'm not sure which one will be the best one.
Any suggestion? Thanks in advance
One of my new favourite methods for this sort of thing is agglomerate clustering.
First, concatenate all your vectors into a matrix, where each row is a separate vector. This makes such methods much easier to use:
F = [A; B; C; D; E];
Then the linkages can be found:
Z = linkage(F, 'ward', 'euclidean');
This can be plotted using:
dendrogram(Z);
This shows a tree, where each leaf at the bottom is one of the original vectors. Lengths of the branches show similarities and dissimilarities.
As you can see, 1, 2 and 3 are shown to be very close, as are 4 and 5. This even gives a measure of closeness, and shows that vectors 1 and 3 are deemed to be closer than vectors 2 and 3 (in the sense that, percentagewise, 7 is closer to 8 than 6 is to 7).
If all the vectors you are comparing are of the same length, a suitable norm on pairwise differences may well be enough. The norm to choose will depend on your particular criteria of closeness, of course, but with the examples you show, simply summing the absolute values of the components of the pairwise differences gives:
A B C D E
A 0 1 1 12 11
B 0 2 13 12
C 0 13 12
D 0 1
E 0
which doesn't need a particularly well-tuned threshold to work.
You can use pdist(), this function gives you the pairwise distances.
Various distance (opposite of similarity) metrics are already implemented, 'euclidean' seems appropriate for your situation, although you may want to try out the effect of different metrics.
Here it goes the solution I propose based on your results:
Z = [A;B;C;D;E];
Y = pdist(Z);
matrix = SQUAREFORM(Y);
matrix_round = round(matrix);
Now that we have the vector we can set the threshold based on the maximun value and decide with which theshold is the most appropriate.
It would be nice to create some cluster plot showing the differences between them.
Best regards