Finding if there relationship between numbers - r

I have a challenge. This may be little tricky or even not possible but wanted to check if anyone has any thoughts on this?
PS : This question is in general and not related to only to R. May be I can say its general mathematics
I have a data
df
ColA ColB ColC
6 9 27
1 4 32
4 8 40
If you observe closely, there is some relationship between these columns.
Example, (ColC/ColB)+ColA will give you number 9.
df
ColA ColB ColC ColD
6 9 27 9
1 4 32 9
4 8 40 9
However this data is manipulated and I made sure there is some relation.
But in general, lets us take any numbers, is there a way to find if there is any relationship between these numbers. Need not be (ColC/ColB)+ColA . It could be anything.
Say we have 5 columns of numeric data. I need to find mathematical operation between these so that common number exists.
This is more into mathematics(algebra).
Can anyone let me know is this even possible ?

For some types of relationships this is doable. But when such a method fails to find a relationship, it typically just means there could be a relationship of a kind not covered by your approach.
One common tool for finding relationships is linear algebra, and linear dependencies in particular. Write your data in a matrix like you did. Consider that a linear equation
a*ColA + b*ColB + c*ColC = 0
Use standard techniques such as Gaussian elimination to find coefficients a, b, c which satisfy this equation but are not all zero themselves. You probably can find a library to compute the kernel of a matrix which you can use for that. Now you know whether one of the columns can be expressed as a linear combination of the other two.
This is a very limited class of relationships, and doesn't cover your example yet. But you can improve it by including more columns. Include a column with ones everywhere to allow for a constant term in your formula. Include all pair wise products.
x + a*ColA + b*ColB + c*ColC + ab*ColA*ColB + ac*ColA*ColC + bc*ColB*ColC + aa*ColA^2 + bb*ColB^2 + cc*ColC^2 = 0
Now for your data this could tell you that there is a solution of the form
b=-9 c=1 ab=1 x=a=ac=bc=aa=bb=cc=0
-9*ColB + ColC + ColA*ColB = 0
which is equivalent to the relationship you described in your question.
But also observed that you are now using 3 data points to determine 10 variables. So this one relationship is by far not the only one.
In general you want at least as many data points as you have variables in your equation. You want at least as many rows as you have columns in your extended matrix. Only then can you say that a relationship between them us indeed a property of the underlying data and not merely an artifact of having too much flexibility and too little data.
In R you might want to look into using linear models for determining coefficients in the presence of imprecise data. You can also use powers of formulas to include all interactions between columns, i.e. those higher degree terms which I included above as well.

Related

Calculation of possible combinations if the number of possible properties and their possible states is unknown

I have an unknown number of properties, and each property has an unknown number of possible states. How can I calculate the number of possible combinations?
It's hard enough for me to formulate it mathematically. That's why I can't get it into my code.
If all properties could have the same number of states, the number of possible combinations would be simply number_of_possible_combinations = number_possible_states^number_possible_properties.
However, that is not the case.
A coded example would be helpful, or a mathematical formula.
Just multiply all the possible number of states; for example
3 states
2 states
11 states
gives a total of 2 * 3 * 11 = 66 possible combinations
The case where the number of states is fixed is just a special case of this formula.
In mathematical terms is the product of the cardinality of the sets of states:

How to compare the difference between 2 factors in r

Let's say that I have 2 words, Happy and Happiness, and I store them each as a factor. I want to find the difference of those 2 factors, so I will preferably want a function that will spit out yiness, inessy, iness and y in any combination, just iness, or just y.
Let's also consider if the code would work for the 2 phrases:
al,y,a€a%f,a,s$lf,askdʇjg,asfg
and
al,y,a€a%f,a,s$lf,askd/879,876/jg,asfg
both without spaces, notice the only difference is ʇ is replaced for /879,876/.
Many thanks in advance!

Dominance-directed (tournament) graph metrics

I am interested in deriving dominance metrics (as in a dominance hierarchy) for nodes in a dominance directed graph, aka a tournament graph. I can use R and the package igraph to easily construct such graphs, e.g.
library(igraph)
create a data frame of edges
the.froms <- c(1,1,1,2,2,3)
the.tos <- c(2,3,4,3,4,4)
the.set <- data.frame(the.froms, the.tos)
set.graph <- graph.data.frame(the.set)
plot(set.graph)
This plotted graph shows that node 1 influences nodes 2, 3, and 4 (is dominant to them), that 2 is dominant to 3 and 4, and that 3 is dominant to 4.
However, I see no easy way to actually calculate a dominance hierarchy as in the page: https://www.math.ucdavis.edu/~daddel/linear_algebra_appl/Applications/GraphTheory/GraphTheory_9_17/node11.html . So, my first and main question is does anyone know how to derive a dominance hierarchy/node-based dominance metric for a graph like this using some hopefully already coded solution in R?
Moreover, in my real case, I actually have a sparse matrix that is missing some interactions, e.g.
incomplete.set <- the.set[-2, ]
incomplete.graph <- graph.data.frame(incomplete.set)
plot(incomplete.graph)
In this plotted graph, there is no connection between species 1 and 3, however making some assumptions about transitivity, the dominance hierarchy is the same as above.
This is a much more complicated problem, but if anyone has any input about how I might go about deriving node-based metrics of dominance for sparse matrices like this, please let me know. I am hoping for an already coded solution in R, but I'm certainly MORE than willing to code it myself.
Thanks in advance!
Not sure if this is perfect or that I fully understand this, but it seems to work as it should from some trial and error:
library(relations)
result <- relation_consensus(endorelation(graph=the.set),method="Borda")
relation_class_ids(result)
#1 2 3 4
#1 2 3 4
There are lots of potential options for method= for dealing with ties etc - see ?relation_consensus for more information. Using method="SD/L" which is a linear order might be the most appropriate for your data, though it can suggest multiple possible solutions due to conflicts in more complex examples. For the current simple data this is not the case though - try:
result <- relation_consensus(endorelation(graph=the.set),method="SD/L",
control=list(n="all"))
result
#An ensemble of 1 relation of size 4 x 4.
lapply(result,relation_class_ids)
#[[1]]
#1 2 3 4
#1 2 3 4
Methods of dealing with this are again provided in the examples in ?relation_consensus.

Does R randomForest's rfcv method actually say which features it selected, or not?

I would like to use rfcv to cull the unimportant variables from a data set before creating a final random forest with more trees (please correct and inform me if that's not the way to use this function). For example,
> data(fgl, package="MASS")
> tst <- rfcv(trainx = fgl[,-10], trainy = fgl[,10], scale = "log", step=0.7)
> tst$error.cv
9 6 4 3 2 1
0.2289720 0.2149533 0.2523364 0.2570093 0.3411215 0.5093458
In this case, if I understand the result correctly, it seems that we can remove three variables without negative side effects. However,
> attributes(tst)
$names
[1] "n.var" "error.cv" "predicted"
None of these slots tells me what those first three variables that can be harmlessly removed from the dataset actually were.
I think the purpose of rfcv is to establish how your accuracy is related to the number of variables you use. This might not seem useful when you have 10 variables, but when you have thousands of variables it is quite handy to understand how much those variables "add" to the predictive power.
As you probably found out, this code
rf<-randomForest(type ~ .,data=fgl)
importance(rf)
gives you the relative importance of each of the variables.

Determine how different are some vectors

I want to differentiate data vectors to find those that are similar. For example:
A=[4,5,6,7,8];
B=[4,5,6,6,8];
C=[4,5,6,7,7];
D=[1,2,3,9,9];
E=[1,2,3,9,8];
In the previous example I want to distinguish that A,B,C vectors are similar (not the same) to each other and D,E are similiar to each other. The result should be something like: A,B,C are similar and D,E are similar, but the group A,B,C is not similar to the group of D,E. Matlab can do this?
I was thinking using some classification algorithm or Kmeans,ROC,etc.. but I'm not sure which one will be the best one.
Any suggestion? Thanks in advance
One of my new favourite methods for this sort of thing is agglomerate clustering.
First, concatenate all your vectors into a matrix, where each row is a separate vector. This makes such methods much easier to use:
F = [A; B; C; D; E];
Then the linkages can be found:
Z = linkage(F, 'ward', 'euclidean');
This can be plotted using:
dendrogram(Z);
This shows a tree, where each leaf at the bottom is one of the original vectors. Lengths of the branches show similarities and dissimilarities.
As you can see, 1, 2 and 3 are shown to be very close, as are 4 and 5. This even gives a measure of closeness, and shows that vectors 1 and 3 are deemed to be closer than vectors 2 and 3 (in the sense that, percentagewise, 7 is closer to 8 than 6 is to 7).
If all the vectors you are comparing are of the same length, a suitable norm on pairwise differences may well be enough. The norm to choose will depend on your particular criteria of closeness, of course, but with the examples you show, simply summing the absolute values of the components of the pairwise differences gives:
A B C D E
A 0 1 1 12 11
B 0 2 13 12
C 0 13 12
D 0 1
E 0
which doesn't need a particularly well-tuned threshold to work.
You can use pdist(), this function gives you the pairwise distances.
Various distance (opposite of similarity) metrics are already implemented, 'euclidean' seems appropriate for your situation, although you may want to try out the effect of different metrics.
Here it goes the solution I propose based on your results:
Z = [A;B;C;D;E];
Y = pdist(Z);
matrix = SQUAREFORM(Y);
matrix_round = round(matrix);
Now that we have the vector we can set the threshold based on the maximun value and decide with which theshold is the most appropriate.
It would be nice to create some cluster plot showing the differences between them.
Best regards

Resources