How to define the difference between the similarity (e.g.: measured with cosine) of entities from two given sets? - similarity

I have two groups of entities, a "valid" group which is expected to contain similar entities, and a "random" group which is expected to contain less similar entities when compared to the valid group. I have a long list of similarity/distance measurements, like cosine, canberra, manhattan, etc., that are all computed within the groups to produce an average similarity scores for the entities in them. My question is, how should the difference between the scores of the two group be defined? I know comparing different metric results is not advised, but this problem mainly stems from the fact, that the entities have two different descriptions (a vector-based and a knowledge-graph-based) that require different similarity metrics, and my aim is to compare the two types of description through the difference of average similarities between valid and random groups of entities.
My first intuition was to simply take
similarity_gain = (valid_score - random_score) / np.abs(random_score)
However, this largely ignored the scale of the metrics, e.g.: dot measurement resulted in huge gains due to the large difference in the vectors, meanwhile other measurements with logarithmic-like scales showed little difference. Would a simply normalization to [0, 1] be a proper solution? Or is there another way to represent the gain through the different metrics for proper comparison?

Related

generating completely new vector based on other vectors

Assume I have four-vectors (v1,v2,v3,v4), and I want to create a new vector (vec_new) that is not close to any of those four-vectors. I was thinking about interpolation and extrapolation. Do you think they are suitable? Are they also apply for vector and generate a vector of let's say 300 dimensions? Another possible option would be the transformation matrix. But I am not sure if it fit my concern. I think averaging and concatenation are not the good ones as I might be close to some of those four-vectors.
based on my problem, Imagine I divided my vectors into two categories. I need to find a vector which belongs to non-of those categories.
Any other ideas?
Per my comment, I wouldn't expect the creation of synthetic "far away" examples to be useful for realistic goals.
Even things like word antonyms are not maximally cosine-dissimilar from each other, because among the realm of all word-meaning-possibilities, antonyms are quite similar to each other. For example, 'hot' and 'cold' are considered opposites, but are the same kind of word, describing the same temperature-property, and can often be drop-in replacements for each other in the same sentences. So while they may show an interesting contrast in word-vector space, the "direction of difference" isn't going to be through the origin -- as would create maximal cosine-dissimilarity.
And in classification contexts, even a simple 2-category classifier will need actual 'negative' examples. With only positive examples, the 'vector space' won't necessarily model anything about hypothesized-but-not-actually-present negative examples. (It's nearly impossible to divide the space into two categories without training examples showing the real "boundaries".)
Still, there's an easy way to make a vector that is maximally dissimilar to another single vector: negate it. That creates a vector that's in the exact opposite direction from the original, and thus will have a cosine-similarity of -1.0.
If you have a number of vectors against which you want to find a maximally-dissimilar vector, I suspect you can't do much better than negating the average of all the vectors. That is, average the vectors, then negate that average-vector, to find the vector that's pointing exactly-opposite the average.
Good luck!

A Neverending cforest

how can I decouple the time cforest/ctree takes to construct a tree from the number of columns in the data?
I thought the option mtry could be used to do just that, i.e. the help says
number of input variables randomly sampled as candidates at each node for random forest like algorithms.
But while that does randomize the output trees it doesn't decouple the CPU time from the number of columns, e.g.
p<-proc.time()
ctree(gs.Fit~.,
data=Aspekte.Fit[,1:60],
controls=ctree_control(mincriterion=0,
maxdepth=2,
mtry=1))
proc.time()-p
takes twice as long as the same with Aspekte.Fit[,1:30] (btw. all variables are boolean). Why? Where does it scale with the number of columns?
As I see it the algorithm should:
At each node randomly select two columns.
Use them to split the response. (no scaling because of mincriterion=0)
Proceed to the next node (for a total of 3 due to maxdepth=2)
without being influenced by the column total.
Thx for pointing out the error of my ways

Multiple events in traminer

I'm trying to analyse multiple sequences with TraMineR at once. I've had a look at seqdef but I'm struggling to understand how I'd create a TraMineR dataset when I'm dealing with multiple variables. I guess I'm working with something similar to the dataset used by Aassve et al. (as mentioned in the tutorial), whereby each wave has information about several states (e.g. children, marriage, employment). All my variables are binary. Here's an example of a dataset with three waves (D,W2,W3) and three variables.
D<-data.frame(ID=c(1:4),A1=c(1,1,1,0),B1=c(0,1,0,1),C1=c(0,0,0,1))
W2<-data.frame(A2=c(0,1,1,0),B2=c(1,1,0,1),C2=c(0,1,0,1))
W3<-data.frame(A3=c(0,1,1,0),B3=c(1,1,0,1),C3=c(0,1,0,1))
L<-data.frame(D,W2,W3)
I may be wrong but the material I found deals with the data management and analysis of one variable at a time only (e.g. employment status across several waves). My dataset is much larger than the above so I can't really impute these manually as shown on page 48 of the tutorial. Has anyone dealt with this type of data using TraMineR (or similar package)?
1) How would you feed the data above to TraMineR?
2) How would you compute the substitution costs and then cluster them?
Many thanks
When using sequence analysis, we are interested in the evolution of one variable (for instance, a sequence of one variable across several waves). You have then multiple possibilities to analyze several variables:
Create on sequences per variable and then analyze the links between the cluster of sequences. In my opinion, this is the best way to go, if your variables measure different concepts (for instance, family and employment).
Create a new variable for each wave that is the interaction of the different variables of one wave using the interaction function. For instance, for wave one, use L$IntVar1 <- interaction(L$A1, L$B1, L$C1, drop=T) (use drop=T to remove unused combination of answers). And then analyze the sequence of this newly created variable. In my opinion, this is the prefered way if your variables are different dimensions of the same concept. For instance, marriage, children and union are all related to familly life.
Create one sequence object per variable and then use seqdistmc to compute the distance (multi-channel sequence analysis). This is equivalent to the previous method depending on how you will set substitution costs (see below).
If you use the second strategy, you could use the following substitution costs. You can count the differences between the original variable to set the substition costs. For instance, between states "Married, Child" and "Not married and Child", you could set the substitution to "1" because there is only a difference on the "marriage" variable. Similarly, you would set the substition cost between states "Married, Child" and "Not married and No Child" to "2" because all of your variables are different. Finally, you set the indel cost to half the maximum substitution cost. This is the strategy used by seqdistmc.
Hope this helps.
In Biemann and Datta (2013) they talk about multi dimensional analysis. That means creating multiple sequences for the same "individuals".
I used the following approach to do so:
1) define 3 dimensional sequences
comp.seq <- seqdef(comp,NULL,states=comp.scodes,labels=comp.labels, alphabet=comp.alphabet,missing="Z")
titles.seq <- seqdef(titles,NULL,states=titles.scodes,labels=titles.labels, alphabet=titles.alphabet,missing="Z")
member.seq <- seqdef(member,NULL,states=member.scodes,labels=member.labels, alphabet=member.alphabet,missing="Z")
2) Compute the multi channel (multi dimension) distance
mcdist <- seqdistmc(channels=list(comp.seq,member.seq,titles.seq),method="OM",sm=list("TRATE","TRATE","TRATE"),with.missing=TRUE)
3) cluster it with ward's method:
library(cluster)
clusterward<- agnes(mcdist,diss=TRUE,method="ward")
plot(clusterward,which.plots=2)
Nevermind the parameters like "missing" or "left" and etc. but i hope the brief code sample helps.

The approach to calculating 'similar' objects based on certain weighted criteria

I have a site that has multiple Project objects. Each project has (for example):
multiple tags
multiple categories
a size
multiple types
etc.
I would like to write a method to grab all 'similar' projects based on the above criteria. I can easily retrieve similar projects for each of the above singularly (i.e. projects of a similar size or projects that share a category etc.) but I would like it to be more intelligent then just choosing projects that either have all the above in common, or projects that have at least one of the above in common.
Ideally, I would like to weight each of the criteria, i.e. a project that has a tag in common is less 'similar' then a project that is close in size etc. A project that has two tags in common is more similar than a project that has one tag in common etc.
What approach (practically and mathimatically) can I take to do this?
The common way to handle this (in machine learning at least) is to create a metric which measures the similarity -- A Jaccard metric seems like a good match here, given that you have types, categories, tags, etc, which are not really numbers.
Once you have a metric, you can speed up searching for similar items by using a KD tree, vp-tree or another metric tree structure, provided your metric obeys the triangle inequality( d(a,b) < d(a,c) + d(c, b) )
The problem is, that there are obviously an infinite number of ways of solving this.
First of all, define a similarity measure for each of your attributes (tag similarity, category similarity, description similarity, ...)
Then try to normalize all these similarities to use a common scale, e.g. 0 to 1, with 0 being most similar, and the values having a similar distribution.
Next, assign each feature a weight. E.g. tag similarity is more important than description similarity.
Finally, compute a combined similarity as weighted sum of the individual similarities.
There is an infinite number of ways, as you can obviously assign arbitrary weights, have various choices for the single-attribute similarities already, infinite number of ways of normalizing the individual values. And so on.
There are methods for learning the weights. See ensemble methods. However, to learn the weights you need to have user input on what is a good result and what not. Do you have such training data?
Start with a value of 100 in each category.
Apply penalties. Like, -1 for each kB difference in size, or -2 for each tag not found in the other project. You end up with a value of 0..100 in each category.
Multiply each category's value with the "weight" of the category (i.e., similarity in size is multiplied with 1, similarity in tags with 3, similarity in types with 2).
Add up the weighted values.
Divide by the sum of weight factors (in my example, 1 + 3 + 2 = 6) to get an overall similarity of 0..100.
Possibilities to reduce the comparing of projects below the initial O(n^2) (i.e. comparing each project with each other) is heavily depending on context. It might be the real crux of your software, or it might not be necessary at all if n is low.

What makes these two R data frames not identical?

I have two small data frames, this_tx and last_tx. They are, in every way that I can tell, completely identical. this_tx == last_tx results in a frame of identical dimensions, all TRUE. this_tx %in% last_tx, two TRUEs. Inspected visually, clearly identical. But when I call
identical(this_tx, last_tx)
I get a FALSE. Hilariously, even
identical(str(this_tx), str(last_tx))
will return a TRUE. If I set this_tx <- last_tx, I'll get a TRUE.
What is going on? I don't have the deepest understanding of R's internal mechanics, but I can't find a single difference between the two data frames. If it's relevant, the two variables in the frames are both factors - same levels, same numeric coding for the levels, both just subsets of the same original data frame. Converting them to character vectors doesn't help.
Background (because I wouldn't mind help on this, either): I have records of drug treatments given to patients. Each treatment record essentially specifies a person and a date. A second table has a record for each drug and dose given during a particular treatment (usually, a few drugs are given each treatment). I'm trying to identify contiguous periods during which the person was taking the same combinations of drugs at the same doses.
The best plan I've come up with is to check the treatments chronologically. If the combination of drugs and doses for treatment[i] is identical to the combination at treatment[i-1], then treatment[i] is a part of the same phase as treatment[i-1]. Of course, if I can't compare drug/dose combinations, that's right out.
Generally, in this situation it's useful to try all.equal which will give you some information about why two objects are not equivalent.
Well, the jaded cry of "moar specifics plz!" may win in this case:
Check the output of dput() and post if possible. str() just summarizes the contents of an object whilst dput() dumps out all the gory details in a form that may be copied and pasted into another R interpreter to regenerate the object.

Resources