The approach to calculating 'similar' objects based on certain weighted criteria - math

I have a site that has multiple Project objects. Each project has (for example):
multiple tags
multiple categories
a size
multiple types
etc.
I would like to write a method to grab all 'similar' projects based on the above criteria. I can easily retrieve similar projects for each of the above singularly (i.e. projects of a similar size or projects that share a category etc.) but I would like it to be more intelligent then just choosing projects that either have all the above in common, or projects that have at least one of the above in common.
Ideally, I would like to weight each of the criteria, i.e. a project that has a tag in common is less 'similar' then a project that is close in size etc. A project that has two tags in common is more similar than a project that has one tag in common etc.
What approach (practically and mathimatically) can I take to do this?

The common way to handle this (in machine learning at least) is to create a metric which measures the similarity -- A Jaccard metric seems like a good match here, given that you have types, categories, tags, etc, which are not really numbers.
Once you have a metric, you can speed up searching for similar items by using a KD tree, vp-tree or another metric tree structure, provided your metric obeys the triangle inequality( d(a,b) < d(a,c) + d(c, b) )

The problem is, that there are obviously an infinite number of ways of solving this.
First of all, define a similarity measure for each of your attributes (tag similarity, category similarity, description similarity, ...)
Then try to normalize all these similarities to use a common scale, e.g. 0 to 1, with 0 being most similar, and the values having a similar distribution.
Next, assign each feature a weight. E.g. tag similarity is more important than description similarity.
Finally, compute a combined similarity as weighted sum of the individual similarities.
There is an infinite number of ways, as you can obviously assign arbitrary weights, have various choices for the single-attribute similarities already, infinite number of ways of normalizing the individual values. And so on.
There are methods for learning the weights. See ensemble methods. However, to learn the weights you need to have user input on what is a good result and what not. Do you have such training data?

Start with a value of 100 in each category.
Apply penalties. Like, -1 for each kB difference in size, or -2 for each tag not found in the other project. You end up with a value of 0..100 in each category.
Multiply each category's value with the "weight" of the category (i.e., similarity in size is multiplied with 1, similarity in tags with 3, similarity in types with 2).
Add up the weighted values.
Divide by the sum of weight factors (in my example, 1 + 3 + 2 = 6) to get an overall similarity of 0..100.
Possibilities to reduce the comparing of projects below the initial O(n^2) (i.e. comparing each project with each other) is heavily depending on context. It might be the real crux of your software, or it might not be necessary at all if n is low.

Related

How to define the difference between the similarity (e.g.: measured with cosine) of entities from two given sets?

I have two groups of entities, a "valid" group which is expected to contain similar entities, and a "random" group which is expected to contain less similar entities when compared to the valid group. I have a long list of similarity/distance measurements, like cosine, canberra, manhattan, etc., that are all computed within the groups to produce an average similarity scores for the entities in them. My question is, how should the difference between the scores of the two group be defined? I know comparing different metric results is not advised, but this problem mainly stems from the fact, that the entities have two different descriptions (a vector-based and a knowledge-graph-based) that require different similarity metrics, and my aim is to compare the two types of description through the difference of average similarities between valid and random groups of entities.
My first intuition was to simply take
similarity_gain = (valid_score - random_score) / np.abs(random_score)
However, this largely ignored the scale of the metrics, e.g.: dot measurement resulted in huge gains due to the large difference in the vectors, meanwhile other measurements with logarithmic-like scales showed little difference. Would a simply normalization to [0, 1] be a proper solution? Or is there another way to represent the gain through the different metrics for proper comparison?

How to add weight to attribute inside feature vector for distance measure?

I am attempting to measure the distance between two feature vectors, but I want to give more importance to one attribute inside the feature vector beyond the rest. For example, if the vector I had below were filled with numeric features, how would I place more value on "taste"?
V = [ Taste, Smell, Feel, Look ]
I know I could just isolate that value and perform the distance measure on that, but I wasn't sure if that were the best way and if I would lose the "rest of the picture" by doing so. When I search for weighted distance measures, I tend to land on pages where the weight is just being used for normalization or standardization of the data which doesn't appear to carry the same meaning as what I would like.
Am I better off using the distance measure on the full vector and then applying something like KNN with weights later on?
I think you can try matrix multiply means you can give a weight matrix and just multiply this weight matrix with your data.

Multiple events in traminer

I'm trying to analyse multiple sequences with TraMineR at once. I've had a look at seqdef but I'm struggling to understand how I'd create a TraMineR dataset when I'm dealing with multiple variables. I guess I'm working with something similar to the dataset used by Aassve et al. (as mentioned in the tutorial), whereby each wave has information about several states (e.g. children, marriage, employment). All my variables are binary. Here's an example of a dataset with three waves (D,W2,W3) and three variables.
D<-data.frame(ID=c(1:4),A1=c(1,1,1,0),B1=c(0,1,0,1),C1=c(0,0,0,1))
W2<-data.frame(A2=c(0,1,1,0),B2=c(1,1,0,1),C2=c(0,1,0,1))
W3<-data.frame(A3=c(0,1,1,0),B3=c(1,1,0,1),C3=c(0,1,0,1))
L<-data.frame(D,W2,W3)
I may be wrong but the material I found deals with the data management and analysis of one variable at a time only (e.g. employment status across several waves). My dataset is much larger than the above so I can't really impute these manually as shown on page 48 of the tutorial. Has anyone dealt with this type of data using TraMineR (or similar package)?
1) How would you feed the data above to TraMineR?
2) How would you compute the substitution costs and then cluster them?
Many thanks
When using sequence analysis, we are interested in the evolution of one variable (for instance, a sequence of one variable across several waves). You have then multiple possibilities to analyze several variables:
Create on sequences per variable and then analyze the links between the cluster of sequences. In my opinion, this is the best way to go, if your variables measure different concepts (for instance, family and employment).
Create a new variable for each wave that is the interaction of the different variables of one wave using the interaction function. For instance, for wave one, use L$IntVar1 <- interaction(L$A1, L$B1, L$C1, drop=T) (use drop=T to remove unused combination of answers). And then analyze the sequence of this newly created variable. In my opinion, this is the prefered way if your variables are different dimensions of the same concept. For instance, marriage, children and union are all related to familly life.
Create one sequence object per variable and then use seqdistmc to compute the distance (multi-channel sequence analysis). This is equivalent to the previous method depending on how you will set substitution costs (see below).
If you use the second strategy, you could use the following substitution costs. You can count the differences between the original variable to set the substition costs. For instance, between states "Married, Child" and "Not married and Child", you could set the substitution to "1" because there is only a difference on the "marriage" variable. Similarly, you would set the substition cost between states "Married, Child" and "Not married and No Child" to "2" because all of your variables are different. Finally, you set the indel cost to half the maximum substitution cost. This is the strategy used by seqdistmc.
Hope this helps.
In Biemann and Datta (2013) they talk about multi dimensional analysis. That means creating multiple sequences for the same "individuals".
I used the following approach to do so:
1) define 3 dimensional sequences
comp.seq <- seqdef(comp,NULL,states=comp.scodes,labels=comp.labels, alphabet=comp.alphabet,missing="Z")
titles.seq <- seqdef(titles,NULL,states=titles.scodes,labels=titles.labels, alphabet=titles.alphabet,missing="Z")
member.seq <- seqdef(member,NULL,states=member.scodes,labels=member.labels, alphabet=member.alphabet,missing="Z")
2) Compute the multi channel (multi dimension) distance
mcdist <- seqdistmc(channels=list(comp.seq,member.seq,titles.seq),method="OM",sm=list("TRATE","TRATE","TRATE"),with.missing=TRUE)
3) cluster it with ward's method:
library(cluster)
clusterward<- agnes(mcdist,diss=TRUE,method="ward")
plot(clusterward,which.plots=2)
Nevermind the parameters like "missing" or "left" and etc. but i hope the brief code sample helps.

Analyzing Path Data

I have data representing the paths people take across a fixed set of points (discrete, e.g., nodes and edges). So far I have been using igraph.
I haven't found a good way yet (in igraph or another package) to create canonical paths summarizing what significant sub-groups of respondents are doing.
A canonical path can be operationalized in any reasonable way and is just meant to represent a typical path or sub-path for a significant portion of the population.
Does there already exist a function to create these within igraph or another package?
One option: represent each person's movement as a directed edge. Create an aggregate graph such that each edge has a weight corresponding to the number of times that edge occurred. Those edges with large weights will be "typical" 1-paths.
Of course, it gets more interesting to find common k-paths or explore how paths vary among individuals. The naive approach for 2-paths would be to create N additional nodes that correspond to nodes when visited in the middle of the 2-path. For example, if you have nodes a_1, ..., a_N you would create nodes b_1, ..., b_N. The aggregate network might have an edge (a_3, b_5, 10) and an edge (b_5, a_7, 10); this would represent the two-path (a_3, b_5, a_7) occurring 10 times. The task you're interested in corresponds to finding those two-paths with large weights.
Both the igraph and network packages would suffice for this sort of analysis.
If you have some bound on k (ie. only 6-paths occur in your dataset), I might also suggest enumerating all the paths that are taken and computing the histogram of each unique path. I don't know of any functions that do this automagically for you.

Fuzzy matching of product names

I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.
For example "Canon PowerShot a20IS", "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS"
should all match "Canon PowerShot A20 IS". I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough unfortunately.
The main problem is that even single-letter changes in relevant keywords can make a huge difference, but it's not easy to detect which are the relevant keywords. Consider for example three product names:
Lenovo T400
Lenovo R400
New Lenovo T-400, Core 2 Duo
The first two are ridiculously similar strings by any standard (ok, soundex might help to disinguish the T and R in this case, but the names might as well be 400T and 400R), the first and the third are quite far from each other as strings, but are the same product.
Obviously, the matching algorithm cannot be a 100% precise, my goal is to automatically match around 80% of the names with a high confidence.
Any ideas or references is much appreciated
I think this will boil down to distinguishing key words such as Lenovo from chaff such as New.
I would run some analysis over the database of names to identify key words. You could use code similar to that used to generate a word cloud.
Then I would hand-edit the list to remove anything obviously chaff, like maybe New is actually common but not key.
Then you will have a list of key words that can be used to help identify similarities. You would associate the "raw" name with its keywords, and use those keywords when comparing two or more raw names for similarities (literally, percentage of shared keywords).
Not a perfect solution by any stretch, but I don't think you are expecting one?
The key understanding here is that you do have a proper distance metric. That is in fact not your problem at all. Your problem is in classification.
Let me give you an example. Say you have 20 entries for the Foo X1 and 20 for the Foo Y1. You can safely assume they are two groups. On the other hand, if you have 39 entries for the Bar X1 and 1 for the Bar Y1, you should treat them as a single group.
Now, the distance X1 <-> Y1 is the same in both examples, so why is there a difference in the classification? That is because Bar Y1 is an outlier, whereas Foo Y1 isn't.
The funny part is that you do not actually need to do a whole lot of work to determine these groups up front. You simply do an recursive classification. You start out with node per group, and then add the a supernode for the two closest nodes. In the supernode, store the best assumption, the size of its subtree and the variation in it. As many of your strings will be identical, you'll soon get large subtrees with identical entries. Recursion ends with the supernode containing at the root of the tree.
Now map the canonical names against this tree. You'll quickly see that each will match an entire subtree. Now, use the distances between these trees to pick the distance cutoff for that entry. If you have both Foo X1 and Foo Y1 products in the database, the cut-off distance will need to be lower to reflect that.
edg's answer is in the right direction, I think - you need to distinguish key words from fluff.
Context matters. To take your example, Core 2 Duo is fluff when looking at two instances of a T400, but not when looking at a a CPU OEM package.
If you can mark in your database which parts of the canonical form of a product name are more important and must appear in one form or another to identify a product, you should do that. Maybe through the use of some sort of semantic markup? Can you afford to have a human mark up the database?
You can try to define equivalency classes for things like "T-400", "T400", "T 400" etc. Maybe a set of rules that say "numbers bind more strongly than letters attached to those numbers."
Breaking down into cases based on manufacturer, model number, etc. might be a good approach. I would recommend that you look at techniques for term spotting to try and accomplish that: http://www.worldcat.org/isbn/9780262100854
Designing everything in a flexible framework that's mostly rule driven, where the rules can be modified based on your needs and emerging bad patterns (read: things that break your algorithm) would be a good idea, as well. This way you'd be able to improve the system's performance based on real world data.
You might be able to make use of a trigram search for this. I must admit I've never seen the algorithm to implement an index, but have seen it working in pharmaceutical applications, where it copes very well indeed with badly misspelt drug names. You might be able to apply the same kind of logic to this problem.
This is a problem of record linkage. The dedupe python library provides a complete implementation, but even if you don't use python, the documentation has a good overview of how to approach this problem.
Briefly, within the standard paradigm, this task is broken into three stages
Compare the fields, in this case just the name. You can use one or more comparator for this, for example an edit distance like the Levenshtein distance or something like the cosine distance that compares the number of common words.
Turn an array fo distance scores into a probability that a pair of records are truly about the same thing
Cluster those pairwise probability scores into groups of records that likely all refer to the same thing.
You might want to create logic that ignores the letter/number combination of model numbers (since they're nigh always extremely similar).
Not having any experience with this type of problem, but I think a very naive implementation would be to tokenize the search term, and search for matches that happen to contain any of the tokens.
"Canon PowerShot A20 IS", for example, tokenizes into:
Canon
Powershot
A20
IS
which would match each of the other items you want to show up in the results. Of course, this strategy will likely produce a whole lot of false matches as well.
Another strategy would be to store "keywords" with each item, such as "camera", "canon", "digital camera", and searching based on items that have matching keywords. In addition, if you stored other attributes such as Maker, Brand, etc., you could search on each of these.
Spell checking algorithms come to mind.
Although I could not find a good sample implementation, I believe you can modify a basic spell checking algorithm to comes up with satisfactory results. i.e. working with words as a unit instead of a character.
The bits and pieces left in my memory:
Strip out all common words (a, an, the, new). What is "common" depends on context.
Take the first letter of each word and its length and make that an word key.
When a suspect word comes up, looks for words with the same or similar word key.
It might not solve your problems directly... but you say you were looking for ideas, right?
:-)
That is exactly the problem I'm working on in my spare time. What I came up with is:
based on keywords narrow down the scope of search:
in this case you could have some hierarchy:
type --> company --> model
so that you'd match
"Digital Camera" for a type
"Canon" for company and there you'd be left with much narrower scope to search.
You could work this down even further by introducing product lines etc.
But the main point is, this probably has to be done iteratively.
We can use the Datadecision service for matching products.
It will allow you to automatically match your product data using statistical algorithms. This operation is done after defining a threshold score of confidence.
All data that cannot be automatically matched will have to be manually reviewed through a dedicated user interface.
The online service uses lookup tables to store synonyms as well as your manual matching history. This allows you to improve the data matching automation next time you import new data.
I worked on the exact same thing in the past. What I have done is using an NLP method; TF-IDF Vectorizer to assign weights to each word. For example in your case:
Canon PowerShot a20IS
Canon --> weight = 0.05 (not a very distinguishing word)
PowerShot --> weight = 0.37 (can be distinguishing)
a20IS --> weight = 0.96 (very distinguishing)
This will tell your model which words to care and which words to not. I had quite good matches thanks to TF-IDF.
But note this: a20IS cannot be recognized as a20 IS, you may consider to use some kind of regex to filter such cases.
After that, you can use a numeric calculation like cosine similarity.

Resources