Is there any way to combine multiple distance metrics in a similarity function? - similarity

I need to find a way to code a similarity function between two vector(data instances), let's name them and . These data instances have categorical features as well as quantities. Thus, I'd like to find a way to combine, let's say Hamming Distance and Euclidean Distance as to use it in my association problem.
There are merging types for k-NN, such as voting etc., still association problem cannot be solved by voting approach.

You can you Gower distance to find the distance when you have mixed data type.

Related

What is the difference between metric and non-metric MDS for a beginner?

I am fairly new to data science and would like to know in simple words (like teaching your grandmother) what the difference between metric and non-metric Multidimensional scaling is.
I have been googling for 2 days and watching different videos and wasn't able to quite understand some of the terms people are using to describe the difference, maybe I am lacking some basic knowledge but I don't know in which area so if you have an idea of what I should have a firm understanding of before tackling this subject, I would appreciate the advice. Here is what I know:
Multidimensional scaling is a way of reducing dimensions to be able to visualize or represent data in a more friendly manner. I know that there are several ways for MDS like metric and non metric, PCA and FA (maybe FA is a part of PCA, I'm not sure).
The example I am trying to apply this on is a set of data showing different cities and attributes related to these cities. For example, on a score from 1-7 (1 lowest - 7 highest), this is the score of each city and the corresponding attribute.
**Clean** **Friendly** **Expensive** **Beautiful**
Berlin----------- 4 --------------------- 2-----------------------5------------------------6
Geneva---------6 --------------------- 3-----------------------7------------------------7
Paris------------ 3 --------------------- 4-----------------------6------------------------7
Barcelona----- 2 --------------------- 6-----------------------3------------------------4
How do I know if I should be using metric or non-metric MDS. Are there general rules of thumb or simple logic that I can use to decide without going deep into the technical process.
Thank you
Well, I might not be able to give you a specific answer but a simple answer would be that metric MDS already has the input matrix in the form of distances (i.e. actual distances between cities) and therefore the distances have meaning in the input matrix and create a map of actual physical locations from those distances.
In non-metric MDS, the distances are just a representation of the rankings (i.e. high as in 7 or low as in 1) and they do not have any meaning on their own but they are needed to create the map using euclidean geometry and the map then just shows the similarity in rankings represented by distances between coordinates on the map.
Metric MDS deals with an item x item input matrix whose entries represent Euclidean distance (special case of metric MDS called classical MDS and being equivalent to PCA) or any other distance between items.
Non-metric MDS deals with some distance-like measure (let's call it dissimilarity) between items. There is no requirement for the dissimilarity to satisfy formal properties of a distance/metric (see this wiki for needed properties). The only requirement is that it should be possible to order the dissimilarity values for all item x item pairs in non-decreasing order.
In your case, the item x attribute matrix contains ordinal data (data on a scale 1-7). Euclidean distance won't be appropriate here, but e.g. Pearson "distance" or cosine "distance" are usually used for such data and, as they're not proper distances, non-metric MDS should then be chosen.

Difference of nearest-neighbour clustering and K-nearest neighbour clustering

we're two students working on a seminar paper (topic: Marketing in the Age of Big Data) where we have to conduct a cluster analysis by using nearest neighbour clustering. unfortunately, we cannot differentiate between nearest neighbour clustering and K-nearest neighbours. At first of all we thought that it is the same just called different. After we've read many papers where it is said that KNN is a supervised machine learning algorithm, while our professor said that the nearest neighbour is an unsupervised algorithm we recognised that there must be a difference. There are a lot of different declarations on the internet, why we are confused now.
Hopefully, someone can help us to solve the understanding problems.
Many thanks in advance and many greetings.
"Nearest Neighbour" is merely "k Nearest Neighbours" with k=1.
What may be confusing is that "nearest neighbour" is also applicable to both supervised and unsupervised clustering. In the supervised case, a "new", unclassified element is assigned to the same class as the nearest neighbour (or the mode of the nearest k neighbours).
In the unsupervised case, we generally apply "hierarchical clustering": take the two points with the least distance between them; declare a new class to contain the two points.
Now iterate through the distances, smallest to largest; if neither point is yet in a class, make a new class to contain them; if one point is already in a class, then add the other point to that class; if both points are in classes, then merge the classes. Continue this process until you have the desired quantity of classes.
Note: when you add a point to a class, remove (from your iteration list) the distances from that point to other class members. When you merge classes, remove all the distances between points that used to be in opposite classes.
Does that help?
Nearest neighbor algorithm basically returns the training example which is at the least distance from the given test sample. k-Nearest neighbor returns k(a positive integer) training examples at least distance from given test sample.

Use a KNN-regression algorithm in R

I am working on using the k nearest neighbours with a certain variable identified(test) for determining the value of this same variable of an individual with this value non-identified(test). Two possible approaches can be done then:
first(easy one), calculate the mean value of the variable of the k individuals; second(best one), calculate a weighted distance value according to the proximity of the individuals.
My first approach has been using the knn.index function in FNN package for identifying the nearest neighbours, and then using the indexes, look for the values in the dataset to do the mean. This was working so slow, as the dataset is quite big. Is there any algorithm already implemented to do this calculation faster, and would it be possible to add weights according to distance?
After a week of trying to solve the problem, I found a function in R which was solving my question, this might help others who have strugled with the same issue.
The function is named kknn, and it is in the package KKNN. It lets you do a KNN regression, but weigthing the points by the distance.

Text clustering with Levenshtein distances

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a distance function for strings. Also, since I do not know in advance the number of clusters, hierarchical clustering is the way to go and not k-means.
Although I get the problem in its abstract form, I do not know what is the easie way to actually do it. For example, is MATLAB or R a better choice for the actual implementation of hierarchical clustering with the custom function (Levenshtein distance).
For both software, one may easily find a Levenshtein distance implementation. The clustering part seems harder. For example Clustering text in MATLAB calculates the distance array for all strings, but I cannot understand how to use the distance array to actually get the clustering. Can you any of you gurus show me the way to how to implement the hierarchical clustering in either MATLAB or R with a custom function?
This may be a bit simplistic, but here's a code example that uses hierarchical clustering based on Levenshtein distance in R.
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))
In this example, we create a set of 30 random char(5) strings artificially in 3 groups (starting with "aa", "bb", and "cc"). We calculate the Levenshtein distance matrix using adist(...), and we run heirarchal clustering using hclust(...). Then we cut the dendrogram into three clusters with cutree(...) and append the cluster id's to the original strings.
ELKI includes Levenshtein distance, and offers a wide choice of advanced clustering algorithms, for example OPTICS clustering.
Text clustering support was contributed by Felix Stahlberg, as part of his work on:
Stahlberg, F., Schlippe, T., Vogel, S., & Schultz, T. Word segmentation through cross-lingual word-to-phoneme alignment. Spoken Language Technology Workshop (SLT), 2012 IEEE. IEEE, 2012.
We would of course appreciate additional contributions.
While the answer depends to a degree on the meaning of the strings, in general your problem is solved by the sequence analysis family of techniques. More specifically, Optimal Matching Analysis (OMA).
Most often the OMA is carried out in three steps. First, you define your sequences. From your description I can assume that each letter is a separate "state", the building block in a sequence. Second, you will employ one of the several algorithms to calculate the distances between all sequences in your dataset, thus obtaining the distance matrix. Finally, you will feed that distance matrix into a clustering algorithm, such as hierarchical clustering or Partitioning Around Medoids (PAM), which seems to gain popularity due to the additional information on the quality of the clusters. The latter guides you in the choice of the number of clusters, one of the several subjective steps in the sequence analysis.
In R the most convenient package with a great number of functions is TraMineR, the website can be found here. Its user guide is very accessible, and developers are more or less active on SO as well.
You are likely to find that clustering is not the most difficult part, except for the decision on the number of clusters. The guide for TraMineR shows that is the syntax is very straighforward, and the results are easy to interpret based on visual sequence graphs. Here is an example from the user guide:
clusterward1 <- agnes(dist.om1, diss = TRUE, method = "ward")
dist.om1 is the distance matrix obtained by OMA, cluster membership is contained in the clusterward1 object, which which you can do whatever you want: plotting, recoding as variables etc. The diss=TRUE option indicates that the data object is the dissimilarity (or distance) matrix. Easy, eh? The most difficult choice (not syntactically, but methodologically) is to choose the right distance algorithm, suitable for your particular application. Once you have that, being able to justify the choice, the rest is quite easy. Good luck!
If you would like a clear explanation of how to use partitional clustering (which will surely be faster) to solve your problem, check this paper: Effective Spell Checking Methods Using Clustering Algorithms.
https://www.researchgate.net/publication/255965260_Effective_Spell_Checking_Methods_Using_Clustering_Algorithms?ev=prf_pub
The authors explain how to cluster a dictionary using a modified (PAM-like) version of iK-Means.
Best of Luck!

Determining optimal number of clusters and with Daisy function and Gower Similarity

I am attempting to cluster the behavioral traits of 250 species into life-history strategies. The trait data consists of both numerical and nominal variables. I am relatively new to R and to cluster analysis, but I believe the best option to find the distances for these points is to use the gower similarity method within the daisy function. 1) Is that the best method?
Once I have these distances, I would like to find significant clusters. I have looked into pvclust and like its ability to give me the strength of the cluster. However, I have not been able to modify the code to accept the distance measurements previously made using daisy. I have unsuccessfully tried to follow the advice given here https://stats.stackexchange.com/questions/10347/making-a-heatmap-with-a-precomputed-distance-matrix-and-data-matrix-in-r/10349#10349 and using the code obtained here http://www.is.titech.ac.jp/~shimo/prog/pvclust/pvclust_unofficial_090824/pvclust.R
2)Can anyone help me to modify the existing code to accept my distance measurements?
3) Or, is there another better way to determine the number of significant clusters?
I thank all in advance for your help.
Some comments...
About 1)
It is a good way to deal with different types of data.
You could also create as many new rows in the dataset as possible nominal values and put 1/0 where it is needed. For example if there are 3 nominal values such as "reptile", "mammal" and "bird" you could change your initial dataset that has 2 columns (numeric, Nominal)
for a new one with 4 columns (numeric, numeric( representing reptile), numeric(representing mammal), numeric(representing bird)) an instance (23.4,"mammal") would be mapped to (23.4,0,1,0).
Using this mapping you could work with "normal" distances (be sure to standardize the data so that no column dominates the others due to it's big/small values).
About 2)
daisy returns an element of type dissimilarity, you can use it in other clustering algorithms from the cluster package (maybe you don't have to implement more stuff). For example the function pam can get the object returned by daisy directly.
About 3)
Clusters are really subjective and most cluster algorithms depend on the initial conditions so "significant clusters" is not really a term that some people would not be comfortable using. Pam could be useful in your case because clusters are centered using medoids which is good for nominal data (because it is interpretable). K-means for example has the disadvantage that the centroids are not interpretable (what does it mean 1/2 reptile 1/2 mammal?) pam builds the clusters centered to instances which is nice for interpretation purposes.
About pam:
http://en.wikipedia.org/wiki/K-medoids
http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/pam.html
You can use Zahn algorithm to find the cluster. Basically it's a minimum spanning tree and a function to remove the longest edge.

Resources