What's the difference between Pearson correlation similarity and adjust cosine similarity? - similarity

While they are very similar, I am sure there is some difference between Pearson correlation similarity and adjust cosine similarity, because all the papers and web pages divide them into two different kinds.
However none of them provide a clear definition. Here is one of the pages.
Can anyone tell the difference?
Thanks

The two definations look very similar, but be aware that:
In pearson correlation, the mean which subtracted is about the particular item itself (ratings from all users), mean(Ri)
In adjusted cosine correlation, the mean is about the particular user (ratings to all items), mean(Ru)
This tiny difference may lead to totally disparate results.

Related

Determining the ideal number of clusters by comparing the agglomeration coefficients of severel cluster numbers. Cluster-Analysis

I am currently stuck at a cluster-analysis. I would like to determine the ideal number of clusters by comparing the agglomerative coefficients of different number of clusters to each other. The percentage change in coefficient to the next level should be the basis for my decision, like in the following screenshot:
Percentage Change in Coefficient to next level
But I do not know how to get these coefficients. I was only able to compute the AC for the whole dendrogram by using the AGNES program. How can I get the agglomeration coefficient for different numbers of clusters?
I'll appreciate an advice, thank you.

Document similarity selfplagiarism

I have thousands of small documents from 100 different authors. Using quanteda package, I calculated cosine similarity between the authors with themselves. For example, author x has 100 texts, so I have come up with a 100 x 100 matrix of similarity. Author y has 50 texts, so I have come up with a 50 x 50 similarity matrix.
Now I want to compare these two authors. In other words, which author copies himself more? If I take the average the columns or rows and then average again the vector of means, I arrive at a number so I can compare these two means of means, but I am not sure if these proceeding is right. I hope I made myself clear.
I think the answer depends on what exactly is your quantity of interest. If this is a single summary of how similar are an author's documents to one another, then some distribution across the document similarities, within author, is probably your best means of comparing this quantity between authors.
You could save and plot the cosine similarities across an author's documents as a density, for instance, in addition to your strategy of summarising this distribution using a mean. To capture the variance I would also characterise the standard deviation of this similarity.
I'd be cautious about calling cosine similarity within author as "self-plagiarism". Cosine similarity computes a measure of distance across vector representations of bags of words, and is not viewed as a method for identifying "plagiarism". In addition, there are very pejorative connotations to the term "plagiarism", which means the dishonest representation of someone else's ideas as your own. (I don't even believe that the term "self-plagiarism" makes sense at all, but then I have academic colleagues who disagree.)
Added:
Consider the textreuse package for R, it is designed for the sort of text analysis of reuse that you are looking for.
I don't think Levenshtein distance is what you are looking for. As the Wikipedia page points out, the LD between kitten and sitting is 3, but this means absolutely nothing in substantive terms about their semantic relationship or one being an example of "re-use" of the other. An argument could be made that LD based on words might show re-use, but that's not how most algorithms e.g. http://turnitin.com implement detection for plagiarism.

How to calculate False Accept and Reject Rates in pattern recognition

I am working on a vein pattern recognition project based on SURF algorithm and euclidean distance. I have completed my program to find the maximum and minimum distance between vein features and find a match exactly when there is an identical image. i.e max and min distance between two images is zero. In this case, how would I find my FAR and FRR. Will it be 0% or am I missing a big concept here?
Even if there is a slight variation it wouldn't match in which case, I guess I need to have a threshold value to compare to. I have calculate the max and min distance between all combination of images with the same hand, with different hands. In this case, how do I computer the FAR and FRR. This is my first biometrics project and it would be helpful if I am directed to any resource that would help me in this. Thank you.
Kindly help me out.
Usually, it's impossible to find a model int the training set that exactly matches a sample in the testing set. (The situation that the euclidean distance is zero is impossible in the reality.) So we use many kinds of similarity measure methods, such as in your case, use euclidean distance.
And then we should have a threshold to determine whether a test sample and a certain model in training set is belong to the same person. If the distance is larger than the threshold, we reject the hypothesis that they are the same person and accept otherwise. So we have false reject and false accept cases.
In your case, maybe you can avoid choosing a threshold as every test sample is compare to all the training models. So you can just choose the nearest training model as your recognition result.
To calculate FAR and FRR, you should have the key labels to your test samples so that you can compare you recognition results with the keys.

Interpreting negative LDA classifier scores

The post
Classification functions in linear discriminant analysis in R
from user Tyler provides a function to produce the classification functions (not discriminant functions!) from an LDA model generated with lda().
I used these classification functions to calculate all classification scores for my data. I want to use the additional information e.g. to find out which was the second most probable class and to understand the development in different time slices
Now I would like to ask you for your help to interpret the following scenarios:
scores close to/exactly zero (is it possible to claim that this exact class effectively was not recognized?)
single negative scores of higher absolute value than highest positive value (Does it mean anything at all?)
results with all negative scores (in the original interpretation, the highest score determines the classification. Is this intended by the LDA or does it mean that really none of the classifications is a good fit and one could say that no known pattern could be identified?)
single very low positive values while all others are high absolute negative values (can I argue that the "signal strength" is low in this case?)
I know this is more of a statistical than a programming problem. I thought of it as a follow-up of the post at the beginning of this entry.
Thank you very much for your help!

Simple algorithm for online outlier detection of a generic time series

I am working with a large amount of time series.
These time series are basically network measurements coming every 10 minutes, and some of them are periodic (i.e. the bandwidth), while some other aren't (i.e. the amount of routing traffic).
I would like a simple algorithm for doing an online "outlier detection". Basically, I want to keep in memory (or on disk) the whole historical data for each time series, and I want to detect any outlier in a live scenario (each time a new sample is captured).
What is the best way to achieve these results?
I'm currently using a moving average in order to remove some noise, but then what next? Simple things like standard deviation, mad, ... against the whole data set doesn't work well (I can't assume the time series are stationary), and I would like something more "accurate", ideally a black box like:
double outlier_detection(double* vector, double value);
where vector is the array of double containing the historical data, and the return value is the anomaly score for the new sample "value" .
This is a big and complex subject, and the answer will depend on (a) how much effort you want to invest in this and (b) how effective you want your outlier detection to be. One possible approach is adaptive filtering, which is typically used for applications like noise cancelling headphones, etc. You have a filter which constantly adapts to the input signal, effectively matching its filter coefficients to a hypothetical short term model of the signal source, thereby reducing mean square error output. This then gives you a low level output signal (the residual error) except for when you get an outlier, which will result in a spike, which will be easy to detect (threshold). Read up on adaptive filtering, LMS filters, etc, if you're serious about this kind of technique.
I suggest the scheme below, which should be implementable in a day or so:
Training
Collect as many samples as you can hold in memory
Remove obvious outliers using the standard deviation for each attribute
Calculate and store the correlation matrix and also the mean of each attribute
Calculate and store the Mahalanobis distances of all your samples
Calculating "outlierness":
For the single sample of which you want to know its "outlierness":
Retrieve the means, covariance matrix and Mahalanobis distances from training
Calculate the Mahalanobis distance "d" for your sample
Return the percentile in which "d" falls (using the Mahalanobis distances from training)
That will be your outlier score: 100% is an extreme outlier.
PS. In calculating the Mahalanobis distance, use the correlation matrix, not the covariance matrix. This is more robust if the sample measurements vary in unit and number.

Resources