How to calculate False Accept and Reject Rates in pattern recognition

How to calculate False Accept and Reject Rates in pattern recognition - pattern-recognition

I am working on a vein pattern recognition project based on SURF algorithm and euclidean distance. I have completed my program to find the maximum and minimum distance between vein features and find a match exactly when there is an identical image. i.e max and min distance between two images is zero. In this case, how would I find my FAR and FRR. Will it be 0% or am I missing a big concept here?
Even if there is a slight variation it wouldn't match in which case, I guess I need to have a threshold value to compare to. I have calculate the max and min distance between all combination of images with the same hand, with different hands. In this case, how do I computer the FAR and FRR. This is my first biometrics project and it would be helpful if I am directed to any resource that would help me in this. Thank you.
Kindly help me out.

Usually, it's impossible to find a model int the training set that exactly matches a sample in the testing set. (The situation that the euclidean distance is zero is impossible in the reality.) So we use many kinds of similarity measure methods, such as in your case, use euclidean distance.
And then we should have a threshold to determine whether a test sample and a certain model in training set is belong to the same person. If the distance is larger than the threshold, we reject the hypothesis that they are the same person and accept otherwise. So we have false reject and false accept cases.
In your case, maybe you can avoid choosing a threshold as every test sample is compare to all the training models. So you can just choose the nearest training model as your recognition result.
To calculate FAR and FRR, you should have the key labels to your test samples so that you can compare you recognition results with the keys.

Related

Different bandwidth specification in mean-shift clustering with different packages in R

I want to perform mean-shift clustering in R and found out that there are at least two packages that have this functionality: MeanShift and meanShiftR. As showed here the latter is much faster and as I tried out the first one and it took a long time to perform a clustering, I'm keen on choosing meanShiftR. However meanShiftR::meanShift function has rather uncommon way of bandwidth specification, see part of documentation:
queryData A matrix or vector of points to be classified by the mean
shift algorithm. Values must be finite and non-missing.
bandwidth A vector of length equal to the number of columns in the queryData matrix, or length one when queryData is a vector. This
value will be used in the kernel density estimate for steepest ascent
classification. The default is one for each dimension.
I'm not an expert in mean-shift clustering, but the only banwidth specifications I have found in the literature is that bandwidth is scalar or positive definite, symmetric matrix, not a vector. So is this the technical trick to represent the bandwidth and the value of bandwidth have to be the same for each dimension? Or maybe it can vary?
The other issue is that even setting the same value of bandwidth in meanShiftR package as in MeanShift::msClustering, but just replicated to match the number of columns, I've obtained totally different results, in particular much larger number of cluster. Also, the modes were rather very similar and not representative of the dataset. That made me wonder if this package works correct. Have someone even used meanShiftR? If so, maybe you could present any example as the documentation is not clear enough for me?

This isn't actually different.
One scalar per query point.

pop.size argument in GenMatch() respectively genoud()

I am using genetic matching in R using GenMatch in order to find comparable treatment and control groups to estimate a treatment effect. The default code for matching looks as follows:
GenMatch(Tr, X, BalanceMatrix=X, estimand="ATT", M=1, weights=NULL,
pop.size = 100, max.generations=100,...)
The description for the pop.size argument in the package is:
Population Size. This is the number of individuals genoud uses to
solve the optimization problem. The theorems proving that genetic
algorithms find good solutions are asymptotic in population size.
Therefore, it is important that this value not be small. See genoud
for more details.
Looking at gnoud the additional description is:
...There are several restrictions on what the value of this number can
be. No matter what population size the user requests, the number is
automatically adjusted to make certain that the relevant restrictions
are satisfied. These restrictions originate in what is required by
several of the operators. In particular, operators 6 (Simple
Crossover) and 8 (Heuristic Crossover) require an even number of
individuals to work on—i.e., they require two parents. Therefore, the
pop.size variable and the operators sets must be such that these three
operators have an even number of individuals to work with. If this
does not occur, the population size is automatically increased until
this constraint is satisfied.
I want to know how gnoud (resp. GenMatch) incorporates the population size argument. Does the algorithm randomly select n individuals from the population for the optimization?
I had a look at the package description and the source code, but did not find a clear answer.

The word "individuals" here does not refer to individuals in the sample (i.e., individual units your dataset), but rather to virtual individuals that the genetic algorithm uses. These individuals are individual draws of a set of the variables to be optimized. They are unrelated to your sample.
The goal of genetic matching is to choose a set of scaling factors (which the Matching documentation calls weights), one for each covariate, that weight the importance of that covariate in a scaled Euclidean distance match. I'm no expert on the genetic algorithm, but my understanding of what it does is that it makes a bunch of guesses at the optimal values of these scaling factors, keeps the ones that "do the best" in the sense of optimizing the criterion (which is determined by fit.func in GenMatch()), and creates new guesses as slight perturbations of the kept guesses. It then repeats this process many times, simulating what natural selection does to optimize traits in living things. Each guess is what the word "individual" refers to in the description for pop.size, which corresponds to the number of guesses at each generation of the algorithm.
GenMatch() always uses your entire sample (unless you have provided a restriction like a caliper, exact matching requirement, or common support rule); it does not sample units from your sample to form each guess (which is what bagging in is other machine learning contexts).
Results will change over many runs because the genetic algorithm itself is a stochastic process. It may converge to a solution asymptotically, but because it is optimizing over a lumpy surface, it will find different solutions each time in finite samples with finite generations and a finite population size (i.e., pop.size).

Testing CSR on lpp with R

I have recently posted a "very newie to R" question about the correct way of doing this, if you are interested in it you can find it [here].1
I have now managed to develop a simple R script that does the job, but now the results are what troubles me.
Long story short I'm using R to analyze lpp (Linear Point Pattern) with mad.test.That function performs an hypothesis test where the null hypothesis is that the points are randomly distributed. Currently I have 88 lpps to analyze, and according to the p.value 86 of them are randomly distributed and 2 of them are not.
These are the two not randomly distributed lpps.
Looking at them you can see some kind of clusters in the first one, but the second one only has three points, and seems to me that there is no way one can assure only three points are not corresponding to a random distribution. There are other tracks with one, two, three points but they all fall into the "random" lpps category, so I don't know why this one is different.
So here is the question: how many points are too little points for CSR testing?
I have also noticed that these two lpps have a much lower $statistic$rank than the others. I have tried to find what that means but I'm clueless righ now, so here is another newie question: Is the $statistic$rank some kind of quality analysis indicator, and thus can I use it to group my lpp analysis into "significant ones" and "too little points" ones?
My R script and all the shp files can be downloaded from here(850 Kb).
Thank you so much for your help.

It is impossible to give an universal answer to the question about how many points is needed for an analysis. Usually 0, 1 and 2 are too few for a standalone analysis. However, if they are part of repeated measurements of the same thing they might be interesting still. Also, I would normally say that your example with 3 points is too few to say anything interesting. However, an extreme example would be if you have a single long line segment where one point occurs close to one end and two other occur close to each other at the other end. This is not so likely to happen for CSR and you may be inclined to not believe that hypothesis. This appears to be what happened in your case.
Regarding your question about the rank you might want to read a bit more up on the Monte Carlo test you are preforming. Basically, you summarise the point pattern by a single number (maximum absolute deviation of linear K) and then you look at how extreme this number is compared to numbers generated at random from CSR. Assuming you use 99 simulations of CSR you have 100 numbers in total. If your data ranks as the most extreme ($statistic$rank==1) among these it has p-value 1%. If it ranks as the 50th number the p-value is 50%. If you used another number of simulations you have to calculate accordingly. I.e. with 199 simulations rank 1 is 0.5%, rank 2 is 1%, etc.

There is a fundamental problem here with multiple testing. You are applying a hypothesis test 88 times. The test is (by default) designed to give a false positive in 5 percent (1 in 20) of applications, so if the null hypothesis is true, you should expect 88 /20 = 4.4 false positives to have occurred your 88 tests. So getting only 2 positive results ("non-random") is entirely consistent with the null hypothesis that ALL of the patterns are random. My conclusion is that the patterns are random.

What is the meaning of "Inf" in S_Dbw output in R commander?

I have ran clv package which consists of S_Dbw and SD validity indexes for clustering purposes in R commander. (http://cran.r-project.org/web/packages/clv/index.html)
I evaluated my clustering results from DBSCAN, K-Means, Kohonen algorithms with S_Dbw index. but for all these three algorithms S_Dbw is "Inf".
Is it "Infinite" meaning? Why did i confront with "Inf". Is there any problem in my clustering results?
In general, when is S_Dbw index result "Inf"?

Be careful when comparing different algorithms with such an index.
The reason is that the index is pretty much an algorithm in itself. One particular clustering will necessarily be the "best" for each index. The main difference between an index and an actual clustering algorithm is that the index doesn't tell you how to find the "best" solution.
Some examples: k-means minimizes the distances from cluster members to cluster centers. Single-link hierarchical clustering will find the partition with the optimal minimum distance between partitions. Well, DBSCAN will find the partitioning of the dataset, where all density-connected points are in the same partition. As such, DBSCAN is optimal - if you use the appropriate measure.
Seriously. Do not assume that because one algorithm scores higher than another in a particular measure means that the algorithm works better. All that you find out this way is that a particular algorithm is more (cor-)related to a particular measure. Think of it as a kind of correlation between the measure and the algorithm, on a conceptual level.
Using a measure for comparing different results of the same algorithm is different. Then obviously there shouldn't be a benefit from one algorithm over itself. There might still be a similar effect with respect to parameters. For example the in-cluster distances in k-means obviously should go down when you increase k.
In fact, many of the measures are not even well-defined on DBSCAN results. Because DBSCAN has the concept of noise points, which the indexes do not AFAIK.
Do not assume that the measure will either give you an indication of what is "true" or "correct". And even less, what is useful or new. Because you should be using cluster analysis not to find a mathematical optimum of a particular measure, but to learn something new and useful about your data. Which probably is not some measure number.
Back to the indices. They usually are totally designed around k-means. From a short look at S_Dbw I have the impression that the moment one "cluster" consists of a single object (e.g. a noise object in DBSCAN), the value will become infinity - aka: undefined. It seems as if the authors of that index did not consider this corner case, but only used it on toy data sets where such situations did not arise. The R implementation can't fix this, without diverting from the original index and instead turning it into yet another index. Handling noise objects and singletons is far from trivial. I have not yet seen an index that doesn't fail in one way or another - typically, a solution such as "all objects are noise" will either score perfect, or every clustering can trivially be improved by putting each noise object to the nearest non-singleton cluster. If you want your algorithm to be able to say "this object doesn't belong to any cluster" then I do not know any appropriate index.

The IEEE floating point standard defines Inf and -Inf as positive and negative infinity respectively. It means your result was too large to represent in the given number of bits.

Simple algorithm for online outlier detection of a generic time series

I am working with a large amount of time series.
These time series are basically network measurements coming every 10 minutes, and some of them are periodic (i.e. the bandwidth), while some other aren't (i.e. the amount of routing traffic).
I would like a simple algorithm for doing an online "outlier detection". Basically, I want to keep in memory (or on disk) the whole historical data for each time series, and I want to detect any outlier in a live scenario (each time a new sample is captured).
What is the best way to achieve these results?
I'm currently using a moving average in order to remove some noise, but then what next? Simple things like standard deviation, mad, ... against the whole data set doesn't work well (I can't assume the time series are stationary), and I would like something more "accurate", ideally a black box like:
double outlier_detection(double* vector, double value);
where vector is the array of double containing the historical data, and the return value is the anomaly score for the new sample "value" .

This is a big and complex subject, and the answer will depend on (a) how much effort you want to invest in this and (b) how effective you want your outlier detection to be. One possible approach is adaptive filtering, which is typically used for applications like noise cancelling headphones, etc. You have a filter which constantly adapts to the input signal, effectively matching its filter coefficients to a hypothetical short term model of the signal source, thereby reducing mean square error output. This then gives you a low level output signal (the residual error) except for when you get an outlier, which will result in a spike, which will be easy to detect (threshold). Read up on adaptive filtering, LMS filters, etc, if you're serious about this kind of technique.

I suggest the scheme below, which should be implementable in a day or so:
Training
Collect as many samples as you can hold in memory
Remove obvious outliers using the standard deviation for each attribute
Calculate and store the correlation matrix and also the mean of each attribute
Calculate and store the Mahalanobis distances of all your samples
Calculating "outlierness":
For the single sample of which you want to know its "outlierness":
Retrieve the means, covariance matrix and Mahalanobis distances from training
Calculate the Mahalanobis distance "d" for your sample
Return the percentile in which "d" falls (using the Mahalanobis distances from training)
That will be your outlier score: 100% is an extreme outlier.
PS. In calculating the Mahalanobis distance, use the correlation matrix, not the covariance matrix. This is more robust if the sample measurements vary in unit and number.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex