I have two heavily unbalanced datasets which are labelled as positive and negative, and I am able to generate a confusion matrix which yields a ~95% true positive rate (and inheritly 5% false negative rate) and a ~99.5% true negative rate (0.5% false positive rate).
The problem I try to build an ROC graph is that the x-axis does not range from 0 to 1, with intervals of 0.1. Instead, it ranges from 0 to something like 0.04 given my very low false positive rate.
Any insight as to why this happens?
Thanks
In a ROC graph, the two axes are the rate of false positives (F) and the rate of true positives (T). T is the probability that given a positive data item, your algorithm classifies it as positive. F is the probability that given a negative data item, your algorithm incorrectly classifies it as positive. The axes are always from 0 to 1, and if your algorithm is not parametric you should end up with a single point (or two for the two datasets) on the ROC graph instead of a curve. You get a curve if you algorithm is parametric and then the curve is induced by different values of the parameter(s).
See http://www2.cs.uregina.ca/~dbd/cs831/notes/ROC/ROC.html
I have figured it out. I used Platt's algorithm to extract the probability of a positive classification and sorted the dataset, highest probability first. I iterated through the dataset, any positive example (real positive, not classified as positive) increments the truepositive count while any negative example (real negative, not classified as negative) increments the falsepositive count.
Think of it as the support vector on the SVM which separates the two classes (+ve and -ve) moving gradually from one side of the svm to the other. Here i'm imagining points on a 2d plane. As the support vector moves, it uncovers examples. Any examples which are labelled as positive are truepostives, any negatives are falsepositives.
Hope this helps. It took me days to figure out something so trivial due to the lack information on the net (or just my lack of understanding of SVMs in general). This is especially aimed at those who are using CvSVM in the OpenCV package. As you might be aware, CvSVM does not return probability values. Instead, it returns a value based on the distance function. You do not need to use Platt's algorithm to extract an ROC curve based on probabilities, instead, you could use the distance values themselves. Say for example, you start the distance at 10, and you decrement it slowly until you've covered all of the dataset. I found using probabilities better to visualise, so to each his own.
Please mind my english as it's not my first language
Related
I am trying to learn some facial landmark detection model, and notice that many of them use NME(Normalized Mean Error) as performance metric:
The formula is straightforward, it calculate the l2 distance between ground-truth points and model prediction result, then divided it by a normalized factor, which vary from different dataset.
However, when adopting this formula on some landmark detector that some one developed, i have to deal with this non-trivial situation, that is some detector may not able to generate enough number landmarks for some input image(might because of NMS/model inherited problem/image quality etc). Thus some of ground-truth points might not have their corresponding one in the prediction result.
So how to solve this problem, should i just add such missing point result to "failure result set" and use FR to measure the model, and ignore them when doing the NME calculation?
If you have as output of neural network an vector 10x1 as example
that is your points like [x1,y1,x2,y2...x5,y5]. This vector will be fixed length cause of number of neurons in your model.
If you have missing points - this is because (as example you have 4 from 5 points) some points are go beyond the image width and height. Or are with minus (negative) like [-0.1, -0.2, 0.5,0.7 ...] there first 2 points you can not see on image like they are mission but they will be in vector and you can callculate NME.
In some custom neural nets that can be possible, because missing values will be changed to biggest error points.
suppose I have formatted the classification results of a model as the following:
actual.class score.actual.class
A 1
A 1
A 0.6
A 0.1
B 0.5
B 0.3
. .
. .
1-If I understand well, the ROC curve plots the trade-off between true positives and false positives. This implies that I need to be varying the score threshold for just one class(the true class) and not both, right? I mean if I pick A to be the true class here then I would use only the subset(results,actual.class="A") to plot the ROC curve?
2-What if I wanted to generated the curve manually (without libraries), are the thresholds going to be each possible score from that subset?
3-Are the following points generated correctly from the above data for the purposes of plotting the ROC curve? (I'm using class A as the true class)
threshold fpr tpr
1 1 0
0.6 1/2 1/2
0.1 1/4 3/4
0 0 1
Are these the points that are going to form my ROC?
"This implies that I need to be varying the score threshold for just
one class(the true class) and not both, right?"
There seems to be a misunderstanding since there's no such thing as a separate threshold for positive or negative. ROC curves are used in the context of the evaluation of binary classification algorithms. In such algorithms, elements that don't belong to one type (TRUE) are automatically identified as elements of the other (FALSE).
The choice of the threshold may only shift the balance, such that more observations are assigned to one type rather than the other. This variation of the threshold is the parameter that allows to draw an ROC curve. Else it would be just one point.
Concerning your third point: Yes, as far as I can tell from your example I would say that this kind of data is what typically constitutes an ROC curve.
please refer to this image:
I believe it is generated using R or SAS or something. I want to make sure I understand what it is depicting and recreate it from scratch.
I understand the left hand side, the ROC curve and I have generated my own using my probit model at varying thresholds.
What I do not understand is the right hand side graph. What does it mean by 'cost' function? What are the units? I assume the x axis labeled: 'threshold' is the success cutoff threshold I used in the ROC. My only guess is the Y axis is the sum of squared residuals? But if that's the case, I'd have to get the residuals after each iteration of the threshold?
Please explain what the axes are and how one goes about computing them.
--Edit--
For clarity, I don't need a proof or a line of code. Because I use a different statistical software, it's much more useful to have someone explain conceptually (with minimal jargon) how to compute the Y axis. That way I can write it in terms of my software's language.
Thank you
I will try to make this as clear as possible. The term cost function can be used in multiple cases and it can have multiple meanings. Usually, when we use the term in the context of a regression model, it is natural that we think of minimizing the sum of the squared residuals.
However, this is not the case here (we still do it because we are interested in minimizing the function but that function is not minimized within an algorithm like the sum of the squared residuals). Let me elaborate on what the second graph means.
As #oshun correctly mentioned the author of the R-blogger post (where these graphs came from) wanted to find a measure (i.e. a number) to compare the "mistakes" of the classification at different points of thresholds. In order to do that and create those measures he did something very intuitive and simple. He counted the false positives and false negatives for different levels of the threshold. The function he used is:
sum(df$pred >= threshold & df$survived == 0) * cost_of_fp + #false positives
sum(df$pred < threshold & df$survived == 1) * cost_of_fn #false negatives
I deliberately split the above in two lines. The first line counts the false positives (prediction >= threshold means the algorithm classified the passenger as survived but in reality they didn't - i.e. survived equals 0). The second line does the same thing but counts the false negatives (i.e. those that were predicted as not survived but in reality they did).
Now that leaves us to what cost_of_fp and what cost_of_fn are. These are nothing more than weights and are set arbitrarily by the user. In the example above the author used cost_of_fp = 1 and cost_of_fn = 3. This just means that as far as the cost function is concerned a false negative is 3 times more important than a false positive. So, in the cost function any false negative is just multiplied by 3 in order to increase the number of false positives + false negatives (which is the result of the cost function).
To sum up, the y-axis in the graph above is just:
false_positives * weight_fp + false_negatives * weight_fn
for every value of the threshold (which is used to calculate the false_positives and false_negatives).
I hope this is clear now.
I want to cluster some curves which contains daily click rate.
The dataset is click rate data in time series.
y1 = [time1:0.10,time2:0.22,time3:0.344,...]
y2 = [time1:0.10,time2:0.22,time3:0.344,...]
I don't know how to measure two curve's similarity using kmeans.
Is there any paper for this purpose or some library?
For similarity, you could use any kind of time series distance. Many of these will perform alignment, also of sequences of different length.
However, k-means will not get you anywhere.
K-means is not meant to be used with arbitrary distances. It actually does not use distance for assignment, but least-sum-of-squares (which happens to be squared euclidean distance) - aka: variance.
The mean must be consistent with this objective. It's not hard to see that the mean also minimizes the sum of squares. This guarantees convergence of k-means: in each single step (both assignment and mean update), the objective is reduced, thus it must converge after a finite number of steps (as there are only a finite number of discrete assignments).
But what is the mean of multiple time series of different length?
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
how to generate pseudo-random positive definite matrix with constraints on the off-diagonal elements?
The user wants to impose a unique, non-trivial, upper/lower bound on the correlation between every pair of variable in a var/covar matrix.
For example: I want a variance matrix in which all variables have 0.9 > |rho(x_i,x_j)| > 0.6, rho(x_i,x_j) being the correlation between variables x_i and x_j.
Thanks.
There are MANY issues here.
First of all, are the pseudo-random deviates assumed to be normally distributed? I'll assume they are, as any discussion of correlation matrices gets nasty if we diverge into non-normal distributions.
Next, it is rather simple to generate pseudo-random normal deviates, given a covariance matrix. Generate standard normal (independent) deviates, and then transform by multiplying by the Cholesky factor of the covariance matrix. Add in the mean at the end if the mean was not zero.
And, a covariance matrix is also rather simple to generate given a correlation matrix. Just pre and post multiply the correlation matrix by a diagonal matrix composed of the standard deviations. This scales a correlation matrix into a covariance matrix.
I'm still not sure where the problem lies in this question, since it would seem easy enough to generate a "random" correlation matrix, with elements uniformly distributed in the desired range.
So all of the above is rather trivial by any reasonable standards, and there are many tools out there to generate pseudo-random normal deviates given the above information.
Perhaps the issue is the user insists that the resulting random matrix of deviates must have correlations in the specified range. You must recognize that a set of random numbers will only have the desired distribution parameters in an asymptotic sense. Thus, as the sample size goes to infinity, you should expect to see the specified distribution parameters. But any small sample set will not necessarily have the desired parameters, in the desired ranges.
For example, (in MATLAB) here is a simple positive definite 3x3 matrix. As such, it makes a very nice covariance matrix.
S = randn(3);
S = S'*S
S =
0.78863 0.01123 -0.27879
0.01123 4.9316 3.5732
-0.27879 3.5732 2.7872
I'll convert S into a correlation matrix.
s = sqrt(diag(S));
C = diag(1./s)*S*diag(1./s)
C =
1 0.0056945 -0.18804
0.0056945 1 0.96377
-0.18804 0.96377 1
Now, I can sample from a normal distribution using the statistics toolbox (mvnrnd should do the trick.) As easy is to use a Cholesky factor.
L = chol(S)
L =
0.88805 0.012646 -0.31394
0 2.2207 1.6108
0 0 0.30643
Now, generate pseudo-random deviates, then transform them as desired.
X = randn(20,3)*L;
cov(X)
ans =
0.79069 -0.14297 -0.45032
-0.14297 6.0607 4.5459
-0.45032 4.5459 3.6549
corr(X)
ans =
1 -0.06531 -0.2649
-0.06531 1 0.96587
-0.2649 0.96587 1
If your desire was that the correlations must ALWAYS be greater than -0.188, then this sampling technique has failed, since the numbers are pseudo-random. In fact, that goal will be a difficult one to achieve unless your sample size is large enough.
You might employ a simple rejection scheme, whereby you do the sampling, then redo it repeatedly until the sample has the desired properties, with the correlations in the desired ranges. This may get tiring.
An approach that might work (but one that I've not totally thought out at this point) is to use the standard scheme as above to generate a random sample. Compute the correlations. I they fail to lie in the proper ranges, then identify the perturbation one would need to make to the actual (measured) covariance matrix of your data, so that the correlations would be as desired. Now, find a zero mean random perturbation to your sampled data that would move the sample covariance matrix in the desired direction.
This might work, but unless I knew that this is actually the question at hand, I won't bother to go any more deeply into it. (Edit: I've thought some more about this problem, and it appears to be a quadratic programming problem, with quadratic constraints, to find the smallest perturbation to a matrix X, such that the resulting covariance (or correlation) matrix has the desired properties.)
This is not a complete answer, but a suggestion of a possible constructive method:
Looking at the characterizations of the positive definite matrices (http://en.wikipedia.org/wiki/Positive-definite_matrix) I think one of the most affordable approaches could be using the Sylvester criterion.
You can start with a trivial 1x1 random matrix with positive determinant and expand it in one row and column step by step while ensuring that the new matrix has also a positive determinant (how to achieve that is up to you ^_^).
Woodship,
"First of all, are the pseudo-random deviates assumed to be normally distributed?"
yes.
"Perhaps the issue is the user insists that the resulting random matrix of deviates must have correlations in the specified range."
Yes, that's the whole difficulty
"You must recognize that a set of random numbers will only have the desired distribution parameters in an asymptotic sense."
True, but this is not the problem here: your strategy works for p=2, but fails for p>2, regardless of sample size.
"If your desire was that the correlations must ALWAYS be greater than -0.188, then this sampling technique has failed, since the numbers are pseudo-random. In fact, that goal will be a difficult one to achieve unless your sample size is large enough."
It is not a sample size issue b/c with p>2 you do not even observe convergence to the right range for the correlations, as sample size growths: i tried the technique you suggest before posting here, it obviously is flawed.
"You might employ a simple rejection scheme, whereby you do the sampling, then redo it repeatedly until the sample has the desired properties, with the correlations in the desired ranges. This may get tiring."
Not an option, for p large (say larger than 10) this option is intractable.
"Compute the correlations. I they fail to lie in the proper ranges, then identify the perturbation one would need to make to the actual (measured) covariance matrix of your data, so that the correlations would be as desired."
Ditto
As for the QP, i understand the constraints, but i'm not sure about the way you define the objective function; by using the "smallest perturbation" off some initial matrix, you will always end up getting the same (solution) matrix: all the off diagonal entries will be exactly equal to either one of the two bounds (e.g. not pseudo random); plus it is kind of an overkill isn't it ?
Come on people, there must be something simpler