suppose I have formatted the classification results of a model as the following:
actual.class score.actual.class
A 1
A 1
A 0.6
A 0.1
B 0.5
B 0.3
. .
. .
1-If I understand well, the ROC curve plots the trade-off between true positives and false positives. This implies that I need to be varying the score threshold for just one class(the true class) and not both, right? I mean if I pick A to be the true class here then I would use only the subset(results,actual.class="A") to plot the ROC curve?
2-What if I wanted to generated the curve manually (without libraries), are the thresholds going to be each possible score from that subset?
3-Are the following points generated correctly from the above data for the purposes of plotting the ROC curve? (I'm using class A as the true class)
threshold fpr tpr
1 1 0
0.6 1/2 1/2
0.1 1/4 3/4
0 0 1
Are these the points that are going to form my ROC?
"This implies that I need to be varying the score threshold for just
one class(the true class) and not both, right?"
There seems to be a misunderstanding since there's no such thing as a separate threshold for positive or negative. ROC curves are used in the context of the evaluation of binary classification algorithms. In such algorithms, elements that don't belong to one type (TRUE) are automatically identified as elements of the other (FALSE).
The choice of the threshold may only shift the balance, such that more observations are assigned to one type rather than the other. This variation of the threshold is the parameter that allows to draw an ROC curve. Else it would be just one point.
Concerning your third point: Yes, as far as I can tell from your example I would say that this kind of data is what typically constitutes an ROC curve.
Related
I need to make some models in R and have some trouble with some of my predictors. They are distributed between 0 and 1, they give the percentage of landcover types. E.g. 0.3 means 30% of the area is covered by forest.
Here are a histogram and a density plot of one of them:
histogram
density plot
I want to transform these predictors towards a uniform distribution within R (it does not have to be perfect). I don't know what transformation to use since there are many data points close to the maximum and the minimum of them.
Any help is appreciated, thanks!
It's not clear to me why you need to do this - most statistical methods don't make demands about the distribution of the predictor variables - but
rank(x)/(length(x)+1)
will give you a new variable that's uniformly distributed between 0 and 1 (and is never exactly 0 or 1)
I've trained a model to predict a certain variable. When I now use this model to predict said value and compare this predictions to the actual values, I get the two following distributions.
The corresponding R Data Frame looks as follows:
x_var | kind
3.532 | actual
4.676 | actual
...
3.12 | predicted
6.78 | predicted
These two distributions obviously have slightly different means, quantiles, etc. What I would now like to do is combine these two distributions into one (especially as they are fairly similar), but not like in the following thread.
Instead, I would like to plot one density function that shows the difference between the actual and predicted values and enables me to say e.g. 50% of the predictions are within -X% and +Y% of the actual values.
I've tried just plotting the difference between predicted-actual and also the difference compared to the mean in the respective group. However, neither approach has produced my desired result. With the plotted distribution, it is especially important to be able to make above statement, i.e. 50% of the predictions are within -X% and +Y% of the actual values. How can this be achieved?
Let's consider the two distributions as df_actual, df_predicted, then calculate
# dataframe with difference between two distributions
df_diff <- data.frame(x = df_predicted$x - df_actual$x, y = df_predicted$y - df_actual$y)
Then find the relative % difference by :
x_diff = mean(( df_diff$x - df_actual$x) / df_actual $x) * 100
y_diff = mean(( df_diff$y - df_actual$y) / df_actual $y) * 100
This will give you % prediction whether +/- in x as well as y. This is my opinion and also follow this thread for displaying and measuring area between two distribution curves.
I hope this helps.
ParthChaudhary is right - rather than subtracting the distributions, you want to analyze the distribution of differences. But take care to subtract the values within corresponding pairs, or otherwise the actual - predicted differences will be overshadowed by the variance of actual (and predicted) alone. I.e., if you have something like:
x y type
0 10.9 actual
1 15.7 actual
2 25.3 actual
...
0 10 predicted
1 17 predicted
2 23 predicted
...
you would merge(df[df$type=="actual",], df[df$type=="predicted",], by="x"), then calculate and plot y.x-y.y.
To better quantify whether the differences between your predicted and actual distributions are significant, you could consider using the Kolmogorov-Smirnov test in R, available via the function ks.test
please refer to this image:
I believe it is generated using R or SAS or something. I want to make sure I understand what it is depicting and recreate it from scratch.
I understand the left hand side, the ROC curve and I have generated my own using my probit model at varying thresholds.
What I do not understand is the right hand side graph. What does it mean by 'cost' function? What are the units? I assume the x axis labeled: 'threshold' is the success cutoff threshold I used in the ROC. My only guess is the Y axis is the sum of squared residuals? But if that's the case, I'd have to get the residuals after each iteration of the threshold?
Please explain what the axes are and how one goes about computing them.
--Edit--
For clarity, I don't need a proof or a line of code. Because I use a different statistical software, it's much more useful to have someone explain conceptually (with minimal jargon) how to compute the Y axis. That way I can write it in terms of my software's language.
Thank you
I will try to make this as clear as possible. The term cost function can be used in multiple cases and it can have multiple meanings. Usually, when we use the term in the context of a regression model, it is natural that we think of minimizing the sum of the squared residuals.
However, this is not the case here (we still do it because we are interested in minimizing the function but that function is not minimized within an algorithm like the sum of the squared residuals). Let me elaborate on what the second graph means.
As #oshun correctly mentioned the author of the R-blogger post (where these graphs came from) wanted to find a measure (i.e. a number) to compare the "mistakes" of the classification at different points of thresholds. In order to do that and create those measures he did something very intuitive and simple. He counted the false positives and false negatives for different levels of the threshold. The function he used is:
sum(df$pred >= threshold & df$survived == 0) * cost_of_fp + #false positives
sum(df$pred < threshold & df$survived == 1) * cost_of_fn #false negatives
I deliberately split the above in two lines. The first line counts the false positives (prediction >= threshold means the algorithm classified the passenger as survived but in reality they didn't - i.e. survived equals 0). The second line does the same thing but counts the false negatives (i.e. those that were predicted as not survived but in reality they did).
Now that leaves us to what cost_of_fp and what cost_of_fn are. These are nothing more than weights and are set arbitrarily by the user. In the example above the author used cost_of_fp = 1 and cost_of_fn = 3. This just means that as far as the cost function is concerned a false negative is 3 times more important than a false positive. So, in the cost function any false negative is just multiplied by 3 in order to increase the number of false positives + false negatives (which is the result of the cost function).
To sum up, the y-axis in the graph above is just:
false_positives * weight_fp + false_negatives * weight_fn
for every value of the threshold (which is used to calculate the false_positives and false_negatives).
I hope this is clear now.
I'm trying to calculate the AUC for a large-ish data set and having trouble finding one that both handles values that aren't just 0's or 1's and works reasonably quickly.
So far I've tried the ROCR package, but it only handles 0's and 1's and the pROC package will give me an answer but could take 5-10 minutes to calculate 1 million rows.
As a note all of my values fall between 0 - 1 but are not necessarily 1 or 0.
EDIT: both the answers and predictions fall between 0 - 1.
Any suggestions?
EDIT2:
ROCR can deal with situations like this:
Ex.1
actual prediction
1 0
1 1
0 1
0 1
1 0
or like this:
Ex.2
actual prediction
1 .25
1 .1
0 .9
0 .01
1 .88
but NOT situations like this:
Ex.3
actual prediction
.2 .25
.6 .1
.98 .9
.05 .01
.72 .88
pROC can deal with Ex.3 but it takes a very long time to compute. I'm hoping that there's a faster implementation for a situation like Ex.3.
So far I've tried the ROCR package, but it only handles 0's and 1's
Are you talking about the reference class memberships or the predicted class memberships?
The latter can be between 0 and 1 in ROCR, have a look at its example data set ROCR.simple.
If your reference is in [0, 1], you could have a look at (disclaimer: my) package softclassval. You'd have to construct the ROC/AUC from sensitivity and specificity calculations, though. So unless you think of an optimized algorithm (as ROCR developers did), it'll probably take long, too. In that case you'll also have to think what exactly sensitivity and specificity should mean, as this is ambiguous with reference memberships in (0, 1).
Update after clarification of the question
You need to be aware that grouping the reference or actual together looses information. E.g., if you have actual = 0.5 and prediction = 0.8, what is that supposed to mean? Suppose these values were really actual = 5/10 and prediction = 5/10.
By summarizing the 10 tests into two numbers, you loose the information whether the same 5 out of the 10 were meant or not. Without this, actual = 5/10 and prediction = 8/10 is consistent with anything between 30 % and 70 % correct recognition!
Here's an illustration where the sensitivity is discussed (i.e. correct recognition e.g. of click-through):
You can find the whole poster and two presentaions discussing such issues at softclassval.r-forge.r-project.org, section "About softclassval".
Going on with these thoughts, weighted versions of mean absolute, mean squared, root mean squared etc. errors can be used as well.
However, all those different ways to express of the same performance characteristic of the model (e.g. sensitivity = % correct recognitions of actual click-through events) do have a different meaning, and while they coincide with the usual calculation in unambiguous reference and prediction situations, they will react differently with ambiguous reference / partial reference class membership.
Note also, as you use continuous values in [0, 1] for both reference/actual and prediction, the whole test will be condensed into one point (not a line!) in the ROC or specificity-sensitivity plot.
Bottom line: the grouping of the data gets you in trouble here. So if you could somehow get the information on the single clicks, go and get it!
Can you use other error measures for assessing method performance? (e.g. Mean Absolute Error, Root Mean Square Error)?
This post might also help you out, but if you have different numbers of classes for observed and predicted values, then you might run into some issues.
https://stat.ethz.ch/pipermail/r-help/2008-September/172537.html
I have two heavily unbalanced datasets which are labelled as positive and negative, and I am able to generate a confusion matrix which yields a ~95% true positive rate (and inheritly 5% false negative rate) and a ~99.5% true negative rate (0.5% false positive rate).
The problem I try to build an ROC graph is that the x-axis does not range from 0 to 1, with intervals of 0.1. Instead, it ranges from 0 to something like 0.04 given my very low false positive rate.
Any insight as to why this happens?
Thanks
In a ROC graph, the two axes are the rate of false positives (F) and the rate of true positives (T). T is the probability that given a positive data item, your algorithm classifies it as positive. F is the probability that given a negative data item, your algorithm incorrectly classifies it as positive. The axes are always from 0 to 1, and if your algorithm is not parametric you should end up with a single point (or two for the two datasets) on the ROC graph instead of a curve. You get a curve if you algorithm is parametric and then the curve is induced by different values of the parameter(s).
See http://www2.cs.uregina.ca/~dbd/cs831/notes/ROC/ROC.html
I have figured it out. I used Platt's algorithm to extract the probability of a positive classification and sorted the dataset, highest probability first. I iterated through the dataset, any positive example (real positive, not classified as positive) increments the truepositive count while any negative example (real negative, not classified as negative) increments the falsepositive count.
Think of it as the support vector on the SVM which separates the two classes (+ve and -ve) moving gradually from one side of the svm to the other. Here i'm imagining points on a 2d plane. As the support vector moves, it uncovers examples. Any examples which are labelled as positive are truepostives, any negatives are falsepositives.
Hope this helps. It took me days to figure out something so trivial due to the lack information on the net (or just my lack of understanding of SVMs in general). This is especially aimed at those who are using CvSVM in the OpenCV package. As you might be aware, CvSVM does not return probability values. Instead, it returns a value based on the distance function. You do not need to use Platt's algorithm to extract an ROC curve based on probabilities, instead, you could use the distance values themselves. Say for example, you start the distance at 10, and you decrement it slowly until you've covered all of the dataset. I found using probabilities better to visualise, so to each his own.
Please mind my english as it's not my first language