Precision recall datasets - information-retrieval

Can anyone point me at datasets of precision and recall generated by binary classifiers at different confidence levels?
I'm not trying to develop a classifier. I'm doing an empirical investigation into common patterns that emerge when these data sets are evaluated.
Ideally, what I'd like are three columns: precision, recall, confidence threshold. Alternatively, precision and recall could be replaced by the four underlying tallies: true positives, true negatives, false positives and false negatives.
I could also work from lists of ranked results where each result is flagged as true or false.
The actual classifiers and source data sets aren't that important but I would like to know what they are.
Many thanks in advance.

Related

Add or calculate weights for meta analysis of single proportions using glmm

I am conducting a meta-analysis of sensitivities and tried using the metaprop()argument in the meta package to pool single proportions.
Using the logit-link function, the argument will by default choose a GLMM. However there are no weights provided.
Event hough this is the best solution according to literature, the studies which i want to pool vary greatly in samplesize (n=6 to n=568). I think the weighting might be essential in this case and wondered if there is a solution which also offers weights?
In short, i want pooled, weighted subgroub summary estimates which i can show in a forest plot.
my dataset is based on 2x2 table data (true positive, false negative, true negative, false positive), with a proportional outcome (sensitivity or specificity [%]) and the subgroup "exp"(invitro/bacterial)
metaprop(E6sens$TP, E6sens$TP + E6sens$FN, comb.fixed=FALSE, comb.random=TRUE,
sm="PLOGIT", method.ci="CP", studlab=E6sens$title, byvar=E6sens$exp)
I would be grateful for any suggestions
regards,
Julia

Understanding Graph of Binary Response Regression

please refer to this image:
I believe it is generated using R or SAS or something. I want to make sure I understand what it is depicting and recreate it from scratch.
I understand the left hand side, the ROC curve and I have generated my own using my probit model at varying thresholds.
What I do not understand is the right hand side graph. What does it mean by 'cost' function? What are the units? I assume the x axis labeled: 'threshold' is the success cutoff threshold I used in the ROC. My only guess is the Y axis is the sum of squared residuals? But if that's the case, I'd have to get the residuals after each iteration of the threshold?
Please explain what the axes are and how one goes about computing them.
--Edit--
For clarity, I don't need a proof or a line of code. Because I use a different statistical software, it's much more useful to have someone explain conceptually (with minimal jargon) how to compute the Y axis. That way I can write it in terms of my software's language.
Thank you
I will try to make this as clear as possible. The term cost function can be used in multiple cases and it can have multiple meanings. Usually, when we use the term in the context of a regression model, it is natural that we think of minimizing the sum of the squared residuals.
However, this is not the case here (we still do it because we are interested in minimizing the function but that function is not minimized within an algorithm like the sum of the squared residuals). Let me elaborate on what the second graph means.
As #oshun correctly mentioned the author of the R-blogger post (where these graphs came from) wanted to find a measure (i.e. a number) to compare the "mistakes" of the classification at different points of thresholds. In order to do that and create those measures he did something very intuitive and simple. He counted the false positives and false negatives for different levels of the threshold. The function he used is:
sum(df$pred >= threshold & df$survived == 0) * cost_of_fp + #false positives
sum(df$pred < threshold & df$survived == 1) * cost_of_fn #false negatives
I deliberately split the above in two lines. The first line counts the false positives (prediction >= threshold means the algorithm classified the passenger as survived but in reality they didn't - i.e. survived equals 0). The second line does the same thing but counts the false negatives (i.e. those that were predicted as not survived but in reality they did).
Now that leaves us to what cost_of_fp and what cost_of_fn are. These are nothing more than weights and are set arbitrarily by the user. In the example above the author used cost_of_fp = 1 and cost_of_fn = 3. This just means that as far as the cost function is concerned a false negative is 3 times more important than a false positive. So, in the cost function any false negative is just multiplied by 3 in order to increase the number of false positives + false negatives (which is the result of the cost function).
To sum up, the y-axis in the graph above is just:
false_positives * weight_fp + false_negatives * weight_fn
for every value of the threshold (which is used to calculate the false_positives and false_negatives).
I hope this is clear now.

Is exhaustive model selection in R with high interaction terms and inclusion of main effects possible with regsubsets() or other functions?

I would like to perform automated, exhaustive model selection on a dataset with 7 predictors (5 continuous and 2 categorical) in R. I would like all continuous predictors to have the potential for interaction (at least up to 3 way interactions) and also have non-interacting squared terms.
I have been using regsubsets() from the leaps package and have gotten good results, however many of the models contain interaction terms without including the main effects as well (e.g., g*h is an included model predictor but g is not). Since inclusion of the main effect as well will affect the model score (Cp, BIC, etc) it is important to include them in comparisons with the other models even if they are not strong predictors.
I could manually weed through the results and cross off models that include interactions without main effects but I'd prefer to have an automated way to exclude those. I'm fairly certain this isn't possible with regsubsets() or leaps(), and probably not with glmulti either. Does anyone know of another exhaustive model selection function that allows for such specification or have a suggestion for script that will sort the model output and find only models that fit my specs?
Below is simplified output from my model searches with regsubsets(). You can see that model 3 and 4 do include interaction terms without including all the related main effects. If no other functions are known for running a search with my specs then suggestions on easily sub-setting this output to exclude models without the necessary main effects included would be helpful.
Model adjR2 BIC CP n_pred X.Intercept. x1 x2 x3 x1.x2 x1.x3 x2.x3 x1.x2.x3
1 0.470344346 -41.26794246 94.82406866 1 TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
2 0.437034361 -36.5715963 105.3785057 1 TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
3 0.366989617 -27.54194252 127.5725366 1 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
4 0.625478214 -64.64414719 46.08686422 2 TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
You can use the dredge() function from the MuMIn package.
See also Subsetting in dredge (MuMIn) - must include interaction if main effects are present .
After working with dredge I found that my models have too many predictors and interactions to run dredge in a reasonable period (I calculated that with 40+ potential predictors it might take 300k hours to complete the search on my computer). But it does exclude models where interactions don't match with main effects so I imagine that might still be a good solution for many people.
For my needs I've moved back to regsubsets and have written some code to parse through the search output in order to exclude models that contain terms in interactions that are not included as main effects. This code seems to work well so I'll share it here. Warning: it was written with human expediency in mind, not computational, so it could probably be re-coded to be faster. If you've got 100,000s of models to test you might want to make it sleeker. (I've been working on searches with ~50,000 models and up to 40 factors which take my 2.4ghz i5 core a few hours to process)
reg.output.search.with.test<- function (search_object) { ## input an object from a regsubsets search
## First build a df listing model components and metrics of interest
search_comp<-data.frame(R2=summary(search_object)$rsq,
adjR2=summary(search_object)$adjr2,
BIC=summary(search_object)$bic,
CP=summary(search_object)$cp,
n_predictors=row.names(summary(search_object)$which),
summary(search_object)$which)
## Categorize different types of predictors based on whether '.' is present
predictors<-colnames(search_comp)[(match("X.Intercept.",names(search_comp))+1):dim(search_comp)[2]]
main_pred<-predictors[grep(pattern = ".", x = predictors, invert=T, fixed=T)]
higher_pred<-predictors[grep(pattern = ".", x = predictors, fixed=T)]
## Define a variable that indicates whether model should be reject, set to FALSE for all models initially.
search_comp$reject_model<-FALSE
for(main_eff_n in 1:length(main_pred)){ ## iterate through main effects
## Find column numbers of higher level ters containing the main effect
search_cols<-grep(pattern=main_pred[main_eff_n],x=higher_pred)
## Subset models that are not yet flagged for rejection, only test these
valid_model_subs<-search_comp[search_comp$reject_model==FALSE,]
## Subset dfs with only main or higher level predictor columns
main_pred_df<-valid_model_subs[,colnames(valid_model_subs)%in%main_pred]
higher_pred_df<-valid_model_subs[,colnames(valid_model_subs)%in%higher_pred]
if(length(search_cols)>0){ ## If there are higher level pred, test each one
for(high_eff_n in search_cols){ ## iterate through higher level pred.
## Test if the intxn effect is present without main effect (working with whole column of models)
test_responses<-((main_pred_df[,main_eff_n]==FALSE)&(higher_pred_df[,high_eff_n]==TRUE))
valid_model_subs[test_responses,"reject_model"]<-TRUE ## Set reject to TRUE where appropriate
} ## End high_eff for
## Transfer changes in reject to primary df:
search_comp[row.names(valid_model_subs),"reject_model"]<-valid_model_subs[,"reject_model"
} ## End if
} ## End main_eff for
## Output resulting table of all models named for original search object and current time/date in folder "model_search_reg"
current_time_date<-format(Sys.time(), "%m_%d_%y at %H_%M_%S")
write.table(search_comp,file=paste("./model_search_reg/",paste(current_time_date,deparse(substitute(search_object)),
"regSS_model_search.csv",sep="_"),sep=""),row.names=FALSE, col.names=TRUE, sep=",")
} ## End reg.output.search.with.test fn

How to Find False Positive Prediction Count using R Script

Assuming "test" and "train" are two data frames for testing and traininig respectively, and "model" is a classifier that was generated using training data. I can find the number of misclassified examples like this:
n = sum(test$class_label != predict(model, test))
How can I find the number of examples that is predicted as negative but it is actually positive? (i.e. false positive)
NOTE: Above example assumes that the problem is a binary classification problem whose classes are, say, "yes" (positive class) and "no". Additionally, predict is a function of caret package.
This will get you a 2x2 table showing true positives, false positives, false negatives and true negatives.
> table(Truth = test$class_label, Prediction = predict(model, test))
Prediction
Truth yes no
yes 32 3
no 8 27

ROC Graph Construction

I have two heavily unbalanced datasets which are labelled as positive and negative, and I am able to generate a confusion matrix which yields a ~95% true positive rate (and inheritly 5% false negative rate) and a ~99.5% true negative rate (0.5% false positive rate).
The problem I try to build an ROC graph is that the x-axis does not range from 0 to 1, with intervals of 0.1. Instead, it ranges from 0 to something like 0.04 given my very low false positive rate.
Any insight as to why this happens?
Thanks
In a ROC graph, the two axes are the rate of false positives (F) and the rate of true positives (T). T is the probability that given a positive data item, your algorithm classifies it as positive. F is the probability that given a negative data item, your algorithm incorrectly classifies it as positive. The axes are always from 0 to 1, and if your algorithm is not parametric you should end up with a single point (or two for the two datasets) on the ROC graph instead of a curve. You get a curve if you algorithm is parametric and then the curve is induced by different values of the parameter(s).
See http://www2.cs.uregina.ca/~dbd/cs831/notes/ROC/ROC.html
I have figured it out. I used Platt's algorithm to extract the probability of a positive classification and sorted the dataset, highest probability first. I iterated through the dataset, any positive example (real positive, not classified as positive) increments the truepositive count while any negative example (real negative, not classified as negative) increments the falsepositive count.
Think of it as the support vector on the SVM which separates the two classes (+ve and -ve) moving gradually from one side of the svm to the other. Here i'm imagining points on a 2d plane. As the support vector moves, it uncovers examples. Any examples which are labelled as positive are truepostives, any negatives are falsepositives.
Hope this helps. It took me days to figure out something so trivial due to the lack information on the net (or just my lack of understanding of SVMs in general). This is especially aimed at those who are using CvSVM in the OpenCV package. As you might be aware, CvSVM does not return probability values. Instead, it returns a value based on the distance function. You do not need to use Platt's algorithm to extract an ROC curve based on probabilities, instead, you could use the distance values themselves. Say for example, you start the distance at 10, and you decrement it slowly until you've covered all of the dataset. I found using probabilities better to visualise, so to each his own.
Please mind my english as it's not my first language

Resources