I am interested in optimising predictions for a multinomial regression model with 3 (or more) classes according to various measures.
For two-class models (logistic regression), this can be done in the pROC package using the coords function with best.method="youden" or closest.topleft. This will choose the threshold on the probability of success as predicted by the logistic regression model that either maximises specificity+sensitivity (youden) or gives the point in the ROC curve closest in the Euclidean distance to the point (1,1) (closest.topleft).
In the three (ore more) class case, it is possible to generalise sensitivity+specificity to the sum over classes of sensitivity for each class. We can then ask, if we choose a vector of weights on the probabilities of the non-reference classes, which vector of weights will maximise this quantity? This set of weights will give the analogue in the three (or more) class case of the Youden index.
My questions are:
is there an R package and command that implement this? If not, I will write one, but I want to make sure I am not duplicating work that has already been done.
If not, what other functionalities would be useful to build into this package? For example, it would be possible to find the best set of weights that ensure one of the sensitivities is at least above some set threshold. It would also be possible to find the analogue of closest.topleft--the set of weights that give sensitivities closest to (1,1,1), and so on. Also, it would be possible to include some plotting capabilities, e.g., for the 3-class situation, a 3D version of an ROC curve that plots the three sensitivities on three axes.
Thanks!
Related
I have been using the variofit function in R's gstat package to fit semivariogram models to some spatial data I have, and I am confused by a couple of the models that have been generated. Basically for these few models, I will get a model that has a range for autocorrelation, but not a partial sill. I was told that even without a sill, though, the model should still have some sort of shape to reflect the range, but plotting this model results in the flat lines that are shown in the attached screenshot. I do not think it is a matter of bad initial values as I let variofit parse out the best initial values from a matrix of many values made by expand.grid. I wanted to know whether this is being plotted correctly contrary to what I've been told, and what exactly it means to have a range but no partial sill value. I know when I used an alternative model fitting function from geoR (fit.variogram), these models could be fit to a periodic or wave distribution, though poorly so/probably overfit — so would this be some indication of that, which variofit just cannot plot? I unfortunately can't share the data, but I included an example of the code I have used to make these models if it will help to answer my question:
geo.entPC <- as.geodata(cbind(jitteryPC, log.PC[,5], coords.col=1:2, data.col=5))
test.pc.grid2 <- expand.grid(seq(0,2,0.2),seq(0,100,10))
variog.function.col2 <-function (x) {
vario.cloud <- variog(x, estimator.type = "classical", option="bin")
variogram.mod <- variofit(vario.cloud , ini.cov.pars=test.pc.grid2, fix.nug=FALSE, weights="equal")
plot(vario.cloud)
lines(variogram.mod, col="red")
summary(x)
}
variog.function.col2(geo.entPC)
From the attached plot showing the empirical variogram, I would not expect to find any sensible spatial correlation. This is in accordance with the fitted variogram, which is essentially a pure nugget model. The spatial range might be a relic of the numerical optimization, or the partial spatial sill might (numerically) differ from 0 at a digit that is not shown in the summary of the fitted variogram. However, no matter what the range is for an irrelevant small partial sill, the spatial correlation is neglectable.
Depending on the data, it is sometimes beneficial to limit the maximum distance of pairs used to calculate the empirical variogram - but make sure to have "enough" pairs in each bin.
I am currently stuck at a cluster-analysis. I would like to determine the ideal number of clusters by comparing the agglomerative coefficients of different number of clusters to each other. The percentage change in coefficient to the next level should be the basis for my decision, like in the following screenshot:
Percentage Change in Coefficient to next level
But I do not know how to get these coefficients. I was only able to compute the AC for the whole dendrogram by using the AGNES program. How can I get the agglomeration coefficient for different numbers of clusters?
I'll appreciate an advice, thank you.
How to find of all clusters have different variances in a parameter after k-modes have been applied? We applies k-modes clustering and then plotted the clusters using CLUSPLOT and are clusters came to be too much overlapping. To test that we wanted to apply t-test, for which we need to find variances of variables in all clusters. But because we have categorical data, how should we calculate the variance?
Variance, and clusplot is also based on this concept, mostly makes sense for continuous variables. It's rooted in least squares estimation, so you need something to compute squares and square roots.
Just trying to apply these things to categorical or binary variables usually fails to give good results. So you will have to come up with an approach that makes sense for your domain/application. The "turnkey" solution of variance won't do.
I have a problem where I'm trying to use supervised learning in python. I have a series of x,y coordinates which i know belong to a label in one data set. In the other i have only the x,y coordinates. I am going to use one set to train the other, my approach is that of supervised learning and to use a classification algorithm (linear discriminant analysis) as the number of labels is discrete. Although they are discrete, they are large in number (n=~80,000). My question, at which number of labels should i consider regression over classification where regression is better suited to continuous labels. I'm using SciKit as my machine learning package and using astronml.orgs excellent tutorial as a guide.
It is not about numbers. It is about being continuous or not. It does not matter if you have 80,000 classes or even more; as long as there is no correlation between neighbour classes (for eg. class i and i+1), you should use classification (not regression).
Regression only makes sense when the labels are continuous (real numbers for eg.) or at least when there is a correlation between adjacent classes (for eg. when labels show the count of something, you can do regression and then round up the results).
Let Y be a binary variable.
If we use logistic regression for modeling, then we can use cv.glm for cross validation and there we can specify the cost function in the cost argument. By specifying the cost function, we can assign different unit costs to different types of errors:predicted Yes|reference is No or predicted No|reference is Yes.
I am wondering if I could achieve the same in SVM. In other words, is there a way for me to specify a cost(loss) function instead of using built-in loss function?
Besides the Answer by Yueguoguo, there is also three more solutions, the standard Wrapper approach, hyperplane tuning and the one in e1017.
The Wrapper approach (available out of the box for example in weka) is applicable to almost all classifiers. The idea is to over- or undersample the data in accordance with the misclassification costs. The learned model if trained to optimise accuracy is optimal under the costs.
The second idea is frequently used in textminining. The classification is svm's are derived from distance to the hyperplane. For linear separable problems this distance is {1,-1} for the support vectors. The classification of a new example is then basically, whether the distance is positive or negative. However, one can also shift this distance and not make the decision and 0 but move it for example towards 0.8. That way the classifications are shifted in one or the other direction, while the general shape of the data is not altered.
Finally, some machine learning toolkits have a build in parameter for class specific costs like class.weights in the e1017 implementation. the name is due to the fact that the term cost is pre-occupied.
The loss function for SVM hyperplane parameters is automatically tuned thanks to the beautiful theoretical foundation of the algorithm. SVM applies cross-validation for tuning hyperparameters. Say, an RBF kernel is used, cross validation is to select the optimal combination of C (cost) and gamma (kernel parameter) for the best performance, measured by certain metrics (e.g., mean squared error). In e1071, the performance can be obtained by using tune method, where the range of hyperparameters as well as attribute of cross-validation (i.e., 5-, 10- or more fold cross validation) can be specified.
To obtain comparative cross-validation results by using Area-Under-Curve type of error measurement, one can train different models with different hyperparameter configurations and then validate the model against sets of pre-labelled data.
Hope the answer helps.