Is the model's prediction probability the same as the confidence level?

Is the model's prediction probability the same as the confidence level? - math

This question seems weird, let me explain it by example.
We train a particular classification model to determine if an image contains a person or not.
After the model is trained, we use an new image for predicting.
The predicting result show that there is 94% probability that the image contains a person.
Thus, could I say, the confidence level is 94%, for that the image may contains a person?

Your third item is not properly interpreted. The model returns a normalized score of 0.94 for the category "person". Although this score correlates relatively well with our cognitive notions of "probability" and "confidence", do not confuse it with either of those. It's a convenient metric with some overall useful properties, but it is not an accurate prediction to two digits of accuracy.
Granted, there may well be models for which the model's prediction is an accurate figure. For instance, the RealOdds models you'll find on 538 are built and tested to that standard. However, that is a directed effort of more than a decade; your everyday deep learning model is not held to the same standard ... unless you work to tune it to that, making the accuracy of that number a part of your training (incorporate it into the error function).
You can run a simple (although voluminous) experiment: collect all of the predictions and bin them; say, a range of 0.1 for each of 10 bins. Now, if this "prediction" is, indeed, a probability, then your 0.6-0.7 bin should correctly identify a person 65% of the time. Check that against ground truth: did that bin get 65% correct and 35% wrong? Is the discrepancy within expected ranges: do this for each of the 10 categories and run your favorite applicable statistical measures on it.
I expect that this will convince you that the inference score is neither a prediction nor a confidence score. However, I'm also hoping it will give you some ideas for future work.

Related

Poisson Regression with overload of zeroes SAS

I am testing different models for the best fit and most robust statistics to my data. My dataset contains over 50000 observations, approx. over 99.3% of the data are zeroes - such 0.7% are actual events.
Eventually see: https://imgur.com/a/CUuTlSK
I search to find the best fit of the following models; Logistic, Poisson, NB, ZIP, ZINB, PLH, NBLH. (NB: Negative-binomial, ZI: Zero-Inflated, P: Poisson, LH: Logit Hurdle)
The first way I tried doing this was by estimating the binary response with logistic regression.
My questions: Can I use Poisson on the binary variable or should I instead impose the binary with some integer values? For instance with the associated loss; if y=1 then y_val=y*loss. In my case, the variance of y_val becomes approx. 2.5E9. I held to use the binary variable because it does not matter, in this purpose, what the company defaulted with, default is default no matter the amount.
Both with logistic regression and Poisson, I got some terrible statistic: Very high deviance value (and 0 p-value), terrible estimates (=many of the estimated parameters are 0 -> odds ratio =1), very low confidence intervals, everything seems to be 'wrong'. If I transform the response variable to log(y_val) for y>1 in Poisson the statistics seem to get better - however, this is against the assumptions of integer count response in Poisson.
I briefly have tested the ZINB, it does not change the statistics significantly (=it does not help at all in this case).
Does there exist any proper way of dealing with such a dataset? I am interested in achieving the best fit for my data (about startup business' and their default status).
The data are cleaned and ready to be fitted. Is there anything I should be aware of that I don't have mentioned?
I use the genmod procedure in SAS with dist=Poisson, zinb, zip etc.
Thanks in advance.

Sorry, my rep is too low to comment, so it has to be an answer.
You should consider undersampling technique before using any regression/model, because your target is below 5%, which makes it extremely difficult to to predict.
Undersampling is a method of cutting out non-target events, in order to increase target ratio, I really recommend considering it, I got to use it once in my practice, and it seemed pretty helpful

Query related to Misclassification rate in Decision Trees

I am working on Decision Tree model .The dataset is related to cars.I have 80% data in training set and 20% test set. The summary of the model ( based on training data) shows misclassification rate around 0.02605 where as when I run the model on training set came as 0.0289 , the difference between them is around 0.003. Is the difference acceptable , what is causing this difference? I am new to R/statistics.Please share your feedback.

Acceptable misclassification rate is more art than science. If your data is generated from a single population then there is without a doubt to be some unavoidable overlap between the groups, which will make linear classification error-prone. This doesn't mean its a problem. For instance, if you are classifying credit card charges as possibly fraudulent or not, and your recourse isn't too harsh in the case when you classify an observation to the former, then you it may be advantageous to be on the safer side and end up with more false-positives rather than a low misclassification rate. You could 1. visualize your data to identify overlap, or 2. compute N*.03 to discern the number of misclassified cases; if you have an understanding of what you are classifying, you could assess the seriousness of misclassification that way.

Is there a numerical method for approaching the first derivative at t = 0 s in a real-time application?

I want to improve step-by-step, whilst unevenly-sampled data are coming, the value of the first derivative at t = 0 s. For example, if you want to find the initial velocity in a projectile's motion, but you do not know its final position and velocity, however, you are receiving (slowly) the measurements of the projectile's current position and time.
Update - 26 Aug 2018
I would like to give you more details:
"Unevenly-sampled data" means the time intervals are not regular (irregular times between successive measurements). However, data have almost the same sampling frequency, i.e., it is about 15 min. Thus, there are some measurements without changes, because of the nature of the phenomenon (heat transfer). It gives an exponential tendency and I can fit data to a known model, but an important amount of information is required. For practical purposes, I only need to know the value of the very first slope for the whole process.
I tried a progresive Weighted Least Squares (WLS) fitting procedure, with a weight matrix such as
W = diag((0.5).^(1:kk)); % where kk is the last measurement id
But it was using preprocessed data (i.e., jitter-removing, smoothing, and fitting using the theoretical functional). I gave me the following result:
This is a real example of the problem and its "current solution"
It is good for me, but I would like to know if there is an optimal manner of doing that, but employing the raw data (or smoothed data).

IMO, additional data is not relevant to improve the estimate at zero. Because perturbations come into play and the correlation between the first and last samples goes decreasing.
Also, the asymptotic behavior of the phenomenon is probably not known rigorously (is it truly a first order linear model) ? And this can introduce a bias in the measurements.
I would stick to the first points (say up to t=20) and fit a simple model, say quadratic.
If in fact what you are trying to do is to fit a first order linear model to the data, then least-squares fitting on the raw data is fine. If there are significant outliers, robust fitting is preferable.

How to interpret a VAR model without sigificant coefficients?

I am trying to investigate the relationship between some Google Trends Data and Stock Prices.
I performed the augmented ADF Test and KPSS test to make sure that both time series are integrated of the same order (I(1)).
However, after I took the first differences, the ACF plot was completely insigificant (except for 1 of course), which told me that the differenced series are behaving like white noise.
Nevertheless I tried to estimate a VAR model which you can see attached.
As you can see, only one constant is significant. I have already read that because Stocks.ts.l1 is not significant in the equation for GoogleTrends and GoogleTrends.ts.l1 is not significant in the equation for Stocks, there is no dynamic between the two time series and both can also be models independently from each other with a AR(p) model.
I checked the residuals of the model. They fulfill the assumptions (normally distributed residuals are not totally given but ok, there is homoscedasticity, its stable and there is no autocorrelation).
But what does it mean if no coefficient is significant as in the case of the Stocks.ts equation? Is the model just inappropriate to fit the data, because the data doesn't follow an AR process. Or is the model just so bad, that a constant would describe the data better than the model? Or a combination of the previous questions? Any suggestions how I could proceed my analysis?
Thanks in advance

Interpreting negative LDA classifier scores

The post
Classification functions in linear discriminant analysis in R
from user Tyler provides a function to produce the classification functions (not discriminant functions!) from an LDA model generated with lda().
I used these classification functions to calculate all classification scores for my data. I want to use the additional information e.g. to find out which was the second most probable class and to understand the development in different time slices
Now I would like to ask you for your help to interpret the following scenarios:
scores close to/exactly zero (is it possible to claim that this exact class effectively was not recognized?)
single negative scores of higher absolute value than highest positive value (Does it mean anything at all?)
results with all negative scores (in the original interpretation, the highest score determines the classification. Is this intended by the LDA or does it mean that really none of the classifications is a good fit and one could say that no known pattern could be identified?)
single very low positive values while all others are high absolute negative values (can I argue that the "signal strength" is low in this case?)
I know this is more of a statistical than a programming problem. I thought of it as a follow-up of the post at the beginning of this entry.
Thank you very much for your help!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex