Understanding Graph of Binary Response Regression

Understanding Graph of Binary Response Regression - r

please refer to this image:
I believe it is generated using R or SAS or something. I want to make sure I understand what it is depicting and recreate it from scratch.
I understand the left hand side, the ROC curve and I have generated my own using my probit model at varying thresholds.
What I do not understand is the right hand side graph. What does it mean by 'cost' function? What are the units? I assume the x axis labeled: 'threshold' is the success cutoff threshold I used in the ROC. My only guess is the Y axis is the sum of squared residuals? But if that's the case, I'd have to get the residuals after each iteration of the threshold?
Please explain what the axes are and how one goes about computing them.
--Edit--
For clarity, I don't need a proof or a line of code. Because I use a different statistical software, it's much more useful to have someone explain conceptually (with minimal jargon) how to compute the Y axis. That way I can write it in terms of my software's language.
Thank you

I will try to make this as clear as possible. The term cost function can be used in multiple cases and it can have multiple meanings. Usually, when we use the term in the context of a regression model, it is natural that we think of minimizing the sum of the squared residuals.
However, this is not the case here (we still do it because we are interested in minimizing the function but that function is not minimized within an algorithm like the sum of the squared residuals). Let me elaborate on what the second graph means.
As #oshun correctly mentioned the author of the R-blogger post (where these graphs came from) wanted to find a measure (i.e. a number) to compare the "mistakes" of the classification at different points of thresholds. In order to do that and create those measures he did something very intuitive and simple. He counted the false positives and false negatives for different levels of the threshold. The function he used is:
sum(df$pred >= threshold & df$survived == 0) * cost_of_fp + #false positives
sum(df$pred < threshold & df$survived == 1) * cost_of_fn #false negatives
I deliberately split the above in two lines. The first line counts the false positives (prediction >= threshold means the algorithm classified the passenger as survived but in reality they didn't - i.e. survived equals 0). The second line does the same thing but counts the false negatives (i.e. those that were predicted as not survived but in reality they did).
Now that leaves us to what cost_of_fp and what cost_of_fn are. These are nothing more than weights and are set arbitrarily by the user. In the example above the author used cost_of_fp = 1 and cost_of_fn = 3. This just means that as far as the cost function is concerned a false negative is 3 times more important than a false positive. So, in the cost function any false negative is just multiplied by 3 in order to increase the number of false positives + false negatives (which is the result of the cost function).
To sum up, the y-axis in the graph above is just:
false_positives * weight_fp + false_negatives * weight_fn
for every value of the threshold (which is used to calculate the false_positives and false_negatives).
I hope this is clear now.

Related

Convergence of R density() function to a delta function

I'm a bit puzzled by the behavior of the R density() function in an edge case...
Suppose I add more and more points with x=0 into a simulated data set. What I expect is that the density estimate will very quickly converge (I'm being deliberately vague about what that means...) to a delta function at x=0. In practice, the fit certainly gets narrower, but very slowly, as shown by this sequence of plots:
plot(density(c(0,0)), xlim=c(-2,2))
plot(density(c(0,0,0,0)), xlim=c(-2,2))
plot(density(c(rep(0,10000))), xlim=c(-2,2))
plot(density(c(rep(0,10000000))), xlim=c(-2,2))
But if you add a tiny bit of noise to the simulated data, the behavior is much better:
plot(density(0.0000001*rnorm(10000000) + c(rep(0,10000000))), xlim=c(-2,2))
Just let sleeping dogs lie? Or am I missing something about the usage of density()?

Per ?bw.nrd0, the default bandwidth selector for density:
bw.nrd0 implements a rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator. It defaults to 0.9 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative one-fifth power (= Silverman's ‘rule of thumb’, Silverman (1986, page 48, eqn (3.31)) unless the quartiles coincide when a positive result will be guaranteed.
When your data is constant, then the quartiles coincide, so the last clause guaranteeing a positive result kicks in. This basically means that the bandwidth chosen is not a continuous function of the spread of the data, at zero.
To illustrate:
> bw.nrd0(rep(0, 1e6))
[1] 0.05678616
> bw.nrd0(rnorm(1e6, s=1e-6))
[1] 5.672872e-08

Actually (...tail between legs...) I now realize that my entire question was misguided. Being fairly new to R, I had instantly assumed that density() tries to fit Gaussians of different widths to the data points, optimizing both the number of Gaussians and their individual widths. But in fact it is doing something much simpler. It just smears out each data point, and adds up the smears to give a smoothed estimate of the data. density() is just a simple smoothing algorithm. So, yes indeed, RTFM :)

Decimal points - Probability value of 0 in Language R

How to treat p value in R ?
I am expecting very low p values like:
1.00E-80
I need to -log10
-log10(1.00E-80)
-log10(0) is Inf, but Inf at sense of rounding too.
But is seems that after 1.00E-308, R yields 0.
1/10^308
[1] 1e-308
1/10^309
[1] 0
Is the accuracy of p-value display with lm function the same as the cutoff point, 1e-308, or it is just designed such that we need a cutoff point and I need to consider a different cutoff point - such as 1e-100 (for example) to replace 0 with <1e-100.

There are a variety of possible answers -- which one is most useful depends on the context:
R is indeed incapable under ordinary circumstances of storing floating-point values closer to zero than .Machine$double.xmin, which varies by platform but is typically (as you discovered) on the order of 1e-308. If you really need to work with numbers this small and can't find a way to work on the log scale directly, you need to search Stack Overflow or the R wiki for methods for dealing with arbitrary/extended precision values (but you probably should try to work on the log scale -- it will be much less of a hassle)
in many circumstances R actually computes p values on the (natural) log scale internally, and can if requested return the log values rather than exponentiating them before giving the answer. For example, dnorm(-100,log=TRUE) gives -5000.919. You can convert directly to the log10 scale (without exponentiating and then using log10) by dividing by log(10): dnorm(-100,log=TRUE)/log(10)=-2171, which would be too small to represent in floating point. For the p*** (cumulative distribution function) functions, use log.p=TRUE rather than log=TRUE. (This particular point depends heavily on your particular context. Even if you are not using built-in R functions you may be able to find a way to extract results on the log scale.)
in some cases R presents p-value results as being <2.2e-16 even when a more precise value is known: (t1 <- t.test(rnorm(10,100),rnorm(10,80)))
prints
....
t = 56.2902, df = 17.904, p-value < 2.2e-16
but you can still extract the precise p-value from the result
> t1$p.value
[1] 1.856174e-18
(in many cases this behaviour is controlled by the format.pval() function)
An illustration of how all this would work with lm:
d <- data.frame(x=rep(1:5,each=10))
set.seed(101)
d$y <- rnorm(50,mean=d$x,sd=0.0001)
lm1 <- lm(y~x,data=d)
summary(lm1) prints the p-value of the slope as <2.2e-16, but if we use coef(summary(lm1)) (which does not use the p-value formatting), we can see that the value is 9.690173e-203.
A more extreme case:
set.seed(101); d$y <- rnorm(50,mean=d$x,sd=1e-7)
lm2 <- lm(y~x,data=d)
coef(summary(lm2))
shows that the p-value has actually underflowed to zero. However, we can still get an answer on the log scale:
tval <- coef(summary(lm2))["x","t value"]
2*pt(abs(tval),df=48,lower.tail=FALSE,log.p=TRUE)/log(10)
gives -692.62 (you can check this approach with the previous example where the p-value doesn't overflow and see that you get the same answer as printed in the summary).

Small numbers are generally hard to deal with.
The limit in R for infinite is caused by the use of double precision floating point :
?double All R platforms are required to work with values conforming to the IEC 60559 (also known as IEEE 754) standard. This basically works with a precision of 53 bits, and represents to that precision a range of absolute values from about 2e-308 to 2e+308.
http://en.wikipedia.org/wiki/Double_precision_floating-point_format
You may find the Rmpfr package helpful here as it allows you to create multiple precision numbers.
install.packages("Rmpfr")
require(Rmpfr)
log(mpfr(1/10^309, precBits=500))

ROC Graph Construction

I have two heavily unbalanced datasets which are labelled as positive and negative, and I am able to generate a confusion matrix which yields a ~95% true positive rate (and inheritly 5% false negative rate) and a ~99.5% true negative rate (0.5% false positive rate).
The problem I try to build an ROC graph is that the x-axis does not range from 0 to 1, with intervals of 0.1. Instead, it ranges from 0 to something like 0.04 given my very low false positive rate.
Any insight as to why this happens?
Thanks

In a ROC graph, the two axes are the rate of false positives (F) and the rate of true positives (T). T is the probability that given a positive data item, your algorithm classifies it as positive. F is the probability that given a negative data item, your algorithm incorrectly classifies it as positive. The axes are always from 0 to 1, and if your algorithm is not parametric you should end up with a single point (or two for the two datasets) on the ROC graph instead of a curve. You get a curve if you algorithm is parametric and then the curve is induced by different values of the parameter(s).
See http://www2.cs.uregina.ca/~dbd/cs831/notes/ROC/ROC.html

I have figured it out. I used Platt's algorithm to extract the probability of a positive classification and sorted the dataset, highest probability first. I iterated through the dataset, any positive example (real positive, not classified as positive) increments the truepositive count while any negative example (real negative, not classified as negative) increments the falsepositive count.
Think of it as the support vector on the SVM which separates the two classes (+ve and -ve) moving gradually from one side of the svm to the other. Here i'm imagining points on a 2d plane. As the support vector moves, it uncovers examples. Any examples which are labelled as positive are truepostives, any negatives are falsepositives.
Hope this helps. It took me days to figure out something so trivial due to the lack information on the net (or just my lack of understanding of SVMs in general). This is especially aimed at those who are using CvSVM in the OpenCV package. As you might be aware, CvSVM does not return probability values. Instead, it returns a value based on the distance function. You do not need to use Platt's algorithm to extract an ROC curve based on probabilities, instead, you could use the distance values themselves. Say for example, you start the distance at 10, and you decrement it slowly until you've covered all of the dataset. I found using probabilities better to visualise, so to each his own.
Please mind my english as it's not my first language

R, cointegration, multivariate, co.ja(), johansen

I am new to R and cointegration so please have patience with me as I try to explain what it is that I am trying to do. I am trying to find cointegrated variables among 1500-2000 voltage variables in the west power system in Canada/US. THe frequency is hourly (common in power) and cointegrated combinations can be as few as N variables and a maximum of M variables.
I tried to use ca.jo but here are issues that I ran into:
1) ca.jo (Johansen) has a limit to the number of variables it can work with
2) ca.jo appears to force the first variable in the y(t) vector to be the dependent variable (see below).
Eigenvectors, normalised to first column: (These are the cointegration relations)
V1.l2 V2.l2 V3.l2
V1.l2 1.0000000 1.0000000 1.0000000
V2.l2 -0.2597057 -2.3888060 -0.4181294
V3.l2 -0.6443270 -0.6901678 0.5429844
As you can see ca.jo tries to find linear combinations of the 3 variables but by forcing the coefficient on the first variable (in this case V1) to be 1 (i.e. the dependent variable). My understanding was that ca.jo would try to find all combinations such that every variable is selected as a dependent variable. You can see the same treatment in the examples given in the documentation for ca.jo.
3) ca.jo does not appear to find linear combinations of fewer than the number of variables in the y(t) vector. So if there were 5 variables and 3 of them are cointegrated (i.e. V1 ~ V2 + V3) then ca.jo fails to find this combination. Perhaps I am not using ca.jo correctly but my expectation was that a cointegrated combination where V1 ~ V2 + V3 is the same as V1 ~ V2 + V3 + 0 x V4 + 0 x V5. In other words the coefficient of the variable that are NOT cointegrated should be zero and ca.jo should find this type of combination.
I would greatly appreciate some further insight as I am fairly new to R and cointegration and have spent the past 2 months teaching myself.
Thank you.
I have also posted on nabble:
http://r.789695.n4.nabble.com/ca-jo-cointegration-multivariate-case-tc3469210.html

I'm not an expert, but since no one is responding, I'm going to try to take a stab at this one.. EDIT: I noticed that I just answered to a 4 year old question. Hopefully it might still be useful to others in the future.
Your general understanding is correct. I'm not going to go in great detail about the whole procedure but will try to give some general insight. The first thing that the Johansen procedure does is create a VECM out of the VAR model that best corresponds to the data (This is why you need the lag length for the VAR as input to the procedure as well). The procedure will then investigate the non-lagged component matrix of the VECM by looking at its rank: If the variables are not cointegrated then the rank of the matrix will not be significantly different from 0. A more intuitive way of understanding the johansen VECM equations is to notice the comparibility with the ADF procedure for each distinct row of the model.
Furthermore, The rank of the matrix is equal to the number of its eigenvalues (characteristic roots) that are different from zero. Each eigenvalue is associated with a different cointegrating vector, which
is equal to its corresponding eigenvector. Hence, An eigenvalue significantly different
from zero indicates a significant cointegrating vector. Significance of the vectors can be tested with two distinct statistics: The max statistic or the trace statistic. The trace test tests the null hypothesis of less than or equal to r cointegrating vectors against the alternative of more than r cointegrating vectors. In contrast, The maximum eigenvalue test tests the null hypothesis of r cointegrating vectors against the alternative of r + 1 cointegrating vectors.
Now for an example,
# We fit data to a VAR to obtain the optimal VAR length. Use SC information criterion to find optimal model.
varest <- VAR(yourData,p=1,type="const",lag.max=24, ic="SC")
# obtain lag length of VAR that best fits the data
lagLength <- max(2,varest$p)
# Perform Johansen procedure for cointegration
# Allow intercepts in the cointegrating vector: data without zero mean
# Use trace statistic (null hypothesis: number of cointegrating vectors <= r)
res <- ca.jo(yourData,type="trace",ecdet="const",K=lagLength,spec="longrun")
testStatistics <- res#teststat
criticalValues <- res#criticalValues
# chi^2. If testStatic for r<= 0 is greater than the corresponding criticalValue, then r<=0 is rejected and we have at least one cointegrating vector
# We use 90% confidence level to make our decision
if(testStatistics[length(testStatistics)] >= criticalValues[dim(criticalValues)[1],1])
{
# Return eigenvector that has maximum eigenvalue. Note: we throw away the constant!!
return(res#V[1:ncol(yourData),which.max(res#lambda)])
}
This piece of code checks if there is at least one cointegrating vector (r<=0) and then returns the vector with the highest cointegrating properties or in other words, the vector with the highest eigenvalue (lamda).
Regarding your question: the procedure does not "force" anything. It checks all combinations, that is why you have your 3 different vectors. It is my understanding that the method just scales/normalizes the vector to the first variable.
Regarding your other question: The procedure will calculate the vectors for which the residual has the strongest mean reverting / stationarity properties. If one or more of your variables does not contribute further to these properties then the component for this variable in the vector will indeed be 0. However, if the component value is not 0 then it means that "stronger" cointegration was found by including the extra variable in the model.
Furthermore, you can test test significance of your components. Johansen allows a researcher to test a hypothesis about one or more
coefficients in the cointegrating relationship by viewing the hypothesis as
a restriction on the non-lagged component matrix in the VECM. If there exist r cointegrating vectors, only these linear combinations or linear transformations of them, or combinations of the cointegrating vectors, will be stationary. However, I'm not aware on how to perform these extra checks in R.
Probably, the best way for you to proceed is to first test the combinations that contain a smaller number of variables. You then have the option to not add extra variables to these cointegrating subsets if you don't want to. But as already mentioned, adding other variables can potentially increase the cointegrating properties / stationarity of your residuals. It will depend on your requirements whether or not this is the behaviour you want.

I've been searching for an answer to this and I think I found one so I'm sharing with you hoping it's the right solution.
By using the johansen test you test for the ranks (number of cointegration vectors), and it also returns the eigenvectors, and the alphas and betas do build said vectors.
In theory if you reject r=0 and accept r=1 (value of r=0 > critical value and r=1 < critical value) you would search for the highest eigenvalue and from that build your vector. On this case, if the highest eigenvalue was the first, it would be V1*1+V2*(-0.26)+V3*(-0.64).
This would generate the cointegration residuals for these variables.
Again, I'm not 100%, but preety sure the above is how it works.
Nonetheless, you can always use the cajools function from the urca package to create a VECM automatically. You only need to feed it a cajo object and define the number of ranks (https://cran.r-project.org/web/packages/urca/urca.pdf).
If someone could confirm / correct this, it would be appreciated.

Finding a reasonable (noise-free) maximum element in a vector

Consider a vector V riddled with noisy elements. What would be the fastest (or any) way to find a reasonable maximum element?
For e.g.,
V = [1 2 3 4 100 1000]
rmax = 4;
I was thinking of sorting the elements and finding the second differential {i.e. diff(diff(unique(V)))}.
EDIT: Sorry about the delay.
I can't post any representative data since it contains 6.15e5 elements. But here's a plot of the sorted elements.
By just looking at the plot, a piecewise linear function may work.
Anyway, regarding my previous conjecture about using differentials, here's a plot of diff(sort(V));
I hope it's clearer now.
EDIT: Just to be clear, the desired "maximum" value would be the value right before the step in the plot of the sorted elements.

NEW ANSWER:
Based on your plot of the sorted amplitudes, your diff(sort(V)) algorithm would probably work well. You would simply have to pick a threshold for what constitutes "too large" a difference between the sorted values. The first point in your diff(sort(V)) vector that exceeds that threshold is then used to get the threshold to use for V. For example:
diffThreshold = 2e5;
sortedVector = sort(V);
index = find(diff(sortedVector) > diffThreshold,1,'first');
signalThreshold = sortedVector(index);
Another alternative, if you're interested in toying with it, is to bin your data using HISTC. You would end up with groups of highly-populated bins at both low and high amplitudes, with sparsely-populated bins in between. It would then be a matter of deciding which bins you count as part of the low-amplitude group (such as the first group of bins that contain at least X counts). For example:
binEdges = min(V):1e7:max(V); % Create vector of bin edges
n = histc(V,binEdges); % Bin amplitude data
binThreshold = 100; % Pick threshold for number of elements in bin
index = find(n < binThreshold,1,'first'); % Find first bin whose count is low
signalThreshold = binEdges(index);
OLD ANSWER (for posterity):
Finding a "reasonable maximum element" is wholly dependent upon your definition of reasonable. There are many ways you could define a point as an outlier, such as simply picking a set of thresholds and ignoring everything outside of what you define as "reasonable". Assuming your data has a normal-ish distribution, you could probably use a simple data-driven thresholding approach for removing outliers from a vector V using the functions MEAN and STD:
nDevs = 2; % The number of standard deviations to use as a threshold
index = abs(V-mean(V)) <= nDevs*std(V); % Index of "reasonable" values
maxValue = max(V(index)); % Maximum of "reasonable" values

I would not sort then difference. If you have some reason to expect continuity or bounded change (the vector is of consecutive sensor readings), then sorting will destroy the time information (or whatever the vector index represents). Filtering by detecting large spikes isn't a bad idea, but you would want to compare the spike to a larger neighborhood (2nd difference effectively has you looking within a window of +-2).
You need to describe formally the expected information in the vector, and the type of noise.
You need to know the frequency and distribution of errors and non-errors. In the simplest model, the elements in your vector are independent and identically distributed, and errors are all or none (you randomly choose to store the true value, or an error). You should be able to figure out for each element the chance that it's accurate, vs. the chance that it's noise. This could be very easy (error data values are always in a certain range which doesn't overlap with non-error values), or very hard.
To simplify: don't make any assumptions about what kind of data an error produces (the worst case is: you can't rule out any of the error data points as ridiculous, but they're all at or above the maximum among non-error measurements). Then, if the probability of error is p, and your vector has n elements, then the chance that the kth highest element in the vector is less or equal to the true maximum is given by the cumulative binomial distribution - http://en.wikipedia.org/wiki/Binomial_distribution

First, pick your favorite method for identifying outliers...

If you expect the numbers to come from a normal distribution, you can use a say 2xsd (standard deviation) above the mean to determine your max.

Do you have access to bounds of your noise-free elements. For example, do you know that your noise-free elements are between -10 and 10 ?
In that case, you could remove noise, and then find the max
max( v( find(v<=10 & v>=-10) ) )

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex