library(ROCR)
pred1 <- prediction(predictions=glm.prob2,labels =test_data$Direction)
perf1<-performance(pred1,measure = "TP.rate",x.measure = "FP.rate")
plot(perf1)
I keep getting the following error message:
Wrong argument types: First argument must be of type 'prediction'; second and optional third argument must be available performance measures!
How can I get the roc curve for this?
As the error suggests, your measure and x.measure arguments are invalid.
The documentation of the performance function lists the following options to choose from:
‘acc’: Accuracy. P(Yhat = Y). Estimated as: (TP+TN)/(P+N).
‘err’: Error rate. P(Yhat != Y). Estimated as: (FP+FN)/(P+N).
‘fpr’: False positive rate. P(Yhat = + | Y = -). Estimated as:
FP/N.
‘fall’: Fallout. Same as ‘fpr’.
‘tpr’: True positive rate. P(Yhat = + | Y = +). Estimated as:
TP/P.
‘rec’: Recall. Same as ‘tpr’.
‘sens’: Sensitivity. Same as ‘tpr’.
‘fnr’: False negative rate. P(Yhat = - | Y = +). Estimated as:
FN/P.
‘miss’: Miss. Same as ‘fnr’.
‘tnr’: True negative rate. P(Yhat = - | Y = -).
‘spec’: Specificity. Same as ‘tnr’.
‘ppv’: Positive predictive value. P(Y = + | Yhat = +). Estimated
as: TP/(TP+FP).
‘prec’: Precision. Same as ‘ppv’.
‘npv’: Negative predictive value. P(Y = - | Yhat = -). Estimated
as: TN/(TN+FN).
‘pcfall’: Prediction-conditioned fallout. P(Y = - | Yhat = +).
Estimated as: FP/(TP+FP).
‘pcmiss’: Prediction-conditioned miss. P(Y = + | Yhat = -).
Estimated as: FN/(TN+FN).
‘rpp’: Rate of positive predictions. P(Yhat = +). Estimated as:
(TP+FP)/(TP+FP+TN+FN).
‘rnp’: Rate of negative predictions. P(Yhat = -). Estimated as:
(TN+FN)/(TP+FP+TN+FN).
‘phi’: Phi correlation coefficient. (TP*TN -
FP*FN)/(sqrt((TP+FN)*(TN+FP)*(TP+FP)*(TN+FN))). Yields a
number between -1 and 1, with 1 indicating a perfect
prediction, 0 indicating a random prediction. Values below 0
indicate a worse than random prediction.
‘mat’: Matthews correlation coefficient. Same as ‘phi’.
‘mi’: Mutual information. I(Yhat, Y) := H(Y) - H(Y | Yhat), where
H is the (conditional) entropy. Entropies are estimated
naively (no bias correction).
‘chisq’: Chi square test statistic. ‘?chisq.test’ for details.
Note that R might raise a warning if the sample size is too
small.
‘odds’: Odds ratio. (TP*TN)/(FN*FP). Note that odds ratio produces
Inf or NA values for all cutoffs corresponding to FN=0 or
FP=0. This can substantially decrease the plotted cutoff
region.
‘lift’: Lift value. P(Yhat = + | Y = +)/P(Yhat = +).
‘f’: Precision-recall F measure (van Rijsbergen, 1979). Weighted
harmonic mean of precision (P) and recall (R). F = 1/
(alpha*1/P + (1-alpha)*1/R). If alpha=1/2, the mean is
balanced. A frequent equivalent formulation is F = (beta^2+1)
* P * R / (R + beta^2 * P). In this formulation, the mean is
balanced if beta=1. Currently, ROCR only accepts the alpha
version as input (e.g. alpha=0.5). If no value for alpha is
given, the mean will be balanced by default.
‘rch’: ROC convex hull. A ROC (=‘tpr’ vs ‘fpr’) curve with
concavities (which represent suboptimal choices of cutoff)
removed (Fawcett 2001). Since the result is already a
parametric performance curve, it cannot be used in
combination with other measures.
‘auc’: Area under the ROC curve. This is equal to the value of the
Wilcoxon-Mann-Whitney test statistic and also the probability
that the classifier will score are randomly drawn positive
sample higher than a randomly drawn negative sample. Since
the output of ‘auc’ is cutoff-independent, this measure
cannot be combined with other measures into a parametric
curve. The partial area under the ROC curve up to a given
false positive rate can be calculated by passing the optional
parameter ‘fpr.stop=0.5’ (or any other value between 0 and 1)
to ‘performance’.
‘prbe’: Precision-recall break-even point. The cutoff(s) where
precision and recall are equal. At this point, positive and
negative predictions are made at the same rate as their
prevalence in the data. Since the output of ‘prbe’ is just a
cutoff-independent scalar, this measure cannot be combined
with other measures into a parametric curve.
‘cal’: Calibration error. The calibration error is the absolute
difference between predicted confidence and actual
reliability. This error is estimated at all cutoffs by
sliding a window across the range of possible cutoffs. The
default window size of 100 can be adjusted by passing the
optional parameter ‘window.size=200’ to ‘performance’. E.g.,
if for several positive samples the output of the classifier
is around 0.75, you might expect from a well-calibrated
classifier that the fraction of them which is correctly
predicted as positive is also around 0.75. In a
well-calibrated classifier, the probabilistic confidence
estimates are realistic. Only for use with probabilistic
output (i.e. scores between 0 and 1).
‘mxe’: Mean cross-entropy. Only for use with probabilistic output.
MXE := - 1/(P+N) sum_{y_i=+} ln(yhat_i) + sum_{y_i=-}
ln(1-yhat_i). Since the output of ‘mxe’ is just a
cutoff-independent scalar, this measure cannot be combined
with other measures into a parametric curve.
‘rmse’: Root-mean-squared error. Only for use with numerical class
labels. RMSE := sqrt(1/(P+N) sum_i (y_i - yhat_i)^2). Since
the output of ‘rmse’ is just a cutoff-independent scalar,
this measure cannot be combined with other measures into a
parametric curve.
‘sar’: Score combinining performance measures of different
characteristics, in the attempt of creating a more "robust"
measure (cf. Caruana R., ROCAI2004): SAR = 1/3 * ( Accuracy +
Area under the ROC curve + Root mean-squared error ).
‘ecost’: Expected cost. For details on cost curves, cf.
Drummond&Holte 2000,2004. ‘ecost’ has an obligatory x axis,
the so-called 'probability-cost function'; thus it cannot be
combined with other measures. While using ‘ecost’ one is
interested in the lower envelope of a set of lines, it might
be instructive to plot the whole set of lines in addition to
the lower envelope. An example is given in ‘demo(ROCR)’.
‘cost’: Cost of a classifier when class-conditional
misclassification costs are explicitly given. Accepts the
optional parameters ‘cost.fp’ and ‘cost.fn’, by which the
costs for false positives and negatives can be adjusted,
respectively. By default, both are set to 1.
So you should do something like:
perf1 <- performance(pred1, measure = "tpr", x.measure = "fpr")
Related
I want to plot the posterior distribution for data sampled from gamma(2,3) with a prior distribution of gamma(3,3). I am assuming alpha=2 is known. But a graph of my posterior for different values of the rate parameter centers around 4. It should be 3. I even tried with a uniform prior to make things simpler. Can you please spot what's wrong? Thank you.
set.seed(101)
dat <- rgamma(100,shape=2,rate=3)
alpha <- 3
n <- 100
post <- function(beta_1) {
posterior<- (((beta_1^alpha)^n)/gamma(alpha)^n)*
prod(dat^(alpha-1))*exp(-beta_1*sum(dat))
return(posterior)
}
vlogl <- Vectorize(post)
curve(vlogl2,from=2,to=6)
A tricky question and possibly more related to statistics than to programming =). I initially made the same reasoning mistake as you, but subsequently realised to be more careful with the posterior and the roles of alpha and beta_1.
The prior is uniform (or flat) so the posterior distribution is proportional (not equal) to the likelihood.
The quantity you have assigned to the posterior is indeed the likelihood. Plugging in alpha=3, this evaluates to
(prod(dat^2)/(gamma(alpha)^n)) * beta_1^(3*n)*exp(-beta_1*sum(dat)).
This is the crucial step. The last two terms in the product depend on beta_1 only, so these two parts determine the shape of the posterior. The posterior distribution is thus gamma distributed with shape parameter 3*n+1 and rate parameter sum(dat). As the mode of the gamma distribution is the ratio of these two and sum(dat) is about 66 for this seed, we get a mode of 301/66 (about 4.55). This coincides perfectly with the ``posterior plot'' (again you plotted the likelihood which is not properly scaled, i.e. not properly integrating to 1) produced by your code (attached below).
I hope LifeisBetter now =).
But a graph of my posterior for different values of the rate parameter
centers around 4. It should be 3.
The mean of your data is 0.659 (~2/3). Given a gamma distribution with a shape parameter alpha = 3, we are trying to find likely values of the rate parameter, beta, that gave rise to the observed data (subject to our prior information). The mean of a gamma distribution is the shape parameter divided by the rate parameter. 100 observations should be enough to mostly overcome the somewhat informative prior (which had a mean of 1), so we should expect beta to take values somewhere in the region alpha/mean(dat), not 3.
alpha/mean(dat)
#> [1] 4.54915
I'm not going to show the derivation of the posterior distribution for beta without TeX, but it is a gamma distribution that includes the rate parameter from the prior distribution of beta (betaPrior = 3):
set.seed(101)
n <- 100
dat <- rgamma(n, 2, 3)
alpha <- 3
betaPrior <- 3
post <- function(x) dgamma(x, alpha*(n + 1), sum(dat) + betaPrior)
curve(post, 2, 6)
Notice that the mean of beta is at ~4.39 rather than ~4.55 because of the informative prior that had a mean of 1.
I am using the useful gratia package by Gavin Simpson to extract the difference in two smooths for two different levels of a factor variable. The smooths are generated by the wonderful mgcv package. For example
library(mgcv)
library(gratia)
m1 <- gam(outcome ~ s(dep_var, by = fact_var) + fact_var, data = my.data)
diff1 <- difference_smooths(m1, smooth = "s(dep_var)")
draw(diff1)
This give me a graph of the difference between the two smooths for each level of the "by" variable in the gam() call. The graph has a shaded 95% credible interval (CI) for the difference.
Statistical significance, or areas of statistical significance at the 0.05 level, is assessed by whether or where the y = 0 line crosses the CI, where the y axis represents the difference between the smooths.
Here is an example from Gavin's site where the "by" factor variable had 3 levels.
The differences are clearly statistically significant (at 0.05) over nearly all of the graphs.
Here is another example I have generated using a "by" variable with 2 levels.
The difference in my example is clearly not statistically significant anywhere.
In the mgcv package, an approximate p value is outputted for a smooth fit that tests the null hypothesis that the coefficients are all = 0, based on a chi square test.
My question is, can anyone suggest a way of calculating a p value that similarly assesses the difference between the two smooths instead of solely relying on graphical evidence?
The output from difference_smooths() is a data frame with differences between the smooth functions at 100 points in the range of the smoothed variable, the standard error for the difference and the upper and lower limits of the CI.
Here is a link to the release of gratia 0.4 that explains the difference_smooths() function
enter link description here
but gratia is now at version 0.6
enter link description here
Thanks in advance for taking the time to consider this.
Don
One way of getting a p value for the interaction between the by factor variables is to manipulate the difference_smooths() function by activating the ci_level option. Default is 0.95. The ci_level can be manipulated to find a level where the y = 0 is no longer within the CI bands. If for example this occurred when ci_level = my_level, the p value for testing the hypothesis that the difference is zero everywhere would be 1 - my_level.
This is not totally satisfactory. For example, it would take a little manual experimentation and it may be difficult to discern accurately when zero drops out of the CI. Although, a function could be written to search the accompanying data frame that is outputted with difference_smooths() as the ci_level is varied. This is not totally satisfactory either because the detection of a non-zero CI would be dependent on the 100 points chosen by difference_smooths() to assess the difference between the two curves. Then again, the standard errors are approximate for a GAM using mgcv, so that shouldn't be too much of a problem.
Here is a graph where the zero first drops out of the CI.
Zero dropped out at ci_level = 0.88 and was still in the interval at ci_level = 0.89. So an approxiamte p value would be 1 - 0.88 = 0.12.
Can anyone think of a better way?
Reply to Gavin Simpson's comments Feb 19
Thanks very much Gavin for taking the time to make your comments.
I am not sure if using the criterion, >= 0 (for negative diffs), is a good way to go. Because of the draws from the posterior, there is likely to be many diffs that meet this criterion. I am interpreting your criterion as sample the posterior distribution and count how many differences meet the criterion, calculate the percentage and that is the p value. Correct me if I have misunderstood. Using this approach, I consistently got p values at around 0.45 - 0.5 for different gam models, even when it was clear the difference in the smooths should be statistically significant, at least at p = 0.05, because the confidence band around the smooth did not contain zero at a number of points.
Instead, I was thinking perhaps it would be better to compare the means of the posterior distribution of each of the diffs. For example
# get coefficients for the by smooths
coeff.level1 <- coef(gam.model1)[31:38]
coeff.level0 <- coef(gam.model1)[23:30]
# these indices are specific to my multi-variable gam.model1
# in my case 8 coefficients per smooth
# get posterior coefficients variances for the by smooths' coefficients
vp_level1 <- gam.model1$Vp[31:38, 31:38]
vp_level0 <- gam.model1$Vp[23:30, 23:30]
#run the simulation to get the distribution of each
#difference coefficient using the joint variance
library(MASS)
no.draws = 1000
sim <- mvrnorm(n = no.draws, (coeff.level1 - coeff.level0),
(vp_level1 + vp_level0))
# sim is a no.draws X no. of coefficients (8 in my case) matrix
# put the results into a data.frame.
y.group <- data.frame(y = as.vector(sim),
group = c(rep(1,no.draws), rep(2,no.draws),
rep(3,no.draws), rep(4,no.draws),
rep(5,no.draws), rep(6,no.draws),
rep(7,no.draws), rep(8,no.draws)) )
# y has the differences sampled from their posterior distributions.
# group is just a grouping name for the 8 sets of differences,
# (one set for each difference in coefficients)
# compare means with a linear regression
lm.test <- lm(y ~ as.factor(group), data = y.group)
summary(lm.test)
# The p value for the F statistic tells you how
# compatible the data are with the null hypothesis that
# all the group means are equal to each other.
# Same F statistic and p value from
anova(lm.test)
One could argue that if all coefficients are not equal to each other then they all can't be equal to zero but that isn't what we want here.
The basis of the smooth tests of fit given by summary(mgcv::gam.model1)
is a joint test of all coefficients == 0. This would be from a type of likelihood ratio test where model fit with and without a term are compared.
I would appreciate some ideas how to do this with the difference between two smooths.
Now that I got this far, I had a rethink of your original suggestion of using the criterion, >= 0 (for negative diffs). I reinterpreted this as meaning for each simulated coefficient difference distribution (in my case 8), count when this occurs and make a table where each row (my case, 8) is for one of these distributions with two columns holding this count and (number of simulation draws minus count), Then on this table run a chi square test. When I did this, I got a very low p value when I believe I shouldn't have as 0 was well within the smooth difference CI across almost all the levels of the exposure. Maybe I am still misunderstanding your suggestion.
Follow up thought Feb 24
In a follow up thought, we could create a variable that represents the interaction between the by factor and continuous variable
library(dplyr)
my.dat <- my.dat %>% mutate(interact.var =
ifelse(factor.2levels == "yes", 1, 0)*cont.var)
Here I am assuming that factor.2levels has the levels ("no", "yes"), and "no" is the reference level. The ifelse function creates a dummy variable which is multiplied by the continuous variable to generate the interactive variable.
Then we place this interactive variable in the GAM and get the usual statistical test for fit, that is, testing all the coefficients == 0.
#GavinSimpson actually posted a method of how to get the difference between two smooths and assess its statistical significance here in 2017. Thanks to Matteo Fasiolo for pointing me in that direction.
In that approach, the by variable is converted to an ordered categorical variable which causes mgcv::gam to produce difference smooths in comparison to the reference level. Statistical significance for the difference smooths is then tested in the usual way with the summary command for the gam model.
However, and correct me if I have misunderstood, the ordered factor approach causes the smooth for the main effect to now be the smooth for the reference level of the ordered factor.
The approach I suggested, see the main post under the heading, Follow up thought Feb 24, where the interaction variable is created, gives an almost identical result for the p value for the difference smooth but does not change the smooth for the main effect. It also does not change the intercept and the linear term for the by categorical variable which also both changed with the ordered variable approach.
I stuck in a problem that reproduces data in an article. (https://www.nature.com/articles/s41598-019-40993-w)
Attached picture describes concentration-response curve (left), right graph means Emax of individual curve (E'max) and logEC50 (pEC50). The lower equation is the (6) equation on the manuscript.
From the manuscript,
"Simulated data (Fig. 1, left) were fitted with Eq. (10) and resulting E’MAX values were plotted against corresponding EC50 values (Fig. 1, right). Fitting Eq. (6) to the data yielded system EMAX = 0.98 ± 0.01 and logarithm of the equilibrium dissociation constant of the agonist-receptor complex KA = −5.99 ± 0.01 (parameter estimate ± SD). ."
To sum up, E'max and EC50 are given value. I need fitting the Eq.(6) to estimate Emax and KA value.
I tried that on GraphPad Prism software. First I made X, Y column (pEC50, E'max) and put in the data from graph. Then, I set non-linear user defined equation like (6) equation.
Y= EMAX - (EMAX * 10^X)/KA
Rules for intial value of EMAX is 1 (initial value to be fit), of KA is 0 (initial value to be fit).
Default of constraints of both EMAX and KA are "No constraint"
When run the user defined equation, I got different value (EMAX = 1.0, KA = 0.0).
How can I correctly estimate the Emax and KA value..?
I have a continuous independent variable (let's say 'height') and a binary independent variable (let's say 'gets a job'). I want to see at what cutoff value for height best predicts one's ability to get a job. I also want to see how accurate this model is. I assumed a multinomial logistic model. I wanted a ROC curve so I used the ROCR package in R. This was my code:
mymodel <- multinom(job~height, data = dataset)
pred <- predict(mymodel,dataset,type = 'prob')
roc_pred <- prediction(pred,dataset$job)
roc <- performance(roc_pred,"tpr","fpr")
plot(roc,colorize=T)
Now, this is my question. When I colorize the plot, it gives me the range of cut-off values used to make the plot. I'm a little confused as to what the cutoff values actually are though. Are the cutoff values the heights? Or the probability that a certain data point (person) with a certain height is able to get a job? I have a feeling it's the latter, but I am interested in the former. If it is the latter, how do I obtain the cutoff value for the height??
I found a video that explains the cutoffs you see: https://www.youtube.com/watch?v=YdNhNfJ4Vl8
There are many different ways to estimate optimal cutoffs: Youden Index, Sensitivity + Specificity,Distance to Corner and many others (see this article)
I suggest you use a pROC library to do so
library(pROC)
roc <- roc(fit, obs, percent = TRUE)
roc.out <- coords(roc, "best", ret = c("threshold", "sens", "spec"), transpose = TRUE)
method "best" uses the Younden index (J- index) The maximum value of the Youden index is 1 (perfect test) and the minimum is 0 when the test has no diagnostic value. The minimum occurs when sensitivity=1−specificity, i.e., represented by the equal line (the diagonal) in the ROC diagram. The vertical distance between the equal line and the ROC curve is the J-index for that particular cutoff. The J-index is represented by the ROC-curve itself.
I have a gamma distribution fit to my data using libary(fitdistrplus). I need to determine a method for defining the range of x values that can be "reasonably" expected, analogous to using standard deviations with normal distributions.
For example, x values within two standard deviations from the mean could be considered to be the reasonable range of expected values from a normal distribution. Any suggestions for how to define a similar range of expected values based on the shape and rate parameters of a gamma distribution?
...maybe something like identifying the two values of x that between which contains 95% of the data?
Let's assume we have a random variable that is gamma distributed with shape alpha=2 and rate beta=3. We would expect this distribution to have mean 2/3 and standard deviation sqrt(2)/3, and indeed we see this in simulated data:
mean(rgamma(100000, 2, 3))
# [1] 0.6667945
sd(rgamma(100000, 2, 3))
# [1] 0.4710581
sqrt(2) / 3
# [1] 0.4714045
It would be pretty weird to define confidence ranges as [mean - gamma*sd, mean + gamma*sd]. To see why, consider if we selected gamma=2 in the example above. This would yield confidence range [-0.276, 1.609], but the gamma distribution can't even take on negative values, and 4.7% of data falls above 1.609. This is at the very least not a well balanced confidence interval.
A more natural choice might by to take the 0.025 and 0.975 percentiles of the distribution as a confidence range. We would expect 2.5% of data to fall below this range and 2.5% of data to fall above the range. We can use qgamma to determine that for our example parameters the confidence range would be [0.081, 1.857].
qgamma(c(0.025, 0.975), 2, 3)
# [1] 0.08073643 1.85721446
The mean expected value of a gamma is:
E[X] = k * theta
The variance is Var[X] = k * theta^2 where, k is shape and theta is scale.
But typically I would use 95% quantiles to indicate data spread.