I am doing some analysis regarding a binomial glm model that I have fitted earlier in R. While looking at my data, I figured out that the suitable cutoff point for my binary outcome should be 0.75 instead of 0.5. I am trying to get the cost() function of the cv.glm() {boot package} to use the 0.75 cutoff point, but I have failed to get the right syntax.
I know for the 0.5 cutoff we normally use:
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
Can someone show me what is the right way to change the cutoff point in this function? (let's stick to 0.75 maybe).
Related
I am using the useful gratia package by Gavin Simpson to extract the difference in two smooths for two different levels of a factor variable. The smooths are generated by the wonderful mgcv package. For example
library(mgcv)
library(gratia)
m1 <- gam(outcome ~ s(dep_var, by = fact_var) + fact_var, data = my.data)
diff1 <- difference_smooths(m1, smooth = "s(dep_var)")
draw(diff1)
This give me a graph of the difference between the two smooths for each level of the "by" variable in the gam() call. The graph has a shaded 95% credible interval (CI) for the difference.
Statistical significance, or areas of statistical significance at the 0.05 level, is assessed by whether or where the y = 0 line crosses the CI, where the y axis represents the difference between the smooths.
Here is an example from Gavin's site where the "by" factor variable had 3 levels.
The differences are clearly statistically significant (at 0.05) over nearly all of the graphs.
Here is another example I have generated using a "by" variable with 2 levels.
The difference in my example is clearly not statistically significant anywhere.
In the mgcv package, an approximate p value is outputted for a smooth fit that tests the null hypothesis that the coefficients are all = 0, based on a chi square test.
My question is, can anyone suggest a way of calculating a p value that similarly assesses the difference between the two smooths instead of solely relying on graphical evidence?
The output from difference_smooths() is a data frame with differences between the smooth functions at 100 points in the range of the smoothed variable, the standard error for the difference and the upper and lower limits of the CI.
Here is a link to the release of gratia 0.4 that explains the difference_smooths() function
enter link description here
but gratia is now at version 0.6
enter link description here
Thanks in advance for taking the time to consider this.
Don
One way of getting a p value for the interaction between the by factor variables is to manipulate the difference_smooths() function by activating the ci_level option. Default is 0.95. The ci_level can be manipulated to find a level where the y = 0 is no longer within the CI bands. If for example this occurred when ci_level = my_level, the p value for testing the hypothesis that the difference is zero everywhere would be 1 - my_level.
This is not totally satisfactory. For example, it would take a little manual experimentation and it may be difficult to discern accurately when zero drops out of the CI. Although, a function could be written to search the accompanying data frame that is outputted with difference_smooths() as the ci_level is varied. This is not totally satisfactory either because the detection of a non-zero CI would be dependent on the 100 points chosen by difference_smooths() to assess the difference between the two curves. Then again, the standard errors are approximate for a GAM using mgcv, so that shouldn't be too much of a problem.
Here is a graph where the zero first drops out of the CI.
Zero dropped out at ci_level = 0.88 and was still in the interval at ci_level = 0.89. So an approxiamte p value would be 1 - 0.88 = 0.12.
Can anyone think of a better way?
Reply to Gavin Simpson's comments Feb 19
Thanks very much Gavin for taking the time to make your comments.
I am not sure if using the criterion, >= 0 (for negative diffs), is a good way to go. Because of the draws from the posterior, there is likely to be many diffs that meet this criterion. I am interpreting your criterion as sample the posterior distribution and count how many differences meet the criterion, calculate the percentage and that is the p value. Correct me if I have misunderstood. Using this approach, I consistently got p values at around 0.45 - 0.5 for different gam models, even when it was clear the difference in the smooths should be statistically significant, at least at p = 0.05, because the confidence band around the smooth did not contain zero at a number of points.
Instead, I was thinking perhaps it would be better to compare the means of the posterior distribution of each of the diffs. For example
# get coefficients for the by smooths
coeff.level1 <- coef(gam.model1)[31:38]
coeff.level0 <- coef(gam.model1)[23:30]
# these indices are specific to my multi-variable gam.model1
# in my case 8 coefficients per smooth
# get posterior coefficients variances for the by smooths' coefficients
vp_level1 <- gam.model1$Vp[31:38, 31:38]
vp_level0 <- gam.model1$Vp[23:30, 23:30]
#run the simulation to get the distribution of each
#difference coefficient using the joint variance
library(MASS)
no.draws = 1000
sim <- mvrnorm(n = no.draws, (coeff.level1 - coeff.level0),
(vp_level1 + vp_level0))
# sim is a no.draws X no. of coefficients (8 in my case) matrix
# put the results into a data.frame.
y.group <- data.frame(y = as.vector(sim),
group = c(rep(1,no.draws), rep(2,no.draws),
rep(3,no.draws), rep(4,no.draws),
rep(5,no.draws), rep(6,no.draws),
rep(7,no.draws), rep(8,no.draws)) )
# y has the differences sampled from their posterior distributions.
# group is just a grouping name for the 8 sets of differences,
# (one set for each difference in coefficients)
# compare means with a linear regression
lm.test <- lm(y ~ as.factor(group), data = y.group)
summary(lm.test)
# The p value for the F statistic tells you how
# compatible the data are with the null hypothesis that
# all the group means are equal to each other.
# Same F statistic and p value from
anova(lm.test)
One could argue that if all coefficients are not equal to each other then they all can't be equal to zero but that isn't what we want here.
The basis of the smooth tests of fit given by summary(mgcv::gam.model1)
is a joint test of all coefficients == 0. This would be from a type of likelihood ratio test where model fit with and without a term are compared.
I would appreciate some ideas how to do this with the difference between two smooths.
Now that I got this far, I had a rethink of your original suggestion of using the criterion, >= 0 (for negative diffs). I reinterpreted this as meaning for each simulated coefficient difference distribution (in my case 8), count when this occurs and make a table where each row (my case, 8) is for one of these distributions with two columns holding this count and (number of simulation draws minus count), Then on this table run a chi square test. When I did this, I got a very low p value when I believe I shouldn't have as 0 was well within the smooth difference CI across almost all the levels of the exposure. Maybe I am still misunderstanding your suggestion.
Follow up thought Feb 24
In a follow up thought, we could create a variable that represents the interaction between the by factor and continuous variable
library(dplyr)
my.dat <- my.dat %>% mutate(interact.var =
ifelse(factor.2levels == "yes", 1, 0)*cont.var)
Here I am assuming that factor.2levels has the levels ("no", "yes"), and "no" is the reference level. The ifelse function creates a dummy variable which is multiplied by the continuous variable to generate the interactive variable.
Then we place this interactive variable in the GAM and get the usual statistical test for fit, that is, testing all the coefficients == 0.
#GavinSimpson actually posted a method of how to get the difference between two smooths and assess its statistical significance here in 2017. Thanks to Matteo Fasiolo for pointing me in that direction.
In that approach, the by variable is converted to an ordered categorical variable which causes mgcv::gam to produce difference smooths in comparison to the reference level. Statistical significance for the difference smooths is then tested in the usual way with the summary command for the gam model.
However, and correct me if I have misunderstood, the ordered factor approach causes the smooth for the main effect to now be the smooth for the reference level of the ordered factor.
The approach I suggested, see the main post under the heading, Follow up thought Feb 24, where the interaction variable is created, gives an almost identical result for the p value for the difference smooth but does not change the smooth for the main effect. It also does not change the intercept and the linear term for the by categorical variable which also both changed with the ordered variable approach.
I am trying to plot a ROC curve of an identifier used to determine positive incidences against background dataset. The identifier is a list of probability scores with some overlap between the two groups.
FG BG
0.02 0.10
0.03 0.25
0.02 0.12
0.04 0.16
0.05 0.45
0.12 0.31
0.13 0.20
(where FG = Positive and BG = Negative.)
I am plotting a ROC curve using PRROC in R to assess how well the identifier classifies the data into the correct group. Although there is a clear distinction between the classifier values produced between the positive and negative datasets, but my current ROC plot in R shows a low AUC value. My probability scores for the positive data are lower than the background so if I switch the classification around and have the background as the foreground points, I get a high scoring AUC curve and I am not 100% clear why this is the case, which plot is the best to use or whether there was an additional step I have missed before analysing my data.
roc <- roc.curve(scores.class0 = FG, scores.class1 = BG, curve = T)
ROC curve
Area under curve:
0.07143
roc2 <- roc.curve(scores.class0 = BG, scores.class1 = FG, curve = T)
ROC curve
Area under curve:
0.92857
As you have indeed noticed, most ROC analysis tools assume that the scores in your positive class are higher than those of the negative class. More formally, an instance is classified as "positive" if X > T, where T is the decision threshold, and negative otherwise.
There is no fundamental reason for it to be so. It is perfectly valid to have a decision such as X < T, however most ROC software don't have that option.
Using your first option resulting in AUC = 0.07143 would imply that your classifier performs worse than random. This is not correct.
As you noticed, swapping the class labels yields the correct curve value.
This is possible because ROC curves are insensitive to class distributions - and the classes can be reverted without a problem.
However I wouldn't personally recommend that option. I can see two cases where this can be misleading:
to someone else looking at the code, or yourself in a few months; figuring the classes are wrong and "fixing" them
or if you want to apply the same code to PR curves, which are sensitive to class distributions and where you cannot swap the classes.
An alternative and preferable approach would be to invert your scores for this analysis, so that the positive class effectively has higher scores:
roc <- roc.curve(scores.class0 = -FG, scores.class1 = -BG, curve = T)
I have a vector of observations and would like to obtain an empirical p value of each obervation with R. I don't know the underlying distribution, and what I currently do is just
runif(100,0,1000)->ay
quantile(ay)
However, that does not really give me a p-value. How can I obtain a p-value empirically?
I think this is what you're looking for:
rank(ay)/length(ay)
I think what you want is the ecdf function. This returns an empirical cumulative distribution function, which you can apply directly
ay <- runif(100)
aycdf <- ecdf(ay)
And then
> aycdf(c(.1, .5, .7))
[1] 0.09 0.51 0.73
The data is gamma like distributed.
To replicate the data would be something like this:
a) first find the distrib. parameters of the true data:
fitdist(datag, "gamma", optim.method="Nelder-Mead")
b) Use the parameters shape, rate, scale to simulate data:
data <- rgamma(10000, shape=0.6, rate=4.8, scale=1/4.8)
To find quantiles using the qgamma function in r, would be just:
EDIT:
qgamma(c(seq(1,0.1,by=-0.1)), shape=0.6, rate =4.8, scale = 1/4.8, log = FALSE)
How I can find quantiles for my true data (not simulated with rgamma)?
Please note that the quantile r function returns the desired quantiles of the true data (datag) but these are as I understand assuming the data are normally distributed. As you can see they are clearly not.
quantile(datag, seq(0,1, by=0.1), type=7)
What function in r to use or otherwise how to obtain statistically the quantiles for the highly skewed data?
In addition, would this make sense somewhat? But still not getting the lower values!
Fn <- ecdf(datag)
Fn(seq(0.1,1,by=0.1))
Quantiles are returned by the "q" functions, in this case qgamma. For your data the eyeball integration suggests that most of the data is to the left of 0.2 and if we ask for the 0.8 quantile we see that 80% of the data in the estimated distribution is to the left of:
qgamma(.8, shape=0.6, rate=4.8)
#[1] 0.20604
Seems to agree with what you have plotted. If you wanted the 0.8 quantile in the sample you have, then just:
quantile(datag, 0.8)
So I am trying to see how close the sample size calculations (for two sample independent proportions with unequal samples sizes) are between proc power in SAS and some sample size functions in r. I am using the data found here at a UCLA website.
The UCLA site gives parameters as follows:
p1=.3,p2=.15,power=.8,null difference=0, and for the two-sided tests it assumes equal sample sizes;
for the unequal sample size tests the parameters are the same, with group weights of 1 for group1 and 2 for group2, and the tests they perform are one-sided.
I am using the r function
pwr.t.test(n=NULL,d=0,sig.level=0.05,type="two.sample",alternative="two.sided")
from the pwr package.
So if I input the parameter selections as the UCLA site has for their first example, I get the following error:
Error in uniroot(function(n) eval(p.body) - power, c(2, 1e+07)) :
f() values at end points not of opposite sign.
This appears to be because the difference is undetectable by r. I set d=.5 and it ran. Would SAS give error as well for too small difference? It doesn't in the example as their null difference is zero also.
I also get the error above when using
pwr.2p.test(h = 0, n = , sig.level =.05, power = .8)
and
pwr.chisq.test(w =0, N = , df =1 , sig.level =.05, power =.8 ).
I may be doing something horribly wrong, but I cant seem to really find a way if the hypothesized difference is 0.
I understand that SAS and r are using different methods for calculating the power, so I shouldn't expect to get the same result. I am really just trying to see if I can replicate proc power results in r.
I have been able to get near identical results for the first example with equal sample sizes and a two-sided alternative using
bsamsize(p1=.30,p2=.15,fraction=.5, alpha=.05, power=.8)
from the Hmisc package. But when they do 1-sided tests with unequal sample sizes I can't replicate those.
Is there a way to replicate the process in r for the 1-sided sample size calculations for unequal group sizes?
Cheers.
In pwr.t.test and its derivatives, d is not the null difference (that's assumed to be zero), but the effect size/hypothesized difference between the two populations. If the difference between population means is zero, no sample size will let you detect a nonexistent difference.
If population A has a proportion of 15% and population B has a proportion of 30%, then you use the function pwr::ES.h to calculate the effect size and do a test of proportions like:
> pwr.2p.test(h=ES.h(0.30,0.15),power=0.80,sig.level=0.05)
Difference of proportion power calculation for binomial distribution (arcsine transformation)
h = 0.3638807
n = 118.5547
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: same sample sizes
> pwr.chisq.test(w=ES.w1(0.3,0.15),df=1,sig.level=0.05,power=0.80)
Chi squared power calculation
w = 0.2738613
N = 104.6515
df = 1
sig.level = 0.05
power = 0.8
NOTE: N is the number of observations