Is using 'adjust = "tukey"' in emmeans equivalent to a Tukey HSD test? - r

I have been looking around and I am quite confused about Tukey adjustment in emmeans.
The documentation of emmeans doesn't mention Tukey HSD at all, but in here it is said that "For most contrast() results, adjust is often something else, depending on what type of contrasts are created. For example, pairwise comparisons default to adjust = "tukey", i.e., the Tukey HSD method.".
As I understand, Tukey HSD is essentially a series of pairwise t-test with adjustment for type I error.
But the emmeans function is calculating estimated marginal means (EMMs), which I assume are not pairwise t-tests; then applying the Tukey adjustment to emmeans output, would not be an equivalent to Tukey HSD post hoc test.
A second related question would be what the function "tukey.emmc", also from emmeans, does?
[Update]
I guess my second question is, what is the difference between tukey.emmc and contrast() with 'adjust = "tukey"'?

Using adjust = "tukey" means that critical values and adjusted P values are obtained from the Studentized range distribution qtukey() and ptukey() respectively. Those are the same critical values that are used in the Tukey HSD test. But to put a very fine edge on it, the Tukey HSD method is really defined only for independent samples of equal size, which may or may not be the case for emmeans() results. For more details, see ? summary.emmGrid and refer to the section on P-value adjustments.
Regarding the second question, both pairwise.emmc() generates contrast coefficients for pairwise comparisons; as does revpairwise.emmc(). Here is a third possibility:
> emmeans:::tukey.emmc
function(levs, reverse = FALSE, ...) {
if (reverse)
revpairwise.emmc(levs, ...)
else
pairwise.emmc(levs, ...)
}
That is, tukey.emmc() invokes one of those pairwise-comparison methods depending on reverse. Thus, contrast(..., method = "tukey", reverse = TRUE) is equivalent to contrast(..., method = "revpairwise").
Every .emmc function passes a default adjustment method to contrast(), and in the case of pairwise.emmc() and tukey.emmc(), that default is adjust = "tukey". Thus, calling contrast(..., method = "pairwise") is the same as contrast(..., method = "pairwise", adjust = "tukey"). Whereas calling some other contrast function may produce different defaults. For example, consec.emmc() passes the "mvt" adjustment by default:
> emmeans:::consec.emmc(1:4)
2 - 1 3 - 2 4 - 3
1 -1 0 0
2 1 -1 0
3 0 1 -1
4 0 0 1
> attributes(.Last.value)
$names
[1] "2 - 1" "3 - 2" "4 - 3"
$row.names
[1] 1 2 3 4
$class
[1] "data.frame"
$desc
[1] "changes between consecutive levels"
$adjust
[1] "mvt"
An additional comment about the Tukey adjustment: That adjustment is only appropriate for a single set of pairwise comparisons. If you specify adjust = "tukey" for non-pairwise comparisons or arbitrary contrasts, it will overrule you and use the "sidak" adjustment instead.

Related

Why Ljung box test return NA for Q and hence p-value when residual is 0 (in R)

Just like the question title.
I have done Ljung box tests in R for model fitting in time-series with constant values (i.e.: 0), and I got perfect model fit and 0 residuals with no surprise. But I want to know why the test returns NA for Q and p-value instead of for example p=0.99999 or something like that.
I want to have a theoretical interpretation for this.
Given you are using stats::Box.test() you can take a look at the code yourself:
utils::getAnywhere(Box.test)
The Ljung-Box Q-statistics is NaN because
cor <- acf (x, lag.max = lag, plot = FALSE, na.action = na.pass)
already returns NaN. So the subsequent computations
obs <- cor$acf[2:(lag+1)]
STATISTIC <- n*(n+2)*sum(1/seq.int(n-1, n-lag)*obs^2)
are NaN too - and so on. So it seems like you should look at stats::acf() what's going on in there ...
utils::getAnywhere(acf)
You should also be able to find the code on Github.

Is it possible to change Type I error threshold using t.test() function?

I am asked to compute a test statistic using the t.test() function, but I need to reduce the type I error. My prof showed us how to change a confidence level for this function, but not the acceptable type I error for null hypothesis testing. The goal is for the argument to automatically compute a p-value based on a .01 error rate rather than the normal .05.
The r code below involves a data set that I have downloaded.
t.test(mid$log_radius_area, mu=8.456)
I feel like I've answered this somewhere, but can't seem to find it on SO or CrossValidated.
Similarly to this question, the answer is that t.test() doesn't specify any threshold for rejecting/failing to reject the null hypothesis; it reports a p-value, and you get to decide whether to reject or not. (The conf.level argument is for adjusting which confidence interval the output reports.)
From ?t.test:
t.test(1:10, y = c(7:20))
Welch Two Sample t-test
data: 1:10 and c(7:20)
t = -5.4349, df = 21.982, p-value = 1.855e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.052802 -4.947198
sample estimates:
mean of x mean of y
5.5 13.5
Here the p-value is reported as 1.855e-05, so the null hypothesis would be rejected for any (pre-specified) alpha level >1.855e-05. Note that the output doesn't say anywhere "the null hypothesis is rejected at alpha=0.05" or anything like that. You could write your own function to do that, using the $p.value element that is saved as part of the test results:
report_test <- function(tt, alpha=0.05) {
cat("the null hypothesis is ")
if (tt$p.value > alpha) {
cat("**NOT** ")
}
cat("rejected at alpha=",alpha,"\n")
}
tt <- t.test(1:10, y = c(7:20))
report_test(tt)
## the null hypothesis is rejected at alpha= 0.05
Most R package/function writers don't bother to do this, because they figure that it should be simple enough for users to do for themselves.

Custom contrasts in R: contrast coefficient matrix or contrast matrix / coding scheme? And how to get there?

Custom contrasts are very widely used in analyses, e.g.: "Do DV values at level 1 and level 3 of this three-level factor differ significantly?"
Intuitively, this contrast is expressed in terms of cell means as:
c(1,0,-1)
One or more of these contrasts, bound as columns, form a contrast coefficient matrix, e.g.
mat = matrix(ncol = 2, byrow = TRUE, data = c(
1, 0,
0, 1,
-1, -1)
)
[,1] [,2]
[1,] 1 0
[2,] 0 1
[3,] -1 -1
However, when it comes to running these contrasts specified by the coefficient matrix, there is a lot of (apparently contradictory) information on the web and in books. My question is which information is correct?
Claim 1: contrasts(factor) takes a coefficient matrix
In some examples, the user is shown that the intuitive contrast coefficient matrix can be used directly via the contrasts() or C() functions. So it's as simple as:
contrasts(myFactor) <- mat
Claim 2: Transform coefficients to create a coding scheme
Elsewhere (e.g. UCLA stats) we are told the coefficient matrix (or basis matrix) must be transformed from a coefficient matrix into a contrast matrix before use. This involves taking the inverse of the transform of the coefficient matrix: (mat')⁻¹, or, in Rish:
contrasts(myFactor) = solve(t(mat))
This method requires padding the matrix with an initial column of means for the intercept. To avoid this, some sites recommend using a generalized inverse function which can cope with non-square matrices, i.e., MASS::ginv()
contrasts(myFactor) = ginv(t(mat))
Third option: premultiply by the transform, take the inverse, and post multiply by the transform
Elsewhere again (e.g. a note from SPSS support), we learn the correct algebra is: (mat'mat)-¹ mat'
Implying to me that the correct way to create the contrasts matrix should be:
x = solve(t(mat)%*% mat)%*% t(mat)
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 1 0 -1
[3,] 0 1 -1
contrasts(myFactor) = x
My question is, which is right? (If I am interpreting and describing each piece of advice accurately). How does one specify custom contrasts in R for lm, lme etc?
Refs
Claim 2 is correct (see the answers here and here) and sometimes claim 1, too. This is because there are cases in which the generalized inverse of the (transposed) coefficient matrix is equal to the matrix itself.
For what it's worth....
If you have a factor with 3 levels (levels A, B, and C) and you want to test the following orthogonal contrasts: A vs B, and the avg. of A and B vs C, your contrast codes would be:
Cont1<- c(1,-1, 0)
Cont2<- c(.5,.5, -1)
If you do as directed on the UCLA site (transform coefficients to make a coding scheme), as such:
Contrasts(Variable)<- solve(t(cbind(c(1,1,1), Cont1, Cont2)))[,2:3]
then your results are IDENTICAL to if you had created two dummy variables (e.g.:
Dummy1<- ifelse(Variable=="A", 1, ifelse(Variable=="B", -1, 0))
Dummy2<- ifelse(Variable=="A", .5, ifelse(Variable=="B", .5, -1))
and entered them both into the regression equation instead of your factor, which makes me inclined to think that this is the correct way.
PS I don't write the most elegant R code, but it gets the job done. Sorry, I'm sure there are easier ways to recode variables, but you get the gist.
I'm probably missing something, but in each of your three examples, you specify the contrast matrix in the same way, i.e.
## Note it should plural of contrast
contrasts(myFactor) = x
The only thing that differs is the value of x.
Using the data from the UCLA website as an example
hsb2 = read.table('http://www.ats.ucla.edu/stat/data/hsb2.csv', header=T, sep=",")
#creating the factor variable race.f
hsb2$race.f = factor(hsb2$race, labels=c("Hispanic", "Asian", "African-Am", "Caucasian"))
We can specify either the treatment version of the contrasts
contrasts(hsb2$race.f) = contr.treatment(4)
summary(lm(write ~ race.f, hsb2))
or the sum version
contrasts(hsb2$race.f) = contr.sum(4)
summary(lm(write ~ race.f, hsb2))
Alternatively, we can specify a bespoke contrast matrix.
See ?contr.sum for other standard contrasts.

Chi squared goodness of fit for a geometric distribution

As an assignment I had to develop and algorithm and generate a samples for a given geometric distribution with PMF
Using the inverse transform method, I came up with the following expression for generating the values:
Where U represents a value, or n values depending on the size of the sample, drawn from a Unif(0,1) distribution and p is 0.3 as stated in the PMF above.
I have the algorithm, the implementation in R and I already generated QQ Plots to visually assess the adjustment of the empirical values to the theoretical ones (generated with R), i.e., if the generated sample follows indeed the geometric distribution.
Now I wanted to submit the generated sample to a goodness of fit test, namely the Chi-square, yet I'm having trouble doing this in R.
[I think this was moved a little hastily, in spite of your response to whuber's question, since I think before solving the 'how do I write this algorithm in R' problem, it's probably more important to deal with the 'what you're doing is not the best approach to your problem' issue (which certainly belongs where you posted it). Since it's here, I will deal with the 'doing it in R' aspect, but I would urge to you go back an ask about the second question (as a new post).]
Firstly the chi-square test is a little different depending on whether you test
H0: the data come from a geometric distribution with parameter p
or
H0: the data come from a geometric distribution with parameter 0.3
If you want the second, it's quite straightforward. First, with the geometric, if you want to use the chi-square approximation to the distribution of the test statistic, you will need to group adjacent cells in the tail. The 'usual' rule - much too conservative - suggests that you need an expected count in every bin of at least 5.
I'll assume you have a nice large sample size. In that case, you'll have many bins with substantial expected counts and you don't need to worry so much about keeping it so high, but you will still need to choose how you will bin the tail (whether you just choose a single cut-off above which all values are grouped, for example).
I'll proceed as if n were say 1000 (though if you're testing your geometric random number generation, that's pretty low).
First, compute your expected counts:
dgeom(0:20,.3)*1000
[1] 300.0000000 210.0000000 147.0000000 102.9000000 72.0300000 50.4210000
[7] 35.2947000 24.7062900 17.2944030 12.1060821 8.4742575 5.9319802
[13] 4.1523862 2.9066703 2.0346692 1.4242685 0.9969879 0.6978915
[19] 0.4885241 0.3419669 0.2393768
Warning, dgeom and friends goes from x=0, not x=1; while you can shift the inputs and outputs to the R functions, it's much easier if you subtract 1 from all your geometric values and test that. I will proceed as if your sample has had 1 subtracted so that it goes from 0.
I'll cut that off at the 15th term (x=14), and group 15+ into its own group (a single group in this case). If you wanted to follow the 'greater than five' rule of thumb, you'd cut it off after the 12th term (x=11). In some cases (such as smaller p), you might want to split the tail across several bins rather than one.
> expec <- dgeom(0:14,.3)*1000
> expec <- c(expec, 1000-sum(expec))
> expec
[1] 300.000000 210.000000 147.000000 102.900000 72.030000 50.421000
[7] 35.294700 24.706290 17.294403 12.106082 8.474257 5.931980
[13] 4.152386 2.906670 2.034669 4.747562
The last cell is the "15+" category. We also need the probabilities.
Now we don't yet have a sample; I'll just generate one:
y <- rgeom(1000,0.3)
but now we want a table of observed counts:
(x <- table(factor(y,levels=0:14),exclude=NULL))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <NA>
292 203 150 96 79 59 47 25 16 10 6 7 0 2 5 3
Now you could compute the chi-square directly and then calculate the p-value:
> (chisqstat <- sum((x-expec)^2/expec))
[1] 17.76835
(pval <- pchisq(chisqstat,15,lower.tail=FALSE))
[1] 0.2750401
but you can also get R to do it:
> chisq.test(x,p=expec/1000)
Chi-squared test for given probabilities
data: x
X-squared = 17.7683, df = 15, p-value = 0.275
Warning message:
In chisq.test(x, p = expec/1000) :
Chi-squared approximation may be incorrect
Now the case for unspecified p is similar, but (to my knowledge) you can no longer get chisq.test to do it directly, you have to do it the first way, but you have to estimate the parameter from the data (by maximum likelihood or minimum chi-square), and then test as above but you have one fewer degree of freedom for estimating the parameter.
See the example of doing a chi-square for a Poisson with estimated parameter here; the geometric follows the much same approach as above, with the adjustments as at the link (dealing with the unknown parameter, including the loss of 1 degree of freedom).
Let us assume you've got your randomly-generated variates in a vector x. You can do the following:
x <- rgeom(1000,0.2)
x_tbl <- table(x)
x_val <- as.numeric(names(x_tbl))
x_df <- data.frame(count=as.numeric(x_tbl), value=x_val)
# Expand to fill in "gaps" in the values caused by 0 counts
all_x_val <- data.frame(value = 0:max(x_val))
x_df <- merge(all_x_val, x_df, by="value", all.x=TRUE)
x_df$count[is.na(x_df$count)] <- 0
# Get theoretical probabilities
x_df$eprob <- dgeom(x_df$val, 0.2)
# Chi-square test: once with asymptotic dist'n,
# once with bootstrap evaluation of chi-sq test statistic
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE)
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE,
simulate.p.value=TRUE, B=10000)
There's a "goodfit" function described as "Goodness-of-fit Tests for Discrete Data" in package "vcd".
G.fit <- goodfit(x, type = "nbinomial", par = list(size = 1))
I was going to use the code you had posted in an earlier question, but it now appears that you have deleted that code. I find that offensive. Are you using this forum to gather homework answers and then defacing it to remove the evidence? (Deleted questions can still be seen by those of us with sufficient rep, and the interface prevents deletion of question with upvoted answers so you should not be able to delete this one.)
Generate a QQ Plot for testing a geometrically distributed sample
--- question---
I have a sample of n elements generated in R with
sim.geometric <- function(nvals)
{
p <- 0.3
u <- runif(nvals)
ceiling(log(u)/log(1-p))
}
for which i want to test its distribution, specifically if it indeed follows a geometric distribution. I want to generate a QQ PLot but have no idea how to.
--------reposted answer----------
A QQ-plot should be a straight line when compared to a "true" sample drawn from a geometric distribution with the same probability parameter. One gives two vectors to the functions which essentially compares their inverse ECDF's at each quantile. (Your attempt is not particularly successful:)
sim.res <- sim.geometric(100)
sim.rgeom <- rgeom(100, 0.3)
qqplot(sim.res, sim.rgeom)
Here I follow the lead of the authors of qqplot's help page (which results in flipping that upper curve around the line of identity):
png("QQ.png")
qqplot(qgeom(ppoints(100),prob=0.3), sim.res,
main = expression("Q-Q plot for" ~~ {G}[n == 100]))
dev.off()
---image not included---
You can add a "line of good fit" by plotting a line through through the 25th and 75th percentile points for each distribution. (I added a jittering feature to this to get a better idea where the "probability mass" was located:)
sim.res <- sim.geometric(500)
qqplot(jitter(qgeom(ppoints(500),prob=0.3)), jitter(sim.res),
main = expression("Q-Q plot for" ~~ {G}[n == 100]), ylim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )),
xlim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )))
qqline(sim.res, distribution = function(p) qgeom(p, 0.3),
prob = c(0.25, 0.75), col = "red")

cost function in cv.glm of boot library in R

I am trying to use the crossvalidation cv.glm function from the boot library in R to determine the number of misclassifications when a glm logistic regression is applied.
The function has the following signature:
cv.glm(data, glmfit, cost, K)
with the first two denoting the data and model and K specifies the k-fold.
My problem is the cost parameter which is defined as:
cost: A function of two vector arguments specifying the cost function
for the crossvalidation. The first argument to cost should correspond
to the observed responses and the second argument should correspond to
the predicted or fitted responses from the generalized linear model.
cost must return a non-negative scalar value. The default is the
average squared error function.
I guess for classification it would make sense to have a function which returns the rate of misclassification something like:
nrow(subset(data, (predict >= 0.5 & data$response == "no") |
(predict < 0.5 & data$response == "yes")))
which is of course not even syntactically correct.
Unfortunately, my limited R knowledge let me waste hours and I was wondering if someone could point me in the correct direction.
It sounds like you might do well to just use the cost function (i.e. the one named cost) defined further down in the "Examples" section of ?cv.glm. Quoting from that section:
# [...] Since the response is a binary variable an
# appropriate cost function is
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
This does essentially what you were trying to do with your example. Replacing your "no" and "yes" with 0 and 1, lets say you have two vectors, predict and response. Then cost() is nicely designed to take them and return the mean classification rate:
## Simulate some reasonable data
set.seed(1)
predict <- seq(0.1, 0.9, by=0.1)
response <- rbinom(n=length(predict), prob=predict, size=1)
response
# [1] 0 0 0 1 0 0 0 1 1
## Demonstrate the function 'cost()' in action
cost(response, predict)
# [1] 0.3333333 ## Which is right, as 3/9 elements (4, 6, & 7) are misclassified
## (assuming you use 0.5 as the cutoff for your predictions).
I'm guessing the trickiest bit of this will be just getting your mind fully wrapped around the idea of passing a function in as an argument. (At least that was for me, for the longest time, the hardest part of using the boot package, which requires that move in a fair number of places.)
Added on 2016-03-22:
The function cost(), given above is in my opinion unnecessarily obfuscated; the following alternative does exactly the same thing but in a more expressive way:
cost <- function(r, pi = 0) {
mean((pi < 0.5) & r==1 | (pi > 0.5) & r==0)
}
I will try to explain the cost function in simple words. Let's take
cv.glm(data, glmfit, cost, K) arguments step by step:
data
The data consists of many observations. Think of it like series of numbers or even.
glmfit
It is generalized linear model, which runs on the above series. But there is a catch it splits data into several parts equal to K. And runs glmfit on each of them separately (test set), taking the rest of them as training set. The output of glmfit is a series consisting of same number of elements as the split input passed.
cost
Cost Function. It takes two arguments first the split input series(test set), and second the output of glmfit on the test input. The default is mean square error function.
.
It sums the square of difference between observed data point and predicted data point. Inside the function a loop runs over the test set (output and input should have same number of elements) calculates difference, squares it and adds to output variable.
K
The number to which the input should be split. Default gives leave one out cross validation.
Judging from your cost function description. Your input(x) would be a set of numbers between 0 and 1 (0-0.5 = no and 0.5-1 = yes) and output(y) is 'yes' or 'no'. So error(e) between observation(x) and prediction(y) would be :
cost<- function(x, y){
e=0
for (i in 1:length(x)){
if(x[i]>0.5)
{
if( y[i]=='yes') {e=0}
else {e=x[i]-0.5}
}else
{
if( y[i]=='no') {e=0}
else {e=0.5-x[i]}
}
e=e*e #square error
}
e=e/i #mean square error
return (e)
}
Sources : http://www.cs.cmu.edu/~schneide/tut5/node42.html
The cost function can optionally be defined if there is one you prefer over the default average squared error. If you wanted to do so then the you would write a function that returns the cost you want to minimize using two inputs: (1) the vector of known labels that you are predicting, and (2) the vector of predicted probabilities from your model for those corresponding labels. So for the cost function that (I think) you described in your post you are looking for a function that will return the average number of accurate classifications which would look something like this:
cost <- function(labels,pred){
mean(labels==ifelse(pred > 0.5, 1, 0))
}
With that function defined you can then pass it into your glm.cv() call. Although I wouldn't recommend using your own cost function over the default one unless you have reason to. Your example isn't reproducible, so here is another example:
> library(boot)
>
> cost <- function(labels,pred){
+ mean(labels==ifelse(pred > 0.5, 1, 0))
+ }
>
> #make model
> nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal)
> #run cv with your cost function
> (nodal.glm.err <- cv.glm(nodal, nodal.glm, cost, nrow(nodal)))
$call
cv.glm(data = nodal, glmfit = nodal.glm, cost = cost, K = nrow(nodal))
$K
[1] 53
$delta
[1] 0.8113208 0.8113208
$seed
[1] 403 213 -2068233650 1849869992 -1836368725 -1035813431 1075589592 -782251898
...
The cost function defined in the example for cv.glm clearly assumes that the predictions are probabilities, which would require the type="response" argument in the predict function. The documentation from library(boot) should state this explicitly. I would otherwise be forced to assume that the default type="link" is used inside the cv.glm function, in which case the cost function would not work as intended.

Resources