Notation of categorical variables in regression analysis - r

In the process of studying logistic regression using carret's mdrr data, questions arise.
I created a full model using a total of 19 variables, and I have questions about the notation of the categorical variable.
In my regression model, the categorical variables are:
nDB : 0 or 1 or 2
nR05 : 0 or 1
nR10 : 1 or 2
I created a full model using glm, but I do not know why the names of categorical variables have one of the numbers in the category.
-------------------------------------------------------------------------------
glm(formula = mdrrClass ~ ., family = binomial, data = train)
#Coefficients:
#(Intercept) nDB1 nDB2 nX nR051 nR101 nBnz2
#5.792e+00 5.287e-01 -3.103e-01 -2.532e-01 -9.291e-02 9.259e-01 -2.108e+00
#SPI BLI PW4 PJI2 Lop BIC2 VRA1
#3.222e-05 -1.201e+01 -3.754e+01 -5.467e-01 1.010e+00 -5.712e+00 -2.424e-04
# PCR H3D FDI PJI3 DISPm DISPe G.N..N.
# -6.397e-02 -4.360e-04 3.458e+01 -6.579e+00 -5.690e-02 2.056e-01 -7.610e-03
#Degrees of Freedom: 263 Total (i.e. Null); 243 Residual
#Null Deviance: 359.3
#Residual Deviance: 232.6 AIC: 274.6
-------------------------------------------------------------------------------
The above results show that nDB is numbered, and nR05 and nR10 are related to categories.
I am wondering why numbers are attached as above.

When you have categorical predictors in any regression model you need to create dummy variables. R does this for you and the output you see are the contrasts
Your variable nDB has 3 levels: 0, 1, 2
One of those needs to be chosen as the reference level (R was chosen 0 for you in this case, but this can also be specified manually). Then dummy variables are created to compare every other level against your reference level: 0 vs 1 and 0 vs 2
R names these dummy variables nDB1 and nDB2. nDB1 is for the 0 vs 1 contrast, and nDB2 is for the 0 vs 2 contrast. The numbers after the variable names are just to indicate which contrast you're looking at
The coefficient values are interpreted as the difference in your y (outcome) value between groups 0 and 1 (nDB1), and separately between groups 0 and 2 (nDB2). In other words, what change in the outcome would you expect when moving from one group to the other?
Your other categorical variables have 2 levels and are just a simpler case of the above
For example, nR05 only has 0 and 1 as values. 0 was chosen as your reference, and because theres only 1 possible contrast here, a single dummy variable is created comparing 0 vs 1. In the output that dummy variable is called nR051

It's always the case for categorical variables, espacially when they are not binary (like your nDB). It's so that you know for which value you have the coefficient. For the nDB variable the model has created two new variables: nDB_1 which equals 1 if nDB=1 and equals 0 if nDB= 0 or nDB=2.

To analyze a binary variable (whose values would be TRUE / FALSE, 0/1, or YES / NO) according to a quantitative explanatory variable, a logistic regression can be used.
Consider for example the following data, where x is the age of 40 people, and y the variable indicating if they bought a death metal album in the last 5 years (1 if "yes", 0 if "no" )
Graphically, we can see that, more likely, the older people are, the less they buy death metal.
Logistic regression is a special case of the Generalized Linear Model (GLM).
With a classical linear regression model, we consider the following model:
Y = αX + β
The expectation of Y is therefore predicted as follows:
E (Y) = αX + β
Here, because of the binary distribution of Y, the above relations can not apply. To "generalize" the linear model, we therefore consider that
g (E (Y)) = αX + β
where g is a link function.
In this case, for a logistic regression, the link function corresponds to the logit function:
logit (p) = log (p/(1-p))
Note that this logit function transforms a value (p) between 0 and 1 (such as a probability for example) into a value between - ∞ and + ∞.
Here's how to do the logistic regression under R:
myreg=glm(y~x, family=binomial(link=logit))
summary(myreg)
glm(formula = y ~ x, family = binomial(link = logit))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8686 -0.7764 0.3801 0.8814 2.0253
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.9462 1.9599 3.034 0.00241 **
## x -0.1156 0.0397 -2.912 0.00360 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 52.925 on 39 degrees of freedom
## Residual deviance: 39.617 on 38 degrees of freedom
## AIC: 43.617
##
## Number of Fisher Scoring iterations: 5
We obtain the following model:
logit (E (Y)) = - 0.12X + 5.95
and we note that the (negative) influence of age on the purchase of death metal albums is significant at the 5% level (p(>[Z| ----> < 5%).
Thus, logistic regression is often used to bring out risk factor (like Age but also BMI, Sex and so on …)

Related

How to fit a known linear equation to my data in R?

I used a linear model to obtain the best fit to my data, lm() function.
From literature I know that the optimal fit would be a linear regression with the slope = 1 and the intercept = 0. I would like to see how good this equation (y=x) fits my data? How do I proceed in order to find an R^2 as well as a p-value?
This is my data
(y = modelled, x = measured)
measured<-c(67.39369,28.73695,60.18499,49.32405,166.39318,222.29022,271.83573,241.72247, 368.46304,220.27018,169.92343,56.49579,38.18381,49.33753,130.91752,161.63536,294.14740,363.91029,358.32905,239.84112,129.65078,32.76462,30.13952,52.83656,67.35427,132.23034,366.87857,247.40125,273.19316,278.27902,123.24256,45.98363,83.50199,240.99459,266.95707,308.69814,228.34256,220.51319,83.97942,58.32171,57.93815,94.64370,264.78007,274.25863,245.72940,155.41777,77.45236,70.44223,104.22838,294.01645,312.42321,122.80831,41.65770,242.22661,300.07147,291.59902,230.54478,89.42498,55.81760,55.60525,111.64263,305.76432,264.27192,233.28214,192.75603,75.60803,63.75376)
modelled<-c(42.58318,71.64667,111.08853,67.06974,156.47303,240.41188,238.25893,196.42247,404.28974,138.73164,116.73998,55.21672,82.71556,64.27752,145.84891,133.67465,295.01014,335.25432,253.01847,166.69241,68.84971,26.03600,45.04720,75.56405,109.55975,202.57084,288.52887,140.58476,152.20510,153.99427,75.70720,92.56287,144.93923,335.90871,NA,264.25732,141.93407,122.80440,83.23812,42.18676,107.97732,123.96824,270.52620,388.93979,308.35117,100.79047,127.70644,91.23133,162.53323,NA ,276.46554,100.79440,81.10756,272.17680,387.28700,208.29715,152.91548,62.54459,31.98732,74.26625,115.50051,324.91248,210.14204,168.29598,157.30373,45.76027,76.07370)
Now I would like to see how good the equation y=x fits the data presented above (R^2 and p-value)?
I am very grateful if somebody can help me with this (basic) problem, as I found no answers to my question on stackoverflow?
Best regards Cyril
Let's be clear what you are asking here. You have an existing model, which is "the modelled values are the expected value of the measured values", or in other words, measured = modelled + e, where e are the normally distributed residuals.
You say that the "optimal fit" should be a straight line with intercept 0 and slope 1, which is another way of saying the same thing.
The thing is, this "optimal fit" is not the optimal fit for your actual data, as we can easily see by doing:
summary(lm(measured ~ modelled))
#>
#> Call:
#> lm(formula = measured ~ modelled)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -103.328 -39.130 -4.881 40.428 114.829
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.09461 13.11026 1.762 0.083 .
#> modelled 0.91143 0.07052 12.924 <2e-16 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 55.13 on 63 degrees of freedom
#> Multiple R-squared: 0.7261, Adjusted R-squared: 0.7218
#> F-statistic: 167 on 1 and 63 DF, p-value: < 2.2e-16
This shows us the line that would produce the optimal fit to your data in terms of reducing the sum of the squared residuals.
But I guess what you are asking is "How well do my data fit the model measured = modelled + e ?"
Trying to coerce lm into giving you a fixed intercept and slope probably isn't the best way to answer this question. Remember, the p value for the slope only tells you whether the actual slope is significantly different from 0. The above model already confirms that. If you want to know the r-squared of measured = modelled + e, you just need to know the proportion of the variance of measured that is explained by modelled. In other words:
1 - var(measured - modelled) / var(measured)
#> [1] 0.7192672
This is pretty close to the r squared from the lm call.
I think you have sufficient evidence to say that your data is consistent with the model measured = modelled, in that the slope in the lm model includes the value 1 within its 95% confidence interval, and the intercept contains the value 0 within its 95% confidence interval.
As mentioned in the comments, you can use the lm() function, but this actually estimates the slope and intercept for you, whereas what you want is something different.
If slope = 1 and the intercept = 0, essentially you have a fit and your modelled is already the predicted value. You need the r-square from this fit. R squared is defined as:
R2 = MSS/TSS = (TSS − RSS)/TSS
See this link for definition of RSS and TSS.
We can only work with observations that are complete (non NA). So we calculate each of them:
TSS = nonNA = !is.na(modelled) & !is.na(measured)
# residuals from your prediction
RSS = sum((modelled[nonNA] - measured[nonNA])^2,na.rm=T)
# total residuals from data
TSS = sum((measured[nonNA] - mean(measured[nonNA]))^2,na.rm=T)
1 - RSS/TSS
[1] 0.7116585
If measured and modelled are supposed to represent the actual and fitted values of an undisclosed model, as discussed in the comments below another answer, then if fm is the lm object for that undisclosed model then
summary(fm)
will show the R^2 and p value of that model.
The R squared value can actually be calculated using only measured and modelled but the formula is different if there is or is not an intercept in the undisclosed model. The signs are that there is no intercept since if there were an intercept sum(modelled - measured, an.rm = TRUE) should be 0 but in fact it is far from it.
In any case R^2 and the p value are shown in the output of the summary(fm) where fm is the undisclosed linear model so there is no point in restricting the discussion to measured and modelled if you have the lm object of the undisclosed model.
For example, if the undisclosed model is the following then using the builtin CO2 data frame:
fm <- lm(uptake ~ Type + conc, CO2)
summary(fm)
we have the this output where the last two lines show R squared and p value.
Call:
lm(formula = uptake ~ Type + conc, data = CO2)
Residuals:
Min 1Q Median 3Q Max
-18.2145 -4.2549 0.5479 5.3048 12.9968
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.830052 1.579918 16.349 < 2e-16 ***
TypeMississippi -12.659524 1.544261 -8.198 3.06e-12 ***
conc 0.017731 0.002625 6.755 2.00e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.077 on 81 degrees of freedom
Multiple R-squared: 0.5821, Adjusted R-squared: 0.5718
F-statistic: 56.42 on 2 and 81 DF, p-value: 4.498e-16

incorrect logistic regression output

I'm doing logistic regression on Boston data with a column high.medv (yes/no) which indicates if the median house pricing given by column medv is either more than 25 or not.
Below is my code for logistic regression.
high.medv <- ifelse(Boston$medv>25, "Y", "N") # Applying the desired
`condition to medv and storing the results into a new variable called "medv.high"
ourBoston <- data.frame (Boston, high.medv)
ourBoston$high.medv <- as.factor(ourBoston$high.medv)
attach(Boston)
# 70% of data <- Train
train2<- subset(ourBoston,sample==TRUE)
# 30% will be Test
test2<- subset(ourBoston, sample==FALSE)
glm.fit <- glm (high.medv ~ lstat,data = train2, family = binomial)
summary(glm.fit)
The output is as follows:
Deviance Residuals:
[1] 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.57 48196.14 0 1
lstat NA NA NA NA
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.0000e+00 on 0 degrees of freedom
Residual deviance: 3.1675e-10 on 0 degrees of freedom
AIC: 2
Number of Fisher Scoring iterations: 21
Also i need:
Now I'm required to use the misclassification rate as the measure of error for the two cases:
using lstat as the predictor, and
using all predictors except high.medv and medv.
but i am stuck at the regression itself
With every classification algorithm, the art relies on choosing the threshold upon which you will determine whether the the result is positive or negative.
When you predict your outcomes in the test data set you estimate probabilities of the response variable being either 1 or 0. Therefore, you need to the tell where you are gonna cut, the threshold, at which the prediction becomes 1 or 0.
A high threshold is more conservative about labeling a case as positive, which makes it less likely to produce false positives and more likely to produce false negatives. The opposite happens for low thresholds.
The usual procedure is to plot the rates that interests you, e.g., true positives and false positives against each other, and then choose what is the best rate for you.
set.seed(666)
# simulation of logistic data
x1 = rnorm(1000) # some continuous variables
z = 1 + 2*x1 # linear combination with a bias
pr = 1/(1 + exp(-z)) # pass through an inv-logit function
y = rbinom(1000, 1, pr)
df = data.frame(y = y, x1 = x1)
df$train = 0
df$train[sample(1:(2*nrow(df)/3))] = 1
df$new_y = NA
# modelling the response variable
mod = glm(y ~ x1, data = df[df$train == 1,], family = "binomial")
df$new_y[df$train == 0] = predict(mod, newdata = df[df$train == 0,], type = 'response') # predicted probabilities
dat = df[df$train==0,] # test data
To use missclassification error to evaluate your model, first you need to set up a threshold. For that, you can use the roc function from pROC package, which calculates the rates and provides the corresponding thresholds:
library(pROC)
rates =roc(dat$y, dat$new_y)
plot(rates) # visualize the trade-off
rates$specificity # shows the ratio of true negative over overall negatives
rates$thresholds # shows you the corresponding thresholds
dat$jj = as.numeric(dat$new_y>0.7) # using 0.7 as a threshold to indicate that we predict y = 1
table(dat$y, dat$jj) # provides the miss classifications given 0.7 threshold
0 1
0 86 20
1 64 164
The accuracy of your model can be computed as the ratio of the number of observations you got right against the size of your sample.

How to get probability from GLM output

I'm extremely stuck at the moment as I am trying to figure out how to calculate the probability from my glm output in R. I know the data is very insignificant but I would really love to be shown how to get the probability from an output like this. I was thinking of trying inv.logit() but didn't know what variables to put within the brackets.
The data is from occupancy study. I'm assessing the success of a hair trap method versus a camera trap in detecting 3 species (red squirrel, pine marten and invasive grey squirrel). I wanted to see what affected detection (or non detection) of the various species. One hypotheses was the detection of another focal species at the site would affect the detectability of red squirrel. Given that pine marten is a predator of the red squirrel and that the grey squirrel is a competitor, the presence of those two species at a site might affect the detectability of the red squirrel.
Would this show the probability? inv.logit(-1.14 - 0.1322 * nonRS events)
glm(formula = RS_sticky ~ NonRSevents_before1stRS, family = binomial(link = "logit"), data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7432 -0.7432 -0.7222 -0.3739 2.0361
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1455 0.4677 -2.449 0.0143 *
NonRSevents_before1stRS -0.1322 0.1658 -0.797 0.4255
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.575 on 33 degrees of freedom
Residual deviance: 33.736 on 32 degrees of freedom
(1 observation deleted due to missingness)
AIC: 37.736
Number of Fisher Scoring iterations: 5*
If you want to predict the probability of response for a specified set of values of the predictor variable:
pframe <- data.frame(NonRSevents_before1stRS=4)
predict(fitted_model, newdata=pframe, type="response")
where fitted_model is the result of your glm() fit, which you stored in a variable. You may not be familiar with the R approach to statistical analysis, which is to store the fitted model as an object/in a variable, then apply different methods to it (summary(), plot(), predict(), residuals(), ...)
This is obviously only a made-up example: I don't know if 4 is a reasonable value for the NonRSevents_before1stRS variable)
you can specify more different values to do predictions for at the same time (data.frame(NonRSevents_before1stRS=c(4,5,6,7,8)))
if you have multiple predictors, you have to specify some value for every predictor for every prediction, e.g. data.frame(x=4:8,y=mean(orig_data$y), ...)
If you want the predicted probabilities for the observations in your original data set, just predict(fitted_model, type="response")
You're correct that inv.logit() (from a bunch of different packages, don't know which you're using) or plogis() (from base R, essentially the same) will translate from the logit or log-odds scale to the probability scale, so
plogis(predict(fitted_model))
would also work (predict provides predictions on the link-function [in this case logit/log-odds] scale by default).
The dependent variable in a logistic regression is a log odds ratio. We'll illustrate how to interpret the coefficients with the space shuttle autolander data from the MASS package.
After loading the data, we'll create a binary dependent variable where:
1 = autolander used,
0 = autolander not used.
We will also create a binary independent variable for shuttle stability:
1 = stable positioning
0 = unstable positioning.
Then, we'll run glm() with family=binomial(link="logit"). Since the coefficients are log odds ratios, we'll exponentiate them to turn them back into odds ratios.
library(MASS)
str(shuttle)
shuttle$stable <- 0
shuttle[shuttle$stability =="stab","stable"] <- 1
shuttle$auto <- 0
shuttle[shuttle$use =="auto","auto"] <- 1
fit <- glm(use ~ factor(stable),family=binomial(link = "logit"),data=shuttle) # specifies base as unstable
summary(fit)
exp(fit$coefficients)
...and the output:
> fit <- glm(use ~ factor(stable),family=binomial(link = "logit"),data=shuttle) # specifies base as unstable
>
> summary(fit)
Call:
glm(formula = use ~ factor(stable), family = binomial(link = "logit"),
data = shuttle)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.1774 -1.0118 -0.9566 1.1774 1.4155
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.747e-15 1.768e-01 0.000 1.0000
factor(stable)1 -5.443e-01 2.547e-01 -2.137 0.0326 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 350.36 on 255 degrees of freedom
Residual deviance: 345.75 on 254 degrees of freedom
AIC: 349.75
Number of Fisher Scoring iterations: 4
> exp(fit$coefficients)
(Intercept) factor(stable)1
1.0000000 0.5802469
>
The intercept of 0 is the log odds for unstable, and the coefficient of -.5443 is the log odds for stable. After exponentiating the coefficients, we observe that the odds of autolander use under the condition of an unstable shuttle 1.0, and are multiplied by .58 if the shuttle is stable. This means that the autolander is less likely to be used if the shuttle has stable positioning.
Calculating probability of autolander use
We can do this in two ways. First, the manual approach: exponentiate the coefficients and convert the odds to probabilities using the following equation.
p = odds / (1 + odds)
With the shuttle autolander data it works as follows.
# convert intercept to probability
odds_i <- exp(fit$coefficients[1])
odds_i / (1 + odds_i)
# convert stable="stable" to probability
odds_p <- exp(fit$coefficients[1]) * exp(fit$coefficients[2])
odds_p / (1 + odds_p)
...and the output:
> # convert intercept to probability
> odds_i <- exp(fit$coefficients[1])
> odds_i / (1 + odds_i)
(Intercept)
0.5
> # convert stable="stable" to probability
> odds_p <- exp(fit$coefficients[1]) * exp(fit$coefficients[2])
> odds_p / (1 + odds_p)
(Intercept)
0.3671875
>
The probability of autolander use when a shuttle is unstable is 0.5, and decreases to 0.37 when the shuttle is stable.
The second approach to generate probabilities is to use the predict() function.
# convert to probabilities with the predict() function
predict(fit,data.frame(stable="0"),type="response")
predict(fit,data.frame(stable="1"),type="response")
Note that the output matches the manually calculated probabilities.
> # convert to probabilities with the predict() function
> predict(fit,data.frame(stable="0"),type="response")
1
0.5
> predict(fit,data.frame(stable="1"),type="response")
1
0.3671875
>
Applying this to the OP data
We can apply these steps to the glm() output from the OP as follows.
coefficients <- c(-1.1455,-0.1322)
exp(coefficients)
odds_i <- exp(coefficients[1])
odds_i / (1 + odds_i)
# convert nonRSEvents = 1 to probability
odds_p <- exp(coefficients[1]) * exp(coefficients[2])
odds_p / (1 + odds_p)
# simulate up to 10 nonRSEvents prior to RS
coef_df <- data.frame(nonRSEvents=0:10,
intercept=rep(-1.1455,11),
nonRSEventSlope=rep(-0.1322,11))
coef_df$nonRSEventValue <- coef_df$nonRSEventSlope *
coef_df$nonRSEvents
coef_df$intercept_exp <- exp(coef_df$intercept)
coef_df$slope_exp <- exp(coef_df$nonRSEventValue)
coef_df$odds <- coef_df$intercept_exp * coef_df$slope_exp
coef_df$probability <- coef_df$odds / (1 + coef_df$odds)
# print the odds & probabilities by number of nonRSEvents
coef_df[,c(1,7:8)]
...and the final output.
> coef_df[,c(1,7:8)]
nonRSEvents odds probability
1 0 0.31806 0.24131
2 1 0.27868 0.21794
3 2 0.24417 0.19625
4 3 0.21393 0.17623
5 4 0.18744 0.15785
6 5 0.16423 0.14106
7 6 0.14389 0.12579
8 7 0.12607 0.11196
9 8 0.11046 0.09947
10 9 0.09678 0.08824
11 10 0.08480 0.07817
>

Post-hoc test for glmer

I'm analysing my binomial dataset with R using a generalized linear mixed model (glmer, lme4-package). I wanted to make the pairwise comparisons of a certain fixed effect ("Sound") using a Tukey's post-hoc test (glht, multcomp-package).
Most of it is working fine, but one of my fixed effect variables ("SoundC") has no variance at all (96 times a "1" and zero times a "0") and it seems that the Tukey's test cannot handle that. All pairwise comparisons with this "SoundC" give a p-value of 1.000 whereas some are clearly significant.
As a validation I changed one of the 96 "1"'s to a "0" and after that I got normal p-values again and significant differences where I expected them, whereas the difference had actually become smaller after my manual change.
Does anybody have a solution? If not, is it fine to use the results of my modified dataset and report my manual change?
Reproducible example:
Response <- c(1,0,1,1,0,1,1,0,1,1,0,1,1,0,1,1,1,1,0,
0,1,1,0,1,1,0,1,1,0,1,1,0,1,1,0,1,1,0,1,1,0,
1,1,0,1,1,0,1,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,0,1)
Data <- data.frame(Sound=rep(paste0('Sound',c('A','B','C')),22),
Response,
Individual=rep(rep(c('A','B'),2),rep(c(18,15),2)))
# Visual
boxplot(Response ~ Sound,Data)
# Mixed model
library (lme4)
model10 <- glmer(Response~Sound + (1|Individual), Data, family=binomial)
# Post-hoc analysis
library (multcomp)
summary(glht(model10, mcp(Sound="Tukey")))
This is verging on a CrossValidated question; you are definitely seeing complete separation, where there is a perfect division of your response into 0 vs 1 results. This leads to (1) infinite values of the parameters (they're only listed as non-infinite due to computational imperfections) and (2) crazy/useless values of the Wald standard errors and corresponding $p$ values (which is what you're seeing here). Discussion and solutions are given here, here, and here, but I'll illustrate a little more below.
To be a statistical grouch for a moment: you really shouldn't be trying to fit a random effect with only 3 levels anyway (see e.g. http://glmm.wikidot.com/faq) ...
Firth-corrected logistic regression:
library(logistf)
L1 <- logistf(Response~Sound*Individual,data=Data,
contrasts.arg=list(Sound="contr.treatment",
Individual="contr.sum"))
coef se(coef) p
(Intercept) 3.218876e+00 1.501111 2.051613e-04
SoundSoundB -4.653960e+00 1.670282 1.736123e-05
SoundSoundC -1.753527e-15 2.122891 1.000000e+00
IndividualB -1.995100e+00 1.680103 1.516838e-01
SoundSoundB:IndividualB 3.856625e-01 2.379919 8.657348e-01
SoundSoundC:IndividualB 1.820747e+00 2.716770 4.824847e-01
Standard errors and p-values are now reasonable (p-value for the A vs C comparison is 1 because there is literally no difference ...)
Mixed Bayesian model with weak priors:
library(blme)
model20 <- bglmer(Response~Sound + (1|Individual), Data, family=binomial,
fixef.prior = normal(cov = diag(9,3)))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.711485 2.233667 0.7662221 4.435441e-01
## SoundSoundB -5.088002 1.248969 -4.0737620 4.625976e-05
## SoundSoundC 2.453988 1.701674 1.4421024 1.492735e-01
The specification diag(9,3) of the fixed-effect variance-covariance matrix produces
$$
\left(
\begin{array}{ccc}
9 & 0 & 0 \
0 & 9 & 0 \
0 & 0 & 9
\end{array}
\right)
$$
In other words, the 3 specifies the dimension of the matrix (equal to the number of fixed-effect parameters), and the 9 specifies the variance -- this corresponds to a standard devation of 3 or a 95% range of about $\pm 6$, which is quite large/weak/uninformative for logit-scaled responses.
These are roughly consistent (the model is very different)
library(multcomp)
summary(glht(model20, mcp(Sound="Tukey")))
## Estimate Std. Error z value Pr(>|z|)
## SoundB - SoundA == 0 -5.088 1.249 -4.074 0.000124 ***
## SoundC - SoundA == 0 2.454 1.702 1.442 0.309216
## SoundC - SoundB == 0 7.542 1.997 3.776 0.000397 ***
As I said above, I would not recommend a mixed model in this case anyway ...

R: logistic regression using frequency table, cannot find correct Pearson Chi Square statistics

I was implement logistic regression to the following data frame and got a reasonable (the same as using STATA) results. But the Pearson chi square and degree of freedom I got from R is very different from STATA, which in turn gave me an very small p-value. And I cannot get the area under ROC curve. Could anyone help me to find out why residual() does not work on glm() with priori weights, and how to deal with area under ROC curve?
Following is my code and output.
1. Data
Here is my data frame test_data, y is outcome, x1 and x2 are covariates:
y x1 x2 freq
0 No 0 268
0 No 1 14
0 Yes 0 109
0 Yes 1 1
1 No 0 31
1 No 1 6
1 Yes 0 45
1 Yes 1 6
I generated this data frame from the original data by counting occurrence of each covariate pattern, and store the number in new variable freq.
2. GLM Model
Then I did the logistic regression as:
logit=glm(y~x1+x2, data=test_data, family=binomial, weights=freq)
Output shows:
Deviance Residuals:
1 2 3 4 5 6 7 8
-7.501 -3.536 -8.818 -1.521 11.957 3.501 10.409 2.129
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2010 0.1892 -11.632 < 2e-16 ***
x1 1.3538 0.2516 5.381 7.39e-08 ***
x2 1.6261 0.4313 3.770 0.000163 ***
Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 457.35 on 7 degrees of freedom
Residual deviance: 416.96 on 5 degrees of freedom
AIC: 422.96
Number of Fisher Scoring iterations: 5
Coefficients are the same as STATA.
3. Chi Square Statistics
when I tried to calculate the Pearson chi square:
pearson_chisq = sum(residuals(logit, type = "pearson", weights=test_data$freq)^2)
I got 488, instead of 1.3 given by STATA. Also the DOF generated by R is chisq_dof = df.residuals(logit)=5, instead of 1. So I got an extremely small p-value~e^-100.
4. Discrimination
Then I calculated the area under ROC curve as:
library(verification)
logit_mf = model.frame(logit)
roc.area(logit_mf $y, fitted(logit))$A
The output is:
[1] 0.5
Warning message:
In wilcox.test.default(pred[obs == 1], pred[obs == 0], alternative = "great") :
cannot compute exact p-value with ties
Thanks!
I figured out how to solve this problem eventually. The data set I used above should be summarised to covariate patterns. Then use the definition of Pearson chi square to do calculation. I provide the R code as follows:
# extract covariate patterns
library(dplyr)
temp=test_data %>% group_by(x1, x2) %>% summarise(m=sum(freq),y=sum(freq*y))
temp$pred=fitted(p01_logit_j)[1:4]
# calculate Pearson chi square
temp=mutate(temp, pearson=(y-mpred)/sqrt(mpred*(1-pred)))
pearson_chi2 = with(temp, sum(pearson^2))
temp_dof = 4-(2+1) #dof=J-(p+1)
# calculate p-value
pchisq(pearson_chi2, temp_dof, lower.tail=F)
The result of p-value is 0.241941, which is same as STATA.
In order to calculate AUC, we should first expand the covariate pattern to the "original" data, then use the "expanded" data to get AUC. Noted we have 392 "0" and 88 "1" in the frequency table. My code follows:
# expand observation
y_expand=c(rep(0,392),rep(1,88))
logit_mf = model.frame(logit)
logit_pred = fitted(logit)
logit_mf$freq=test_data$freq
# expand prediction
yhat_expand=with(logit_mf, rep(pred, freq))
library(verification)
roc.area(y_expand, yhat)$A
AUC=0.6760, same as that of STATA.

Resources