P value difference between FISHER exact test and GLM - r

I have a dateset plink.raw and I am testing if a maker CN00020133 (has three level 0, 1, 2) is associated with phenotype5. I want to compare 0 vs 1 and 2 vs 1 using GLM or fisher extract test:
table(plink.raw$phenotype5,plink.raw$CN00020133)
1 0 2
0 3559 0 7
1 14806 54 123
tested using GLM, I can see the p value for 0 vs 1 is 0.912894.
plink.raw$CN00020133 <- factor(plink.raw$CN00020133, levels=c("1","0","2"))
univariate=glm(phenotype5 ~ relevel(CN00020133,ref ="1"), family = binomial, data = plink.raw)
summary(univariate)
Call:
glm(formula = phenotype5 ~ relevel(CN00020133, ref = "1"),
family = binomial, data = plink.raw)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4173 0.6564 0.6564 0.6564 0.6564
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.42555 0.01867 76.361 < 2e-16 ***
relevel(CN00020133, ref = "1")0 13.14051 120.12616 0.109 0.912894
relevel(CN00020133, ref = "1")2 1.44072 0.38902 3.703 0.000213 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 18158 on 18548 degrees of freedom
Residual deviance: 18114 on 18546 degrees of freedom
AIC: 18120
Number of Fisher Scoring iterations: 13
But if I tested it using fisher exact test, p value for 0 vs 1 is 4.618e-06.
Convictions <- matrix(c(0, 54, 3559, 14806), nrow = 2,dimnames = list(c("control", "case"),c("del/dup", "normal_copy")))
fisher.test(Convictions, alternative = "less")
Fisher's Exact Test for Count Data
data: Convictions
p-value = 9.048e-06
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
0.0000000 0.2374411
sample estimates:
odds ratio 0

You have a case of complete separation, also known as The Hauck-Donner effect occurs. If you CN00020133==0, all of them have 1 as phenotype, and this makes it hard to estimate the standard error of the coefficient. There's a fair amount of material on it, for example Alexej's blog, post by Brian Ripley, Ben Bolker's notes.
If you need to test for significance of the effect of "1", one solution is to use the likelihood ratio test:
df = rbind(
data.frame(phenotype=rep(0:1,c(3559,14806)),CN00020133="1"),
data.frame(phenotype=rep(0:1,c(0,54)),CN00020133="0")
)
anova(glm(phenotype ~ CN00020133,data=df,family=binomial),test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: phenotype
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 18418 18082
CN00020133 1 23.227 18417 18059 1.44e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This gives you different p-value compared to fisher test because it's a different distribution (binomial vs hyper-geometric). But more or less you can conclude there is an added effect of "1" using "0" as reference.
There is a implementation of Firth's logistic regression in R, you can try and see but I must say I am not very familiar with this:
library("logistf")
logistf(phenotype ~ CN00020133,data=df,family=binomial)

Related

Logistic regression from R returning values greater than one

I have run a logistic regression in R using glm to predict the likelihood that an individual in 1993 will have arthritis in 2004 (Arth2004) based on gender (Gen), smoking status (Smoke1993), hypertension (HT1993), high cholesterol (HC1993), and BMI (BMI1993) status in 1993. My sample size is n=7896. All variables are binary with 0 and 1 for false and true except BMI, which is continuous numeric. For gender, male=1 and female=0.
When I run the regression in R, I get good p-values, but when I actually use the regression for prediction, I get values greater than one quite often for very standard individuals. I apologize for the large code block, but I thought more information may be helpful.
library(ResourceSelection)
library(MASS)
data=read.csv(file.choose())
data$Arth2004 = as.factor(data$Arth2004)
data$Gen = as.factor(data$Gen)
data$Smoke1993 = as.factor(data$Smoke1993)
data$HT1993 = as.factor(data$HT1993)
data$HC1993 = as.factor(data$HC1993)
data$BMI1993 = as.numeric(data$BMI1993)
logistic <- glm(Arth2004 ~ Gen + Smoke1993 + BMI1993 + HC1993 + HT1993, data=data, family="binomial")
summary(logistic)
hoslem.test(logistic$y, fitted(logistic))
confint(logistic)
min(data$BMI1993)
median(data$BMI1993)
max(data$BMI1993)
e=2.71828
The output is as follows:
Call:
glm(formula = Arth2004 ~ Gen + Smoke1993 + BMI1993 + HC1993 +
HT1993, family = "binomial", data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0362 -1.0513 -0.7831 1.1844 1.8807
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.346104 0.158043 -14.845 < 2e-16 ***
Gen1 -0.748286 0.048398 -15.461 < 2e-16 ***
Smoke19931 -0.059342 0.064606 -0.919 0.358
BMI1993 0.084056 0.006005 13.997 < 2e-16 ***
HC19931 0.388217 0.047820 8.118 4.72e-16 ***
HT19931 0.341375 0.058423 5.843 5.12e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10890 on 7895 degrees of freedom
Residual deviance: 10309 on 7890 degrees of freedom
AIC: 10321
Number of Fisher Scoring iterations: 4
Hosmer and Lemeshow goodness of fit (GOF) test
data: logistic$y, fitted(logistic)
X-squared = 18.293, df = 8, p-value = 0.01913
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -2.65715966 -2.03756775
Gen1 -0.84336906 -0.65364134
Smoke19931 -0.18619647 0.06709748
BMI1993 0.07233866 0.09588198
HC19931 0.29454661 0.48200673
HT19931 0.22690608 0.45595006
[1] 18
[1] 26
[1] 43
A non-smoking female w/ median BMI (26), hypertension, and high cholesterol yields the following:
e^(26*0.084056+1*0.388217+1*0.341375-0*0.748286-0*0.059342-2.346104)
[1] 1.7664
I think the issue is related somehow to BMI considering that is the only variable that is numeric. Does anyone know why this regression produces probabilities greater than 1?
By default, family = "binomial" uses the logit link function (see ?family). So the probability you're looking for is 1.7664 / (1+1.7664).

Replicate EViews' MA() function in R regression

I'm taking some models that were built in EViews and putting them into R. I'm having trouble replicating EViews' MA function.
I tried using the lag of the regressions residuals, but this isn't quite the same. I've seen some mentions that this is an ARIMA regression.. Is there no way to replicate MA from EViews in an lm regression?
For example in R:
set.seed(2)
a = data.frame(a = 1:6,
b = runif(6, 0.0, 1.0),
c = runif(6, 0.0, 1.0))
fit_C = lm(c ~ a + b, data = a)
a$C.pred = predict.lm(fit_C, a)
a$C.resid = a$c - a$C.pred
fit_C = lm(c ~ a + b + lag(C.resid, 1), data = a)
summary(fit_C)
Outputs:
Call:
lm(formula = c ~ a + b + lag(C.resid, 1), data = a)
Residuals:
1 2 3 4 5 6
-1.779e-17 -1.131e-17 5.474e-17 -5.218e-18 -1.959e-17 -8.320e-19
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.327e-01 4.279e-17 1.011e+16 <2e-16 ***
a -3.998e-02 1.353e-17 -2.954e+15 <2e-16 ***
b 2.889e-01 7.278e-17 3.969e+15 <2e-16 ***
lag(C.resid, 1) 1.000e+00 8.241e-17 1.213e+16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.389e-17 on 2 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 5.444e+31 on 3 and 2 DF, p-value: < 2.2e-16
Where in EViews:
Dependent Variable: C01
Method: Least Squares
Date: 09/18/18 Time: 10:24
Sample: 1 6
Included observations: 6
Convergence achieved after 9 iterations
MA Backcast: 0
Variable Coefficient Std. Error t-Statistic Prob.
C 0.892941 0.147320 6.061254 0.0262
A -0.101365 0.041651 -2.433684 0.1354
B 0.063370 0.257874 0.245740 0.8288
MA(1) -0.982901 0.058536 -16.79134 0.0035
R-squared 0.933603 Mean dependent var 0.462030
Adjusted R-squared 0.834008 S.D. dependent var 0.250812
S.E. of regression 0.102186 Akaike info criterion -1.489321
Sum squared resid 0.020884 Schwarz criterion -1.628148
Log likelihood 8.467963 Hannan-Quinn criter. -2.045057
F-statistic 9.373951 Durbin-Watson stat 2.907407
Prob(F-statistic) 0.097923
Inverted MA Roots .98
How do I replicate the MA(1) variable in R?

smooth.spline in glm with NAs in dataset

I am trying to use a smooth.spline transformation for my explanatory variables in glm (logit regression).
I get the error because smooth.spline cannot work with NAs.
Here is my code:
LogitModel <- glm(dummy~ smooth.spline(A) + B + C
,family = binomial(link = "logit"), data = mydata)
How can I handle that (without changing mydata?)
You can use generalized additive models (GAM) which include splines naturally. For example, you can use gam package, as a side-effect they are handling NA's. Please see the code below:
library(gam)
set.seed(123)
data(kyphosis)
# simulation of NA
NAs <- matrix(c(sample(81, 4), 1:4), byrow = FALSE, ncol = 2)
kyphosis_NA[NAs] <- NA
# gam
m_NA <- gam(Kyphosis ~ s(Age,4) + Number + Start, family = binomial, data=kyphosis_NA)
summary(m_NA)
Output:
Call: gam(formula = Kyphosis ~ s(Age, 4) + Number + Start, family = binomial,
data = kyphosis)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.21622 -0.50581 -0.24260 -0.06758 2.36573
(Dispersion Parameter for binomial family taken to be 1)
Null Deviance: 83.2345 on 80 degrees of freedom
Residual Deviance: 53.452 on 74 degrees of freedom
AIC: 67.452
Number of Local Scoring Iterations: 9
Anova for Parametric Effects
Df Sum Sq Mean Sq F value Pr(>F)
s(Age, 4) 1 0.037 0.0368 0.0442 0.834140
Number 1 4.682 4.6816 5.6109 0.020460 *
Start 1 8.869 8.8694 10.6301 0.001683 **
Residuals 74 61.743 0.8344
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Anova for Nonparametric Effects
Npar Df Npar Chisq P(Chi)
(Intercept)
s(Age, 4) 3 5.8327 0.12
Number
Start

How to compare slopes in R

I am performing an ANCOVA so as to test what is the relationship between body size (covariate, logLCC) and different head measures (response variable, logLP) in each sex (cathegorical variable, sexo).
I got the slopes for each sex in the lm and I would like to compare them to 1. More specifically, I would like to know if the slopes are significantly higher or less than 1, or if they are equal to 1, as this would have different biological meanings in their allometric relationships.
Here is my code:
#Modelling my lm#
> lm.logLP.sexo.adu<-lm(logLP~logLCC*sexo, data=ADU)
> anova(lm.logLP.sexo.adu)
Analysis of Variance Table
Response: logLP
Df Sum Sq Mean Sq F value Pr(>F)
logLCC 1 3.8727 3.8727 3407.208 < 2.2e-16 ***
sexo 1 0.6926 0.6926 609.386 < 2.2e-16 ***
logLCC:sexo 1 0.0396 0.0396 34.829 7.563e-09 ***
Residuals 409 0.4649 0.0011
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Obtaining slopes#
> lm.logLP.sexo.adu$coefficients
(Intercept) logLCC sexoM logLCC:sexoM
-0.1008891 0.6725818 -1.0058962 0.2633595
> lm.logLP.sexo.adu1<-lstrends(lm.logLP.sexo.adu,"sexo",var="logLCC")
> lm.logLP.sexo.adu1
sexo logLCC.trend SE df lower.CL upper.CL
H 0.6725818 0.03020017 409 0.6132149 0.7319487
M 0.9359413 0.03285353 409 0.8713585 1.0005241
Confidence level used: 0.95
#Comparing slopes#
> pairs(lm.logLP.sexo.adu1)
contrast estimate SE df t.ratio p.value
H - M -0.2633595 0.04462515 409 -5.902 <.0001
#Checking whether the slopes are different than 1#
#Computes Summary with statistics
> s1<-summary(lm.logLP.sexo.adu)
> s1
Call:
lm(formula = logLP ~ logLCC * sexo, data = ADU)
Residuals:
Min 1Q Median 3Q Max
-0.13728 -0.02202 -0.00109 0.01880 0.12468
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.10089 0.12497 -0.807 0.42
logLCC 0.67258 0.03020 22.271 < 2e-16 ***
sexoM -1.00590 0.18700 -5.379 1.26e-07 ***
logLCC:sexoM 0.26336 0.04463 5.902 7.56e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03371 on 409 degrees of freedom
Multiple R-squared: 0.9083, Adjusted R-squared: 0.9076
F-statistic: 1350 on 3 and 409 DF, p-value: < 2.2e-16
#Computes t-student H0: intercept=1. The estimation of coefficients and their s.d. are in s1$coefficients
> t1<-(1-s1$coefficients[2,1])/s1$coefficients[2,2]
#Calculates two tailed probability
> pval<- 2 * pt(abs(t1), df = df.residual(lm.logLP.sexo.adu), lower.tail = FALSE)
> print(pval)
[1] 3.037231e-24
I saw this whole process in several threads here. But all that I can understand is that my slopes are just different from 1.
How could I check that they are greater or smaller than 1?
EDITED
Solved!
#performs one-side test H0=slope bigger than 1
pval<-pt(t1, df = df.residual(lm.logLP.sexo.adu), lower.tail = FALSE)
#performs one-side test H0=slope smaller than 1
pval<-pt(t1, df = df.residual(lm.logLP.sexo.adu), lower.tail = TRUE)
Also, tests should be performed in single-sex models.
How could I check that they are greater or smaller than 1?
As in this post, this post, and as your in question, you can make Wald test which you compute by
t1<-(1-s1$coefficients[2,1])/s1$coefficients[2,2]
Alternatively, use the vcov and coef function to make the code more readable
fit <- lm.logLP.sexo.adu
t1<-(1-coef(fit)[1])/vcov(fit)[1, 1]
The Wald test gives you t-statistics which can be used to make both a two-sided or one-sided test. Thus, you can drop the abs and set the lower.tail argument according to which tail you want to test in.

Inference about Slope coefficient in R

By default lm summary test slope coefficient equal to zero. My question is very basic. I want to know how to test slope coefficient equal to non-zero value. One approach could be to use confint but this does not provide p-value. I also wonder how to do one-sided test with lm.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2,10,20, labels=c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
summary(lm.D9)
Call:
lm(formula = weight ~ group)
Residuals:
Min 1Q Median 3Q Max
-1.0710 -0.4938 0.0685 0.2462 1.3690
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0320 0.2202 22.850 9.55e-15 ***
groupTrt -0.3710 0.3114 -1.191 0.249
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared: 0.07308, Adjusted R-squared: 0.02158
F-statistic: 1.419 on 1 and 18 DF, p-value: 0.249
confint(lm.D9)
2.5 % 97.5 %
(Intercept) 4.56934 5.4946602
groupTrt -1.02530 0.2833003
Thanks for your time and effort.
as #power says, you can do by your hand.
here is an example:
> est <- summary.lm(lm.D9)$coef[2, 1]
> se <- summary.lm(lm.D9)$coef[2, 2]
> df <- summary.lm(lm.D9)$df[2]
>
> m <- 0
> 2 * abs(pt((est-m)/se, df))
[1] 0.2490232
>
> m <- 0.2
> 2 * abs(pt((est-m)/se, df))
[1] 0.08332659
and you can do one-side test by omitting 2*.
UPDATES
here is an example of two-side and one-side probability:
> m <- 0.2
>
> # two-side probability
> 2 * abs(pt((est-m)/se, df))
[1] 0.08332659
>
> # one-side, upper (i.e., greater than 0.2)
> pt((est-m)/se, df, lower.tail = FALSE)
[1] 0.9583367
>
> # one-side, lower (i.e., less than 0.2)
> pt((est-m)/se, df, lower.tail = TRUE)
[1] 0.0416633
note that sum of upper and lower probabilities is exactly 1.
Use the linearHypothesis function from car package. For instance, you can check if the coefficient of groupTrt equals -1 using.
linearHypothesis(lm.D9, "groupTrt = -1")
Linear hypothesis test
Hypothesis:
groupTrt = - 1
Model 1: restricted model
Model 2: weight ~ group
Res.Df RSS Df Sum of Sq F Pr(>F)
1 19 10.7075
2 18 8.7292 1 1.9782 4.0791 0.05856 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The smatr package has a slope.test() function with which you can use OLS.
In addition to all the other good answers, you could use an offset. It's a little trickier with categorical predictors, because you need to know the coding.
lm(weight~group+offset(1*(group=="Trt")))
The 1* here is unnecessary but is put in to emphasize that you are testing against the hypothesis that the difference is 1 (if you want to test against a hypothesis of a difference of d, then use d*(group=="Trt")
You can use t.test to do this for your data. The mu parameter sets the hypothesis for the difference of group means. The alternative parameter lets you choose between one and two-sided tests.
t.test(weight~group,var.equal=TRUE)
Two Sample t-test
data: weight by group
t = 1.1913, df = 18, p-value = 0.249
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.2833003 1.0253003
sample estimates:
mean in group Ctl mean in group Trt
5.032 4.661
t.test(weight~group,var.equal=TRUE,mu=-1)
Two Sample t-test
data: weight by group
t = 4.4022, df = 18, p-value = 0.0003438
alternative hypothesis: true difference in means is not equal to -1
95 percent confidence interval:
-0.2833003 1.0253003
sample estimates:
mean in group Ctl mean in group Trt
5.032 4.661
Code up your own test. You know the estimated coeffiecient and you know the standard error. You could construct your own test stat.

Resources