What does NA in odds ratio mean? - r

I am currently working on landing page testing with both independent and dependent variables as logical variables. I wanted to check which of these variables, if true, is a major factor for a conversion.
So basically we are testing multiple variations of a single variable. For example, we have three different images, if image 1 is true for one row, the other two variables are false.
I used Logistic regression to conduct this test. When I looked at the odds ratio output, I ended up having a lot of NAs. I am not sure how to interpret them and how to rectify them.
Below is the sample dataset. The actual data has 18000+ rows.
classifier1 <- glm(formula = Target ~ .,
family = binomial,
data = Dataset)
This is the output.
Does this mean I need more data? Is there some other way to conduct multivariate landing page testing?

It looks like two or more of your variables (columns) are perfectly correlated. Try to remove several columns.
You can see it at the toy data.frame with the random content:
n <- 20
y <- matrix(sample(c(TRUE, FALSE), 5 * n, replace = TRUE), ncol = 5)
colnames(y) <- letters[1:5]
z <- as.data.frame(y)
z$target <- rep(0:1, 2 * n)[1:nrow(z)]
m <- glm(target ~ ., data = z, family = binomial)
summary(m)
At the summary you can see that everything is OK.
Call:
glm(formula = target ~ ., family = binomial, data = z)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.89808 -0.48166 -0.00004 0.64134 1.89222
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.3679 4700.1462 -0.005 0.9962
aTRUE 3.2286 1.6601 1.945 0.0518 .
bTRUE 20.2584 4700.1459 0.004 0.9966
cTRUE 0.7928 1.3743 0.577 0.5640
dTRUE 17.0438 4700.1460 0.004 0.9971
eTRUE 2.9238 1.6658 1.755 0.0792 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27.726 on 19 degrees of freedom
Residual deviance: 14.867 on 14 degrees of freedom
AIC: 26.867
Number of Fisher Scoring iterations: 18
But if you make two columns perfectly correlated as below, and then make generalized linear model:
z$a <- z$b
m <- glm(target ~ ., data = z, family = binomial)
summary(m)
you can observe NAs as below
Call:
glm(formula = target ~ ., family = binomial, data = z)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.66621 -1.01173 0.00001 1.06907 1.39309
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.8718 3243.8340 -0.006 0.995
aTRUE 18.7777 3243.8339 0.006 0.995
bTRUE NA NA NA NA
cTRUE 0.3544 1.0775 0.329 0.742
dTRUE 17.1826 3243.8340 0.005 0.996
eTRUE 1.1952 1.2788 0.935 0.350
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27.726 on 19 degrees of freedom
Residual deviance: 19.996 on 15 degrees of freedom
AIC: 29.996
Number of Fisher Scoring iterations: 17

Related

Likelihood ratio test: 'models were not all fitted to the same size of dataset'

I'm an absolute R beginner and need some help with my likelihood ratio tests for my univariate analyses. Here's the code:
#Univariate analysis for conscientiousness (categorical)
fit <- glm(BCS_Bin~Conscientiousness_cat,data=dat,family=binomial)
summary(fit)
#Likelihood ratio test
fit0<-glm(BCS_Bin~1, data=dat, family=binomial)
summary(fit0)
lrtest(fit, fit0)
The results are:
Call:
glm(formula = BCS_Bin ~ Conscientiousness_cat, family = binomial,
data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8847 -0.8847 -0.8439 1.5016 1.5527
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.84933 0.03461 -24.541 <2e-16 ***
Conscientiousness_catLow 0.11321 0.05526 2.049 0.0405 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7962.1 on 6439 degrees of freedom
Residual deviance: 7957.9 on 6438 degrees of freedom
(1963 observations deleted due to missingness)
AIC: 7961.9
Number of Fisher Scoring iterations: 4
And:
Call:
glm(formula = BCS_Bin ~ 1, family = binomial, data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8524 -0.8524 -0.8524 1.5419 1.5419
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.82535 0.02379 -34.69 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10251 on 8337 degrees of freedom
Residual deviance: 10251 on 8337 degrees of freedom
(65 observations deleted due to missingness)
AIC: 10253
Number of Fisher Scoring iterations: 4
For my LRT:
Error in lrtest.default(fit, fit0) :
models were not all fitted to the same size of dataset
I understand that this is happening because there's different numbers of observations missing? That's because it is data from a large questionnaire, and many more drop outs had occurred by the question assessing my predictor variable (conscientiousness) when compared with the outcome variable (body condition score/BCS). So I just have more data for BCS than conscientiousness, for example (it's producing the same error for many of my other variables too).
In order to run the likelihood ratio test, the model with just the intercept has to be fit to the same observations as the model that includes Conscientiousness_cat. So, you need the subset of the data that has no missing values for Conscientiousness_cat:
BCS_bin_subset = BCS_bin[complete.cases(BCS_bin[,"Conscientiousness_cat"]), ]
You can run both models on this subset of the data and the likelihood ratio test should run without error.
In your case, you could also do:
BCS_bin_subset = BCS_bin[!is.na(BCS_bin$Conscientiousness_cat), ]
However, it's nice to have complete.cases handy when you want a subset of a data frame with no missing values across multiple variables.
Another option that is more convenient if you're going to run multiple models, but that is also more complex is to first fit whatever model uses the largest number of variables from BCS_bin (since that model will exclude the largest number of observations due to missingness) and then use the update function to update that model to models with fewer variables. We just need to make sure that update uses the same observations each time, which we do using a wrapper function defined below. Here's an example using the built-in mtcars data frame:
library(lmtest)
dat = mtcars
# Create some missing values in mtcars
dat[1, "wt"] = NA
dat[5, "cyl"] = NA
dat[7, "hp"] = NA
# Wrapper function to ensure the same observations are used for each
# updated model as were used in the first model
# From https://stackoverflow.com/a/37341927/496488
update_nested <- function(object, formula., ..., evaluate = TRUE){
update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate)
}
m1 = lm(mpg ~ wt + cyl + hp, data=dat)
m2 = update_nested(m1, . ~ . - wt) # Remove wt
m3 = update_nested(m1, . ~ . - cyl) # Remove cyl
m4 = update_nested(m1, . ~ . - wt - cyl) # Remove wt and cyl
m5 = update_nested(m1, . ~ . - wt - cyl - hp) # Remove all three variables (i.e., model with intercept only)
lrtest(m5,m4,m3,m2,m1)

Find value of covariates given a probability in logistic regression

I have the model
am.glm = glm(formula=am ~ hp + I(mpg^2), data=mtcars, family=binomial)
which gives
> summary(am.glm)
Call:
glm(formula = am ~ hp + I(mpg^2), family = binomial, data = mtcars)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5871 -0.5376 -0.1128 0.1101 1.6937
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.71428 8.45330 -2.214 0.0268 *
hp 0.04689 0.02367 1.981 0.0476 *
I(mpg^2) 0.02811 0.01273 2.207 0.0273 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.230 on 31 degrees of freedom
Residual deviance: 20.385 on 29 degrees of freedom
AIC: 26.385
Number of Fisher Scoring iterations: 7
Given a value of hp I would like to find the values of mpg that would lead to a 50% probability of am.
I haven't managed to find anything that can be used to output such predictions. I have managed to code something using
#Coefficients
glm.intercept<-as.numeric(coef(am.glm)[1])
glm.hp.beta<-as.numeric(coef(am.glm)[2])
glm.mpg.sq.beta<-as.numeric(coef(am.glm)[3])
glm.hp.mpg.beta<-as.numeric(coef(am.glm)[4])
#Constants
prob=0.9
c<-log(prob/(1-prob))
hp=120
polyroot(c((glm.hp.beta*hp)+glm.intercept-c, glm.hp.mpg.beta*hp,glm.mpg.sq.beta))
Is there a more elegant solution? Perhaps a predict function equivalent?
Interesting problem!
How about the solution below? Basically, create newdata for which your target variable is sampling the range of observed values. Predict for the vector of these values, and find the minimum value that meets your criteria
# Your desired threshold
prob = 0.5
# Create a sampling vector
df_new <- data.frame(
hp = rep(120, times = 100),
mpg = seq(from = range(mtcars$mpg)[1],
to = range(mtcars$mpg)[2],
length.out = 100))
# Predict on sampling vector
df_new$am <- predict(am.glm, newdata = df_new)
# Find lowest value above threshold
df_new[min(which(df_new$am > prob)), 'mpg']

How to calculate R logistic regression standard error values manually? [duplicate]

This question already has answers here:
Extract standard errors from glm [duplicate]
(3 answers)
Closed 6 years ago.
I want to know, how the standard error values calculate from the logistic regression in R manually.
I took a sample data and i applied binomial logistic regression on it
data = data.frame(x = c(1,2,3,4,5,6,7,8,9,10),y = c(0,0,0,1,1,1,0,1,1,1))
model = glm(y ~ x, data = data, family = binomial(link = "logit"))
And my summary of the model is as follows and i have no idea, how the standard error has been calculated
> summary(model)
Call:
glm(formula = y ~ x, family = binomial(link = "logit"), data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9367 -0.5656 0.2641 0.6875 1.2974
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.9265 2.0601 -1.421 0.1554
x 0.6622 0.4001 1.655 0.0979 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13.4602 on 9 degrees of freedom
Residual deviance: 8.6202 on 8 degrees of freedom
AIC: 12.62
Number of Fisher Scoring iterations: 5
That would be great, if some one will answer to this... thanks in advance
You can calculate this as the square root of the diagonal elements of the unscaled covariance matrix output by summary(model)
sqrt(diag(summary(model)$cov.unscaled)*summary(model)$dispersion)
# (Intercept) x
# 2.0600893 0.4000937
For your model, the dispersion parameter is 1 so the last term (summary(model)$dispersion) could be ignored if you want.
To get this unscaled covariance matrix, you do
fittedVals = model$fitted.values
W = diag(fittedVals*(1 - fittedVals))
solve(t(X)%*%W%*%X)
# (Intercept) x
# (Intercept) 4.2439753 -0.7506158
# x -0.7506158 0.1600754

A glm with interactions for overdisperesed rates

I have measurements obtained from 2 groups (a and b) where each group has the same 3 levels (x, y, z). The measurements are counts out of totals (i.e., rates), but in group a there cannot be zeros whereas in group b there can (hard coded in the example below).
Here's my example data.frame:
set.seed(3)
df <- data.frame(count = c(rpois(15,5),rpois(15,5),rpois(15,3),
rpois(15,7.5),rpois(15,2.5),rep(0,15)),
group = as.factor(c(rep("a",45),rep("b",45))),
level = as.factor(rep(c(rep("x",15),rep("y",15),rep("z",15)),2)))
#add total - fixed for all
df$total <- rep(max(df$count)*2,nrow(df))
I'm interested in quantifying for each level x,y,z if there is any difference between the (average) measurements of a and b? If there is, is it statistically significant?
From what I understand a Poisson GLM for rates seems to be appropriate for these types of data. In my case it seems that perhaps a negative binomial GLM would be even more appropriate since my data are over dispersed (I tried to create that in my example data to some extent but in my real data it is definitely the case).
Following the answer I got for a previous post I went with:
library(dplyr)
library(MASS)
df %>%
mutate(interactions = paste0(group,":",level),
interactions = ifelse(group=="a","a",interactions)) -> df2
df2$interactions = as.factor(df2$interactions)
fit <- glm.nb(count ~ interactions + offset(log(total)), data = df2)
> summary(fit)
Call:
glm.nb(formula = count ~ interactions + offset(log(total)), data = df2,
init.theta = 41.48656798, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.40686 -0.75495 -0.00009 0.46892 2.28720
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.02047 0.07824 -25.822 < 2e-16 ***
interactionsb:x 0.59336 0.13034 4.552 5.3e-06 ***
interactionsb:y -0.28211 0.17306 -1.630 0.103
interactionsb:z -20.68331 2433.94201 -0.008 0.993
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(41.4866) family taken to be 1)
Null deviance: 218.340 on 89 degrees of freedom
Residual deviance: 74.379 on 86 degrees of freedom
AIC: 330.23
Number of Fisher Scoring iterations: 1
Theta: 41.5
Std. Err.: 64.6
2 x log-likelihood: -320.233
I'd expect the difference between a and b for level z to be significant. However, the Std. Error for level z seems enormous and hence the p-value is nearly 1.
My question is whether the model I'm using is set up correctly to answer my question (mainly through the use of the interactions factor?)

Extracting model equation from glm function in R

I've made a logistic regression to combine two independent variables in R using pROC package and I obtain this:
summary(fit)
#Call: glm(formula = Case ~ X + Y, family = "binomial", data = data)
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-1.5751 -0.8277 -0.6095 1.0701 2.3080
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -0.153731 0.538511 -0.285 0.775281
#X -0.048843 0.012856 -3.799 0.000145 ***
#Y 0.028364 0.009077 3.125 0.001780 **
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#(Dispersion parameter for binomial family taken to be 1)
#Null deviance: 287.44 on 241 degrees of freedom
#Residual deviance: 260.34 on 239 degrees of freedom
#AIC: 266.34
#Number of Fisher Scoring iterations: 4
fit
#Call: glm(formula = Case ~ X + Y, family = "binomial", data = data)
#Coefficients:
# (Intercept) X Y
# -0.15373 -0.04884 0.02836
#Degrees of Freedom: 241 Total (i.e. Null); 239 Residual
#Null Deviance: 287.4
#Residual Deviance: 260.3 AIC: 266.3
Now I need to extract some information from this data, but I'm not sure how to do it.
First, I need the model equation: suppose that fit is a combined predictor called CP, could it be CP = -0.15 - 0.05X + 0.03Y ?
Then, the resulting combined predictor from the regression should may present a median value, so that I can compare median from the two groups Case and Controls which I used to make the regression ( in other words, my X and Y variables are N dimensional with N = N1 + N2, where N1 = Number of Controls, for which Case=0, and N2 = Number of Cases, for which Case=1).
I hope to have explained everything.

Resources