Interaction terms for an incomplete design R - r

I'm trying to fit a GLM on some data and I feel like there should be an interaction term between two of the explanatory variables (one categorical and one discrete) but all the non-zero instances of the discrete variable occur on the "1" state of the categorical variable (partly why I feel like there should be an interaction).
When I put the interaction in the glm (var1*var2), it just shows N/A for the interaction term (var1:var2) in the summary ANOVA.
Any help would be appreciated!
Thank you!
Edit: here is a mock example of my issue
a <- data.frame("y" <- c(0,1,2,3),
"var1" <- c(0,1,1,1),
"var2" <- c(0,0,1,2))
a.glm <- glm(y ~ var1*var2, family=poisson, data = a)
summary(a.glm)
This comes up in the console:
Call:
glm(formula = y ~ var1 * var2, family = poisson, data = a)
Deviance Residuals:
1 2 3 4
-0.00002 -0.08284 0.12401 -0.04870
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.303 42247.166 0.00 1.00
var1 22.384 42247.166 0.00 1.00
var2 0.522 0.534 0.98 0.33
var1:var2 NA NA NA NA
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4.498681 on 3 degrees of freedom
Residual deviance: 0.024614 on 1 degrees of freedom
AIC: 13.63
Number of Fisher Scoring iterations: 20

Related

What does NA in odds ratio mean?

I am currently working on landing page testing with both independent and dependent variables as logical variables. I wanted to check which of these variables, if true, is a major factor for a conversion.
So basically we are testing multiple variations of a single variable. For example, we have three different images, if image 1 is true for one row, the other two variables are false.
I used Logistic regression to conduct this test. When I looked at the odds ratio output, I ended up having a lot of NAs. I am not sure how to interpret them and how to rectify them.
Below is the sample dataset. The actual data has 18000+ rows.
classifier1 <- glm(formula = Target ~ .,
family = binomial,
data = Dataset)
This is the output.
Does this mean I need more data? Is there some other way to conduct multivariate landing page testing?
It looks like two or more of your variables (columns) are perfectly correlated. Try to remove several columns.
You can see it at the toy data.frame with the random content:
n <- 20
y <- matrix(sample(c(TRUE, FALSE), 5 * n, replace = TRUE), ncol = 5)
colnames(y) <- letters[1:5]
z <- as.data.frame(y)
z$target <- rep(0:1, 2 * n)[1:nrow(z)]
m <- glm(target ~ ., data = z, family = binomial)
summary(m)
At the summary you can see that everything is OK.
Call:
glm(formula = target ~ ., family = binomial, data = z)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.89808 -0.48166 -0.00004 0.64134 1.89222
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.3679 4700.1462 -0.005 0.9962
aTRUE 3.2286 1.6601 1.945 0.0518 .
bTRUE 20.2584 4700.1459 0.004 0.9966
cTRUE 0.7928 1.3743 0.577 0.5640
dTRUE 17.0438 4700.1460 0.004 0.9971
eTRUE 2.9238 1.6658 1.755 0.0792 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27.726 on 19 degrees of freedom
Residual deviance: 14.867 on 14 degrees of freedom
AIC: 26.867
Number of Fisher Scoring iterations: 18
But if you make two columns perfectly correlated as below, and then make generalized linear model:
z$a <- z$b
m <- glm(target ~ ., data = z, family = binomial)
summary(m)
you can observe NAs as below
Call:
glm(formula = target ~ ., family = binomial, data = z)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.66621 -1.01173 0.00001 1.06907 1.39309
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.8718 3243.8340 -0.006 0.995
aTRUE 18.7777 3243.8339 0.006 0.995
bTRUE NA NA NA NA
cTRUE 0.3544 1.0775 0.329 0.742
dTRUE 17.1826 3243.8340 0.005 0.996
eTRUE 1.1952 1.2788 0.935 0.350
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27.726 on 19 degrees of freedom
Residual deviance: 19.996 on 15 degrees of freedom
AIC: 29.996
Number of Fisher Scoring iterations: 17

categorical variable in logistic regression in r

how I have to implement a categorical variable in a binary logistic regression in R? I want to test the influence of the professional fields (student, worker, teacher, self-employed) on the probability of a purchase of a product.
In my example y is a binary variable (1 for buying a product, 0 for not buying). - x1: is the gender (0 male, 1 female)
- x2: is the age (between 20 and 80)
- x3: is the categorical variable (1=student, 2=worker, 3=teacher, 4=self-employed)
set.seed(123)
y<-round(runif(100,0,1))
x1<-round(runif(100,0,1))
x2<-round(runif(100,20,80))
x3<-round(runif(100,1,4))
test<-glm(y~x1+x2+x3, family=binomial(link="logit"))
summary(test)
If I implement x3 (the professional fields) in my regression above, I get the wrong estimates/interpretation for x3.
What I have to do to get the right influence/estimates for the categorical variable (x3)?
Thanks a lot
I suggest you to set x3 as a factor variable, there is no need to create dummies:
set.seed(123)
y <- round(runif(100,0,1))
x1 <- round(runif(100,0,1))
x2 <- round(runif(100,20,80))
x3 <- factor(round(runif(100,1,4)),labels=c("student", "worker", "teacher", "self-employed"))
test <- glm(y~x1+x2+x3, family=binomial(link="logit"))
summary(test)
Here is the summary:
This is the output of your model:
Call:
glm(formula = y ~ x1 + x2 + x3, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4665 -1.1054 -0.9639 1.1979 1.4044
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.464751 0.806463 0.576 0.564
x1 0.298692 0.413875 0.722 0.470
x2 -0.002454 0.011875 -0.207 0.836
x3worker -0.807325 0.626663 -1.288 0.198
x3teacher -0.567798 0.615866 -0.922 0.357
x3self-employed -0.715193 0.756699 -0.945 0.345
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.47 on 99 degrees of freedom
Residual deviance: 135.98 on 94 degrees of freedom
AIC: 147.98
Number of Fisher Scoring iterations: 4
In any case, I suggest you to study this post on R-bloggers:
https://www.r-bloggers.com/logistic-regression-and-categorical-covariates/

How to analyze data with a binary response and two categorical variables in R

I have a set of data with a binary response (0 and 1) and two categorical variables (one with two levels and the other with four levels).
library(data.table)
data<-data.table(Factor1=rep(c("A","B","C","D"),each=36),
Factor2=rep(c(rep("Red",18),rep("Blue",18)),4),
Response=rep(c(rep(1,11),rep(0,7),rep(0,18)),4))
I´ve trying to analize this with with glm() but I'm not sure is the best way.
model<-glm(Response~Factor1+Factor2,family = binomial(),data=data)
summary(model)
Call:
glm(formula = Response ~ Factor1 + Factor2, family = binomial(),
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.37438 -0.00008 -0.00008 0.99245 0.99245
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.957e+01 1.267e+03 -0.015 0.988
Factor1B 8.942e-15 6.838e-01 0.000 1.000
Factor1C 7.681e-15 6.838e-01 0.000 1.000
Factor1D 7.345e-15 6.838e-01 0.000 1.000
Factor2Red 2.002e+01 1.267e+03 0.016 0.987
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 177.264 on 143 degrees of freedom
Residual deviance: 96.228 on 139 degrees of freedom
AIC: 106.23
Number of Fisher Scoring iterations: 18
According to this, none of the coefficients are significant. But I see the data and evidently there is a difference between "Red" and "Blue".
data[,sum(Response),by=c("Factor1","Factor2")]
Factor1 Factor2 V1
1: A Red 11
2: A Blue 0
3: B Red 11
4: B Blue 0
5: C Red 11
6: C Blue 0
7: D Red 11
8: D Blue 0
I was expecting that the coeffcient Factor2Red was significant but it was not that way. I think that maybe is because of the high estandard error for this coefficient.
If I check the odds ratio I see that the value for this coefficient is very high. But I do not know if that's enough to say that there is a significant effect of being red or blue.
exp(cbind(coef(model)))
[,1]
(Intercept) 3.181005e-09
Factor1B 1.000000e+00
Factor1C 1.000000e+00
Factor1D 1.000000e+00
Factor2Red 4.940037e+08
Would you recommend another way to analyze this?
Factor 2 Red vs. Blue is significant. I believe the logistic model may be unstable because the mean and standard deviation of the Response of Factor2 = Blue is 0. You can run Fisher's exact test -- see documentation at https://stat.ethz.ch/R-manual/R-devel/library/stats/html/fisher.test.html
Try this:
fisher.test(data$Factor2, data$Response, conf.level = 0.95)$conf.int
Here is an informative plot:
library(ggplot2)
data$Factor1Factor2 <- interaction(data$Factor1, data$Factor2)
ggplot(data, aes(x = Factor1Factor2, y = Response, fill = Factor1)) +
geom_boxplot()

R duplicating predictor variables with glm and categorical variables

I am building a glm in R with categorical predictors and a binary response. My data is like this (but much bigger and with multiple predictors):
y <- c(1,1,1,0,0) #response
x <- c(0,0,0,1,2) #predictor
Since this data is categorical (but it is represented by numbers), I did this:
y <- as.factor(y)
x <- as.factor(x)
And then I built my model:
g1 <- glm(y~x, family=binomial(link="logit"))
But the details of the model are the following:
g1
Call: glm(formula = y ~ x, family = binomial(link = "logit"))
Coefficients:
(Intercept) x1 x2
24.57 -49.13 -49.13
Degrees of Freedom: 4 Total (i.e. Null); 2 Residual
Null Deviance: 6.73
Residual Deviance: 2.143e-10 AIC: 6
And the summary is:
summary(g1)
Call:
glm(formula = y ~ x, family = binomial(link = "logit"))
Deviance Residuals:
1 2 3 4 5
6.547e-06 6.547e-06 6.547e-06 -6.547e-06 -6.547e-06
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 24.57 75639.11 0 1
x1 -49.13 151278.15 0 1
x2 -49.13 151278.15 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6.7301e+00 on 4 degrees of freedom
Residual deviance: 2.1434e-10 on 2 degrees of freedom
AIC: 6
Number of Fisher Scoring iterations: 23
What I don't understand is why R has duplicated the x predictor in x1 and x2? What do x1 and x2 mean?
I also need to explicitly write down the model with the estimates, something of the form: y ~ B0 + B1*x so I am stuck now because x has been divided in two and there are no initial variables called x1 and x2...
Thanks for your help!
This happens because you have made x a factor. This factor has three levels (0, 1 and 2). When you put a categorical variable in a regression model, one way of coding it is to use a reference category. In this case R has chosen to make the 0 level the reference category. Then the coefficients of x1 and x2 are the difference in the levels between 0 and 1 as well as 0 and 2 respectively.
This is pretty standard in regression so you shouldn't find it too surprising. Perhaps you were just confused about how R named the coefficients.

What alternative ways are there to specify binomial successes/trials in a formula?

Suppose you are modelling binomial data where each response is a number of successes (y) from a number of trials (N) with some explanatory variables (a and b). There's a few functions that do this kind of thing, and they all seem to use different methods to specify y and N.
In glm, you do glm(cbind(y,N-y)~a+b, data = d) (matrix of success/fail on LHS)
In inla, you do inla(y~a+b, Ntrials=d$N, data=d) (specify number of trials separately)
In glmmBUGS, you do glmmBUGS(y+N~a+b,data=d) (specify success + trials as terms on LHS)
When programming new methods, I've always thought it best to follow what glm does, since that's where people would normally first encounter binomial response data. However, I can never remember if its cbind(y,N-y) or cbind(y,N) - and I usually seem to have success/number of trials in my data rather than success/number of fails - YMMV.
Other approaches are possible of course. For example using a function on the RHS to mark whether the variable is number of trials or number of fails:
myblm( y ~ a + b + Ntrials(N), data=d)
myblm( y ~ a + b + Nfails(M), data=d) # if your dataset has succ/fail variables
or defining an operator to just do a cbind, so you can do:
myblm( y %of% N ~ a + b, data=d)
thus attaching some meaning to the LHS making it explicit.
Has anyone got any better ideas? What's the right way to do this?
You can also let y be fraction in which case you need to supply the weights. It is not in the formula argument but an almost equal amount of keystrokes as if it was in the formula. Here is an example
> set.seed(73574836)
> x <- runif(10)
> n <- sample.int(10, 2)
> y <- sapply(mapply(rbinom, size = 1, n, (1 + exp(1 - x))^-1), function(x)
+ sum(x == 1))
> df <- data.frame(y = y, frac = y / n, x = x, weights = n)
> df
y frac x weights
1 2 1.000 0.9051 2
2 5 0.625 0.3999 8
3 1 0.500 0.4649 2
4 4 0.500 0.5558 8
5 0 0.000 0.8932 2
6 3 0.375 0.1825 8
7 1 0.500 0.1879 2
8 4 0.500 0.5041 8
9 0 0.000 0.5070 2
10 3 0.375 0.3379 8
>
> # the following two fits are identical
> summary(glm(cbind(y, weights - y) ~ x, binomial(), df))
Call:
glm(formula = cbind(y, weights - y) ~ x, family = binomial(),
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.731 -0.374 0.114 0.204 1.596
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.416 0.722 -0.58 0.56
x 0.588 1.522 0.39 0.70
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9.5135 on 9 degrees of freedom
Residual deviance: 9.3639 on 8 degrees of freedom
AIC: 28.93
Number of Fisher Scoring iterations: 3
> summary(glm(frac ~ x, binomial(), df, weights = weights))
Call:
glm(formula = frac ~ x, family = binomial(), data = df, weights = weights)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.731 -0.374 0.114 0.204 1.596
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.416 0.722 -0.58 0.56
x 0.588 1.522 0.39 0.70
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9.5135 on 9 degrees of freedom
Residual deviance: 9.3639 on 8 degrees of freedom
AIC: 28.93
Number of Fisher Scoring iterations: 3
The reason the above works comes down to what glm actually does for binomial outcomes. It computes a fraction for each observation and a weight associated with the observation regardless of how you specify the outcome. Here is a snippet from ?glm which gives a hint of what is going in the estimation
If a binomial glm model was specified by giving a two-column response,
the weights returned by prior.weights are the total numbers of cases
(factored by the supplied case weights) and the component y of the
result is the proportion of successes.
Alternatively, you can make a wrapper for glm.fit or glm using model.frame. See the ... argument in ?model.frame
... for model.frame methods, a mix of further arguments such as
data, na.action, subset to pass to the default method. Any
additional arguments (such as offset and weights or other named
arguments) which reach the default method are used to create further
columns in the model frame, with parenthesised names such as
"(offset)".
Comment
I saw Ben Bolker's comment afterwards. The above is what he points out.
I like this method from the glm documentation:
For binomial and quasibinomial families the response can also be
specified as a factor (when the first level denotes failure and all
others success)
This comports well with the way successes and failures often arise in my experience. One is a catch-all (e.g. "didn't vote") and there are a variety of ways to achieve the other (e.g. "voted for A", "voted for B"). I hope it's clear from the way I'm phrasing this that "success" and "failure" as defined by glm can be defined arbitrarily so that the first level corresponds to a "failure" and all the other levels are a "success".
From r's help page on glm :
"...or as a two-column matrix with the columns giving the numbers of successes and failures"
So it has to be cbind(Y, N-Y)

Resources