I am building a glm in R with categorical predictors and a binary response. My data is like this (but much bigger and with multiple predictors):
y <- c(1,1,1,0,0) #response
x <- c(0,0,0,1,2) #predictor
Since this data is categorical (but it is represented by numbers), I did this:
y <- as.factor(y)
x <- as.factor(x)
And then I built my model:
g1 <- glm(y~x, family=binomial(link="logit"))
But the details of the model are the following:
g1
Call: glm(formula = y ~ x, family = binomial(link = "logit"))
Coefficients:
(Intercept) x1 x2
24.57 -49.13 -49.13
Degrees of Freedom: 4 Total (i.e. Null); 2 Residual
Null Deviance: 6.73
Residual Deviance: 2.143e-10 AIC: 6
And the summary is:
summary(g1)
Call:
glm(formula = y ~ x, family = binomial(link = "logit"))
Deviance Residuals:
1 2 3 4 5
6.547e-06 6.547e-06 6.547e-06 -6.547e-06 -6.547e-06
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 24.57 75639.11 0 1
x1 -49.13 151278.15 0 1
x2 -49.13 151278.15 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6.7301e+00 on 4 degrees of freedom
Residual deviance: 2.1434e-10 on 2 degrees of freedom
AIC: 6
Number of Fisher Scoring iterations: 23
What I don't understand is why R has duplicated the x predictor in x1 and x2? What do x1 and x2 mean?
I also need to explicitly write down the model with the estimates, something of the form: y ~ B0 + B1*x so I am stuck now because x has been divided in two and there are no initial variables called x1 and x2...
Thanks for your help!
This happens because you have made x a factor. This factor has three levels (0, 1 and 2). When you put a categorical variable in a regression model, one way of coding it is to use a reference category. In this case R has chosen to make the 0 level the reference category. Then the coefficients of x1 and x2 are the difference in the levels between 0 and 1 as well as 0 and 2 respectively.
This is pretty standard in regression so you shouldn't find it too surprising. Perhaps you were just confused about how R named the coefficients.
Related
I'm trying to fit a GLM on some data and I feel like there should be an interaction term between two of the explanatory variables (one categorical and one discrete) but all the non-zero instances of the discrete variable occur on the "1" state of the categorical variable (partly why I feel like there should be an interaction).
When I put the interaction in the glm (var1*var2), it just shows N/A for the interaction term (var1:var2) in the summary ANOVA.
Any help would be appreciated!
Thank you!
Edit: here is a mock example of my issue
a <- data.frame("y" <- c(0,1,2,3),
"var1" <- c(0,1,1,1),
"var2" <- c(0,0,1,2))
a.glm <- glm(y ~ var1*var2, family=poisson, data = a)
summary(a.glm)
This comes up in the console:
Call:
glm(formula = y ~ var1 * var2, family = poisson, data = a)
Deviance Residuals:
1 2 3 4
-0.00002 -0.08284 0.12401 -0.04870
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.303 42247.166 0.00 1.00
var1 22.384 42247.166 0.00 1.00
var2 0.522 0.534 0.98 0.33
var1:var2 NA NA NA NA
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4.498681 on 3 degrees of freedom
Residual deviance: 0.024614 on 1 degrees of freedom
AIC: 13.63
Number of Fisher Scoring iterations: 20
In the process of studying logistic regression using carret's mdrr data, questions arise.
I created a full model using a total of 19 variables, and I have questions about the notation of the categorical variable.
In my regression model, the categorical variables are:
nDB : 0 or 1 or 2
nR05 : 0 or 1
nR10 : 1 or 2
I created a full model using glm, but I do not know why the names of categorical variables have one of the numbers in the category.
-------------------------------------------------------------------------------
glm(formula = mdrrClass ~ ., family = binomial, data = train)
#Coefficients:
#(Intercept) nDB1 nDB2 nX nR051 nR101 nBnz2
#5.792e+00 5.287e-01 -3.103e-01 -2.532e-01 -9.291e-02 9.259e-01 -2.108e+00
#SPI BLI PW4 PJI2 Lop BIC2 VRA1
#3.222e-05 -1.201e+01 -3.754e+01 -5.467e-01 1.010e+00 -5.712e+00 -2.424e-04
# PCR H3D FDI PJI3 DISPm DISPe G.N..N.
# -6.397e-02 -4.360e-04 3.458e+01 -6.579e+00 -5.690e-02 2.056e-01 -7.610e-03
#Degrees of Freedom: 263 Total (i.e. Null); 243 Residual
#Null Deviance: 359.3
#Residual Deviance: 232.6 AIC: 274.6
-------------------------------------------------------------------------------
The above results show that nDB is numbered, and nR05 and nR10 are related to categories.
I am wondering why numbers are attached as above.
When you have categorical predictors in any regression model you need to create dummy variables. R does this for you and the output you see are the contrasts
Your variable nDB has 3 levels: 0, 1, 2
One of those needs to be chosen as the reference level (R was chosen 0 for you in this case, but this can also be specified manually). Then dummy variables are created to compare every other level against your reference level: 0 vs 1 and 0 vs 2
R names these dummy variables nDB1 and nDB2. nDB1 is for the 0 vs 1 contrast, and nDB2 is for the 0 vs 2 contrast. The numbers after the variable names are just to indicate which contrast you're looking at
The coefficient values are interpreted as the difference in your y (outcome) value between groups 0 and 1 (nDB1), and separately between groups 0 and 2 (nDB2). In other words, what change in the outcome would you expect when moving from one group to the other?
Your other categorical variables have 2 levels and are just a simpler case of the above
For example, nR05 only has 0 and 1 as values. 0 was chosen as your reference, and because theres only 1 possible contrast here, a single dummy variable is created comparing 0 vs 1. In the output that dummy variable is called nR051
It's always the case for categorical variables, espacially when they are not binary (like your nDB). It's so that you know for which value you have the coefficient. For the nDB variable the model has created two new variables: nDB_1 which equals 1 if nDB=1 and equals 0 if nDB= 0 or nDB=2.
To analyze a binary variable (whose values would be TRUE / FALSE, 0/1, or YES / NO) according to a quantitative explanatory variable, a logistic regression can be used.
Consider for example the following data, where x is the age of 40 people, and y the variable indicating if they bought a death metal album in the last 5 years (1 if "yes", 0 if "no" )
Graphically, we can see that, more likely, the older people are, the less they buy death metal.
Logistic regression is a special case of the Generalized Linear Model (GLM).
With a classical linear regression model, we consider the following model:
Y = αX + β
The expectation of Y is therefore predicted as follows:
E (Y) = αX + β
Here, because of the binary distribution of Y, the above relations can not apply. To "generalize" the linear model, we therefore consider that
g (E (Y)) = αX + β
where g is a link function.
In this case, for a logistic regression, the link function corresponds to the logit function:
logit (p) = log (p/(1-p))
Note that this logit function transforms a value (p) between 0 and 1 (such as a probability for example) into a value between - ∞ and + ∞.
Here's how to do the logistic regression under R:
myreg=glm(y~x, family=binomial(link=logit))
summary(myreg)
glm(formula = y ~ x, family = binomial(link = logit))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8686 -0.7764 0.3801 0.8814 2.0253
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.9462 1.9599 3.034 0.00241 **
## x -0.1156 0.0397 -2.912 0.00360 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 52.925 on 39 degrees of freedom
## Residual deviance: 39.617 on 38 degrees of freedom
## AIC: 43.617
##
## Number of Fisher Scoring iterations: 5
We obtain the following model:
logit (E (Y)) = - 0.12X + 5.95
and we note that the (negative) influence of age on the purchase of death metal albums is significant at the 5% level (p(>[Z| ----> < 5%).
Thus, logistic regression is often used to bring out risk factor (like Age but also BMI, Sex and so on …)
I'm doing logistic regression on Boston data with a column high.medv (yes/no) which indicates if the median house pricing given by column medv is either more than 25 or not.
Below is my code for logistic regression.
high.medv <- ifelse(Boston$medv>25, "Y", "N") # Applying the desired
`condition to medv and storing the results into a new variable called "medv.high"
ourBoston <- data.frame (Boston, high.medv)
ourBoston$high.medv <- as.factor(ourBoston$high.medv)
attach(Boston)
# 70% of data <- Train
train2<- subset(ourBoston,sample==TRUE)
# 30% will be Test
test2<- subset(ourBoston, sample==FALSE)
glm.fit <- glm (high.medv ~ lstat,data = train2, family = binomial)
summary(glm.fit)
The output is as follows:
Deviance Residuals:
[1] 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.57 48196.14 0 1
lstat NA NA NA NA
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.0000e+00 on 0 degrees of freedom
Residual deviance: 3.1675e-10 on 0 degrees of freedom
AIC: 2
Number of Fisher Scoring iterations: 21
Also i need:
Now I'm required to use the misclassification rate as the measure of error for the two cases:
using lstat as the predictor, and
using all predictors except high.medv and medv.
but i am stuck at the regression itself
With every classification algorithm, the art relies on choosing the threshold upon which you will determine whether the the result is positive or negative.
When you predict your outcomes in the test data set you estimate probabilities of the response variable being either 1 or 0. Therefore, you need to the tell where you are gonna cut, the threshold, at which the prediction becomes 1 or 0.
A high threshold is more conservative about labeling a case as positive, which makes it less likely to produce false positives and more likely to produce false negatives. The opposite happens for low thresholds.
The usual procedure is to plot the rates that interests you, e.g., true positives and false positives against each other, and then choose what is the best rate for you.
set.seed(666)
# simulation of logistic data
x1 = rnorm(1000) # some continuous variables
z = 1 + 2*x1 # linear combination with a bias
pr = 1/(1 + exp(-z)) # pass through an inv-logit function
y = rbinom(1000, 1, pr)
df = data.frame(y = y, x1 = x1)
df$train = 0
df$train[sample(1:(2*nrow(df)/3))] = 1
df$new_y = NA
# modelling the response variable
mod = glm(y ~ x1, data = df[df$train == 1,], family = "binomial")
df$new_y[df$train == 0] = predict(mod, newdata = df[df$train == 0,], type = 'response') # predicted probabilities
dat = df[df$train==0,] # test data
To use missclassification error to evaluate your model, first you need to set up a threshold. For that, you can use the roc function from pROC package, which calculates the rates and provides the corresponding thresholds:
library(pROC)
rates =roc(dat$y, dat$new_y)
plot(rates) # visualize the trade-off
rates$specificity # shows the ratio of true negative over overall negatives
rates$thresholds # shows you the corresponding thresholds
dat$jj = as.numeric(dat$new_y>0.7) # using 0.7 as a threshold to indicate that we predict y = 1
table(dat$y, dat$jj) # provides the miss classifications given 0.7 threshold
0 1
0 86 20
1 64 164
The accuracy of your model can be computed as the ratio of the number of observations you got right against the size of your sample.
how I have to implement a categorical variable in a binary logistic regression in R? I want to test the influence of the professional fields (student, worker, teacher, self-employed) on the probability of a purchase of a product.
In my example y is a binary variable (1 for buying a product, 0 for not buying). - x1: is the gender (0 male, 1 female)
- x2: is the age (between 20 and 80)
- x3: is the categorical variable (1=student, 2=worker, 3=teacher, 4=self-employed)
set.seed(123)
y<-round(runif(100,0,1))
x1<-round(runif(100,0,1))
x2<-round(runif(100,20,80))
x3<-round(runif(100,1,4))
test<-glm(y~x1+x2+x3, family=binomial(link="logit"))
summary(test)
If I implement x3 (the professional fields) in my regression above, I get the wrong estimates/interpretation for x3.
What I have to do to get the right influence/estimates for the categorical variable (x3)?
Thanks a lot
I suggest you to set x3 as a factor variable, there is no need to create dummies:
set.seed(123)
y <- round(runif(100,0,1))
x1 <- round(runif(100,0,1))
x2 <- round(runif(100,20,80))
x3 <- factor(round(runif(100,1,4)),labels=c("student", "worker", "teacher", "self-employed"))
test <- glm(y~x1+x2+x3, family=binomial(link="logit"))
summary(test)
Here is the summary:
This is the output of your model:
Call:
glm(formula = y ~ x1 + x2 + x3, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4665 -1.1054 -0.9639 1.1979 1.4044
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.464751 0.806463 0.576 0.564
x1 0.298692 0.413875 0.722 0.470
x2 -0.002454 0.011875 -0.207 0.836
x3worker -0.807325 0.626663 -1.288 0.198
x3teacher -0.567798 0.615866 -0.922 0.357
x3self-employed -0.715193 0.756699 -0.945 0.345
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.47 on 99 degrees of freedom
Residual deviance: 135.98 on 94 degrees of freedom
AIC: 147.98
Number of Fisher Scoring iterations: 4
In any case, I suggest you to study this post on R-bloggers:
https://www.r-bloggers.com/logistic-regression-and-categorical-covariates/
Suppose you are modelling binomial data where each response is a number of successes (y) from a number of trials (N) with some explanatory variables (a and b). There's a few functions that do this kind of thing, and they all seem to use different methods to specify y and N.
In glm, you do glm(cbind(y,N-y)~a+b, data = d) (matrix of success/fail on LHS)
In inla, you do inla(y~a+b, Ntrials=d$N, data=d) (specify number of trials separately)
In glmmBUGS, you do glmmBUGS(y+N~a+b,data=d) (specify success + trials as terms on LHS)
When programming new methods, I've always thought it best to follow what glm does, since that's where people would normally first encounter binomial response data. However, I can never remember if its cbind(y,N-y) or cbind(y,N) - and I usually seem to have success/number of trials in my data rather than success/number of fails - YMMV.
Other approaches are possible of course. For example using a function on the RHS to mark whether the variable is number of trials or number of fails:
myblm( y ~ a + b + Ntrials(N), data=d)
myblm( y ~ a + b + Nfails(M), data=d) # if your dataset has succ/fail variables
or defining an operator to just do a cbind, so you can do:
myblm( y %of% N ~ a + b, data=d)
thus attaching some meaning to the LHS making it explicit.
Has anyone got any better ideas? What's the right way to do this?
You can also let y be fraction in which case you need to supply the weights. It is not in the formula argument but an almost equal amount of keystrokes as if it was in the formula. Here is an example
> set.seed(73574836)
> x <- runif(10)
> n <- sample.int(10, 2)
> y <- sapply(mapply(rbinom, size = 1, n, (1 + exp(1 - x))^-1), function(x)
+ sum(x == 1))
> df <- data.frame(y = y, frac = y / n, x = x, weights = n)
> df
y frac x weights
1 2 1.000 0.9051 2
2 5 0.625 0.3999 8
3 1 0.500 0.4649 2
4 4 0.500 0.5558 8
5 0 0.000 0.8932 2
6 3 0.375 0.1825 8
7 1 0.500 0.1879 2
8 4 0.500 0.5041 8
9 0 0.000 0.5070 2
10 3 0.375 0.3379 8
>
> # the following two fits are identical
> summary(glm(cbind(y, weights - y) ~ x, binomial(), df))
Call:
glm(formula = cbind(y, weights - y) ~ x, family = binomial(),
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.731 -0.374 0.114 0.204 1.596
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.416 0.722 -0.58 0.56
x 0.588 1.522 0.39 0.70
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9.5135 on 9 degrees of freedom
Residual deviance: 9.3639 on 8 degrees of freedom
AIC: 28.93
Number of Fisher Scoring iterations: 3
> summary(glm(frac ~ x, binomial(), df, weights = weights))
Call:
glm(formula = frac ~ x, family = binomial(), data = df, weights = weights)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.731 -0.374 0.114 0.204 1.596
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.416 0.722 -0.58 0.56
x 0.588 1.522 0.39 0.70
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9.5135 on 9 degrees of freedom
Residual deviance: 9.3639 on 8 degrees of freedom
AIC: 28.93
Number of Fisher Scoring iterations: 3
The reason the above works comes down to what glm actually does for binomial outcomes. It computes a fraction for each observation and a weight associated with the observation regardless of how you specify the outcome. Here is a snippet from ?glm which gives a hint of what is going in the estimation
If a binomial glm model was specified by giving a two-column response,
the weights returned by prior.weights are the total numbers of cases
(factored by the supplied case weights) and the component y of the
result is the proportion of successes.
Alternatively, you can make a wrapper for glm.fit or glm using model.frame. See the ... argument in ?model.frame
... for model.frame methods, a mix of further arguments such as
data, na.action, subset to pass to the default method. Any
additional arguments (such as offset and weights or other named
arguments) which reach the default method are used to create further
columns in the model frame, with parenthesised names such as
"(offset)".
Comment
I saw Ben Bolker's comment afterwards. The above is what he points out.
I like this method from the glm documentation:
For binomial and quasibinomial families the response can also be
specified as a factor (when the first level denotes failure and all
others success)
This comports well with the way successes and failures often arise in my experience. One is a catch-all (e.g. "didn't vote") and there are a variety of ways to achieve the other (e.g. "voted for A", "voted for B"). I hope it's clear from the way I'm phrasing this that "success" and "failure" as defined by glm can be defined arbitrarily so that the first level corresponds to a "failure" and all the other levels are a "success".
From r's help page on glm :
"...or as a two-column matrix with the columns giving the numbers of successes and failures"
So it has to be cbind(Y, N-Y)