GLM function for Logistic Regression: what is the default predicted outcome? - r

I am relatively new to R modelling and I came across the GLM functions for modelling. I am interested in Logistic regression using the family 'binomial'. My question is when my dependent variable can take one of two possible outcomes - say 'positive', 'negative' - what is the default outcome for which the estimates are computed - does the model predict the log odds for a 'positive' or a 'negative' outcome by default ? Also, what is the default outcome considered for estimation when the dependent variable is
Yes or No
1 or 2
Pass or Fail
etc. ?
Is there a rule by which R selects this default? Is there a way to override it manually? Please clarify.

It's in the details of ?binomial:
For the ‘binomial’ and ‘quasibinomial’ families the response can
be specified in one of three ways:
As a factor: ‘success’ is interpreted as the factor not
having the first level (and hence usually of having the
second level). added note: this usually means the first level alphabetically, since this is how R defines factors by default.
As a numerical vector with values between ‘0’ and ‘1’,
interpreted as the proportion of successful cases (with the
total number of cases given by the ‘weights’).
As a two-column integer matrix: the first column gives the
number of successes and the second the number of failures.
So the probability predicted is the probability of "success", i.e. of the second level of the factor, or the probability of a 1 in the numeric case.
From your examples:
Yes or No: the default will be to treat "No" as a failure (because alphabetical), but you can use my_data$my_factor <- relevel(my_data$my_factor,"Yes") to make "Yes" be the first level.
1 or 2: this will either fail or produce bogus results. Either make the variable into a factor ("1" will be treated as the first level) or subtract 1 to get a 0/1 variable (or use 2-x if you want 2 to be treated as a failure)
Pass or Fail: see "Yes or No" ...

Related

Linear Regression Model with a variable that zeroes the result

For my class we have to create a model to predict the credit balance of each individuals. Based on observations, many results are zero where the lm tries to calculate them.
To overcome this I created a new variable that results in zero if X and Y are true.
CB$Balzero = ifelse(CB$Rating<=230 & CB$Income<90,0,1)
This resulted in getting 90% of the zero results right. The problem is:
How can I place this variable in the lm so it correctly results in zeros when the proposition is true and the calculation when it is false?
Something like: lm=Balzero*(Balance~.)
I think that
y ~ -1 + Balzero:Balance
might work (you haven't given us a reproducible example to try).
-1 tells R to omit the intercept
: specifies an interaction. If both variables are numeric, then A:B includes the product of A and B as a term in the model.
The second term could also be specified as I(Balzero*Balance) (I means "as is", i.e. interpret * in the usual numerical sense, not in its formula-construction context.)
These specifications should fit the model
Y = beta1*Balzero*Balance + eps
where eps is an error term.
If Balzero == 0, the predicted value will be zero. If Balzero==1 the predicted value will be beta1*Balance.
You might want to look into random forest models, which naturally incorporate the kind of qualitative splitting that you're doing by hand in your example.

Negative Binomial model offset seems to be creating a 2 level factor

I am trying to fit some data to a negative binomial model and run a pairwise comparison using emmeans. The data has two different sample sizes, 15 and 20 (num_sample in the example below).
I have set up two data frames: good.data which produces the expected result of offset() using random sample sizes between 15 and 20, and bad.data using a sample size of either 15 or 20, which seems to produce a factor of either 15 or 20. The bad.data pairwise comparison produces way too many comparisons compared to the good.data, even though they should produce the same number?
set.seed(1)
library(dplyr)
library(emmeans)
library(MASS)
# make data that works
data.frame(site=c(rep("A",24),
rep("B",24),
rep("C",24),
rep("D",24),
rep("E",24)),
trt_time=rep(rep(c(10,20,30),8),5),
pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
num_sample=sample(c(15,17,20),24*5,T),# more than 2 sample sizes...
bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->good.data
# make data that doesn't work
data.frame(site=c(rep("A",24),
rep("B",24),
rep("C",24),
rep("D",24),
rep("E",24)),
trt_time=rep(rep(c(10,20,30),8),5),
pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
num_sample=sample(c(15,20),24*5,T),# only 2 sample sizes...
bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->bad.data
# fit models
good.data%>%
mutate(trt_time=factor(trt_time),
pre_trt=factor(pre_trt),
storage_time=factor(storage_time))%>%
MASS::glm.nb(bad~trt_time:pre_trt:storage_time+offset(log(num_sample)),
data=.)->mod.good
bad.data%>%
mutate(trt_time=factor(trt_time),
pre_trt=factor(pre_trt),
storage_time=factor(storage_time))%>%
MASS::glm.nb(bad~trt_time:pre_trt:storage_time+offset(log(num_sample)),
data=.)->mod.bad
# pairwise comparison
emmeans::emmeans(mod.good,pairwise~trt_time:pre_trt:storage_time+offset(log(num_sample)))$contrasts%>%as.data.frame()
emmeans::emmeans(mod.bad,pairwise~trt_time:pre_trt:storage_time+offset(log(num_sample)))$contrasts%>%as.data.frame()
First , I think you should look up how to use emmeans.The intent is not to give a duplicate of the model formula, but rather to specify which factors you want the marginal means of.
However, that is not the issue here. What emmeans does first is to setup a reference grid that consists of all combinations of
the levels of each factor
the average of each numeric predictor; except if a
numeric predictor has just two different values, then
both its values are included.
It is that exception you have run against. Since num_samples has just 2 values of 15 and 20, both levels are kept separate rather than averaged. If you want them averaged, add cov.keep = 1 to the emmeans call. It has nothing to do with offsets you specify in emmeans-related functions; it has to do with the fact that num_samples is a predictor in your model.
The reason for the exception is that a lot of people specify models with indicator variables (e.g., female having values of 1 if true and 0 if false) in place of factors. We generally want those treated like factors rather than numeric predictors.
To be honest I'm not exactly sure what's going on with the expansion (276, the 'correct' number of contrasts, is choose(24,2), the 'incorrect' number of contrasts is 1128 = choose(48,2)), but I would say that you should probably be following the guidance in the "offsets" section of one of the emmeans vignettes where it says
If a model is fitted and its formula includes an offset() term, then by default, the offset is computed and included in the reference grid. ...
However, many users would like to ignore the offset for this kind of model, because then the estimates we obtain are rates per unit value of the (logged) offset. This may be accomplished by specifying an offset parameter in the call ...
The most natural choice for setting the offset is to 0 (i.e. make predictions etc. for a sample size of 1), but in this case I don't think it matters.
get_contr <- function(x) as_tibble(x$contrasts)
cfun <- function(m) {
emmeans::emmeans(m,
pairwise~trt_time:pre_trt:storage_time, offset=0) |>
get_contr()
}
nrow(cfun(mod.good)) ## 276
nrow(cfun(mod.bad)) ## 276
From a statistical point of view I question the wisdom of looking at 276 pairwise comparisons, but that's a different issue ...

Are predicted probabilities from glm models probabilities of 0 or 1?

My response variable, status has two values, 1 for alive, 0 for dead.
I have built a model such as this one model<- glm(status ~., train_data, family='binomial'). I use predict(model, test_data, type = 'response'), which gives me a vector of predicted probabilities, like this:
0.02 0.04 0.1
Are these probabilities of someone being alive (i.e. status == 1) or someone being dead (i.e. status == 0)?
I'm pretty sure it's probabilities of someone being alive, but is this always the case? Is there a way to specify this directly in the predict() function?
From ?binomial:
For the ‘binomial’ and ‘quasibinomial’ families the response can
be specified in one of three ways:
As a factor: ‘success’ is interpreted as the factor not
having the first level (and hence usually of having the
second level).
As a numerical vector with values between ‘0’ and ‘1’,
interpreted as the proportion of successful cases (with the
total number of cases given by the ‘weights’).
As a two-column integer matrix: the first column gives the
number of successes and the second the number of failures.
If status is numeric with values of 0 or 1, the "total number of cases" is assumed to be 1 (i.e., each observation is the failure (0) or success (1) of a single individual). (The probability is always "probability of 1", i.e. 0 always means "failure" and 1 always means "success".)
There is no way to change this in predict(), as far as I know: if you wanted to flip the probabilities you would need to use 1-status rather than status as your response variable.

Generalized linear model vs Generalized additive model

I'm trying to follow this paper: Using a data science approach to predict cocaine use frequency from depressive symptoms where they use glm, gam with the beck inventory depression. So I did found a similiar dataset to test those models. However I'm having a hard time with both models. For example I have two variables d64a and d64b, and they're coded with 1,2,3,4 meaning that they're ordinal. Also, in the paper y2 is only the value of 1 but i have also a variable extra (that can be dependent, the proportion of consume)
For the GAM model I have:
b<-gam(y2~s(d64a)+s(d64b),data=DATOS2)
but I have the following error:
Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
A term has fewer unique covariate combinations than specified maximum degrees of freedom
Meanwhile for the glm, I have the following:
d<-glm(y2~d64a+d64b,data=DATOS2)
I don't know since d64a and d64b are ordinal I have to use factor()?
The error message tells you that one or both of d64a and d64b do not have 9 (nine) unique values.
By default s(...) will create a basis with nine functions. You get this error if there are fewer than nine unique values in the covariate.
Check which covariates are affected using:
length(unique(d64a))
length(unique(d64b))
and see what the number of unique values is for each of the covariates you wish to include. Then set the k argument to the number returned above if it is less than nine. FOr example, assume that the above checks returned 5 and 7 unique covariates, then you would indicate this by setting k as follows:
b <- gam(y2 ~ s(d64a, k = 5) + s(d64b, k = 7), data = DATOS2)

lm()$assign: what is it?

What is the assign attribute of a linear model fit? It's supposed to somehow provide the position of the response term, but in practice it seems to enumerate all coefficients in the model. It's my understanding that assign is a carryover from S and it's not supported by glm(). I need to extract the equivalent information for glm, but I don't understand what the implementation does for lm and can't seem to find the source code either. The help file for lm.fit says, unhelpfully:
non-null fits will have components assign, effects and (unless not requested) qr relating to the linear fit, for use by extractor functions such as summary and effects
You find this in help("model.matrix"), which creates these values:
There is an attribute "assign", an integer vector with an entry for
each column in the matrix giving the term in the formula which gave
rise to the column. Value 0 corresponds to the intercept (if any), and
positive values to terms in the order given by the term.labels
attribute of the terms structure corresponding to object.
So, it maps the design matrix to the formula.
The numbers from $assign represent the corresponding predictor variable. If your predictor is categorical with 3 levels, you will see the corresponding number (3-1) times in your $assign call. Example:
data(mpg, package = "ggplot2")
m = lm(cty ~ hwy + class,data = mpg)
m$assign
[1] 0 1 2 2 2 2 2 2
# Note how there is six 2's to represent the indicator variables
# for the various 'class' levels. (class has 7 levels)
You will see the quantitative predictors will only have one value (hwy in the example above), since they are represented by one term in the design formula.

Resources