Are predicted probabilities from glm models probabilities of 0 or 1? - r

My response variable, status has two values, 1 for alive, 0 for dead.
I have built a model such as this one model<- glm(status ~., train_data, family='binomial'). I use predict(model, test_data, type = 'response'), which gives me a vector of predicted probabilities, like this:
0.02 0.04 0.1
Are these probabilities of someone being alive (i.e. status == 1) or someone being dead (i.e. status == 0)?
I'm pretty sure it's probabilities of someone being alive, but is this always the case? Is there a way to specify this directly in the predict() function?

From ?binomial:
For the ‘binomial’ and ‘quasibinomial’ families the response can
be specified in one of three ways:
As a factor: ‘success’ is interpreted as the factor not
having the first level (and hence usually of having the
second level).
As a numerical vector with values between ‘0’ and ‘1’,
interpreted as the proportion of successful cases (with the
total number of cases given by the ‘weights’).
As a two-column integer matrix: the first column gives the
number of successes and the second the number of failures.
If status is numeric with values of 0 or 1, the "total number of cases" is assumed to be 1 (i.e., each observation is the failure (0) or success (1) of a single individual). (The probability is always "probability of 1", i.e. 0 always means "failure" and 1 always means "success".)
There is no way to change this in predict(), as far as I know: if you wanted to flip the probabilities you would need to use 1-status rather than status as your response variable.

Related

Linear Regression Model with a variable that zeroes the result

For my class we have to create a model to predict the credit balance of each individuals. Based on observations, many results are zero where the lm tries to calculate them.
To overcome this I created a new variable that results in zero if X and Y are true.
CB$Balzero = ifelse(CB$Rating<=230 & CB$Income<90,0,1)
This resulted in getting 90% of the zero results right. The problem is:
How can I place this variable in the lm so it correctly results in zeros when the proposition is true and the calculation when it is false?
Something like: lm=Balzero*(Balance~.)
I think that
y ~ -1 + Balzero:Balance
might work (you haven't given us a reproducible example to try).
-1 tells R to omit the intercept
: specifies an interaction. If both variables are numeric, then A:B includes the product of A and B as a term in the model.
The second term could also be specified as I(Balzero*Balance) (I means "as is", i.e. interpret * in the usual numerical sense, not in its formula-construction context.)
These specifications should fit the model
Y = beta1*Balzero*Balance + eps
where eps is an error term.
If Balzero == 0, the predicted value will be zero. If Balzero==1 the predicted value will be beta1*Balance.
You might want to look into random forest models, which naturally incorporate the kind of qualitative splitting that you're doing by hand in your example.

How to know probability output by a model corresponds to which class?

I am studying the third chapter of An Introduction to Statistical Learning with Application in R which discusses classification models. In section 4.7.3 Linear Discriminate Analysis (Lab) the model is applied on a dataset named Smarket to predict the up and down of the stock market. Here the total number of down and up prediction was done by the following lines of code: sum(lda.pred$posterior[, 1]>= .5) and sum(lda.pred$posterior[, 1] < .5) and the writers wrote that
Notice that the posterior probability output by the model corresponds to
the probability that the market will decrease
and then to verify these line of codes were written:
lda.pred$posterior [1:20 , 1]
which gives
posterior probability of 20 observations
and
lda.class [1:20]
which gives classes corresponding to the probabilities of the observations given above
Also when I wrote the line of code (thanks to ISLR online course):
data.frame(lda.pred)[1:20, ]
which gives classes and corresponding probabilites. Here is seen that observations having probabilities < 0.5 are classified as down class and observations having probabilities >= 0.5 are classified as up class.
This all is a bit confusing to me. My question is in the first case how do we know that when the probability is greater or equal to 0.5 the prediction is down? Because using contrast() function it is seen that R has created a dummy variable with a 1 for Up which means that the values correspond to the probability of the market going up, rather than down. Again in the second case why observations having probabilities >= 0.5 are classified as up? Don't the first and second case contradict?
You are predicting both, where the posterior probability of "Down" is 1 - "Up". It just so happens that class "Down" is stored in the first column of lda.pred
lda.pred$posterior [1:20 , 1]. The probabilities of "Up" are stored in the second column. (lda.pred$posterior [1:20 , 2]).

Generalized linear model vs Generalized additive model

I'm trying to follow this paper: Using a data science approach to predict cocaine use frequency from depressive symptoms where they use glm, gam with the beck inventory depression. So I did found a similiar dataset to test those models. However I'm having a hard time with both models. For example I have two variables d64a and d64b, and they're coded with 1,2,3,4 meaning that they're ordinal. Also, in the paper y2 is only the value of 1 but i have also a variable extra (that can be dependent, the proportion of consume)
For the GAM model I have:
b<-gam(y2~s(d64a)+s(d64b),data=DATOS2)
but I have the following error:
Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
A term has fewer unique covariate combinations than specified maximum degrees of freedom
Meanwhile for the glm, I have the following:
d<-glm(y2~d64a+d64b,data=DATOS2)
I don't know since d64a and d64b are ordinal I have to use factor()?
The error message tells you that one or both of d64a and d64b do not have 9 (nine) unique values.
By default s(...) will create a basis with nine functions. You get this error if there are fewer than nine unique values in the covariate.
Check which covariates are affected using:
length(unique(d64a))
length(unique(d64b))
and see what the number of unique values is for each of the covariates you wish to include. Then set the k argument to the number returned above if it is less than nine. FOr example, assume that the above checks returned 5 and 7 unique covariates, then you would indicate this by setting k as follows:
b <- gam(y2 ~ s(d64a, k = 5) + s(d64b, k = 7), data = DATOS2)

How does SMOTE create new data from categorical data?

I have used SMOTE in R to create new data and this worked fine. When I was doing further researches on how exactly SMOTE works, I couldn't find an answer, how SMOTE handles categorical data.
In the paper, an example is shown (page 10) with just numeric values. But I still do not know how SMOTE creates new data from categorical example data.
This is the link to the paper:
https://arxiv.org/pdf/1106.1813.pdf
That indeed is an important thing to be aware of. In terms of the paper that you are referring to, Sections 6.1 and 6.2 describe possible procedures for the cases of nominal-continuous and just nominal variables. However, DMwR does not use something like that.
If you look at the source code of SMOTE, you can see that the main work is done by DMwR:::smote.exs. I'll now briefly explain the procedure.
The summary is that the order of factor levels matters and that currently there seems to be a bug regarding factor variables which makes things work oppositely. That is, if we want to find an observation close to one with a factor level "A", then anything other than "A" is treated as "close" and those with level "A" are treated as "distant". Hence, the more factor variables there are, the fewer levels they have, and the fewer continuous variables there are, the more drastic the effect of this bug should be.
So, unless I'm wrong, the function should not be used with factors.
As an example, let's consider the case of perc.over = 600 with one continuous and one factor variable. We then arrive to smote.exs with the sub-data frame corresponding to the undersampled class (say, 50 rows) and proceed as follows.
Matrix T contains all but the class variables. Columns corresponding to the continuous variables remain unchanged, while factors or characters are coerced into integers. In means that the order of factor levels is essential.
Next we generate 50 * 6 = 300 new observations. We do so by creating 6 new observations (n = 1, ..., 6) for each of the 50 present ones (i = 1, ..., 50).
We scale the data by xd <- scale(T, T[i, ], ranges) so that xd shows deviations from the i-th observation. E.g., for i = 1 we have may have
# [,1] [,2]
# [1,] 0.00000000 0.00
# [2,] -0.13333333 0.25
# [3,] -0.26666667 0.25
meaning that the continuous variable for i = 2,3 is smaller than for i =1, but that the factor levels of i = 2,3 are "higher".
Then by running for (a in nomatr) xd[, a] <- xd[, a] == 0 we ignore most of the information in the second column related to factor level deviations: we set deviations to 1 to those cases that have the same factor level as the i-th observation, and 0 otherwise. (I believe it should be the opposite, meaning that it's a bug; I'm going to report it.)
Then we set dd <- drop(xd^2 %*% rep(1, ncol(xd))), which can be seen as a vector of squared distances for each observation from the i-th one and kNNs <- order(dd)[2:(k + 1)] gives the indices of the k nearest neighbours. It purposefully is 2:(k + 1) as the first element should be i (distance should be zero). However, the first element actually not always is i in this case due to point 4, which confirms a bug.
Now we create n-th new observation similar to the i-th one. First we pick one of the nearest neighbours, neig <- sample(1:k, 1). Then difs <- T[kNNs[neig], ] - T[i, ] is the component-wise difference between this neighbour and the i-th observation, e.g.,
difs
# [1] -0.1 -3.0
Meaning that the neighbour has lower values in terms of both variables.
New case is constructed by running: T[i, ] + runif(1) * difs which is indeed a convex combination between the i-th variable and the neighbour. This line is for the continuous variable(s) only. For the factors we have c(T[kNNs[neig], a], T[i, a])[1 + round(runif(1), 0)], which means that the new observation will have the same factor levels as the i-th observation with 50% chance, and the same as this chosen neighbour with another 50% chance. So, this is a kind of discrete interpolation.

GLM function for Logistic Regression: what is the default predicted outcome?

I am relatively new to R modelling and I came across the GLM functions for modelling. I am interested in Logistic regression using the family 'binomial'. My question is when my dependent variable can take one of two possible outcomes - say 'positive', 'negative' - what is the default outcome for which the estimates are computed - does the model predict the log odds for a 'positive' or a 'negative' outcome by default ? Also, what is the default outcome considered for estimation when the dependent variable is
Yes or No
1 or 2
Pass or Fail
etc. ?
Is there a rule by which R selects this default? Is there a way to override it manually? Please clarify.
It's in the details of ?binomial:
For the ‘binomial’ and ‘quasibinomial’ families the response can
be specified in one of three ways:
As a factor: ‘success’ is interpreted as the factor not
having the first level (and hence usually of having the
second level). added note: this usually means the first level alphabetically, since this is how R defines factors by default.
As a numerical vector with values between ‘0’ and ‘1’,
interpreted as the proportion of successful cases (with the
total number of cases given by the ‘weights’).
As a two-column integer matrix: the first column gives the
number of successes and the second the number of failures.
So the probability predicted is the probability of "success", i.e. of the second level of the factor, or the probability of a 1 in the numeric case.
From your examples:
Yes or No: the default will be to treat "No" as a failure (because alphabetical), but you can use my_data$my_factor <- relevel(my_data$my_factor,"Yes") to make "Yes" be the first level.
1 or 2: this will either fail or produce bogus results. Either make the variable into a factor ("1" will be treated as the first level) or subtract 1 to get a 0/1 variable (or use 2-x if you want 2 to be treated as a failure)
Pass or Fail: see "Yes or No" ...

Resources