Why is predict.glmnet not predicting probabilities? - r

I'm working on a model to predict the probability that college baseball players will make the major leagues. My dataset has 633 observations and 13 predictors with a binary response. The code below generates smaller reproducible examples of training and testing datasets:
set.seed(1)
OBP <- rnorm(50, mean=1, sd=.2)
HR.PCT <- rnorm(50, mean=1, sd=.2)
AGE <- rnorm(50, mean=21, sd=1)
CONF <- sample(c("A","B","C","D","E"), size=50, replace=TRUE)
CONF <- factor(CONF, levels=c("A","B","C","D","E"))
df.train <- data.frame(OBP, HR.PCT, AGE, CONF)
df.train <- df.train[order(-OBP),]
df.train$MADE.MAJORS <- 0
df.train$MADE.MAJORS[1:10] <- 1
OBP <- rnorm(10, mean=1, sd=.2)
HR.PCT <- rnorm(10, mean=1, sd=.2)
AGE <- rnorm(10, mean=21, sd=1)
CONF <- sample(c("A","B","C","D","E"), size=10, replace=TRUE)
CONF <- factor(CONF, levels=c("A","B","C","D","E"))
MADE.MAJORS <- sample(0:1, size=10, replace=TRUE, prob=c(0.8,0.2))
df.test <- data.frame(OBP, HR.PCT, AGE, CONF, MADE.MAJORS)
I then used glmnet to perform the lasso with logistic regression and generate predictions. I want the predictions to be in the form of probabilities (that is, between 0 and 1).
library(glmnet)
train.mtx <- with(df.train, model.matrix(MADE.MAJORS ~ OBP + HR.PCT + AGE + CONF)[,-1])
glmmod <- glmnet(x=train.mtx, y=as.factor(df.train$MADE.MAJORS), alpha=1, family="binomial")
cv.glmmod <- cv.glmnet(x=train.mtx, y=df.train$MADE.MAJORS, alpha=1)
test.mtx <- with(df.test, model.matrix(MADE.MAJORS ~ OBP + HR.PCT + AGE + CONF)[,-1])
preds <- predict.glmnet(object=glmmod, newx=test.mtx, s=cv.glmmod$lambda.min, type="response")
cv.preds <- predict.cv.glmnet(object=cv.glmmod, newx=test.mtx, s="lambda.min")
Here are the predictions:
> preds
1
1 -3.2589440
2 -0.4435265
3 3.9646670
4 0.3772816
5 0.9952887
6 -7.3555661
7 0.2283675
8 -2.3871317
9 -8.1632749
10 -1.3563051
> cv.preds
1
1 0.1568839
2 0.3630938
3 0.7435941
4 0.4808428
5 0.5261076
6 -0.1431655
7 0.4123054
8 0.2207381
9 -0.1446941
10 0.2962391
I have a few questions about these results. Feel free to answer any or all (or none) of them. I'm most interested in an answer for the first question.
Why are the predictions from predict.glmnet (the preds vector) not in the form of probabilities? I put the preds values through the inverse logit function and got reasonable probabilities. Was that correct?
The predictions from predict.cv.glmnet (the cv.preds vector) mostly look like probabilities, but some of them are negative. Why is this?
When I use the glmnet function to create the glmmod object, I include the family="binomial" argument to indicate that I'm using logistic regression. However, when I use the cv.glmnet function to find the best value for lambda, I'm not able to specify logistic regression. Am I actually getting the best value for lambda if the cross-validation doesn't use logistic regression?
Similarly, when I use the predict.cv.glmnet function, I'm not able to specify logistic regression. Does this function produce the predictions that I want?

I am not 100% sure on the following because the package does seem to operate counter to its documentation, as you've noticed, but it may produce some indication whether your thinking is along the right path.
Question 1
Yes, you're right. Note that,
> predict.glmnet(object=glmmod, newx=test.mtx, s=cv.glmmod$lambda.min, type="link")
1
1 -3.2589440
2 -0.4435265
3 3.9646670
4 0.3772816
5 0.9952887
6 -7.3555661
7 0.2283675
8 -2.3871317
9 -8.1632749
10 -1.3563051
which is the same output as type="response". Thus, putting it through the inverse logit function would be the right way to get the probabilities. As to why is this happening, I have not clue -perhaps a bug.
Question 2...4
For the cv.preds, you're getting something along the lines of probabilities because you're fitting a Gaussian link. In order to fit a logit link, you should specify the family parameter. Namely:
cv.glmmod <- cv.glmnet(x=train.mtx, y=df.train$MADE.MAJORS, alpha=1, family="binomial")
> cv.preds
1
1 -10.873290
2 1.299113
3 15.812671
4 3.622259
5 5.621857
6 -24.826551
7 1.734000
8 -5.420878
9 -26.160403
10 -4.496020
In this case, cv.preds will output along the real line and you can put those values through the inverse logit to get the probabilities.

Related

Implement null distribution for gbm interaction strength

I am trying to determine which interactions in a gbm model are significant using the method described in Friedman and Popescu 2008 https://projecteuclid.org/euclid.aoas/1223908046. My gbm is a classification model with 9 different classes.
I'm struggling with how to translate Section 8.3 into code to run in R.
I think the overall process is to:
Train a version of the model with max.depth = 1
Simulate response data from this model
Train a new model on this data with max.depth the same as the real model
Get interaction strength for this model
Repeat steps 1-4 to create a null distribution of interaction strengths
The part that I am finding most confusing is implementing equations 48 and 49. (You will have to look at the linked article since I can't reproduce them here)
This is what I think I understand but please correct me if I'm wrong:
y_i is a new vector of the response that we will use to train a new model which will provide the null distribution of interaction statistics.
F_A(x_i) is the prediction from a version of the gbm model trained with max.depth = 1
b_i is a probability between 0 and 1 based on the prediction from the additive model F_A(x_i)
Questions
What is subscript i? Is it the number of iterations in the bootstrap?
How is each artificial data set different from the others?
Are we subbing the Pr(b_i = 1) into equation 48?
How can this be done with multinomial classification?
How would one implement this in R? Preferably using the gbm package.
Any ideas or references are welcome!
Overall, the process is an elegant way to neutralise the interaction effects in y by permuting/re-distributing the extra contribution of modelling on the interactions. The extra contribution could be captured by the margins between a full and an additive model.
What is subscript i? Is it the number of iterations in the bootstrap?
It is the index of samples. There are N samples in each iteration.
How is each artificial data set different from the others?
The predictors X are the same across data sets. The response values Y~ are different due to random permutation of margins in equation 47 and random realisation (for categorical outcomes only) in equation 48.
Are we subbing the Pr(b_i = 1) into equation 48?
Yes, if the outcome Y is binary.
How can this be done with multinomial classification?
One way is to randomly permute margins in the log-odds of each category. Followed by random realisation according to the probability from the additive model.
How would one implement this in R? Preferably using the gbm package.
I tried to implement it in-line with your overall process.
Firstly, a simulated training data set {X1,X2,Y} of size N=200 where Y has three categories (Y1,Y2,Y3) realised by the probabilities determined by X1, X2. The interaction part X1*X2 is in Y1, while the additive parts are in Y2,Y3.
set.seed(1)
N <- 200
X1 <- rnorm(N) # 2 predictors
X2 <- rnorm(N)
#log-odds for 3 categories
Y1 <- 2*X1*X2 + rnorm(N, sd=1/10) # interaction
Y2 <- X1^2 + rnorm(N, sd=1/10) #additive
Y3 <- X2^2 + rnorm(N, sd=1/10) #additive
Y <- rep(NA, N) # Multinomial outcome with 3 categories
for (i in 1:N)
{
prob <- 1 / (1 + exp(-c(Y1[i],Y2[i],Y3[i]))) #logistic regression
Y[i] <- which.max(rmultinom(1, 10000, prob=prob)) #realisation from prob
}
Y <- factor(Y)
levels(Y) <- c('Y1','Y2','Y3')
table(Y)
#Y1 Y2 Y3
#38 75 87
dat = data.frame(Y, X1, X2)
head(dat)
# Y X1 X2
# 2 -0.6264538 0.4094018
# 3 0.1836433 1.6888733
# 3 -0.8356286 1.5865884
# 2 1.5952808 -0.3309078
# 3 0.3295078 -2.2852355
# 3 -0.8204684 2.4976616
Train a full and an additive models with max.depth = 2 and 1 respectively.
library(gbm)
n.trees <- 100
F_full <- gbm(Y ~ ., data=dat, distribution='multinomial', n.trees=n.trees, cv.folds=3,
interaction.depth=2) # consider interactions
F_additive <- gbm(Y ~ ., data=dat, distribution='multinomial', n.trees=n.trees, cv.folds=3,
interaction.depth=1) # ignore interactions
#use improved prediction as interaction strength
interaction_strength_original <- min(F_additive$cv.error) - min(F_full$cv.error)
> 0.1937891
Simulate response data from this model.
#randomly permute margins (residuals) of log-odds to remove any interaction effects
margin <- predict(F_full, n.trees=gbm.perf(F_full, plot.it=FALSE), type='link')[,,1] -
predict(F_additive, n.trees=gbm.perf(F_additive, plot.it=FALSE), type='link')[,,1]
margin <- apply(margin, 2, sample) #independent permutation for each category (Y1, Y2, Y3)
Y_art <- rep(NA, N) #response values of an artificial dataset
for (i in 1:N)
{
prob <- predict(F_additive, n.trees=gbm.perf(F_additive, plot.it=FALSE), type='link',
newdata=dat[i,])
prob <- prob + margin[i,] # equation (47)
prob <- 1 / (1 + exp(-prob))
Y_art[i] <- which.max(rmultinom(1, 1000, prob=prob)) #Similar to random realisation in equation (49)
}
Y_art <- factor(Y_art)
levels(Y_art) = c('Y1','Y2','Y3')
table(Y_art)
#Y1 Y2 Y3
#21 88 91
Train a new model on this artificial data with max.depth (2) the same as the real model
F_full_art = gbm(Y_art ~ ., distribution='multinomial', n.trees=n.trees, cv.folds=3,
data=data.frame(Y_art, X1, X2),
interaction.depth=2)
F_additive_art = gbm(Y_art ~ ., distribution='multinomial', n.trees=n.trees, cv.folds=3,
data=data.frame(Y_art, X1, X2),
interaction.depth=1)
Get interaction strength for this model
interaction_strength_art = min(F_additive_art$cv.error) - min(F_full_art$cv.error)
> 0.01323959 # much smaller than interaction_strength_original in step 1.
Repeat steps 2-4 to create a null distribution of interaction strengths. As expected, the interaction effects are much lower (-0.0527 to 0.0421) in the neutralised data sets than in the original training data set (0.1938).
interaction_strength_art <- NULL
for (j in 1:10)
{
print(j)
interaction_strength_art <- c(interaction_strength_art, step_2_to_4())
}
summary(interaction_strength_art)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#-0.052648 -0.019415 0.001124 -0.004310 0.012759 0.042058
interaction_strength_original
> 0.1937891

Simulate data for mixed-effects model with predefined parameter

I'm trying to simulate data for a model expressed with the following formula:
lme4::lmer(y ~ a + b + (1|subject), data) but with a set of given parameters:
a <- rnorm() measured at subject level (e.g nSubjects = 50)
y is measured at the observation level (e.g. nObs = 7 for each subject
b <- rnorm() measured at observation level and correlated at a given r with a
variance ratio of the random effects in lmer(y ~ 1 + (1 | subject), data) is fixed at for example 50/50 or 10/90 (and so on)
some random noise is present (so that a full model does not explain all the variance)
effect size of the fixed effects can be set at a predefined level (e.g. dCohen=0.5)
I played with various packages like: powerlmm, simstudy or simr but still fail to find a working solution that will accommodate the amount of parameters I'd like to define beforehand.
Also for my learning purposes I'd prefer a base R method than a package solution.
The closest example I found is a blog post by Ben Ogorek "Hierarchical linear models and lmer" which looks great but I can't figure out how to control for parameters listed above.
Any help would be appreciated.
Also if there a package that I don't know of, that can do these type of simulations please let me know.
Some questions about the model definition:
How do we specify a correlation between two random vectors that are different lengths? I'm not sure: I'll sample 350 values (nObs*nSubject) and throw away most of the values for the subject-level effect.
Not sure about "variance ratio" here. By definition, the theta parameters (standard deviations of the random effects) are scaled by the residual standard deviation (sigma), e.g. if sigma=2, theta=2, then the residual std dev is 2 and the among-subject std dev is 4
Define parameter/experimental design values:
nSubjects <- 50
nObs <- 7
## means of a,b are 0 without loss of generality
sdvec <- c(a=1,b=1)
rho <- 0.5 ## correlation
betavec <- c(intercept=0,a=1,b=2)
beta_sc <- betavec[-1]*sdvec ## scale parameter values by sd
theta <- 0.4 ## = 20/50
sigma <- 1
Set up data frame:
library(lme4)
set.seed(101)
## generate a, b variables
mm <- MASS::mvrnorm(nSubjects*nObs,
mu=c(0,0),
Sigma=matrix(c(1,rho,rho,1),2,2)*outer(sdvec,sdvec))
subj <- factor(rep(seq(nSubjects),each=nObs)) ## or ?gl
## sample every nObs'th value of a
avec <- mm[seq(1,nObs*nSubjects,by=nObs),"a"]
avec <- rep(avec,each=nObs) ## replicate
bvec <- mm[,"b"]
dd <- data.frame(a=avec,b=bvec,Subject=subj)
Simulate:
dd$y <- simulate(~a+b+(1|Subject),
newdata=dd,
newparams=list(beta=beta_sc,theta=theta,sigma=1),
family=gaussian)[[1]]

Normal distribution has higher probability for multimodality than unimodality

I am trying to analyze a mixture model using the mixtool package, in other words, I would like to analyze if my data is a uni-, bi- or multimodal distribution.
For simplicity here an example:
library(mixtools)
#creating an aritifical normal distribution
mydata <- rnorm(1000, 1750, 60)
#defining the cuts and preparing it for calculations
cutp <- seq(1600, 2300, by=25)
mult <- makemultdata(mydata, cuts = cutp)
comp <- multmixmodel.sel(mult, comps = 1:3, epsilon = 0.01)
#plotting the data (in this case 2 subpopulations)
mixmdl = normalmixEM(mydata, k=2, maxit=50000)
plot(mixmdl,which=2)
lines(density(mydata), lty=2, lwd=2)
Now as a result for 'comp', I get:
1 2 3 Winner
AIC -Inf -94.04097 -124.04097 2
BIC -Inf -35.04097 -35.04097 2
CAIC -Inf -64.54097 -79.54097 2
ICL -Inf -35.04097 -35.04097 2
Loglik -Inf -35.04097 -35.04097 2
In my very limited understanding for this kind of executions, I expected to see 1 as a 'winner' (since I produced a single normal distribution).
However, as you can see, I get infinite values for 1, and identical values for the BIC, ICL and Loglik for 2 and 3. This speaks against a normal distribution and a higher (or identical) probability to deal with a bi- or multimodal distribution. Since I used a normal distribution to start with, I would expect to see a highest probability for 1 and at least some differences between 2 and 3. What confuses me the most are the identical values for 2 and 3 in some of the tests.
So my question is why my approach fails to recognize the distribution as a gaussian and rather classifies it as bi- / multimodal?
I do not know a lot about the mixtools package. I did give a little try with the what you did, and I did not come to the same conclusion as you.
When I fit a two-component multinomial mixture model (which is what you are doing with multmixmodel.sel), the second component is non-existent; the posterior probability is almost zero.
set.seed(1)
mydata <- rnorm(1000, 1750, 60)
cutp <- seq(min(mydata), max(mydata), by=25)
mult <- makemultdata(mydata, cuts = cutp)
multmod2 <- multmixEM(mult, k=2)
multmod2 $posterior
# comp.1 comp.2
# [1,] 1 1.980052e-226
When I fit mixture models to the original data, the single component is selected each time.
library(mclust)
fit <- Mclust(mydata)
fit
#'Mclust' model object:
# best model: univariate normal (X) with 1 components
library(EMMIX)
# Available from
#https://people.smp.uq.edu.au/GeoffMcLachlan/mix_soft/EMMIX_R/
fit_1 <- EMMIX(mydata, g=1)
fit_2 <- EMMIX(mydata, g=2)
c(fit_1$bic, fit_2$bic)
# [1] 11108.02 11128.67
#(BIC selects the one component model)

Create a binary outcome with random forest

I have a dataset that looks like this:
TEAM1 TEAM2 EXPG1 EXPG2 Gewonnen
ADO Den Haag Groningen 1.5950 1.2672 1
I now try to predict the column Gewonnen based on EXPG1 and EXPG2. Therefore I created a training and test set and am creating the following model (all using rcaret):
modFit <- train(Gewonnen~ EXPG1 + EXPG2, data=training, method="rf", prox=TRUE)
I can't make a confusion matrix now because my data has more references. That's true because when I do:
pred <- predict(modFit, testing)
head(print)
It says: 0.5324000 0.7237333 0.2811333 0.8231000 0.8299333 0.9792000
Because I want to make a confusion matrix I can't turn them into on 0/1 but I have the feeling that there should be an option to do this in the model as well.
Any thoughts on what I should change in this model to create 0/1 values. I couldn't find it in the documentation:
modFit <- train(Gewonnen~ EXPG1 + EXPG2, data=training, method="rf", prox=TRUE)
First of all, as Tim Biegeleisen says, you should convert your Gewonnen variable to a factor (in both training & test sets), if it is not already:
training$Gewonnen <- as.factor(training$Gewonnen)
testing$Gewonnen <- as.factor(testing$Gewonnen)
After that, the type option in the caret function predict determines what type of response you get for a binary classification problem, i.e. class labels or probabilities. Here is a reproducible example from the caret documentation using the Sonar dataset from the package mlbench:
library(caret)
library(mlbench)
data(Sonar)
str(Sonar$Class)
# Factor w/ 2 levels "M","R": 2 2 2 2 2 2 2 2 2 2 ...
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing <- Sonar[-inTraining,]
modFit <- train(Class ~ ., data=training, method="rf", prox=TRUE)
pred <- predict(modFit, testing, type="prob") # for class probabilities
head(pred)
# M R
# 5 0.442 0.558
# 10 0.276 0.724
# 11 0.096 0.904
# 12 0.360 0.640
# 20 0.654 0.346
# 21 0.522 0.478
pred2 <- predict(modFit, testing, type="raw") # for class labels
head(pred2)
# [1] R R R R M M
# Levels: M R
For the confusion matrix, you will need class labels (i.e. pred2 above):
confusionMatrix(pred2, testing$Class)
# Confusion Matrix and Statistics
# Reference
# Prediction M R
# M 25 6
# R 2 18
This answer is a bit speculative as you omitted some critical details about your data set and I have not worked extensively with the caret package. That being said, it appears that you are running random forests in regression mode, which means that you will end up with a continuous function. This means that predictions can have a response value of 0, 1, or anything in between 0 and 1. If your Gewonnen column only has values of 0 or 1, and you want predicted values to also behave this way, then you can try turning Gewonnen into a categorical variable. As this article discusses, this might tell random forests to run in classification mode instead of regression mode.
Gewonnen <- as.factor(Gewonnen)
This builds the random forest as you did before, and you should have the responses you want.

Fitted values for multinom in R: Coefficients for Reference Category?

I'm using the function multinom from the nnet package to run a multinomial logistic regression.
In multinomial logistic regression, as I understand it, the coefficients are the changes in the log of the ratio of the probability of a response over the probability of the reference response (i.e., ln(P(i)/P(r))=B1+B2*X... where i is one response category, r is the reference category, and X is some predictor).
However, fitted(multinom(...)) produces estimates for each category, even the reference category r.
EDIT Example:
set.seed(1)
library(nnet)
DF <- data.frame(X = as.numeric(rnorm(30)),
Y = factor(sample(letters[1:5],30, replace=TRUE)))
DF$Y<-relevel(DF$Y, ref="a") #ensure a is the reference category
model <- multinom(Y ~ X, data = DF)
coef(model)
# (Intercept) X
#b 0.1756835 0.55915795
#c -0.2513414 -0.31274745
#d 0.1389806 -0.12257963
#e -0.4034968 0.06814379
head(fitted(model))
# a b c d e
#1 0.2125982 0.2110692 0.18316042 0.2542913 0.1388810
#2 0.2101165 0.1041655 0.26694618 0.2926508 0.1261210
#3 0.2129182 0.2066711 0.18576567 0.2559369 0.1387081
#4 0.1733332 0.4431170 0.08798363 0.1685015 0.1270647
#5 0.2126573 0.2102819 0.18362323 0.2545859 0.1388516
#6 0.1935449 0.3475526 0.11970164 0.2032974 0.1359035
head(DF)
# X Y
#1 -0.3271010 a
To calculate the predicted probability ratio between response b and response a for row 1, we calculate exp(0.1756835+0.55915795*(-0.3271010))=0.9928084. And I see that this corresponds to the fitted P(b)/P(a) for row 1 (0.2110692/0.2125982=0.9928084).
Is the fitted probability for the reference category calculated algebraically (e.g., 0.2110692/exp(0.1756835+0.55915795*(-0.3271010)))?
Is there a way to obtain the equation for the predicted probability of the reference category?
I had the same question, and after looking around I think the solution is:
given 3 classes: a,b,c and the fitted(model) probabilities pa,pb,pc output by the algorithm, you can reconstruct those probabilities from these 3 equations:
log(pb/pa) = beta1*X
log(pc/pa) = beta2*X
pa+pb+pc=1
Where beta1,beta2 are the rows of the output of coef(model), and X is your input data.
Playing with those equations you get to:
pb = exp(beta1*X)/(1+exp(beta1*X)+exp(beta2*X))
pc = exp(beta2*X)/(1+exp(beta1*X)+exp(beta2*X))
pa = 1 - pb - pc
The key here is that in the help file for multinom() it says that "A log-linear model is fitted, with coefficients zero for the first class."
So that means the predicted values for the reference class can be calculated directly assuming that the coefficients for class "a" are both zero. For example, for the sample row given above, we could calculate the predicted probability for class "a" using the softmax transform:
exp(0+0)/(exp(0+0) + exp(0.1756835 + 0.55915795*(-0.3271010)) + exp(-0.2513414 + (-0.31274745)*(-0.3271010)) + exp(0.1389806 + (-0.12257963)*(-0.3271010)) + exp(-0.4034968 + 0.06814379*(-0.3271010)))
or perhaps more simply, using non-hard-coded numbers, we can calculate the entire set of probabilities for the first row of data as:
softMax <- function(x){
expx <- exp(x)
return(expx/sum(expx))
}
coefs <- rbind(c(0,0), coef(model))
linear.predictor <- as.vector(coefs%*%c(1,-0.3271010))
softMax(linear.predictor)
FWIW: the example in the original question does not reproduce for me exactly, my seed gives different random deviates. So I have reproduced the example freshly and with my calculations below.
library(nnet)
set.seed(1)
DF <- data.frame(
X = as.numeric(rnorm(30)),
Y = factor(sample(letters[1:5],30, replace=TRUE)))
DF$Y<-relevel(DF$Y, ref="a") #ensure a is the reference category
model <- multinom(Y ~ X, data = DF)
coef(model)
## (Intercept) X
## b -0.33646439 1.200191e-05
## c -0.36390688 -1.773889e-01
## d -0.45197598 1.049034e+00
## e -0.01418543 3.076309e-01
DF[1,]
## X Y
## 1 -0.6264538 c
fitted.values(model)[1,]
## a b c d e
## 0.27518921 0.19656378 0.21372240 0.09076844 0.22375617
coefs <- rbind(c(0,0), coef(model))
linear.predictor <- as.vector(coefs%*%c(1,DF[1,"X"]))
softMax(linear.predictor)
## [1] 0.27518921 0.19656378 0.21372240 0.09076844 0.22375617

Resources