Have others seen a discontinuity in e1071::predict.svm() probabilities? - r

I have been using e1071::svm(...,probability=TRUE) in R to fit a binary SVM classifier and then using predict.svm() to get the probabilities for both the training sample and a test sample. When I converted the probabilities to log(odds) and plotted them against the decision.values, I found there was a discontinuity in the predictions:
Plot of log(odds) = log(prob/(1-prob)) vs. Decision Values
This is happening with other models as well, whenever the probability is below about 0.25%; there is consistently a gap from log(odds)= -5.98 to -10.86. Note that this does not occur at a fixed decision.value (which varies with the model). I believe it may also be happening at high probabilities (>99%) as well.
The red and green lines are linear fits for the predictions with log(odds)<-8 and >-8, respectively. The coefficients of the latter agree with the probA and probB outputs returned with the svm object. I have seen other cases where the gap occurs from from +5.98 to +10.86 (only).
Here is the example using the iris dataset:
require("datasets")
require("e1071")
iris$is.setosa <- as.numeric(iris$Species=="setosa")
set.seed(8675309)
fit <- svm(
is.setosa ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iris,probability=T,cost=0.01,kernel="linear",type="C-classification")
preds <- predict(fit,prob=TRUE,newdata=iris,decision=T)
DVs <- attr(preds,"decision.values")[,1]
probs <- attr(preds,"probabilities")[,"1"]
logodds <- log(probs/(1-probs))
plot(DVs,logodds,xlab="decision.values",ylab="log(odds)",main="IRIS dataset")
cat("Coefficents of probability model reported by svm():\n")
print(fit[c("probA","probB")])
fit <- lm(logodds ~ DVs,subset=which(logodds> -8))
cat("fit of logodds ~ DVs when log(odds) greater than -8:\n")
print(summary(fit))
abline(fit,col="green",lty=3)
fit <- lm(logodds ~ DVs,subset=which(logodds< -8))
abline(fit,col="red",lty=3)
Has anyone else seen this behavior? Any idea what might be causing it? Thanks!

Related

Plotting random effects with lmer and sjPlot

I am currently running a mixed effects model using lmer in which random slopes and correlated random intercepts are estimated. After fitting the model I would like to plot the result allowing from random slopes and intercepts as well as one overall fixed line. How I currently implement is this way:
library(lmer)
library(sjPlot)
df <- read_csv("anonymized_test.csv")
m1 <- lmer("DV ~ IV + (1 + IV| iso)", df)
plot_model(m1, ,type="pred",
terms=c("IV","iso"),
pred.type="re", ci.lvl = NA)
This is the result:
Which is not what is expected as we would expect some negative and positive slopes in addition to the random intercepts according to the extracted random effects of the model
The problem is that sjPlot seems to only plot the random intercepts. Looking at an older vignette of sjPlot this seems to have been implemented in a deprecated function (see here ). The question is how do I get this functionality back? Thanks for any insight.
This is actually straightforward, even without the sjPlot package. We may extract fixef and ranef as fe and re and combine them in a plot. Both, fe and re have intercept and slope and get added together.
library(lme4)
fm1 <- lmer("Reaction ~ Days + (Days | Subject)", sleepstudy)
fe <- fixef(fm1)
re <- ranef(fm1)$Subject
clr <- rainbow(nrow(re)) ## define n colors
par(mfrow=c(1, 2))
plot(Reaction ~ Days, sleepstudy, col=clr[as.numeric(Subject)], main='Pred w/ points')
lapply(seq_len(nrow(re)), \(x) abline(fe[1] + re[x, 1], fe[2] + re[x, 2], col=clr[x]))
plot(Reaction ~ Days, sleepstudy, col=clr[as.numeric(Subject)], main='Pred w/o points', type='n')
lapply(seq_len(nrow(re)), \(x) abline(fe[1] + re[x, 1], fe[2] + re[x, 2], col=clr[x]))
However, I also get the random slopes using sjPlot. Not sure what went wrong, maybe you are using outdated software?
sjPlot::plot_model(fm1, type="pred", terms=c("Days","Subject"), pred.type="re", ci.lvl=NA)
# Warning message:
# In RColorBrewer::brewer.pal(n, pal) :
# n too large, allowed maximum for palette Set1 is 9
# Returning the palette you asked for with that many colors

Categorical Regression with Centered Levels

R's standard way of doing regression on categorical variables is to select one factor level as a reference level and constraining the effect of that level to be zero. Instead of constraining a single level effect to be zero, I'd like to constrain the sum of the coefficients to be zero.
I can hack together coefficient estimates for this manually after fitting the model the standard way:
x <- lm(data = mtcars, mpg ~ factor(cyl))
z <- c(coef(x), "factor(cyl)4" = 0)
y <- mean(z[-1])
z[-1] <- z[-1] - y
z[1] <- z[1] + y
z
## (Intercept) factor(cyl)6 factor(cyl)8 factor(cyl)4
## 20.5021645 -0.7593074 -5.4021645 6.1614719
But that leaves me without standard error estimates for the former reference level that I just added as an explicit effect, and I need to have those as well.
I did some searching and found the constrasts functions, and tried
lm(data = mtcars, mpg ~ C(factor(cyl), contr = contr.sum))
but this still only produces two effect estimates. Is there a way to change which constraint R uses for linear regression on categorical variables properly?
Think I've figured it out. Using contrasts actually is the right way to go about it, you just need to do a little work to get the results into a convenient looking form. Here's the fit:
fit <- lm(data = mtcars, mpg ~ C(factor(cyl), contr = contr.sum))
Then the matrix cs <- contr.sum(factor(cyl)) is used to get the effect estimates and the standard error.
The effect estimates just come from multiplying the contrast matrix by the effect estimates lm spits out, like so:
cs %*% coef(fit)[-1]
The standard error can be calculated using the contrast matrix and the variance-covariance matrix of the coefficients, like so:
diag(cs %*% vcov(fit)[-1,-1] %*% t(cs))

How to plot glm model coefficients with abline in R?

I'm struggling to plot the cofficients of an glm model using abline. Lets take this simple 2D example:
d <- iris[51:150, c(3:4,5)]
d[,3] <- factor(d[,3])
plot(d[,1:2], col=d[,3])
The glm model yields 4 coefficients:
m <- glm(formula = Species~Petal.Length*Petal.Width, data = d, family = "binomial")
m$coefficients
# (Intercept) Petal.Length Petal.Width Petal.Length:Petal.Width
# -131.23813 22.93553 63.63527 -10.63606
How to plot those with a simple abline?
Binomial models are usually not set up like this. You usually will have a single 0|1 response variable (i.e. predict whether a sample in a single species). Maybe because you only have 2 species included in your model, it still seems to work (this is not that case when all 3 spp are included).
The second trick is to predict type="response" and round these values to get discrete predictions:
d$pred <- factor(levels(d[,3])[round(predict(m, type="response"))+1])
plot(d[,1:2], col=d[,3])
points(d[,1:2], col=d$pred, pch=4)
here I've added an "X" for the predictions. If color is the same, then the prediction was correct. I count 5 samples where the prediction was incorrect.

Generating predictive simulations from a multilevel model with random intercepts

I am struggling to understand how, in R, to generate predictive simulations for new data using a multilevel linear regression model with a single set of random intercepts. Following the example on pp. 146-147 of this text, I can execute this task for a simple linear model with no random effects. What I can't wrap my head around is how to extend the set-up to accommodate random intercepts for a factor added to that model.
I'll use iris and some fake data to show where I'm getting stuck. I'll start with a simple linear model:
mod0 <- lm(Sepal.Length ~ Sepal.Width, data = iris)
Now let's use that model to generate 1,000 predictive simulations for 250 new cases. I'll start by making up those cases:
set.seed(20912)
fakeiris <- data.frame(Sepal.Length = rnorm(250, mean(iris$Sepal.Length), sd(iris$Sepal.Length)),
Sepal.Width = rnorm(250, mean(iris$Sepal.Length), sd(iris$Sepal.Length)),
Species = sample(as.character(unique(iris$Species)), 250, replace = TRUE),
stringsAsFactors=FALSE)
Following the example in the aforementioned text, here's what I do to get 1,000 predictive simulations for each of those 250 new cases:
library(arm)
n.sims = 1000 # set number of simulations
n.tilde = nrow(fakeiris) # set number of cases to simulate
X.tilde <- cbind(rep(1, n.tilde), fakeiris[,"Sepal.Width"]) # create matrix of predictors describing those cases; need column of 1s to multiply by intercept
sim.fakeiris <- sim(mod0, n.sims) # draw the simulated coefficients
y.tilde <- array(NA, c(n.sims, n.tilde)) # build an array to hold results
for (s in 1:n.sims) { y.tilde[s,] <- rnorm(n.tilde, X.tilde %*% sim.fakeiris#coef[s,], sim.fakeiris#sigma[s]) } # use matrix multiplication to fill that array
That works fine, and now we can do things like colMeans(y.tilde) to inspect the central tendencies of those simulations, and cor(colMeans(y.tilde), fakeiris$Sepal.Length) to compare them to the (fake) observed values of Sepal.Length.
Now let's try an extension of that simple model in which we assume that the intercept varies across groups of observations --- here, species. I'll use lmer() from the lme4 package to estimate a simple multilevel/hierarchical model that matches that description:
library(lme4)
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
Okay, that works, but now what? I run:
sim.fakeiris.lmer <- sim(mod1, n.sims)
When I use str() to inspect the result, I see that it is an object of class sim.merMod with three components:
#fixedef, a 1,000 x 2 matrix with simulated coefficients for the fixed effects (the intercept and Sepal.Width)
#ranef, a 1,000 x 3 matrix with simulated coefficients for the random effects (the three species)
#sigma, a vector of length 1,000 containing the sigmas associated with each of those simulations
I can't wrap my head around how to extend the matrix construction and multiplication used for the simple linear model to this situation, which adds another dimension. I looked in the text, but I could only find an example (pp. 272-275) for a single case in a single group (here, species). The real-world task I'm aiming to perform involves running simulations like these for 256 new cases (pro football games) evenly distributed across 32 groups (home teams). I'd greatly appreciate any assistance you can offer.
Addendum. Stupidly, I hadn't looked at the details on simulate.merMod() in lme4 before posting this. I have now. It seems like it should do the trick, but when I run simulate(mod0, nsim = 1000, newdata = fakeiris), the result has only 150 rows. The values look sensible, but there are 250 rows (cases) in fakeiris. Where is that 150 coming from?
One possibility is to use the predictInterval function from the merTools package. The package is about to be submitted to CRAN, but the current developmental release is available for download from GitHub,
install.packages("devtools")
devtools::install_github("jknowles/merTools")
To get the median and a 95% credible interval of 100 simulations:
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
out <- predictInterval(mod1, newdata=fakeiris, level=0.95,
n.sims=100, stat="median")
By default, predictInterval includes the residual variation, but you can
turn that feature off with:
out2 <- predictInterval(mod1, newdata=fakeiris, level=0.95,
n.sims=100, stat="median",
include.resid.var=FALSE)
Hope this helps!
This might help: it doesn't use sim(), but instead uses mvrnorm() to draw the new coefficients from the sampling distribution of the fixed-effect parameters, uses a bit of internal machinery (setBeta0) to reassign the internal values of the fixed-effect coefficients. The internal values of the random effect coefficients are automatically resampled by simulate.merMod using the default argument re.form=NA. However, the residual variance is not resampled -- it is held fixed across the simulations, which isn't 100% realistic.
In your use case, you would specify newdata=fakeiris.
library(lme4)
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
simfun <- function(object,n=1,newdata=NULL,...) {
v <- vcov(object)
b <- fixef(object)
betapars <- MASS::mvrnorm(n,mu=b,Sigma=v)
npred <- if (is.null(newdata)) {
length(predict(object))
} else nrow(newdata)
res <- matrix(NA,npred,n)
for (i in 1:n) {
mod1#pp$setBeta0(betapars[i,])
res[,i] <- simulate(mod1,newdata=newdata,...)[[1]]
}
return(res)
}
ss <- simfun(mod1,100)

'predict' gives different results than using manually the coefficients from 'summary'

Let me state my confusion with the help of an example,
#making datasets
x1<-iris[,1]
x2<-iris[,2]
x3<-iris[,3]
x4<-iris[,4]
dat<-data.frame(x1,x2,x3)
dat2<-dat[1:120,]
dat3<-dat[121:150,]
#Using a linear model to fit x4 using x1, x2 and x3 where training set is first 120 obs.
model<-lm(x4[1:120]~x1[1:120]+x2[1:120]+x3[1:120])
#Usig the coefficients' value from summary(model), prediction is done for next 30 obs.
-.17947-.18538*x1[121:150]+.18243*x2[121:150]+.49998*x3[121:150]
#Same prediction is done using the function "predict"
predict(model,dat3)
My confusion is: the two outcomes of predicting the last 30 values differ, may be to a little extent, but they do differ. Whys is it so? should not they be exactly same?
The difference is really small, and I think is just due to the accuracy of the coefficients you are using (e.g. the real value of the intercept is -0.17947075338464965610... not simply -.17947).
In fact, if you take the coefficients value and apply the formula, the result is equal to predict:
intercept <- model$coefficients[1]
x1Coeff <- model$coefficients[2]
x2Coeff <- model$coefficients[3]
x3Coeff <- model$coefficients[4]
intercept + x1Coeff*x1[121:150] + x2Coeff*x2[121:150] + x3Coeff*x3[121:150]
You can clean your code a bit. To create your training and test datasets you can use the following code:
# create training and test datasets
train.df <- iris[1:120, 1:4]
test.df <- iris[-(1:120), 1:4]
# fit a linear model to predict Petal.Width using all predictors
fit <- lm(Petal.Width ~ ., data = train.df)
summary(fit)
# predict Petal.Width in test test using the linear model
predictions <- predict(fit, test.df)
# create a function mse() to calculate the Mean Squared Error
mse <- function(predictions, obs) {
sum((obs - predictions) ^ 2) / length(predictions)
}
# measure the quality of fit
mse(predictions, test.df$Petal.Width)
The reason why your predictions differ is because the function predict() is using all decimal points whereas on your "manual" calculations you are using only five decimal points. The summary() function doesn't display the complete value of your coefficients but approximate the to five decimal points to make the output more readable.

Resources