Comparing GLM models using predict - r

Suppose I have two models created by calling glm() on the same data but with different formulas and/or families. Now I want to compare which model is better by predicting on an unknown data. Something like this:
mod1 <- glm(formula1, family1, data)
mod2 <- glm(formula2, family2, data)
mu1 <- predict(mod1, newdata, type = "response")
mu2 <- predict(mod2, newdata, type = "response")
How can I tell which of the predictions mu1 or mu2 is better?
Is there some simple command to compute the log likelihood of a prediction?

It would be easier to answer this with a reproducible example.
It often makes more sense to choose a family a priori rather than according too goodness of fit -- for example, if you have count (non-negative integer) responses with no obvious upper bound, your only real choice that lies strictly within the exponential family is Poisson.
set.seed(101)
x <- runif(1000)
mu <- exp(1+2*x)
y <- rgamma(1000,shape=3,scale=mu/3)
d <- data.frame(x,y)
New data:
nd <- data.frame(x=runif(100))
nd$y <- rgamma(100,shape=3,scale=exp(1+2*nd$x)/3)
Fit Gamma and Gaussian:
mod1 <- glm(y~x,family=Gamma(link="log"),data=d)
mod2 <- glm(y~x,family=gaussian(link="log"),data=d)
Predictions:
mu1 <- predict(mod1, newdata=nd, type="response")
mu2 <- predict(mod2, newdata=nd, type="response")
Extract shape/scale parameters:
sigma <- sqrt(summary(mod2)$dispersion)
shape <- MASS::gamma.shape(mod1)$alpha
Root mean squared error:
rmse <- function(x1,x2) sqrt(mean((x1-x2)^2))
rmse(mu1,nd$y) ## 5.845
rmse(mu2,nd$y) ## 5.842
Negative log likelihoods:
-sum(dgamma(nd$y,shape=shape,scale=mu1/shape,log=TRUE)) ## 276.84
-sum(dnorm(nd$y,mean=mu2,sd=sigma,log=TRUE)) ## 318.4

Related

Two methods of recovering fitted values from a Bayesian Structural Time Series model yield different results

Two conceptually plausible methods of retrieving in-sample predictions (or "conditional expectations") of y[t] given y[t-1] from a bsts model yield different results, and I don't understand why.
One method uses the prediction errors returned by bsts (defined as e=y[t] - E(y[t]|y[t-1]); source: https://rdrr.io/cran/bsts/man/one.step.prediction.errors.html):
library(bsts)
get_yhats1 <- function(fit){
# One step prediction errors defined as e=y[t] - yhat (source: )
# Recover yhat by y-e
bsts.pred.errors <- bsts.prediction.errors(fit, burn=SuggestBurn(0.1, fit))$in.sample
predictions <- t(apply(bsts.pred.errors, 1, function(e){fit$original.series-e}))
return(predictions)
}
Another sums the contributions of all model component at time t.
get_yhats2 <- function(fit){
burn <- SuggestBurn(0.1, fit)
X <- fit$state.contributions
niter <- dim(X)[1]
ncomp <- dim(X)[2]
nobs <- dim(X)[3]
# initialize final fit/residuals matrices with zeros
predictions <- matrix(data = 0, nrow = niter - burn, ncol = nobs)
p0 <- predictions
comps <- seq_len(ncomp)
for (comp in comps) {
# pull out the state contributions for this component and transpose to
# a niter x (nobs - burn) array
compX <- X[-seq_len(burn), comp, ]
# accumulate the predictions across each component
predictions <- predictions + compX
}
return(predictions)
}
Fit a model:
## Air passengers data
data("AirPassengers")
# 11 years, monthly data (timestep=monthly) --> 132 observations
Y <- stats::window(AirPassengers, start=c(1949,1), end=c(1959,12))
y <- log(Y)
ss <- AddLocalLinearTrend(list(), y)
ss <- AddSeasonal(ss, y, nseasons=12, season.duration=1)
bsts.model <- bsts(y, state.specification=ss, niter=500, family='gaussian')
Compute and compare predictions using each of the functions
p1 <- get_yhats1(bsts.model)
p2 <- get_yhats2(bsts.model)
# Compare predictions for t=1:5, first MCMC iteration:
p1[1,1:5]; p2[1,1:5]
I'm the author of bsts.
The 'prediction errors' in bsts come from the filtering distribution. That is, they come from p(state | past data). The state contributions come from the smoothing distribution, i.e. p(state | all data). The filtering distribution looks backward in time, while the smoothing distribution looks both forward and backward. One typically needs the filtering distribution while using a fitted model, and the smoothing distribution while fitting the model in the first place.

How to recover fitted values from BSTS poisson model (in R)?

I am trying to recover in-sample predictions (fitted values) from a bsts model with a specified poisson response using the bsts package in R. The following results in an error: Prediction errors are not supported for Poisson or logit models.
data("AirPassengers")
# 11 years, monthly data (timestep=monthly) --> 132 observations
Y <- stats::window(AirPassengers, start=c(1949,1), end=c(1959,12))
y <- log10(Y)
ss <- AddLocalLinearTrend(list(), y)
ss <- AddSeasonal(ss, y, nseasons=12, season.duration=1)
bsts.model <- bsts(Y, state.specification=ss, niter=150, family='poisson')
bsts.prediction.errors(bsts.model)
Is there a way to retrieve predictions on model-training data with a poisson model in bsts?
One way to do it is to extract the contribution of each model component at time t and sum them.
get_yhats2 <- function(fit){
burn <- SuggestBurn(0.1, fit)
X <- fit$state.contributions
niter <- dim(X)[1]
ncomp <- dim(X)[2]
nobs <- dim(X)[3]
# initialize final fit/residuals matrices with zeros
predictions <- matrix(data = 0, nrow = niter - burn, ncol = nobs)
p0 <- predictions
comps <- seq_len(ncomp)
for (comp in comps) {
# pull out the state contributions for this component and transpose to
# a niter x (nobs - burn) array
compX <- X[-seq_len(burn), comp, ]
# accumulate the predictions across each component
predictions <- predictions + compX
}
return(predictions)
}
get_yhats2(bsts.model)
But I also posted here, showing that this method didn't necessarily match expectations I had even in the Gaussian case.

How to minimize RMSE of a regression using optim()?

I need to minimize RMSE of a linear regression using weights with several parameters.
I tried using optim(), but it gives error - "missing or negative weights not allowed". Weights should not be negative or missing, because output of function changes when parameters change, meaning that weights in regression work.
library(tidyverse)
library(MLmetrics)
library(modelr)
ff1 <- function(a){
data1 <- sim1
a1 <- a[1]
a2 <- a[2]
data1$w <- a1*data1$x + a2*data1$y
fit <- lm(y ~ x ,data = data1,weights=w)
x2 <- data.frame(x=data1$x)
yy <- data.frame(fit = predict(fit,x2))
data1$fit <- yy$fit
rmse1 <- RMSE(data1$fit,data1$x)
return(rmse1)
}
ff1(c(1,1))
ff1(c(1,50))
sol <- optim(c(1,1),ff1)
I tried several methods, but they produce the same error.

predict() gives wrong matrix for bs(); how to predict linear regression?

I've met a problem about function bs().
library(ISLR)
library(ggplot2)
library(caret)
data(Wage)
#summary(Wage)
set.seed(123)
inTrain <- createDataPartition(Wage$wage, p = 0.7, list = F)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
library(splines)
bsBasis <- bs(training$age, df=3)
bsBasis[1:12,]
lm1 <- lm(wage ~ bsBasis, data=training)
lm1$coefficients
## (Intercept) bsBasis1 bsBasis2 bsBasis3
## 60.22 93.39 51.05 47.28
plot(training$age, training$wage, pch=19, cex=0.5)
points(training$age, predict(lm1, newdata=training), col="red", pch=19, cex=0.5)
predict(bsBasis, age=testing$age)
The dimensions of predict(bsBasis, age=testing$age) is 2012x3, while the testing$age got only 988 rows. And the results of predict(bsBasis, age=testing$age) is identical to the bsBasis.
My questions are:
What is predict(bsBasis, age=testing$age) actually doing?
How to use this bsBasis in predicting the wage in the TEST data correctly?
Your question 1
Use newx. Check ?predict.bs for its arguments.
x <- runif(100)
b <- bs(x, df = 3)
predict(b, newx = c(0.2, 0.5))
Different predict functions may behave differently. Here, no matter what variable you use in bs(), age, sex, height, etc, it can only be newx in predict.bs().
Your question 2
You don't really need to form explicitly bsBasis. When using splines in regression, lm and predict.lm will hide construction and prediction of spline from you.
lm1 <- lm(wage ~ bs(age, df = 3), data=training)
predict(lm1, newdata = test)
Note the argument in predict.lm is newdata.

R: How to compute AUC and ROC curve for ´bgeva´ objekt/model?

Since I have data with binary response, but rare events, I would like to improve its forecast by fitting a bgeva model instead of a gam model. To prove and compare it´s prediction accuracy and compare it to other models that I tried, I need to calculate AUC and plot a ROC curve.
The problem is that my code, which works with glm and gam, does not work with bgeva object. Precisely, the use of the function predict() prints the Error:
no applicable method for 'predict' applied to an object of class "bgeva"
and my friend Google did not find any solution for me.
Here is one simple Example from bgeva() package and the code that I used to calculate the AUC and plot the ROC curve for glm and gam objects:
library(bgeva)
set.seed(0)
n <- 1500
x1 <- round(runif(n))
x2 <- runif(n)
x3 <- runif(n)
f1 <- function(x) (cos(pi*2*x)) + sin(pi*x)
f2 <- function(x) (x+exp(-30*(x-0.5)^2))
y <- as.integer(rlogis(n, location = -6 + 2*x1 + f1(x2) + f2(x3), scale = 1) > 0)
dataSim <- data.frame(y,x1,x2,x3)
################
# bgeva model: #
################
out <- bgeva(y ~ x1 + s(x2) + s(x3))
# AUC for bgeva (does not work)##################################
library(ROCR)
pred <-as.numeric(predict(out, type="response", newdata=dataSim))
rp <- prediction(pred, dataSim$y)
auc <- performance( rp, "auc")#y.values[[1]]
auc
################
# gam model: #
################
library(mgcv)
out_gam <- gam(y ~ x1 + s(x2) + s(x3), family=binomial(link=logit))
# AUC and ROC for gam (the same code, works with gam) ############
pred_gam <-as.numeric(predict(out_gam, type="response"))
rp_gam <- prediction(pred_gam, dataSim$y)
auc_gam <- performance( rp_gam, "auc")#y.values[[1]]
auc_gam
roc_gam <- performance( rp_gam, "tpr", "fpr")
plot(roc_gam)
#You can to calculate
pred <-as.numeric(predict(out$gam.fit, type="response", newdata=dataSim))
#your example
> auc
[1] 0.7840645

Resources